Research and optimization of a H.264AVC motion estimation algorithm based on a 3G network

IT 13 019 Examensarbete 30 hp Mars 2013 Research and optimization of a H.264AVC motion estimation algorithm based on a 3G network Ou Yu Institution...

Author: Osborn Lynch

1 downloads 2 Views 2MB Size

Report

Download PDF

Recommend Documents

A Fast Octagon Based Search Algorithm for Motion Estimation

Motion Estimation algorithm for

Sensor Network Optimization Using a Genetic Algorithm

A HYBRID OPTIMIZATION ALGORITHM BASED ON GENETIC ALGORITHM AND ANT COLONY OPTIMIZATION

Research of the Optimization of a Data Mining Algorithm Based on an Embedded Data Mining System

Research on Intrusion Detection Algorithm Based on BP Neural Network

A Nature Inspired Heuristic Optimization Algorithm Based on Lightning

Analysis of Motion Estimation Algorithm in HEVC

Open Access Research on Hybrid Algorithm Based on BP Neural Network and Genetic Algorithm

Optimization and estimation on manifolds

COM motion estimation of a biped robot based on kinodynamics and torque equilibrium

Research on LFS Algorithm in Software Network

Stereo Video Coding Based on Interpolated Motion and Disparity Estimation

Optimization-Based Interactive Motion Synthesis

Optimization of a LNA Using Genetic Algorithm

A Method for Crude Oil Selection and Blending Optimization Based on Improved Cuckoo Search Algorithm

Buffer Management Algorithm Design and Implementation Based on Network Processors

Research Article A Memory Hierarchy Model Based on Data Reuse for Full-Search Motion Estimation on High-Definition Digital Videos

Research on Improved ECC Algorithm in Network and Information Security

An Optimal Binarization Algorithm Based on Particle Swarm Optimization

Motion Estimation on Interlaced Video

Optimization of a Single Lithium-ion Battery Cell with a Gradient-based Algorithm

Combined Wavelet-Domain and Motion-Compensated Video Denoising Based on Video Codec Motion Estimation Methods

A Line Search Algorithm for Unconstrained Optimization*

IT 13 019

Examensarbete 30 hp Mars 2013

Research and optimization of a H.264AVC motion estimation algorithm based on a 3G network Ou Yu

Institutionen för informationsteknologi Department of Information Technology

Abstract Research and optimization of a H.264AVC motion estimation algorithm based on a 3G network Ou Yu

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

The new video codec standard H.264/AVC is jointly developed by ISO/IEC Moving Picture Expert Group MPEG and ITU-T Video Coding Experts Group [1] [2], VCEG. It has higher coding efficiency than the MPEG-4, thus could be applied to high definition application in low bit-rate wireless environment.[3] However H.264/AVC has harsh requirement on the hardware, basically due to the complexity of the algorithms it used. And end devices, e.g. smart phones usually do not have sufficient computing capability, also it is restricted by limited battery power. As a result, it is crucial to reduce the computing complexity of H.264/AVC codec, and in the same time, keep the video quality unharmed. After the analysis of the H.264/AVC coding algorithm, it can be found that ME (motion estimation) consumes the biggest part of the computing power. So in order to adopt H.264/AVC to real-time, low bit-rate video application, it is very important to optimize ME algorithm. In this thesis, basic knowledge and key technology of H.264/AVC is introduced in the first place. Then it systematically illustrate the existing block-matching ME algorithms, both the algorithm flow and different technology involved, also the pros and cons of each. In the next part, a very famous algorithm UMHexagonS, now accepted by ITU-T, is introduced in detail, and the author explain in different aspects why this algorithm could gain more efficiency over others. And on the base of the analysis, the author proposes some improvement to the UMHexagonS, taking thoughts of some classic ME algorithms into it. In the last phase of the thesis, both Subjective quality assessment experiment and objective quality assessment experiment are used to examine the performance of the improved algorithm. It has been shown by experiments that the improved ME algorithm requires less computing power than UMHexagonS, while keeping video quality at the same level. The improved algorithm could be used in a wireless environment such as a 3G network.

Handledare: Huijuan Zhang Ämnesgranskare: Ivan Christoff Examinator: Ivan Christoff IT 13 019 Tryckt av: Reprocentralen ITC

Table of Contents 1 Introduction ...................................................................................................................... 1 1.1 Background ............................................................................................................ 1 1.2 Related Works ........................................................................................................ 2 1.3 Thesis Outline......................................................................................................... 4 2 Principle of Block-Based Motion Estimation Algorithm ................................................. 5 2.1 Introduction of the H.264/AVC Standard ............................................................... 5 2.2 Encoder and Decoder of the H.264/AVC Standard................................................ 6 2.3 Motion Estimation Theory ..................................................................................... 7 2.3.1 Basic concept on Motion Estimation ........................................................... 8 2.3.2 Key Principle of Motion Estimation ........................................................... 11 2.3.3 The Matching Criterion .............................................................................. 12 2.4 Summary .............................................................................................................. 13 3 Analysis of the Classic Motion Estimation Algorithm.................................................... 15 3.1 Full Search............................................................................................................ 15 3.2 Three Step Search ................................................................................................. 16 3.3 Four Step Search .................................................................................................. 17 3.4 Block Based Gradient Descent Search ................................................................. 18 3.5 Diamond Search ................................................................................................... 18 3.6 Motion Vector Field Adaptive Search ................................................................... 19 3.7 Summary .............................................................................................................. 21 4 Analysis and Optimization of Unsymmetrical Multi Hexagon Search ........................... 23 4.1 Analysis of Unsymmetrical Multi Hexagon Search ............................................. 23 4.2 Optimization of Unsymmetrical Multi Hexagon Search ..................................... 27 4.3 Optimization Details of Unsymmetrical Multi Hexagon Search ......................... 28 4.3.1 Optimization on Early Termination ............................................................ 28 4.3.2 Adoption of Movement Intensity ................................................................ 30 4.4 Implementation of the Algorithm ......................................................................... 32 4.4.1 Starting Point Prediction ............................................................................. 32 4.4.2 Search Pattern ............................................................................................. 33 4.4.3 Search Process ............................................................................................ 34 4.5 Summary .............................................................................................................. 34 5 Experiment Proof of Improved Algorithm ..................................................................... 37 5.1 Objective Quality Assessment based on JM Model ............................................. 37

5.1.1 Design of the Experiment ........................................................................... 37 5.1.2 Analysis of the Result ................................................................................. 38 5.2 Subjective Quality Assessment Using Double Stimulus Impairment Scale ........ 40 5.2.1 Design of Experiment ................................................................................. 40 5.2.2 Analysis of the Result ................................................................................. 41 5.3 Summary .............................................................................................................. 42 6 Conclusion...................................................................................................................... 43 References ......................................................................................................................... 44

Chapter1 Introduction 1.1 Background As more and more Telecommunication Service Providers (TSP) start to promote their 3G network business, the coverage of 3G network is rocketing up in China.[4] And it is expected that two years from now, 9 cellphones out of 10 will use 3G network. By the mean time streaming video, one of the most distinguishing features of 3G network, will be the battle field for Telecommunication Service Providers to gain their maximum benefits. Traditional online streaming media usually uses early coding method. The size of the media file is quite small, and correspondingly, the image quality is quite vague. Now, the new generation of high-definition coding standard, H.264/AVC video coding standard can provide better video quality in the same bit rate. And network abstraction layer is added in the coding standard, to make it more convenient for the production of Internet streaming application. Therefore, the technology is quite suitable for 3G mobile multimedia usage. However, to apply H.264/AVC coding standard efficiently on 3G platform, there are mainly four obstacles need to be conquered, including power-consumption control, error control, transmission rate control, compression efficiency. The main power control problem to be solved is how to reduce the power that streaming media tasks require, so as to extend battery life. In a mobile wireless video transmission system, the mobile terminal not only needs to do decoding to receive video, and also needs to encode to send video. So power control problem can be divided into two aspects. [5] Fault-tolerant technology is an essential part in wireless video transmission. Due to the QoS (Quality of Service) of 3G wireless channel, fault-tolerant technology is very crucial to ensure the accuracy and completeness of data transmission. It usually includes lost data recovery and concealment. Video encoding is also crucial for video communication, because the wireless channel bandwidth is limited. And there should be balance between video content size and output quality, and also ensure good and stable quality on the receiving end, Compression efficiency is an important video encoding parameter. Better

1

compression efficiency means better video quality under same video file size. But at the same time, better compression efficiency is gained at the cost of higher complexity of coding algorithm, which leads to more computing power consumption. For now, the CPU, memory and other hardware of an ordinary 3G phone cannot compete with mainstream personal computer. So if we do not adjust video coding algorithm, it will certainly not meet the mobile video platforms in 3G mobile use, not to mention real-time encoding applications, such as video calls. This thesis put its focus on the compression efficiency, trying to find a balanced way to simplify the complexity of the algorithm, and in the same time not harm the video quality significantly.

1.2 Related Works The data compression of video signals is carried out by reducing the redundant signals. Video signals contain two types of redundancy: statistical redundancy (also known as spatial redundancy) and human visual redundancy. Spatial redundancy, or geometric redundancy, is caused by the correlation between adjacent pixels. This kind of redundancy can be erased by changing the mapping rule of the relevant pixels. For example, if the background of the video has only one color, there will be a lot of spatial redundancy. The other kind of redundancy, psychological redundancy, which is caused by the nature of human visual system. Because human visual system is not sensitive to a number of frequency component in the video. For example, human cannot notice slightly color changes in the video. For both kinds of redundancy, the greater the amount of redundancy is, the higher possibility of compressibility will be. In current situation, the bandwidth of 3G network is still a bottle-neck for high quality video transmission. Therefore, how to improve the compression efficiency of video coding is very crucial. And many research papers are focusing on how to reduce the complexity of coding, while ensuring the quality of coding. And most of the papers mentioned fast motion estimation algorithm. During video compression period, more than half of the time is spent on motion estimation (ME). The basic idea of motion estimation is firstly to divide each frame into non-overlapping macroblocks. And then each macroblock can find its reference macroblock in certain frames. By doing this, a large amount of residual can be removed. Block-matching motion estimation has its advantage over other ME algorithms,

2

including recursive estimation, Bayesian estimation and optical flow method. Because the concept of this algorithm is more straight forward, and it is easy to implement. So now many researchers put their energy on block-matching motion estimation, trying to make a breakthrough in video compression efficiency. There are several classic block-based motion estimation algorithms. Theoretically, full search algorithm (FS) [6] is the most accurate block-matching algorithm, because it searches all the block pixel-by-pixel to get the best motion vector (MV). But limited by the high computational complexity, the full search is not the ideal method for real-time usage. Later, some fast search algorithms are made to meet real-time requirement. Three-Step Search (TSS) [7]reduces the amount of computation by reducing the number of search pixel, but its relatively large initial search step impairs the performance. Some many other algorithms, New Three-Step Search (NTSS)[8], New Four-Step Search (NFSS)[9], Block-Based Gradient Descent Search (BBGDS) [10] [11]take use of the motion vector distribution offset, greatly improved the speed and efficiency in low-complexity video usage. October 1999, Diamond (DS) [12]search algorithm is adopted by MPEG-4 verification model. Although the diamond method has superior overall performance than other algorithms, and the application of DS was a big success. But still, the performance in some particular case, for example, this algorithm does not provide flexible way to deal with different video content, and it is easy to fall into local optimization, which in return impacts the search performance and coding efficiency. Hexagon Search (HEXBS) [13] is another advanced algorithm. It uses relatively large search box and fast-moving module to reduce the search times, much less than Diamond Search. But HEXBS also did not consider the motion vector correlation, thus cannot handle video with intense movement quite well. Adaptive Motion Vector Search (MVFAST) and Predicted Motion Vector Adaptive Search (PMVFAST) [14] are included in 2001, MPEG-4 video standard. Both of algorithms use motion correlation and related content based on movement and features to choose different search modes, in addition, it use prediction vector and other new concept such as, early termination,

to improve both search speed and video quality. But still like

other algorithms mentioned above, these two algorithms cannot handle video with intense movement quite well. In chapter 3, all the algorithms mentioned above will be analyzed in detail. And based on this, a new, hybrid algorithm will be introduced which is more suitable for 3G

3

mobile platform.

1.3 Thesis Outline Chapter 1 is the introduction part. It covers the difficult point to apply video coding technology to 3G network usage. Then it summarizes the current study focus, and from which leads out to the study focus of this thesis. Chapter 2 focuses on the basic principle and framework of H.264/AVC, especially the motion estimation. Chapter 3 introduces several classic motion estimation algorithms, and then analyzes the pros and cons of each individual algorithm. In Chapter 4, a motion estimation algorithm called UMHexagonS is analyzed. And based on it, this thesis brings some improvements to it. And this chapter explains the whole improvement process in detail. Chapter 5 is the experiment proof. This thesis uses both objective and subjective ways to prove that the improved algorithms can reduce the complexity of the motion estimation while in the same time keep the video quality unharmed. Chapter 6 is the conclusion.

4

Chapter2 Principle of Block-Based Motion Estimation Algorithm 2.1 Introduction of the H.264/AVC Standard MPEG (Moving Picture Experts Group) and VCEG (Video Coding Experts Group) have jointly developed AVC (Advanced Video Coding), which is better than any early video codec, like MPEG and H.263. This video codec is also known as ITU-T Rec. H.264 and MPEG-4 Part 10 standard. Here, in short, we name it H.264 /AVC or H.264. This international standard was ITU-T adopted and officially promulgated on March, 2003. It is widely believed that the promulgation of H.264 is a major event in the development of video compression coding discipline and its superior compression performance will also play an important role in all aspects of digital television broadcasting, video, real-time communication, network video streaming delivery and multimedia messaging. Specifically, compared with other video coding technology H.264 AVC has the following advantages: 1.

Higher coding efficiency: compared with the H.263, it can save approximately

50% of the bit rate while providing the same video quality. 2.

The quality of the video: H.264 can provide high-quality video images on low bit rate

channel, like 3G network, for example. 3. Improvement of network adaptability: H.264 can work in real-time, low-latency communications applications. (Such as video conferencing) And can also be used for no delay video storage or video streaming server. 4. Using of hybrid coding structure: similar with H.263, H.264 also use DCT transform coding plus the DPCM coding structure. And it uses several advanced technology, such as multi-mode motion estimation, intra prediction, multi-frame prediction, variable length coding based on the contents, 4x4 two-dimensional integer transform new coding method to improve the coding efficiency. 5. Less encoding options: it often needs to set quite a lot of options in H.263, which increases the difficulty of encoding. H.264 tries to be brief "back to basics" and reduce the encoding complexity. 6. Adaptable for different occasions: H.264 can use different transmission and playback rate depending on the environment, and also provides a wealth of error-handling tools, you can control or eliminate packet loss and bit error. 5

7. Error recovery: H.264 provides the tools to solve the problem of network transmission packet loss, especially in a wireless network, which has high bit error rate. 8. Higher degree of complexity: H.264 improves its performance by increasing the complexity. It is estimated that the computational complexity of H.264 encoding is roughly equivalent to three times that of the H.263. And decoding complexity is roughly equivalent to two times that of the H.263.

2.2 Encoder and Decoder of the H.264/AVC Standard H.264/AVC does not give a specific implementation instruction of the encoder and decoder, but only provide a set of semantics and rules. And different encoders and decoders from different providers can work under the predefined framework. This can encourage the positive competition among providers. The composition of encoders and decoders of H.264/AVC is illustrated below:

Chart 2.1 H.264 encoder

Chart 2.2

H.264 decoder

From the charts above, we can see that the main function module of H.264/AVC is

6

similar to former standards, e.g. H.261, H.263, MPEG-1 and MPEG-4. The main difference lies in the detail of each function module. Encoder uses a hybrid coding method of transform and prediction. The input filed, or F frame n is processed in unit of macroblock. And there are two ways of predictions in this stage. They are inter-frame prediction and intra-frame prediction. If intra-frame prediction is adopted, the predicted value PRED (represented as P) is calculated from the previous macroblock within current frame through motion compensation (MC). ' And the reference block is marked as Fn 1 . Another way is inter-frame prediction, which is more precise and compression efficient. The reference block can be either the past frame or the frame in the future. Subtract the PRED from the current block, we can gain a residual block ( Dn ). Then through transform and quantization, transform coefficients X is calculated. And after entropy coding and adding some side information like motion vector and quantization parameter, the final stream can be transferred through NAL for transition or storage purpose. And in order to provide the reference image for further prediction, the encoder also need to have the ability to rebuild image. So the residual image can be inverse quantized and inverse ' ' ' transformed to Dn . And we can get the unfiltered frame uFn by adding D n with P. ' And after some noise removing, we can get the rebuilt frame Fn . And Chart 2.2 is the reversed process of Chart 2.1, the input is the H.264/AVC stream, ' and after the reversed process, the current frame Fn can be extracted and then output the final image signal.

2.3 Motion Estimation Theory Motion estimation is the process to predict the movement trend of the image from inter-frames. Thus, it can use a smaller amount of information (image change) to describe the entire image. Currently, motion estimation algorithm is divided into three categories: l. Pixel-based motion estimation This type of motion estimation algorithm uses pixels as the basic unit, to describe the different state of motion of each pixel. [15]This algorithm is highly precise, but also with high computational complexity that it is difficult to achieve real-time encoding. So it cannot be used in real-time scenario. 2. Object-based motion estimation

7

Object-based motion estimation usually split video images, to create a number of objects, and then to track and match these objects. This algorithm highly depends on video object segmentation algorithm, which is not considered to be mature enough. As a result, the progress in object-based motion estimation research is quite slow. And also, split object may be of different sizes, without any rules in common, leading to the higher complexity of the algorithm. And such kind of design cannot achieve practical purposes. 3. Block-based motion estimation Block-based motion estimation uses blocks as the basic unit, each block contains a number of pixels, and assuming that all pixels within each block has a consistent state of motion. Because it uses block as a unit, the computational complexity of motion estimation can be greatly reduced. In addition, there are also studies about how to improve the encoding quality of the work according to the characteristics of the human eye, for example, there is a coding method that skips the non-sensible video content to allocate more bits on sensible content to obtain better overall video quality. This thesis will focus on the block-based motion estimation, and in the next section, basic principle of block matching motion estimation will be introduced.

2.3.1 Basic concept on Motion Estimation Like any other video coding technology, H.264/AVC is built upon several basic concepts. First, let me introduce those important concepts. 

Field and Frame

Field or frame of the video can be used to generate an encoded image. Typically, a video frame may be divided into two types: continuous or interlaced video frame. In traditional television, a frame is divided into two interlaced field, in order to reduce the blinking of the video image. Usually, video content with low motion movement should adopt frame coding mode, while those with fierce movements should uses field coding mode. 

Macroblock

A coded image can be divided into several macroblocks, and each macroblock

8

consists of a 16 x 16 array of luminance (Y) pixels and some Chroma pixels (Cr, Cb) , which depend on the indicator in the sequence header. And several macroblocks can form a slice. In Slice I, there is only Macroblock I. Slice P can contain Macroblock P and Macroblock I. And Slice B can contain Macroblock B and Macroblock I. Macroblock I can only uses coded pixels in current slice to perform intra-frame prediction. Macroblock P can uses previously coded image to do inter-frame prediction. The macroblock can be divided into 16× 16, 16× 8,8× 16, or 8×8. And it can be further divided into small blocks, for example, if 8×8 mode is chosen, you can choose the sub-block size like 8×8, 8×4, 4×8 or 4× 4. Macroblock B is similar to P, but it can also use the future image to do inter-frame prediction. 

Motion Vector

In inter-frame prediction, every MB is predicted from a certain same sized MB in reference frame. And the vector among these two is called Motion Vector (MV). It has 1/4 pixel accuracy with Luminance component and 1/8 pixel precision with chroma component. As a result, the reference pixel might not really exist in reference frame (if the MV is not integer), but using interpolation operation to gain the result. The transmission of each MV requires certain amount of bits, especially for small-sized block size. In order to reduce the bits rate, we can use adjacent MV to predict current MV, because adjacent MV has high correlation. And there only needs to transmit the differential of the MV (MVD), instead of transmit the whole MV (MVP). By doing this, we can save large amount of bit rate.

Chart 2.3 Current MB and adjacent MBs (in same size)

9

As shown above, E is the current macroblock or sub-macroblock. A B C is the adjacent macroblock on the left, top and top right. If there is more than one macro block on the left (chart 2.4), A, which is on the top is used as reference. And B, which is the left of the top adjacent macroblocks, is chosen to be the reference block.

Chart 2.4 Current MB and adjacent MBs (in different sizes)

In Chart 2.4, if The block size is not 16×8 or 8× 16, the MVP is the average of MV from A, B and C. If the block size is 16× 8, the MVP

of the upper part is from B, and the bottom is

from A. If the block size is 8× 16, the MVP

of the left part is from A, and the right is from

C. 

Motion Compensation

Motion compensation describes the process of turning reference picture to current picture. The segmentation of Macroblocks (mentioned above) increases the correlation of each macroblocks or sub-macroblocks, thus providing the possibility for more efficient motion compensation, which is called tree structured motion compensation. For each macroblocks or sub-macroblocks, it has to have motion compensation for individual. And every MV has to be encoded, transferred, and integrated into the output stream. For large sized MB, the MV and segmentation type only takes small proportion of the whole stream, while the motion compensation takes the biggest part, because the details of a large sized MB is more complex than small sized one. And on the contrast, the MV and segmentation type takes the biggest part for those small sized MB. So small sized MB is suitable for the image with more details, and large sized MB is suitable for those with little or no details. As is show in Chart 2.5, the H.264/AVC encoder chooses the segmentation type for each block on a residual frame, which has not been through motion compensation yet. And for the grey background image, it uses big sized block like 16× 16. But in the 10

detailed part, for example, face and hair, it uses small sized block to gain a better encoding efficiency.

2.3.2 Key Principle of Motion Estimation In normal cases, adjacent frames within a video content is correlated, thus exists redundancy, as illustrated by Shannon information theory. As a matter of fact, there is redundancy for almost every video content. And it provides possibility for video compression and video encoding technology.

Chart 2.5 Residual frame

Using inter-frame prediction technology can erase the redundancy created by frame correlation. The same pixel in the previous frame has reference value for current frames for still image. And for the image that is on the move, we should take motion vector into consideration. So we need to find the matched pixel or MB in previous or future frame, which have reference value for current frame. And the process of finding the matched pixel or MB is called motion estimation. Motion estimation can erase the redundancy by large amount, thus lower the information to encode and also lower the time estimation for encoding. As is shown in Chart 2.6, the practice of motion estimation is first to divide image into individual MBs. Assuming all the pixel within MB has the same MV, then it searches for the matched MB on the previous or future frames based on pre-defined matching criterion. And the vector between current MB and matched MB is called motion vector. And next step is to calculate the differential between motion compensation and compensation residuals, which will be further transformed,

11

quantized, encoded and transferred.

Chart 2.6 Illustration of motion estimation

2.3.3 The Matching Criterion There are four types of matching criterion for block matching [16]： Minimum Mean Square Error (MSE), Minimum Absolute Difference MAD, Normalized Cross-Correlation Function (NCCF), and Absolute Error (SAD) 

Minimum Mean Square Error (MSE)

MSE(i. j ) 

1 M N [ f k (m, n)  f k 1 (m  i, n  j )]2 MN m1 n 1

(i,j) refers to the motion vector，

fk

f k 1

and

(2-1)

is the grey value of each pixel in

current and previous frame. MN is the size of the block. The MB with lowest MSE is selected to be the matched MB. 

Minimum absolute difference (MAD)

MAD(i. j ) 

1 MN

M

N

 | f m 1 n 1

k

(m, n)  f k 1 (m  i, n  j ) |

(2-2)

MB with lowest MAD is the matched MB. 

Normalized Cross-Correlation Function (NCCF) M

NCCF (i. j ) 

N

 f m 1 n 1

M

N

[  f m 1 n 1

2

k

(m, n) f k 1 (m  i, n  j )

k

M

1

N

(m, n)] 2 [ f 1

m 1 n 1

B with highest NCCF is the matched MB. 12

2

k 1

( m  i , n  j )]

(2-3) 2



Absolute Error (SAD) M

N

SAD(i. j )   | f k (m, n)  f k 1 (m  i, n  j ) |

(2-4)

m 1 n 1

And MB with lowest SAD is selected to be the matched MB. In practice, matching criterion does not play a vital role to the precision of the matching process. And because SAD is easier to implement, and it requires low computing power. SAD is often chosen for the matching criterion.

2.4 Summary This chapter is the introduction part. It introduces the basic concept of H.264/AVC video encoding technology. And some of the key technologies like motion compensation and motion estimation are introduced in detail.

13

14

Chapter 3 Analysis of the Classic Motion Estimation Algorithm Since the birth of video codec technology, motion estimation algorithm has become one of the most important elements. The efficiency of the motion estimation algorithm largely affects the success or failure of the video encoding technology. In the development of the block-matching motion algorithm, there comes many innovative, highly efficient algorithms, and many of those have been replaced by more efficient algorithm. But the thought of those algorithms has become classic. This chapter will analyze some classic motion estimation algorithm.

3.1 Full Search

`

Chart 3.1 Full search process

Chart 3.1 illustrates the process to find the best motion vector in Full Search algorithm. The algorithm start the search from the origin of the image (generally upper left), in a pre-made search box, all possible search block and compared with the current block. And then to choose the most appropriate prediction block. The offset of the two blocks is the motion estimation vector MV (u, v). Full search algorithm search for all possible blocks, so the search accuracy is the best. But the search volume is too large, it is difficult to meet the needs for real-time encoding. 15

3.2 Three Step Search The process of the three-step search algorithm is shown in Figure 3.2, the center point of the a 16x16 search box, (i, j) is set to the origin (0,0). The initial search step size is set to the half of the maximum step. Then the MAD values of (i, j) points and their peripheral eight neighboring points are calculated. And the point with smallest MAD is set to be the new search center point. Then the second search step is changed to half of the original search step and then repeats the process again, until the point with smallest MAD is found. And the offset is set to be the motion vector. The three-step search algorithm is highly efficient, but because its search process design is too simple, it cannot be guaranteed that optimal motion vector can be found.

Chart 3.2 Three step search process

16

3.3 Four Step Search The four-step search method has been improved on the basis of the three-step search method. Figure 3.3 shows the search process. 1. Set the center of the search box as the origin, then search in a 15x15 search box of nine 5x5 small search box. If the pixel with minimum SAD is in the center of the search box. Then turn to step 4, otherwise turn to step 2; 2. Set the best matching point in Step 1 as the new search center, still uses 5x5 small search box, and the next process is divided into two cases: if the optimal point in step1 is in the middle of the search box, then it only needs to search for the 6 unsearched points for SAD comparison. And if the optimal is on the corner of the search box, then it only needs to search for the 5 unsearched points. And after the comparison, if the optimal point is in the center, then turn to step 4. Otherwise go to Step 3; 3. Repeat step 2, and then jump to step 4; 4. Set search box into the small box of 3x3, nine detection points compare to get the best match point minimum error matching points is required.

Chart 3.3 Four step search process

17

3.4 Block Based Gradient Descent Search The Block Based Gradient Descent Search (BBGDS) uses a special search mode, this mode focuses on central location search, it searches nine points on each search box, as shown in Figure 3.4. The number of search points of BBGDS is in increment from three or five. Once the point with smallest BDM is exactly in the position of the center of the search box or on the boundary of the search box, the algorithms will stop. BBGDS is very suitable for the scenario with little image movement. But for those with fierce movement, this algorithm is far from satisfactory.

Chart 3.4 BBGDS process

3.5 Diamond Search The Diamond Search (DS) is introduced in 1997 by ShanZhu and Kai.Kuang Ma. It is one of the classic motion estimation algorithms that have a very wide range of applications today. The main factor that affects speed and effectiveness of algorithm is the shape and size of the search box, so the diamond search method uses two set of search box, big diamond and small diamond. The big diamond searches for the center point and adjacent eight point, while small diamond only searches for the center point, along with four adjacent points. The detailed search process is illustrated in Fig 3-6 (e) below. 1. Starts from the center of the image, the algorithm first uses big diamond search pattern to search for the nine adjacent points. And if the point with

18

smallest SAD is in the center of the search box, then jump to step 3, if not, continue to step 2. 2. Starts form the point with smallest SAD in step 1, and repeat the search using big diamond pattern. And compares the SAD of the unsearched points with center point. If the optimal point is in the center, then jump to step3, otherwise, repeat the process. 3. Change the search pattern to small diamonds, and detect the points that are unsearched. Choose the point with smallest SAD to be the optimal matching point. And total search point is 9+n(3，5)+4.

Chart 3.5 DS process

3.6 Motion Vector Field Adaptive Search All the block matching methods mentioned above all use fixed search pattern and search strategies regardless to the nature of the video content. And as a result, here come lots of improved algorithms that take advantage of the time and space of the sequence of images and human visual characteristics. There are two common methods among those. One is rapid background image detection method. In most of the video

19

sequence, the background of the image takes biggest property of the whole, and if we can quickly detect the background, then we can reduce the computing time by big margin. For example, we can directly calculate the SAD value of the zero vector (Starting point), and if the SAD value is less than a certain threshold value T, then directly terminate the search, and the zero vector is the final motion vector. In such a way, we can only perform single time of search to locate the optimal point, and by doing this we can improve the efficiency of the algorithms. Another common method is based on prediction of the complexity of current block movement. Different search patterns will be applied to different complexity of block movement. If the motion vector of the adjacent blocks are comparably high, then the current block are considered to be in fierce movement. In this case, big search box pattern are applied, such as DS and hexagon search, otherwise only use small search pattern to complete the search. And this is the basic idea of the motion vector field adaptive search algorithm (MVFAST), which is quite a breakthrough in fast motion estimation algorithms. 1. Still MB detection Most of the still MB in any video sequence has the MV of (0,0). And those still MB can be detected using SAD. If SAD of pixel (0,0) is smaller than a threshold T, then this MB can be considered to be still, and the search can come to an end, which is called early termination. And (0, 0) is the motion vector. And in MVFAST, this threshold is set to 512, and it is configurable. If it is set to 0, then the early termination process will be skipped. 2. Movement intensity In MVFAST, the movement intensity of a MB can be defined by the MV of Region of Support (ROS). ROS is the adjacent MB on the left, top and top right. Assuming Vi=(xi , yi) is the MV of MBl，MB2，MB3. And L | x |  | y |, L  max( L ) . Then i

i

i

i

the movement intensity of current MB can be defined below. Movement intensity=low， L  L1 = medium， L1  L  L2 =high， L  L2

( L1 , L2 can be pre-defined)

20

(3.1)

Chart 3.4 Region of support

3. Starting search point. The starting search point rely on the Movement intensity of the current MB. If it is medium or low, then use (0,0) as the starting point. If movement intensity is high, then SAD of the MBs that three adjacent MB pointing to will be calculated. And the one with the lowest SAD is chosen to be the starting search point. 4. Search pattern. There is two types of search pattern used in MVFAST, big diamond search and small search. As explained above, MB with high movement intensity will use big diamond as the search pattern. If not, small diamond search is applied. MVFAST take adjacent MB into consideration, to determine the starting search point. And it uses different search pattern for different situation. As a result, it is a balanced way both on speed and quality.

3.7 Summary This chapter first introduces several classic fast motion estimation using block-matching algorithm, and focus on research and analysis of the search model, motion estimation strategy and detailed motion estimation process. In the end analyze strengths and weaknesses of each of these algorithms. It can be concluded from this chapter that FS algorithm the highest accuracy, but also with the highest amount of computation. MVFAST algorithm proposed a valuable thought, which is "stop when is good enough". And in this way, the whole search volume can be saved by large amount. The above analysis has laid a good foundation for the introduction of UMHexagons motion estimation algorithm in the next chapter.

21

22

Chapter 4 Analysis and Optimization of Unsymmetrical Multi Hexagon Search 4.1 Analysis of Unsymmetrical Multi Hexagon Search Some of the motion estimation algorithms mentioned above, such as TSS, FSS, DS, HS, all aims to reduce the search volume by means of limit the search points. And those algorithms can gain good efficiency if the size of the video content is small. But while dealing with some of the large-size images and a larger search range, those fast search algorithms tend to fall into local optimization, and thereby seriously affecting the coding efficiency. Therefore, this chapter focuses on the Unsymmetrical Multi Hexagon Search (UMHexagonS) [17] algorithm. This algorithm can save up to 90% of the computing complexity compared to Full Search. And it uses multi-level, different shapes of search pattern, which can prevent from falling in local optimization. UMHexagonS algorithm also uses SAD as its matching criterion. And it uses early termination mechanism. In most cases, the best matching point is very close to the initial prediction point, which means that in many cases, the motion estimation search is superfluous. And in this early termination mechanism, the threshold value is mainly affected by two factors: the current adjustment factor (β), ------and the predicted m cos t pred motion compensation ( ). The threshold values are defined as follows Threshold A  m cos t pred  (1  1 )

（4.1）（4.2）

Threshold B  m cos t pred  (1   2 )

If Threshold B is met, then the algorithm skips Step 3, and directly jump to Step4_2. And if it only meet the requirement for Threshold A , it needs to perform Step4_1 before Step4_2. And the whole process is illustrated in Chart 4.1.

23

Chart 4.1 Flow chart of UMHexagonS

There are mainly four steps 1. Starting search point prediction Starting search point prediction takes advantage of the correlation of adjacent MB with current MB. It has four prediction modes: 

Median Prediction(MP)

MP uses the median of the MB on the left(A), top (B) and top right (C) as the predicted MV. MV predMP  median[MV A MVB MVC ]

(4.3)

24

Chart 4.2



Median Prediction

Uplayer Prediction (UP)

In H.264/AVC there is 7 modes for MB segmentation from 16× 16 down to 4×4. And UP uses the MB of previous frame, which also has one level bigger of MB size as the predicted MV. For example, the current MB is 8× 16, then it will search for the MB size of 16× 16 as its reference. MV predUP  MVuplayer

(4.4)

Chart 4.3



Median Prediction

Corresponding-block Prediction(CP)

CP uses the MB on the same position in the previous frame for its reference. This is more suitable for the image with low movement. MV predCP  MVCP

(4.5)

Chart 4.4

Corresponding-block Prediction 25



Neighboring Reference-frame Prediction (NRP)

NRP uses one previous frame as a reference for another pervious frame. Assuming the current frame is in time t, and to choose the match MB in time t’, it can take the frame t’ +1 as a reference. Which is MV predNRP  MVNR  t  t'

t  t'1

(4.6)

Chart 4.5 Neighboring Reference-frame Prediction

2. Asymmetric cross search The horizontal movement in a video content is more often than vertical movement. As a result, asymmetric cross search can gain a better accuracy and efficiency. As is shown in step 2 in Chart 4.6, the vertical search range is double the length of horizontal search range. 3. Uneven multi-level hexagon search This step has two sub-steps. First, it performs a 5x5 square search from pixel (-2,-2) to (2, 2). The matched pixel is the center for next step. Then it performs the 16 pixels uneven multi-level hexagon search. Once the first hexagon search is finished, the search range will be expanded to double size to perform a bigger hexagon search. This kind of process can reduce the possibility of fall into local optimization. 4. Extended hexagon search There are also two sub-steps in this step. First of all, the best match point from step 3 is set to be the center of the search. Now the search pattern will be changed into a small, 6 pixel search. And if the best matching pixel is located, then it will perform a small cross search to locate the best matching. And by doing this, the best matching MB and its MV is found. 26

Chart 4.6 UMHexagonS process

UMHexagonS uses multi-layered, different kinds of search pattern to improve the accuracy of the search, while still keep the computing power at a low level. It can save about 90 percent of the computer power compare to full search. And UMHexagonS is officially adopted by H.264/AVC group.

4.2 Optimization of Unsymmetrical Multi Hexagon Search UMHexagonS has several advantages compared to other algorithms. First of all, it has a complete starting point prediction system. And with this system, it can help to save quite amount of computing power, although start point prediction itself also consume some computing power as well. Also, is has variety of search pattern to handle different scenario. It uses uneven multi-level hexagon search to handle large scale search, and hexagon pattern to do detailed search. The combination of these two pattern can reduce those unnecessary search step, and in the same time gain search accuracy. Finally, the introduction of early termination is quite a success. and it use two threshold to further save the computing power. 27

Because the advantages over other algorithms, this thesis uses UMHexagonS as the model for further optimization. A good algorithm does not mean it is perfect in every case, and after some deep thinking, this algorithms can be optimized in those ways: First, UMHexagons is not specially made for mobile usage. Because the quantization parameter (QP) in mobile usage is usually high, which means the tolerance of the error is comparably high. As a result, some part of UMHexagons is over designed. For example, after early termination, it still needs to do cross search to locate the matching block. And here we can use the concept in MVFAST "stop when good enough" to optimize the algorithm. Secondly, for the threshold settings, UMHexagonS uses fixed value. Because the video by nature is not static, and it can change beyond anyone's guess. It is not reasonable to give a fixed number that can be adopted by all kinds of video. Last, while the condition for early termination is met, in another word SAD is smaller than threshold. Then we can consider the matching block is nearby and hexagon search pattern is applied. But in several cases, for example the background is gradient from black to dark grey, the left side of the image already meet the condition for early termination. But the matching block is on the right side of the image. So it needs to uses hexagon pattern to search from left to right, which will of course waste lots of computing power. It is doubtable that SAD can be used to determine whether the matching block is nearby. To sum up those ideas, it is possible to further optimize UMHexagonS.

4.3 Optimization Details of Unsymmetrical Multi Hexagon

Search 4.3.1 Optimization on Early Termination In UMHexagonS, two thresholds is applied for early termination, i.e. when the predicted value is under the lower threshold, then the system consider this is the perfect starting point and then perform small cross search. And if the value is under the higher threshold, somehow higher than the lower threshold, system consider this is an acceptable point, and then to perform hexagon search. However, for a fast motion estimation algorithm, UMHexagonS is too complex on the process. Because even the point can meet the criterion for early termination, it still needs to perform hexagon search and small cross search, which are quite time 28

consuming. Taking the core thought of MVFAST, stop when it is good enough, into consideration, we change the details of early termination of UMHexagonS. The higher threshold is kept untouched, while we change the lower threshold to acceptable threshold. Which means if the predicted value is lower than the threshold, the whole search will come to an end. And we can have the MV in the same time. We also changed the fixed threshold into adaptive threshold, the threshold T  min( SADabove, SADaboveright, SADleft , SAD front ,512) SADabove, SADaboveright, SADleft , SAD front

（4.7）

stands for the SAD of the block on the top, top

right, left, and previous frame. The reason to set a adaptive threshold is that the video content is flexible, and uses a fixed threshold cannot handle all the situations. With the adaptive threshold, the algorithm can adjust the threshold in reaction to different video content.

And the improved algorithm flow is like this

29

Chart 4.7 Optimized process (1)

4.3.2 Adoption of Movement Intensity In most of the cases, the matching MB is very close to the starting search point. In another word, many of the following process are not actually used in many cases. This is the reason that early termination is adopted in UMHexagonS. But here comes another question, how to choose the criterion for early termination. If we use a comparably high threshold, the computing time will be shorten, but the meanwhile the precision is harmed.

And if we use a low threshold, there will be more computing

time required. And by the nature of the video content, the movement of the block is not quite related to the value of SAD. So here, we use another term to describe the 30

movement of the video, which is movement intensity. When the movement intensity of the current MB is low, then it is considered that the match MB is in the nearby. And this kind of setting is more reasonable than SAD.

Different from MVFAST, we

only use two kinds of movement intensity, high and low. And it is determined by adjacent MB, which is MB1, MB2 and MB3. The MV of MB1, MB2, MB3 is L | xi |  | yi |, L  max( Li ) . represented as Vi=(xi , yi). And i Thus the movement intensity of current MB is Movement intensity = low， L  L1 = high， L  L1

L1 is set to 5.

（4.8）

Chart4.8 Region of Support

If L  L1 , then the current MB is in low movement, and we can think that the matching MB is in the nearby. So we can uses hexagon search to find the result. And if not, the target is in a distance, so we have to follow steps as is mentioned above. The final process is illustrated in Chart 4.9.

31

Chart 4.9 Optimized process (2)

4.4 Implementation of the Algorithm The improved process of the algorithm is explained in detail in previous chapter. And based on it, there is the implementation process.

4.4.1 Starting Point Prediction The first step is starting point prediction. There are four modes in UMHexagons. Taking median prediction for example, the algorithm first get the MV of three adjacent MBs, and then average the value to get the result. The implementation on the UMHexagons source code is as follows: 32

void Get_MVp(const int x,const int y,MV *pre_mv,int &mvx,int &mvy,uint32 *sad=NULL) {

uint32 num[10];

if(sad==NULL) sad=num; if(y>0) { pre_mv[0]=frame_info.mv[x][y-1]; sad[0]=frame_info.sad[x][y-1]; else

}

{pre_mv[0].dx=pre_mv[0].dy=0; sad[0]=0; } // predict MV of above MB

if(x>0) {pre_mv[1]=frame_info.mv[x-1][y]; sad[1]=frame_info.sad[x-1][y];} else

{pre_mv[1].dx=pre_mv[1].dy=0; sad[1]=0;} // predict MV of right MB

if(x>0 && y