Efficient fast mode decision using mode complexity for multi-view video coding

J. Cent. South Univ. (2014) 21: 4244−4253 DOI: 10.1007/s11771-014-2421-6 Efficient fast mode decision using mode complexity for multi-view video codi...

Author: Vivian Tyler

2 downloads 2 Views 985KB Size

Report

Download PDF

Recommend Documents

Fast Intra Mode Decision in High Efficiency Video Coding

Energy-Efficient Memory Hierarchy for Motion and Disparity Estimation in Multiview Video Coding

Fast Page Mode DRAM Controller

Energy Efficient Video Streaming Over Bluetooth Using Rateless Coding

Using RMII Master Mode

SVC Mode Decision Based on Mode Correlation and Desired Mode List

Early termination of transform skip mode for High Efficiency Video Coding

Broadband mode division multiplexer using allfiber mode selective couplers

Competitor analysis and market entry mode decision

Data recovery using Normal Mode

DYNAMIC ANALYSIS USING MODE SUPERPOSITION

Connection for Channel Asynchronous Mode Connection for Channel Synchronous Mode

Special Placement Programme through Video Conferencing Mode for Organizations

VIEW SYNTHESIS FOR MULTIVIEW VIDEO TRANSMISSION

Electric VHDL mode: Major mode for editing VHDL code. Usage:

AKADEMIE MODE MANAGEMENT AKADEMIE MODE MA- NAGEMENT AKADEMIE MODE MANAGEMENT AKADEMIE MANAGEMENT AKADEMIE MODE MANAGEMENT

2 wir managen mode. wir managen mode

Real-Mode and Protected-Mode Memory Addressing

USING THE INDIRECT ADDRESSING MODE WITH ST7

Motion Estimation for Video Coding

Mode of Mechanical Ventilation: Volume Controlled Mode

AN-1794 Using RMII Master Mode

Forward mode

Z2 MODE

J. Cent. South Univ. (2014) 21: 4244−4253 DOI: 10.1007/s11771-014-2421-6

Efficient fast mode decision using mode complexity for multi-view video coding WANG Feng-sui(王凤随)1, 2, SHEN Qing-hong(沈庆宏)1, DU Si-dan(都思丹)1 1. School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China; 2. College of Electrical Engineering, Anhui Polytechnic University, Wuhu 241000, China © Central South University Press and Springer-Verlag Berlin Heidelberg 2014 Abstract: The variable block-size motion estimation (ME) and disparity estimation (DE) are adopted in multi-view video coding (MVC) to achieve high coding efficiency. However, much higher computational complexity is also introduced in coding system, which hinders practical application of MVC. An efficient fast mode decision method using mode complexity is proposed to reduce the computational complexity. In the proposed method, mode complexity is firstly computed by using the spatial, temporal and inter-view correlation between the current macroblock (MB) and its neighboring MBs. Based on the observation that direct mode is highly possible to be the optimal mode, mode complexity is always checked in advance whether it is below a predefined threshold for providing an efficient early termination opportunity. If this early termination condition is not met, three mode types for the MBs are classified according to the value of mode complexity, i.e., simple mode, medium mode and complex mode, to speed up the encoding process by reducing the number of the variable block modes required to be checked. Furthermore, for simple and medium mode region, the rate distortion (RD) cost of mode 16×16 in the temporal prediction direction is compared with that of the disparity prediction direction, to determine in advance whether the optimal prediction direction is in the temporal prediction direction or not, for skipping unnecessary disparity estimation. Experimental results show that the proposed method is able to significantly reduce the computational load by 78.79% and the total bit rate by 0.07% on average, while only incurring a negligible loss of PSNR (about 0.04 dB on average), compared with the full mode decision (FMD) in the reference software of MVC. Key words: multi-view video coding; mode decision; mode complexity; computational complexity

1 Introduction With the development in camera and display, the new multimedia applications, such as three dimensional TV (3DTV) and free-viewpoint TV (FTV), have been emerging, which can provide people with the highly-welcome experience of 3D stereoscopic and freedom of selecting the viewpoint [1−2]. The key of these new multimedia applications is the multi-view video, which is captured by a set of video cameras from various viewpoints but at the same time. However, with the increasing number of views, the amount of the multi-view data is linearly increased, compared with the traditional single-view video. It consumes a large amount of video data bandwidth. Hence, it is indispensable for multi-view video to compress video data for efficient storage and transmission of multi-view video data. To this end, the joint video team (JVT) of ITU-T VCEG and ISO/IEC MPEG has standardized multi-view video

coding (MVC) as a new extension of H.264/AVC (i.e., Annex H) [3]. Joint multi-view video coding (JMVC) has been developed as MVC reference software. Compared with H.264/AVC, MVC standard not only adopts intricate intra prediction using spatial correlation and variable block-size ME using temporal correlation within a single view, but also uses variable block-size DE using inter-view correlation between neighbor views in order to achieve higher coding efficiency [4]. Although these techniques achieve the highest possible coding efficiency, they result in extremely large computation complexity which obstructs MVC from practical application. Therefore, it is necessary to design an algorithm for reducing computational complexity with minimal loss of image quality. In recent years, some fast mode decision algorithms for MVC have been presented [5−10]. DING et al [5] proposed a fast ME algorithm based on content-aware by fully utilizing the inter-view correlation. As DE is different from ME in MVC, a selective DE algorithm

Foundation item: Project(08Y29-7) supported by the Transportation Science and Research Program of Jiangsu Province, China; Project(201103051) supported by the Major Infrastructure Program of the Health Monitoring System Hardware Platform Based on Sensor Network Node, China; Project(61100111) supported by the National Natural Science Foundation of China; Project(BE2011169) supported by the Scientific and Technical Supporting Program of Jiangsu Province, China Received date: 2013−06−24; Accepted date: 2013−11−28 Corresponding author: DU Si-dan, Professor, PhD; Tel: +86−13951939745; E-mail: [email protected]; SHEN Qing-hong, Associate Professor, PhD; Tel: +86−13905164251; E-mail: [email protected]

J. Cent. South Univ. (2014) 21: 4244−4253

was proposed by HUO et al [6] for reducing computational complexity in inter-view prediction, based on the observation that the contribution of DE to the coding efficiency depends on the temporal levels of the current picture. CHAN et al [7] proposed a fast mode decision for MVC based on a set of dynamic thresholds, which were determined by making use of the on-line statistical analysis of motion and disparity costs of the first group of picture (GOP, G) length in each view. According to mode distribution correlation between the neighboring views, SHEN et al [8] proposed a fast mode size decision algorithm in inter-frame coding to reduce the computational complexity of ME and DE for MVC. ZENG et al [9] proposed an early termination scheme for skip mode by checking whether the RD cost of skip mode is below an adaptive threshold for providing a possible early termination chance. SEO and SOHN [10] proposed fast algorithms for MVC using disparity estimation skipping for inter-view prediction, based on the amount of motion that can be measured through the use of motion activity and skip mode on the temporal axis. In this work, a more efficient mode decision algorithm for MVC was proposed. First of all, mode complexity was computed in the proposed algorithm using the spatial, temporal and inter-view correlation between the current MB and its neighboring MBs and compared with a presupposed threshold as an efficient early termination scheme. If this mode complexity value was smaller than the presupposed threshold, the direct mode was selected as the optimal mode and the mode decision process was early terminated. If this early termination condition was not granted, all the prediction block modes were divided into three different mode classes on the basis of the mode complexity, i.e., simple mode, medium mode and complex mode. Each class was only assigned specific candidate(s) for mode decision to further reduce the number of the variable block modes. For the MBs in the region with simple mode, only mode size of 16×16 was checked, and other mode sizes were skipped; for the MBs in the region with medium mode, mode size of 8×8 was skipped; for the MBs in the region with complex mode, all mode sizes were checked. In addition, disparity estimation skipping was applied to simple and medium mode regions via being compared with RD cost of mode 16×16 in the temporal prediction direction and the disparity prediction direction to skip unnecessary disparity estimation. For the complex mode, DE was always tested to select the one with the minimum RD cost as the optimal mode. Experimental results have demonstrated that the proposed method can greatly reduce the computational complexity with negligible loss of coding efficiency, compared with FMD in JMVC.

4245

2 Overview of mode decision in MVC To reduce the redundancies existing in temporal and inter-view in multi-view video and improve coding efficiency, the MVC reference software, JMVC, exploits the hierarchical B picture (HBP) [11] prediction structure from HHI. Figure 1 shows an example of HBP prediction structure with 8 views and G=8. We can see that all pictures can be classified into two classes: the anchor pictures (i.e., the pictures at T0 and T8) and the non-anchor pictures, which are located between two adjacent anchor pictures (i.e., the pictures at T1, T2, …, T7). As we know, the MVC standard adopted inter-view prediction via DE to reduce the redundancy among neighboring views. The full inter-view prediction is applied to every other view, i.e., V1, V3, V5, and V7 in Fig. 1. However, for all the pictures in the anchor pictures, which are used as stamps of synchronization and random access, such as the pictures at T0 and T8 in Fig. 1, the inter-view prediction is performed regardless of view order. For the other views, i.e., V0, V2, V4, and V6 in Fig. 1, the temporal prediction via ME is only conducted for the non-anchor pictures by referring the neighboring temporal pictures within the same view. Furthermore, a view type can be determined by exploiting different picture types of the anchor pictures. For example, the view type of V0 is called I-view, the view type of V2 is P-view and the view type of V1 is named B-view. In the JMVC 8.0, a set of modes that have large-scale are presented for exploiting the spatial, temporal and inter-view correlation in multi-view video with various kinds of motion contents. For DE and ME, the JMVC provides seven variable block-sizes (i.e., 16×16, 16×8, 8×16 and 8×8) and each 8×8 block can be further divided into 8×4, 4×8 or 4×4 blocks, as shown in Fig. 2, where 8×8 block and its sub-block sizes are

Fig. 1 Hierarchical B-picture prediction structure in MVC

J. Cent. South Univ. (2014) 21: 4244−4253

4246

Fig. 2 Variable block sizes for motion estimation and disparity estimation in MVC

jointly called as P8×8. For the inter mode, there are eleven candidate modes: Direct, Inter16×16, Inter16×8, Inter8×16, Inter8×8, Inter8×4, Inter4×8, Inter4×4, Intra16×16, Intra8×8 and Intra4×4. Note that the direct mode is the particular case of block 16×16, which can directly derive motion/disparity vector in spatial, temporal or inter-view prediction in MVC. This direct mode fully utilizes the motion correlation between the current MB and its neighboring MBs to further improve the coding efficiency. To achieve higher coding efficiency, Lagrangian rate-distortion optimization (RDO) function [12−13] is employed to select the one with the minimum RD cost as the optimal mode: J ( s, c, MODE | MODE )  SSD( s, c, MODE | Q)  MODE R( s, c, MODE)

(1)

where J ( s, c, MODE | MODE ) is RD cost of MODE, s and c denote original Luma block and reconstructed block after mode coding, respectively. λMODE is the Lagrangian multiplier, Q is the quantization parameter, R(s, c, MODE) means the amount of the bits that are needed to code the headers, motion/disparity vectors, residual, etc., and SSD( s, c, MODE | Q) represents the sum of squared differences between original and reconstructed MB. In order to maximize the coding efficiency, JMVC adopts full mode decision and the rate distortion optimization (RDO) function to identify the optimal mode. In detail, full mode decision is to exhaustively check all the prediction modes and compute the RD cost of each mode, and then select the one with the minimum RD cost as the optimal mode. Since JMVC supports many prediction modes and the computation of RD cost for each mode is time-consuming, the computational complexity of the FMD is extremely heavy. Therefore, a fast mode decision algorithm is highly desirable.

3 Proposed fast mode decision method for multi-view video coding 3.1 Motivation First, it is known that the larger block sizes are more

suitable for coding the homogenous regions with slow motion, while the small block sizes are fit for compressing the complex regions with fast motion. Since the direct mode is the particular case of block 16×16, intuitively, it is fit for coding those homogeneous regions without motion or with slow motion. These scenes are often encountered in the nature video sequences. This means that the direct mode should be more likely to be the optimal mode. The direct mode provides good coding performance and requires little computational complexity in single view video coding [14−15]. Motivated by the observation, it is expected that this might be also true for MVC. To verify this intuition, extensive experiments have been conducted to obtain the distribution of optimal mode in the non-anchor pictures of the B-view by using the full mode decision based on a set of multi-view video sequences with all kinds of motion contents, as listed in Table 1. The experimental conditions are as follows: each test sequence is encoded using the HBP prediction structure under G=16 and Q=32, RDO and contextadaptive binary arithmetic coding (CABAC) are enabled, and the search range of the ME and DE is ±64. The distribution of optimal mode is documented in Table 2. One can see from Table 2 that direct mode is the dominant mode to be the optimal mode for MVC, especially for those homogeneous sequences with slow Table 1 Multi-view video sequences Sequence

Resolution

Frame

Characteristic

Flamenco1

320×240

250

Race1

640×480

250

Ballroom

640×480

250

Exit

640×480

250

Akko & Kayo

640×480

250

Rena

640×480

250

Uli

1024×768

250

Medium local motion Outdoor, fast camera movement Large disparity, rotated motion Large disparity, smooth motion Medium detail and object motion Medium detail and object motion Complex background

Table 2 Distribution of optimal mode in MVC (%) Sequence

Direct

ME

DE

Intra

Flamenco1

80.8

17.2

1.8

0.2

Race1

83.2

14.2

2.3

0.3

Ballroom

77.2

20.7

1.9

0.2

Exit

86.9

12.6

0.5

0.0

Akko & Kayo

77.8

19.4

2.5

0.3

Rena

81.1

15.1

3.3

0.5

Uli

65.6

30.6

3.3

0.5

Average

78.9

18.6

2.2

0.3

J. Cent. South Univ. (2014) 21: 4244−4253

motion. For example, in sequence “Exit”, 86.9% MBs select the direct mode as the optimal mode. After computing the RD costs of all coding modes, many MBs finally end up with being decided as direct mode due to belonging to background or a motionless object. Therefore, this observation implies that ME and DE computation of each MB can be entirely saved if direct mode can be pre-decided. If so, great computational complexity can be reduced. Second, for non-anchor frames in B-view, the MVC standard adopts inter-view prediction via DE to remove the redundancy among neighboring views, besides temporal prediction by ME to remove the temporal redundancy between pictures. The inter-view prediction improves the coding efficiency for MVC, however, leading to a much higher degree of computational complexity, compared with that of the single view coding owing to the exhaustive search for inter-view prediction. The runtime proportion of ME and DE in B-view is shown in Fig. 3. It is easily observed from Fig. 3 that DE occupies almost one half of the total runtime time. One can further see from Table 2 that the average probability of ME selected to be the optimal choice is 18.6%, while that of DE is 2.2%. In other words, after computing the RD costs of all prediction modes, it turns out that temporal prediction by ME is much more likely to be a better prediction than inter-view prediction by DE. This reason is that many real-world video sequences indeed contain substantial amounts of background and motionless objects. However, such video sequences have usually much stronger correlation between two consecutive pictures from the same view than that between the two pictures obtained at the same time but from the different views. Since the computational complexity for B-view is nearly twice that of the single view coding, nevertheless, disparity vectors used for inter-view prediction are rarely selected. Hence, if we can determine ahead of time for an MB whether the optimal prediction direction is in the temporal prediction direction or not, the computational complexity for MVC

Fig. 3 Runtime proportion for ME and DE in B-view

4247

will be reduced by omitting the unnecessary process of computing the RD cost of the prediction modes in inter-view direction. Third, MVC exploits various prediction block sizes to more accurately capture the real motion or disparity in the real-world video sequences and represent object movement or depth information so as to reduce the redundancy energy. It is easily perceived that a small block size (such as P8×8 blocks, including 8×4, 4×8 and 4×4 sub-blocks) is fit for complex textual region with fast motion, while a large block size (for instance, 16×16 blocks) is more suitable for a homogeneous region under slow motion. In other words, for the MBs in region with homogeneous motion and texture, prediction mode sizes are always of large size. The proportion of optimal mode distribution and encoding time by the FMD is shown in Fig. 4. Due to the limited space, only the distribution for ballroom and exit at Q=32 is shown here. The result shows that most of the MBs select 16×16 (in fact, direct mode is also the particular case of block 16×16) as the optimal mode size, while the proportion of the MBs to be coded in other mode sizes (i.e., 16×8, 8×16 and 8×8) is very low. On the other hand, it can be found that the encoding time of the MBs to be coded in other mode sizes is very high, especially for P8×8. Although few P8×8 mode size is selected to be optimal mode, the

Fig. 4 Proportion of optimal mode distribution and encoding time (Q=32): (a) Ballroom; (b) Exit

4248

wasting coding time of P8×8 mode size has a very high proportion (more than 50%) in the whole encoding time. The reason is that the P8×8 mode size needs to estimate the various sub-modes respectively, and the consuming of computation for each sub-mode is very heavy. This observation tells us that whether the 16×16 or P8×8 prediction size is selected or not at the beginning of the mode decision process for each MB has great significance for reducing computation of mode analysis. In this work, a mode complexity parameter is extracted from the previous pictures and views using the spatial, temporal and inter-view correlation to reduce the computational complexity. The candidate modes can be reduced to a limited number by using this parameter. Finally, multi-view video contains a large amount of spatial, temporal and inter-view correlations. From the temporal point of view, slow motion or motionless scenes are often encountered in the real world video sequences. From the spatial point of view, a mass of homogeneous areas are frequently presented in the spatial domain. From the inter-view point of view, the video contents are similar among inter-views. This results in the phenomenon that the coding information, such as RD cost, prediction modes, and motion vectors in the current MB, can be effectively shared and reused from the adjacent MBs in the current view and its neighboring view. Thus, in the design of fast mode decision algorithm, we should take advantage of the above mentioned correlations. 3.2 Proposed fast algorithm 3.2.1 Weighted mode complexity There exists a mass amount of spatial, temporal and inter-view correlations in MVC, therefore, the coding information of the current MB is highly correlated to that

J. Cent. South Univ. (2014) 21: 4244−4253

of its spatial, temporal and inter-view adjacent MBs. Hence, a weighted mode complexity (WMC) parameter is proposed to estimate the mode characteristics of the current MB based on the mode context of the MBs in the previously coded frames or views. First, in the spatial and temporal point of view, there are high spatial correlation between the current MB (i.e., MB0 in Fig. 5) and its spatial-adjacent MBs (i.e., left, top and top-right MBs in Fig. 5), and strong temporal correlation with those temporal-adjacent MBs in both forward and backward pictures. The spatial-temporaladjacent MBs of the current MB are shown Fig. 5. In Fig. 5, MB4 is the MB with the same position as the current MB in forward or backward picture and MBi (i= 5, 6, …, 12) are the 8 neighbors of MB4. Considering that the current MB has strong temporal correlation with those temporal-adjacent MBs in both forward and backward pictures, the temporal-spatial mode complexity T is computed as the average of the forward temporal-spatial mode complexity T−1 and the backward temporal-spatial mode complexity T+1:

T  T 1  T 1  2

(2)

Second, since the video contents are similar among inter-views, the prediction mode of an MB in current view is most similar to the prediction mode of the corresponding MB in neighbor view. Motivated by this observation, inter-view-spatial mode complexity is proposed, as shown Fig. 6. It should be pointed out that MB4 is the corresponding MB of the current MB in forward or backward view, which is located by means of global disparity vector (GDV) [16]. Since GDV measured by MB-size of unit is not exactly disparity between the current MB and the corresponding one, the modes of the corresponding MB and eight of its neighbor

Fig. 5 Temporal-spatial-adjacent MBs of current MB: (a) Forward picture T−1; (b) Current picture T; (c) Backward picture T+1

Fig. 6 Inter-view-spatial-adjacent MBs of current MB: (a) Forward view V−1; (b) Current view V; (c) Backward view V+1

J. Cent. South Univ. (2014) 21: 4244−4253

4249

MBs (i.e., MBi (i=5, 6, …, 12) are the 8 neighbors of MB4 shown in Fig. 6 used to estimate mode characteristic of an MB. It can be seen that two corresponding MBs can be found from forward and backward views, respectively. Therefore, similar to the design of temporal-spatial mode complexity T, the interview-spatial mode complexity V is also computed as the average of the forward inter-view-spatial mode complexity V−1 and the backward inter-view-spatial mode complexity V+1:

V  V 1  V 1  2

(3)

As above-mentioned, the forward temporal-spatial mode complexity T−1, the backward temporal-spatial mode complexity T+1, the forward inter-view-spatial mode complexity V−1 and the backward inter-viewspatial mode complexity V+1 can be computed as

x 

1  N

N

K

x i

 ix

(4)

i 1

where x denotes the neighboring pictures, including the forward temporal picture T−1, the backward temporal picture T+1, the forward inter-view picture V−1 and the backward inter-view picture V+1, i.e., x  {T  1, T  1, V  1, V  1}, i denotes the index of MB, i.e., i=1, 2, …, 12, and N is the number of the MBs equal to 12. Kix is the mode-weight factor located in picture x documented in Table 3, assigned based on the complexity of each mode. Generally speaking, the larger the mode-weight factor, the more complex the MB is.  ix is the MB-weight factor of the MBi located in picture x, which is defined in Table 4 for each MB. This weight factor is designed based on below observation: The closer the adjacent MB is to the current MB, the larger the MB-weight factor should be assigned. Hence, weights  ix are empirically determined from the extensive experiments and results. Table 3 Mode-weight factors of each mode Mode

Direct

16×16

16×8

8×16

8×8

Intra

K ix

0

1

2

2

3

4

Table 4 MB-weight factors of temporal-spatial-adjacent and inter-view-spatial-adjacent MBs i

 ix

i

 ix

1

2

7

0.3

2

2

8

0.96

3

0.96

9

0.96

4

2

10

0.3

5

0.3

11

0.96

6

0.96

12

0.3

In the above process, two weighted mode complexity parameters (i.e., the temporal-spatial mode

complexity T and inter-view-spatial mode complexity V) are individually developed by using the spatial, temporal and inter-view correlation. Now, the question is how to determine the suitable adaptive mode complexity WMC, which should be reflected on the content characteristic of the current MB. In order to achieve a good trade-off between the computational complexity and the coding efficiency, the adaptive weighted mode complexity WMC is determined as

 WMC  (T  V ) / 2

(5)

3.2.2 Early direct mode decision Direct mode provides good coding performance and requires little complexity in both mono-view video coding and MVC [17]. Once direct mode can be pre-decided, variable size ME and DE computation for an MB can be entirely saved. As a result, an early direct mode decision method is developed as follows by making full use of the spatial, temporal and inter-view correlation in multi-view video. In the proposed method, for the current MB, adaptive weighted mode complexity WMC is firstly calculated according to Eq. (5) and then compared with a fixed threshold TD. If this adaptive weighted mode complexity is smaller than the fixed threshold, the MB is considered to be suitable to code with direct mode and the checking process of the remaining modes is skipped. The threshold TD is determined for different sequences in our experiments with 0.075. It is assumed that when the spatial-adjacent MBs (i.e., MBi (i=1, 2, 3) in Fig. 5 or Fig. 6), the corresponding MB (i.e., MB4), the entire nearest neighbor MBs (i.e., MBi (i=6, 8, 9, 11)) and more than two sub-nearest neighbor MBs (i.e., MBi (i=5, 7, 10, 12)) select direct mode as their optimal mode, the current MB has an extremely large probability of choosing direct mode as the optimal mode, and mode decision process can be early terminated. 3.2.3 Selective variable mode size MVC adopts variable prediction mode sizes to enhance coding efficiency. The price to be paid for higher coding efficiency is higher computational complexity. In JMVC, the prediction mode sizes can be classified into three types: large size (16×16), medium size (16×8 and 8×16), and small size (P8×8). Generally, large sizes are usually selected for MBs in the regions with homogeneous motion, while small sizes are chosen for MBs with complex motion. In fact, for MBs with homogeneous motion and texture, a rare number of prediction mode sizes are needed. To attain higher coding time saving while keeping almost the same coding efficiency, the mode sizes having limited contribution to coding efficiency should be omitted. To this end, the threshold T1 and T2 are set based on WMC to decide whether an MB belongs to a region of the simple

J. Cent. South Univ. (2014) 21: 4244−4253

4250

mode, the medium mode or the complex mode: Simple mode, if TD   WMC  T1  MB mode  Medium mode, if T1   WMC  T2 Complex mode, if T   2 WMC 

(6)

where the threshold T1 is determined to assume that all the MBs (i.e., MBi (i=1, 2, …, 12) in Fig. 5 or Fig. 6) are coded with mode 16×16 (namely, all the modeweight factors are equal to 1 in Eq. (4)). Similarly, T2 is decided to suppose that all the MBs are coded with mode 16×8 or 8×16 (i.e., all the mode-weight factors are equal to 2 in Eq. (4)). The proposed selective mode size algorithm is described as follows. For the MBs in a region with simple mode, only mode size 16×16 is checked, and other mode sizes are skipped; for the MBs in a region of medium mode, prediction mode size 8×8 is skipped, and other modes (i.e., 16×16, 16×8 and 8×16) are tested; for the MBs in region with complex mode, all prediction mode sizes are tested. In this way, most of the MBs select 16×16 as the optimal mode size, meanwhile, mode size P8×8 which is rarely selected but occupies more than one half of the coding time is excluded as far as possible. Hence, much of computation for ME and DE

can be greatly reduced. In summary, the above-mentioned algorithms are depicted in flowchart, as shown in Fig. 7. Note that some special MBs are not completely available, such as the boundary MBs located in the first row, the first column, and the last column. In this case, the FMD will be conducted for these special MBs. 3.2.4 Early disparity estimation skipping The inter-view prediction via DE technique enhances the coding efficiency of MVC, but brings about much higher computational complexity. The proportion of DE time is almost the same as that of ME time for B-view. However, inter-view prediction direction is rarely chosen as the optimal prediction direction after calculating the RD costs of all prediction modes. Therefore, it is necessary to design a method to early decide whether DE is performed or not, so that the timeconsuming checking process of DE can be saved. Temporal prediction is generally the most efficient prediction in MVC, but it is sometime necessary to adopt both ME and DE rather than only using ME to achieve better prediction results. To determine whether unnecessary DE skipping or not, it is observed that motion vectors are most likely to be selected in

Fig. 7 Flowchart of proposed early direct mode decision and selective variable mode size algorithms

J. Cent. South Univ. (2014) 21: 4244−4253

4251

homogeneous region under slow motion, and disparity vectors are employed in complex region with fast motion. Temporal prediction cannot bring out good performance in fast motion regions, because the correlation between the current picture and its neighboring pictures decreases and its RD cost increases owing to the larger motion vectors for variable small mode sizes. In other words, temporal prediction direction is more likely to be the optimal prediction for large mode sizes (simple mode), while inter-view prediction direction is probable to be selected as the optimal direction for small mode sizes (complex mode). Furthermore, a DE skipping algorithm is proposed based on the observation that mode size 16×16 selecting inter-view prediction as its optimal prediction direction has the significant reference value to other mode sizes in our experiments. In other words, if mode 16×16 adopts the inter-view prediction, other mode sizes are also most likely to select the same prediction direction. The prediction results of mode 16×16 can be used to decide whether DE of other mode sizes is selected or not. Since the optimal prediction direction is acquired via comparing RD costs between ME and DE, RD cost of mode 16×16 in temporal direction is compared with that in inter-view direction to determine whether other modes select DE. Considering that the current MB has strong temporal correlation with those temporal-adjacent MBs in forward picture, as shown in Fig. 5, the forward temporal RD cost LT−1 of mode 16×16 can be computed as the weighted average of the RD cost values of these temporal-adjacent MBs in forward picture that exploits mode 16×16 as the optimal mode. Similarly, the backward temporal RD cost LT+1 can be obtained. And the average value of the forward and backward temporal RD costs is determined as the temporal RD cost LT: LT   LT 1  LT 1  2

(7)

In the same way, similar to the design of temporal RD cost LT, in Fig. 6, the inter-view RD cost LV is also computed as the average of the forward inter-view RD cost LV−1 and the backward inter-view RD cost LV+1: LV   LV 1  LV 1  2

(8)

As above-mentioned, LT−1, LT+1, LV−1 and LV+1 can be computed as 12

Lx 

 i4

ix   ix  RDcost(MODE16 )ix

12



x i

  ix

(9)

i4

where x denotes the same meaning as Eq. (4), i.e., x  {T  1, T  1, V  1, V  1}, i represents the index of MB, here, i=4, 5, 6, …, 12, and RDcost(MODE16 )ix and  ix denote the RD cost of mode 16×16 and the MB-weight factor of the neighboring MBi located in picture x, as shown in Fig. 5 and Fig. 6, respectively. Moreover, the

weights  ix are documented in Table 4, which are provided with the same value in Eq. (4). And ix is the mode-weight factor, defined as 1, if optimal mode of MBi is Mode 16  16 0, else

ix  

(10)

ix is used to denote whether the MBi located in picture x selects the mode 16×16 as the optimal mode. In other words, only the RD cost value of the MBi located in picture x choosing the mode 16×16 as its optimal mode has the reference value to the current MB. The proposed early disparity estimation skipping algorithm is described as follows. For MBs in region with complex mode, disparity estimation is always performed; for MBs in region with simple and medium mode, if LV