Video Compression System for Mobile Devices Edmund S Jackson, Roger Peplow
Abstract: This paper presents a video compression system expressly designed for mobile devices. Bandwidth is no longer the factor preventing mobile, video capable devices, rather it is computational resources. Cognisant of the low computational ability of mobile devices, a system of modest complexity is proposed in this paper. Index Terms—Mobile Devices, Video Compression, Low Complexity Video
B. Premise of Proposed Algorithm A complexity study of MPEG2, 4 and H.263+ has revealed that the major computational exercise concerns motion estimation and compensation (ME/MC). For this reason ME/MC is entirely abandoned in favour of frame differencing. A statistical study into these difference frames reveals that significant coefficients are spatially clustered. Further considerations show that a source partitioning approach to difference frame coding could yield improved RateDistortion (RD) behaviour.
I. INTRODUCTION A. Mobile Video Requirements .5G and 3G mobile cellular networks continue to increase the available channel bandwidth with the aim of supporting the deployment of multimedia services. Standard 2G GSM networks allow data communication at up to 9.6kbps. Video cannot be transmitted at reasonable resolution and frame rate over such a bandwidth limited channel using existing schemes. GPRS services have increased the available bandwidth to 115 kbps, which is amply sufficient for transmission of video, at a resolution suitable for display on mobile devices (QCIF format). 3G networks support wide area data rates of 384kbps. In addition wireless LAN protocols such as Bluetooth and IEEE 802.11b provide ample bandwidth to devices such as PDA’s for the transmission of video. However, despite the bandwidth sufficiency of modern cellular and wireless LAN channels, devices supporting video transmission are not extant.
2
This is mainly due to the extreme computational burden of video compression. Standard video compression systems such as H.263, H.263+, MPEG2 and MPEG4 all rely on complex motion estimation schemes to achieve compression. The computing power required to execute these algorithms is evidenced by personal computers being able to compress video in realtime only in recent times. The computational resources available to typical mobile devices are significantly less abundant than to a PC. This technical barrier has prevented the deployment of video capable devices in cellular, and other mobile networks.
Manuscript received June 9, 2003. E. S. Jackson is a postgraduate student at the University of Natal, Durban (phone: 0312602731; fax: 0312602727; email:
[email protected]). R. Peplow. is an Associate Professor in the School of Electrical, Electronic and Computer Engineering, at the University of Natal, Durban. This work is sponsored by Thales Advanced Engineering, Armscor and The National Research Fund of South Africa.
In order to realize this improvement, a recently proposed RD estimation scheme [1], [2] is combined with a numerical search [3] to provide a globally optimal bit allocation structure [4]. All of these elements are combined into a system that partitions difference frames into tiles, performs RD estimation on each tile, allocates the available rate between the tiles, codes each separately using SPIHT [5], and then entropy encodes the output stream [6]. C. Outline of Paper Section II will provide background on the theory of operation and capabilities of the current standard video compression systems. The complexity of these schemes is also explored. Section III details the enabling observation for the new algorithm, which is the statistical behaviour of coefficients in video residual frames. Section IV presents the compression algorithm constructed from these considerations. Then Sections V and VI compare the computational burden, and compression performance of the proposed scheme to the standard schemes and finally, conclusions are presented in Section VII. II.
STANDARD VIDEO COMPRESSION SCHEMES
A. Premise of Standardised Methods The previous generation of image compression systems, characterized by JPEG, depends on the discrete cosine transform (DCT) for their fundamental operation. As a result, the current generation of video coders, MPEG2 [6], MPEG4 [7][8] and H.263+ [9], all rely on the DCT. A detailed description of these algorithms is provided in [10]. Broadly speaking however, the strategy is to perform temporal decorrelation, followed by spatial decorrelation. Temporal decorrelation refers to any technique that decreases the correlation between successive frames in a video sequence. Spatial decorrelation is coding of spatial correlation structures in the image. This spatial coding is performed by algorithms very similar to JPEG; segmenting the image in 8x8 pixel size blocks, applying the DCT
individually to each block, and quantizing the results. The temporal decorrelation integrates with this framework by means of block based ME/MC. By comparing successive frames in terms of the 8x8 DCT blocks, an estimate of the motion may be made. Thus only the motion vectors of the blocks, and the difference between successive blocks need be transmitted, reducing the explicit information greatly. B. Complexity of Standardised Methods In order to determine the feasibility of implementing these algorithms on a mobile platform, an empirical complexity study was performed, the results of which are given in Section V. These results show that for a general class of video sequences, at any bit rate, the ME/MC process consumes at least 80% of the computational time in modern algorithms. This is a result of the block searching and matching process described above. MPEG2 does not conform to this, however as this algorithm is much older and, as section VI shows, produces inferior RD performance, hence it is not considered further. III. DIFFERENCE FRAMES In order to produce a low complexity system, it was decided to abandon the ME/MC paradigm. The conceptually and computationally simplest temporal decorrelation method is to code the difference between successive frames. The resulting frame is referred to as the difference residual, or error, frame. These frame display several useful statistical properties that promote succinct description. A. Clustering of Significant Coefficients An original frame and its difference residual frame are shown below, from the standard ‘Hallmonitor’ test sequence.
assumption that difference frames will usually exhibit localized significance. B. Set Partitioning Shannon’s source coding theorem [11] states that a Gaussian source, of variance σ 2 , may be represented to within a fidelity D (a mean square error metric), with R bits per pixel, according to the following relationship: 1 σ 2 R( D) = max 0, log D 2 , or conversely: D = σ 2 ⋅ 2−2 R . Consider a source C, that is partitioned equally into two sub sources A and B, and coded such that the same number of bits are used to describe Source C and both subsources A and B. If sources A and B are coded with a total rate, R bpp, divided between them such that A receives nR and B receives (1n)R bpp, then the total bits required is N N R⋅N nR ⋅ + (1 − n) R ⋅ = bits, n ∈ [0,1] . 2 2 2 In order for Source C to be coded with the same number of bits, a rate of: R⋅N 1 R ⋅ = bpp , 2 N 2 must be used. The difference in distortion between the two representations is given by: 1 −2 R ∆D = σ C 2 ⋅ 2 ( 2 ) − σ A2 ⋅ 2−2 nR + σ B 2 ⋅ 2−2(1− n ) R 2 By taking the derivative of this quantity with respect to n, and equating to zero, the bit partition yielding the maximum distortion difference, n , is found to be: σ 1 1 n = 1 + log 2 A . 2 R σB Substituting this back into the original equation and solving yields a maximum distortion difference of
(
)
∆Dmax = 2− R σ C 2 − σ Aσ B .
Figure 1: Residual Frame #103 from ‘Hallmonitor’ This figure clearly indicates that the significant coefficients in the image are spatially clustered in the residual frame. The two men are the only moving objects in the scene. Consider one of the men. The pixels representing him are given by the set:
(
) {
χ xi , y j = px, y  x, y ∈ objecti
This indicates that sourcepartitioning with optimal bit allocation will yield a distortion decrease in the case of a variance concentration in either subsource A or B, relative to source C, which will be the case when significant coefficients are spatially clustered. Thus source partitioning presents RD advantages for difference residual frames. C. Multiple Image Source Partitioning
}
which represents the projection of the 3D object in real space onto the 2D imaging plane. As the man is localized in real space, his projection is localized in the imaging plane. Thus, the only pixels which may change as a result of his motion are localized within the same area. Under the widely adopted assumption that significant changes between frames are usually a result of object motion, which has been shown to produce localized significant coefficients, it is a fair
Figure 2: ’Akiyo ’ Difference Frame #100 Figure 2 above shows the 100th difference residual frame
from the ‘Akiyo’ standard sequence, partitioned into 64 tiles. In order to demonstrate that the clustering of significant coefficients discussed above translates to a variance concentration, two neighbouring tiles from Figure 2 and their variance are shown below. Source
Source C
Source A
Source B
Image
Figure 3: System Block Diagram This section will briefly describe the component parts. Temporal decorrelation is achieved through difference frames. This is to gain computational advantage over traditional video coders. Thereafter the image is segmented into tiles, in order to capture the clustering of significant coefficients, to gain the coding advantage outlined in III.B. A ratedistortion (RD) estimation of each tile is then domain method [1][2]. The argument for source partitioning presented in III.B was based on the assumption of a Gaussian source. Although this assumption is commonplace, it is not necessarily accurate, particularly for small sources such as image tiles. Thus, a true RD estimation must be performed to determine the actual RD properties of each tile.
SHUIRUPHGXVLQJWKH
Mean Variance
22.46
38.38
11.18
2215.82
2609.58
1243.32
Table III.1: ’Akiyo’ Image Tile Variances Adopting the symbols from above:
σ C 2 = 2215.82 σ Aσ B = 1801.26 Hence, a gain of
(
∆Dmax = 2− R σ C 2 − σ Aσ B −R
).
= 2 (414.56) may be expected. This is not entirely accurate, as the actual images shown above do not necessarily conform to Shannon’s Gaussian assumptions. However, these measurements do provide the expectation of a coding gain, should proper ratedistortion estimation be applied to the tiles of the partitioned image.
IV. PROPOSED VIDEO COMPRESSION SCHEME A video compression scheme, shown in Figure 3, has been built on these principles. Input
Frame(n)
Frame(n1)

Image Tiling RD Estimation
Wavelet Transform
Bit Allocation
SPIHT Arithmetic Coder
Based on the RD estimate of each tile, an optimal bit allocation between the tiles is found, using a combination of the Lagrange Multiplier technique, and a bisection search. This technique is similar to that proposed in [3]. Each tile is then wavelet transformed and compressed using the SPIHT algorithm, to the bitrate already specified by the bit allocation algorithm. It was decided to use a wavelet, rather than DCT based scheme, as wavelet schemes offer many advantages. In particular, SPIHT offers an embedded output bitstream, which allows the precise bit rate called for by the RD estimation and optimal bit allocation stage to be accurately met. The SPIHT output stream of each tile is then concatenated and entropy encoded using an arithmetic encoder. A. RD Estimation and Bit Allocation The crucial element in the system is the RD estimation / bit allocation stage. The RD estimation is achieved by using a slightly modified implementatioQ RI +H DQG 0LWUD¶V domain algorithm [1]. This algorithm combines the most accurate RD estimation known to the authors, with extremely low complexity. Thereafter a numerical bisection based bit allocation algorithm is performed to optimally distribute the bitrate between the tiles. Our work with these algorithms has been detailed in [4]. B. Spatial Coding The spatial coding stage consists of the wavelet transform, SPIHT, and arithmetic coder. All of these components are standard and well described in the literature. C. Complexity The algorithm chosen for each block in Figure 3 has been carefully selected from the possibilities presented in the literature, with computational performance of paramount concern.
Inverse
Output
V. COMPUTATIONAL RESULTS The primary purpose of this project is to produce a low
complexity video system. The reference software codecs of various standardized techniques; H.263+[12], MPEG2[13] and MPEG4[14], have been obtained and executed. Using the Microsoft Profiler (V6), a computational profile of each has been produced. The tables below display the results obtained for two common test sequences, ‘Akiyo’ and ‘Hallmonitor,’ at various bitrates. For each algorithm, the first column displays the total execution time on a certain computer (Intel Pentium 4, 1.6GHz, 512MB RAM, running Windows XP Pro. 2002 SP1), the second column displays the percentage of this time consumed in motion estimation and compensation routines.
Rate (kbps) 15 20 25 40 50
H.263+ Time. ME/MC (ms) % 26137 85.33 38667 83.47 38814 82.94 39128 86.08 39138 84.70
MPEG4 Time ME/MC (ms) % X X X X X X 33615 89.30 X X
Table V.1: Execution Time Data for H.263+ and MPEG4 ‘Akiyo’ Sequence Rate (kbps) 15 20 25 40 50
H.263+ Time. ME/MC (ms) % 64.12 15596 38450 82.95 82.82 39543 39323 85.02 79.79 42120
MPEG4 Time ME/MC (ms) % X X X X X X 39010 82.27 X X
Rate (kbps) 15 20 25 40 50
AKIYO Time. SPIHT (ms) % 2792 39.80 2889 2968 6209
42.33 42.74 67.67
HALLMONITOR Time SPIHT (ms) % 34.23 2701 2725 38.99 2865 40.59 5995 68.75 6329 67.23
Table V.4: Execution Time Data for Proposed Scheme, 'Akiyo'and 'Hallmonitor' As indicated, at the low bit rates of interest, the proposed algorithm is an order of magnitude faster than the standardized techniques. As the rate increases so does the computational complexity, this is a result of the SPIHT compression routine; the multiple tree searching routines in SPIHT become burdensome at high rates. VI. RATE DISTORTION PERFORMANCE RESULTS The visual quality of the schemes is compared next, under the same experimental conditions.
Rate (kbps) 15 20 25 40 50
AVERAGE PSNR (DB) H.263+ Prop. MPEG4 30.54 31.38 32.15 34.41 35.17
26.63 32.60 36.18 38.01
X X X 34.43 X
Table VI.1: RD Performance Comparison, 'Akiyo' Table V.2: Execution Time Data for H.263+ and MPEG4 ‘Hallmonitor’ Sequence The publicly available reference software for MPEG4 is unable to produce data except at 40kbps, hence the missing data in the tables above.
Rate (kbps) 40 60 100
MPEG2 Time. ME/MC (ms) % 25112 40.33 25574 40.11 25239 40.33
Table V.3: Execution Time Data for MPEG2, 'Akiyo' Sequence MPEG2 is designed for video broadcast systems and is thus unable to produce the lowbitrate video under consideration, due to framerate restraints. The performance of our algorithm is tabulated below:
Rate (kbps) 15 20 25 40 50
AVERAGE PSNR (DB) H.263+ Prop. MPEG4 29.32 29.68 29.95 31.53 32.60
24.37 28.86 29.86 33.84 34.90
X X X 33.23 X
Table VI.2: : RD Performance Comparison, 'Hallmonitor' As the tables above show, the proposed algorithm produces a stream of similar visual quality to the standardized techniques, for these sequences. This is due to the use of the wavelet transform for the spatial compression stage in our scheme, whereas the standard techniques rely on the older discrete cosine transform (DCT) based technologies. The advantage held by wavelet techniques over DCT techniques is well established in the still image compression literature, especially by the latest JPEG2000 still image compression standard which utilizes a wavelet method, rather than a DCT method, as was previously the case.
The average PSNR obscures many characteristics of the stream. The following curves display the PSNR of each frame in the ‘Akiyo’ sequence at 40kbps, for our algorithm and MPEG4 respectively. 42
#60
PSNR (dB)
40 38 36 34
scheme uses frame differencing in preference to ME/MC. Advantage is taken of the properties of these difference residual frames, through source partitioning, RD estimation and optimal bit allocation. This process is significantly less time consuming than ME/MC. Thereafter wavelet based still image compression techniques are applied to compress each residual frame. Wavelet transform based compression is chosen for is proven RD performance, which is again confirmed by the fine visual quality of the compressed video.
32 30 1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
Frame # Proposed
MPEG4
Figure 4: Frame by Frame PSNR Comparison A clear disadvantage of our algorithm is the lack of interframe distortion control. The algorithm outputs a nearly constant bit rate stream, but distortion control is not performed. As an RD estimate of each frame has been calculated, this should be a trivial matter to rectify. This is an avenue of future research.
This proposed scheme displays excellent RD performance, as a result of the use of RD optimal wavelet spatial compression, well as greatly reduced computational times, as a result of using frame differencing rather than motion estimation and compensation. This algorithm has the strong potential for implementation on a mobile wireless device. ACKNOWLEDGEMENTS Many thanks are due to Thales Advanced Engineering and Armscor, for their continued financial and technical support of this project. REFERENCES
The following two frames are taken from the ‘Akiyo’ sequence, compressed to 40kbps by our algorithm and MPEG4, and are included to give visual verification of our numerical results.
Figure 5: Akiyo Frame #100 Compressed with MPEG4 (left) and Proposed (right) VII. CONCLUSIONS A need exists to produce mobile devices capable of producing and communicating video data. The first requirement of wireless channels of sufficient bandwidth has recently been met. However, the computational burden imposed by standard image compression techniques precludes their implementation on computationally constrained platforms, such as a cellphone or PDA. All of the standard techniques are constructed around a paradigm of block based DCT transform for spatial compression, and motion estimation and compensation for temporal compression. Motion estimation and compensation requires multiple block matching searches, and has been found to represent the bulk of the computational burden of such schemes. We have abandoned this standard construction with the notion of gaining computational advantage. Our proposed
[1] Z. He and S.K. Mitra, “A Unified RateDistortion Analysis Framework for Transform Coding,” IEEE Trans. Circ. Syst. For Video Tech, Dec 2001. [2] Z. He and S.K. Mitra, ³ Domain RateDistortion Analysis and Rate Control for Visual Coding and Communication,” Ph.D Thesis, University of California at Santa Barbara, April 2001 [3] K. Ramchandran and M. Vetteli, "Best Wavelet Packet Bases in a RateDistortion Sense," IEEE Trans. Image Proc, Vol. 2, No.2 April 1993 [4] E. Jackson, R. Peplow, “Fast RateDistortion Estimation and Optimization for Wavelet Video Coding,“ 3rd International Symposium on Image and Signal Processing and Analysis (IEEE), Rome, 2003 [5] A. Said and W. A. Pearlman, “A New, fast and efficient image codec based on set partitioning in hierarchical trees,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp. 243–250, June 1996. [6] “Generic Coding of Moving Pictures and Associated Audio,” ISO/IEC IS 13818 [7] “Information technology  Coding of audiovisual objects  Part 1: Systems,” ISO/IEC IS 144961:2001 [8] “Information technology  Coding of audiovisual objects  Part 2: Visual, ” ISO/IEC IS 144962:2001 [9] ITUT/SG15, “Video Codec Test Model, TMN8,” ITU, Portland, June 1997 [10] W. Effelsberg, R. Steinmetz, “Video Compression Techniques,” dpunktVerlag fur digitale Technologie, 1998. [11] T. Berger, J. Gibson, “Lossy Source Coding,” IEEE Trans. On Info. Theory, Vol. 44, No. 6, Oct 1998. [12] M. Gallant, G. Cote, B. Errol, “H.263+ TMN2.0 Reference Software,” University of British Colombia, Canada, http://www.ee.ubc.ac.ca/image, 1997 [13] “MPEG2 Encoder / Decoder, Version 1.2,” MPEG Software Simulation Group, http://www.mpeg.org/MSSG/, July 19, 1996.
[14] “(MPEG4) Video Reference Software Version: MicrosoftFPDAM11.0000703 Version 2” ISO/IEC IS 144962 PDAM 1.0000703, July 2000, http://www.iso.ch/iso/en/ittf/PubliclyAvailableStandard s/144965_Compressed_directories/Visual/ Edmund S. Jackson was born in Harare, Zimbabwe in 1979. He obtained his BScEng(Electronic) Summa Cum Laude from the University of Natal, Durban, South Africa in 2001. He is pursuing his MScEng at Natal University in the area of signal processing for low bitrate video. He has published one international conference paper to date. Mr Jackson is a student member of the IEEE and SAIEE. Roger Peplow is an Associate Professor in the School of Electrical, Electronic and Computer Engineering at the University of Natal, Durban.