A Novel Parallel JPEG Compression System Based on FPGA

Journal of Computational Information Systems 7:3 (2011) 697-706 Available at http://www.Jofcis.com A Novel Parallel JPEG Compression System Based on ...
Author: Shannon Daniels
6 downloads 3 Views 526KB Size
Journal of Computational Information Systems 7:3 (2011) 697-706 Available at http://www.Jofcis.com

A Novel Parallel JPEG Compression System Based on FPGA Xi CHEN1,2, Lin ZENG2, Qinglin ZHANG2,†, Wenxuan SHI2 1

Institutes of Microelectronics and Information Technology, Wuhan University, Wuhan 430079, China 2

School of Electronic and Information, Wuhan University, Wuhan 430079, China

Abstract A novel parallel JPEG compression system based on FPGA is developed In order to enhance the efficiency of system, 8 lines of image data are buffered in the on-chip memory spaces of FPGA, which not only reduces the complexity of hardware system, but also decreases the output delay of JPEG compression. Furthermore, an image video is divided into four parts of images, and then these four parts have been compressed respectively in parallel with the base principle of JPEG compression algorithm. Through the actual test, it is concluded that the novel parallel JPEG compression system based on FPGA can process the image at the resolution of 1280*1204 pixels with 500 frames/s real-time and efficiently. Keywords: JPEG Compression; Parallel; Real-time; FPGA

1. Introduction Machine vision is a continuously growing field of research dealing with processing and analyzing of image data. It plays a key role in the development of intelligent systems. In machine vision systems, the detected objects are transformed into video image signals by CMOS or CCD cameras. With the Development of technology of sensor and electronics, more and more engineers prefer to use the high-resolution and high-speed cameras to obtain the real-time image videos, which offer a resolution of 1280*1024 pixels, up to 500 full frames/s, such as the CMOS high speed camera EoSens MC1362, so how to storage and transmit the high throughput data in real-time becomes a new challenge. In machine vision system the technology of static image compression is the research hotspot; the engineers usually employ JPEG or JPEG2000 method to compress the static image. The major differences between JPEG and JPEG200 are that the method of JPEG adopts Discrete Cosine Transform (DCT), while the method of JPEG2000 adopts Embedded Zerotree Wavelets Encoding. In 2004, Ebrahimi Farzad [1] has proposed that JPEG had a slight quality edge at low compression ratios (below 20:1), while JPEG2000 was the clear winner at medium and high compression ratios. Yang [2] has also pointed that at low compression rations, the method of JPEG is the best choice compared with the method of JPEG2000. Furthermore, the realization of JPGE compression is more convenient than that of JPGE2000 compression, which employs DCT and Huffman coding, produces a little of middle data and needs a few of memory spaces. In this paper, the ways to realize JPEG compression with the higher speed and the less memory space †

Corresponding author. Email addresses: [email protected] (Qinglin ZHANG).

1553-9105/ Copyright © 2011 Binary Information Press March, 2011

698

X. Chen et al. /Journal of Computational Information Systems 7:3 (2011) 697-706

based on FPGA are discussed. The paper is organized as follows; in section 2, the base principle of JPEG compression is introduced. In section 3, the method to buffer the incoming video data with 8 lines in ping pong memories is proposed to improve the efficiency of compression with a fewer memory spaces, furthermore a novel method to compress video data in parallel is put forward, by which the higher throughput data can be dealt with. Section 4, the results comparing the base JPEG compression and the improved parallel JPGE compression are presented. Finally, section 5 concludes the paper. 2. Base Principle of JPEG Compression The JPEG standard is based on the Discrete Cosine Transform (DCT). It gives a lot of flexibility so as to obtain a desired compression ration (CR). As presented in figure 1, the base principle of JPEG compression for color images considers the four main operations: color space conversion and downsampling, DCT-2D, quantization, Zig-Zag scanning and entropy coding [3] [4].

YCbCr

color space conversion

R G

Down sampling

lossless

lossy

B Quantization Table

DCT-2D lossless

quantization lossy

Entropy coding Table

Zig-Zag scanning

Entropy coding

lossless

lossless

JPEG Compresion data

Fig. 1 Base Principle Architecture of JPEG Compression

2.1. Color Space Conversion and Downsampling Three-dimensional space of RGB is commonly used to represent the color space, while three- dimensional space of YCbCr is adopted in the system of JPEG compression. So if the base principle of JPEG compression is employed to deal with the color static image, it is necessary to convert YCbCr color space into RGB color space, which is given by[1]: 0.58700 0.11400 ⎞⎛ R ⎞ ⎛ 0 ⎞ ⎛ Y ⎞ ⎛ 0.29900 ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ = − − C 0 . 16874 0 . 33126 0.50000 ⎟⎜ G ⎟ + ⎜128 ⎟ ⎜ b⎟ ⎜ ⎜ C ⎟ ⎜ 0.50000 − 0.41869 − 0.08131⎟⎜ B ⎟ ⎜128 ⎟ ⎠⎝ ⎠ ⎝ ⎠ ⎝ r⎠ ⎝

(1)

Considering that it is more insensitive to the information of chrominance CbCr than to the information of luminance Y in human visual system, so the method to downsample the data of Cb and Cr by 50% is used to increase compression ration. 2.2. DCT-2D Discrete Cosine Transform (DCT) represents the image as the sum of sinusoids of varying magnitude and frequencies, the DCT calculation is fairly complex; in fact, this is the most costly step in JPEG compression.

X. Chen et al. /Journal of Computational Information Systems 7:3 (2011) 697-706

699

DCT is used to produce uncorrelated coefficients, allowing effective compression as each coefficient can be treated independently without risk of affecting compression efficiency. The human visual system is very dependent on spatial frequencies within an image. In fact it is more sensitive to the lower frequencies than to the higher ones. Thus we can discard information that is not perceptible to the human visual system and keep the information that is important to it. The DCT-2D is computed as follows: first, the image data is divided into non-overlapped 8*8 matrix blocks; second, all of the 8*8 matrix blocks are transformed by the two dimensional Discrete Cosine Transform, which is given by the following equation. F (u, v) =

7 7 π (2 x + 1)u π (2 y + 1)v 1 ] cos[ ] C (u )C (v)∑ ∑ f ( x, y ) cos[ 4 16 16 x =0 y =0

⎧ 1 ⎪ C (u ) = ⎨ 2 ⎪⎩ 1

⎧ 1 ⎪ C (v ) = ⎨ 2 ⎪⎩ 1 others w=0

,

w=0

(2)

(3)

others

The result of this equation is an 8*8 matrix representing the frequency domain of the pixel values in the original 8*8 block. Most of the image data will be retained in only a portion of the matrix. 2.3. Quantization Quantization is used to allow for a better compression ration, the quantization is the operation that introduces information losses in the JPEG compression process. The goal of the quantization step is to generate a sparse matrix to allow a large compression rate at the entropy coding operation. Quantization is defined as division of each DCT coefficient by the corresponding quantization value S (u , v ) , followed by rounding to the nearest integer, which is given by equation [4] ⎛ F (u , v) ⎞ ⎟⎟ F ' (u , v) = Round ⎜⎜ ⎝ S (u , v) ⎠

(4)

The matrix of F ' (u , v) is represented as follows: ⎡ F ' (0,0) F ' (0,1) ⎢ F ' (1,0) F ' (1,1) {F ' (u , v)} = ⎢ ⎢ # # ⎢ F ' ( 7 , 0 ) F ' ( 7,1) ⎣

" F ' (0,7) ⎤ " F ' (1,7) ⎥⎥ % # ⎥ ⎥ " F ' (7,7) ⎦

(5)

F ' (u , v) is also called as the coefficient of DCT, in equation [5], the coefficient of F ' (0,0) is referred as the DC coefficient, the others are referred as the AC coefficient. Two kinds of table are needed in the quantization step: one is the luminance table; the other is the chrominance table. The typical tables adopted for quantization are given below: Because Division operations are not efficient for hardware resources, in the most of implementations they are replaced with multiplication and shift operations. For example, dividing an output DCT coefficient by quantization value 13 can be expressed as DCT DCT 216 DCT ⎡ 216 ⎤ DCT = × 16 ≈ 16 × ⎢ ⎥ ≈ 16 × 5041 13 13 2 2 2 ⎣ 13 ⎦

(6)

700

X. Chen et al. /Journal of Computational Information Systems 7:3 (2011) 697-706

[ ]

Where ... represents truncation to integer value. So the DCT coefficient is actually multiplied by 5041 which is stored in the proposed implementation as the corresponding quantization value and then the least significant 16 bits are discarded by a shift operation. Table 1.1 Luminance Table

Table 1.2 Chrominance Table

16

11

10

16

24

40

51

61

17

18

24

47

99

99

99

99

12

12

14

19

26

58

60

55

18

21

26

66

99

99

99

99

14

13

16

24

40

57

69

56

24

26

56

99

99

99

99

99

14

17

22

29

51

87

80

62

47

66

99

99

99

99

99

99

18

22

37

56

68

109

103

77

99

99

99

99

99

99

99

99

24

35

55

64

81

104

113

92

99

99

99

99

99

99

99

99

49

64

78

87

103

121

120

101

99

99

99

99

99

99

99

99

72

92

95

98

112

100

103

99

99

99

99

99

99

99

99

99

2.4. Zig-Zag Scanning and Entropy Coding After quantization is used, the DC coefficient and AC coefficient of each 8*8 block should be read in a Zig-Zag order, as depicted in figure 2.

Fig. 2 Zig-Zag Scanning

Initially, n new DC coefficient is calculated by differential pulse-code modulation (DPCM). The DC coefficient is the first value in the matrix. It is determined by computing the different value between the current DC coefficient and the last DC coefficient, as shown in figure 3. If there is no previous block, then the previous value is set to zero.

Fig. 3 DPCM Coding

X. Chen et al. /Journal of Computational Information Systems 7:3 (2011) 697-706

701

Afterwards, the remaining values in the matrix are called as the AC coefficients. These values are encoded slightly differently using an 8-bit value represented as RRRRSSSS. The run-length, 4 bits RRRR value, is the number of zeros preceding a non-zero value using the zigzag format of reading the matrix. The non-zero value is then coded by size, 4 bits SSSS value, as is described for the difference magnitude. Finally, all of the DC coefficient and the AC coefficients should be coded by Huffman code. While considering that this is not focus in our research, so it will not be described particularly in this paper. 3. Improved JPEG Architecture Based on FPGA In section 2, the base algorithms of JPEG compression is introduced, while in section 3, it is mainly discussed how to improve the efficiency of compression based on FPGA hardware implementation using the base algorithms of JPEG compression, particularly dealing with the higher throughout video data. 3.1. Modified Input Buffer Module Any real-time system of JPEG compression is to employ ping-pong memories for the image buffer. One of the memories is to receive incoming image data, and the other is connected to the JPEG compression modules to make image available. In the system of JPEG compression, the size of each input buffer is commonly set up to match the size of one frame image [5], while there are two disadvantages; the one is that the FPGA system should install large capacities of off-chip caches, such as two pieces of SDRAM, which increases the system complexity. The other is that the operation of JPEG compression can be started only when one frame of image is captured in the ping-pong memories, that is to say, the output delay of compressed image data is longer than the frame period of image. According to the above discussion, all of the image data is divided into non-overlapped 8*8 matrix blocks during the JPEG compression, namely the JPEG compression is operating block by block. In the actual condition, the image data is orderly input into the JPEG compression system by rows, so the ping-pong memories can store the image data in 8 rows rather than in one frame. Taking the MC1362 camera for example, it offers the monochrome video image of the maximum resolution by 1280[H]*1024[V]. In the common JPEG compression system, the size of the ping-pong memories is 2.5Mbytes (1280*1024*2), while in the modified JPEG compression system, the size of the ping-pong memories is 20KBytes (1280*8*2), which can be reduced to 1/128 of the common one. Developed with the technology of microelectronics, most kinds of Field Programmable Gate Array (FPGA) devices consist of a large of memory spaces, for example, FPGA EP2S60 produced by Altera Company [6] consists of 318KBytes, far exceeds the needed size of 20KBytes. Therefore, it is possible to buffer image data with on-chip memory spaces of FPGA, which not only reduces the complexity of hardware system, but also decreases the output delay of JPEG compression. 3.2. High Frame Rate Real-Time JPEG Compression System Take the MC1362 camera for example, when it operates in the full resolution of 1280*1024 pixels, with 500 full frames per second, the maximum throughput rate reaches to 625MBytes, it is much larger than the value of 182.80MBytes, which is the maximum throughput rate of JPEG hardware system employing the base principle of JPEG compression [7]. So using a single high-speed JPEG compression module can not

702

X. Chen et al. /Journal of Computational Information Systems 7:3 (2011) 697-706

meet the needs of the real-time compression of the camera output image. Considering that the efficiency of traditional sequence JPEG compression is not enough, the most effective way to speed up is parallel processing. If sequence operation is converted to multiple parallel computing, the overall performance of system will be improved several times. In this section, a convenient parallel processing method of JPEG compression is proposed, by which the input image is divided into multiple images by rows, and then these multiple images are compressed in parallel, so as to achieve the purpose of increasing the efficiency of compression, finally, the high frame rate real-time JPEG compression system based on this method is realized on FPGA. 3.2.1. Method of the High Frame Rate Image Parallel Compression The maximum throughput rate of JPEG hardware system adopted a single channel of high-speed JPEG compression module is 182.80MBytes, if the system processes in parallel with four channels, the maximum throughput rate will be beyond the one of the camera MC1362, which value is 625MBytes. Then the most important question is how to divide image into four channels. As discussed in section 3.1, the image data is input into the Ping-Pong buffer of compression system by lines, so each line data can be sampled, while the next step is how many sampling points may be taken for the most suitable system of JPEG parallel compression. Take the parallel processing with two channels for example, three different partition methods is depicted in figure 4. n 2

n −1 2

n −1 2

n 2

n −1 2

n 2

n −1 2

n 2

n 2

n −1 2

n −1 4

n 2

n +1 2

n + 2 2

n + 3 2

n + 4 2

n +5 2

n − 6 2

n −5 2

n − 4 2

n −3 2

n −2 2

n −1 2

n 4

Fig. 4 Partition Methods of Parallel Compression Data

In figure 4(a), the image data is divided into two parts by even and odd points, which is the easiest way to implement. However, the operation of DCT is handled by 8*8 blocks, and is to achieve the purpose of compression by removing the redundant information within a block. If the image data is divided by the method of 4(a), the original relationship between the 8 pixels in each line will be disrupted, which will effect the performance of JPEG compression. So the method of 4(a) is infeasible.

X. Chen et al. /Journal of Computational Information Systems 7:3 (2011) 697-706

703

In figure 4(b), the image data is divided into two parts by every eight points. However, how to obtain the differential value of DC between the parallel compression modules of two channels and how to combine the compressed stream of two compression modules to achieve the whole image pictures are too difficult to realize in the real-time JPEG compression system. In figure 4(c), the image data is divided into two parts by every n/2 points, namely each image picture is divided into two pictures from center, and then two pictures are compressed respectively. After decompression, two parts of pictures are recombined into the whole picture. According to the above analysis, there is the disadvantage that several parts of pictures should be combined into the whole picture after decompression in the method (c), however the high frame rate image parallel compression is convenient to implement by the method (c). So the method (c) is employed in our JPEG parallel compression system. 3.2.2. Design of Input Module and Control Module As discussed in section 3.2.1, the original image of 1280*1024 pixels will be divided into four parts of images which have the same resolution of 320*1024, and then four parts of images are compressed respectively by four channels of JPGE compression module in the parallel compression system at the same time. Adopting the buffering method introduced in section 3.1, the image will be compressed when 8 lines are cached in the Ping-Pong buffer. The data output from the high speed camera is transferred with the width of 64bits, every 8 adjacent pixels of one line are output in parallel, so 1280 pixels of a line are transferred in 160 clock cycles. Each line of the image is divided into 4 groups, the first group of 40 data is saved in the first buffer, and the second group of 40 data is saved in the second buffer, etc, as depicted in figure 5. Line Valid

64bits data data1 to data40 bufferred in the first RAM

data41 to data80 bufferred in the second RAM

data81 to data120 bufferred in the three RAM

data121 to data160 bufferred in the four RAM

Fig. 5 Method of Storing Image in the Buffer

Therefore, the size of each ping-pong ram is 2*8*40*64bits=5120Bytes. Considering that the size of the n

ram must be set to 2 , the real size of each ping-pong ram is 8KBytes, and the total size of all ping-pong rams is 32Kbytes, which can be sufficiently supplied in FPGA resource. In order to address conveniently, RAM is set to2*8*64*64bits, and data is stored in the former 40 storage spaces of the buffer. When 8 lines data has been captured, the control module sends a signal to four channels of JPGE compression module, and then coding module automatically reads data from the buffer to compress. A parallel image compression system, which handles the images of 1280*1024 pixels at 500 frames per second, is shown in Figure 6. The compressed data from 4 channels of JPEG coding modules are restored respectively in the different spaces of the SDRAM by the JPEG data storage and manage module, in other word, an image of 1280*1024 pixels is compressed as 4 parts of images at 320*1024 resolutions.

704

X. Chen et al. /Journal of Computational Information Systems 7:3 (2011) 697-706

Fig. 6 Method of Four Channels Parallel JPEG Compression

4. Synthesis Results and Experimental Results 4.1. Synthesis Results The main blocks of JPEG compressor for four channels parallel processing have been described in Verilog HDL and synthesized to Altera FPGAs. Table 4-1 presents a summary of the synthesis results for the complete JPEG compression architecture of four channels parallel processing, considering an Altera StratixII EP2S60 FPGA. Table 4.1 Synthesis Results of the JPEG Compressor for Four Channels Parallel Processing Used Resources

Total Resources

Percentage

LC Combinationals

5862

48352

12.12%

LC Registers

4976

48352

10.29%

Block Memory Bits

335 168bits

2 544 192bits

13.17%

DSP Elements

64

288

22.22%

As shown in table 4.1, the percentage of used resources is far below 25% of the total resources, if the system processes parallel for eight channels, which can also be conveniently realized on an Altera StratixII EP2S60 FPGA, the throughput data of system will double.

X. Chen et al. /Journal of Computational Information Systems 7:3 (2011) 697-706

705

4.2. Experimental Results The system of JPEG compression for four channels parallel is realized on an Altera StratixII EP2S60 FPGA. In the experiments, the system captures the image data from camera MC1362, compresses the image data real time, and then stores the compressed data stream in the SDRAM. When 100 frames of image data have been processed, the system transfers the compressed data from the SDRAM to computer by USB2.0, then the computer decompresses the compressed data and stitches the four parts of image data into the whole one. The results of parallel compressing the image data of Image Resolution Test Card and CPU fan are shown in figure 7, the left pictures are the respectively results of four channels, and the right pictures are the splicing one of the four parts. The table 4-2 gives the results of compression ratios.

Fig. 7 The Results of Four Channels and the Results of Splicing One

Table 4.2 The Results of Compression Ratios Compressing Partly(Bytes)

Compression

Base JPEG

Compression

1

2

3

4

total

Ratios(partly)

(Bytes) 

Ratios(base) 

Test Card

25999

27463

26855

24922

105239

12.45

105116 

12.47 

Fan

16981

19104

19018

12736

67839

19.32

67497 

19.42 

Through the above analysis, the whole image video becomes into four parts of videos by the system

706

X. Chen et al. /Journal of Computational Information Systems 7:3 (2011) 697-706

based on the method of high frame rate image parallel compression, but the method of high frame rate image parallel compression completely resolves the problem of high frame rate image compression real time by sacrificing a little of the compression rations, which has a important impact on the field of machine vision. 5. Conclusion In this work, how to improve the efficiency of compression has been investigated. A novel method to compress video data in parallel with buffering the incoming video data with 8 lines in ping pong memories is proposed. In this method, an image video is divided into four parts of images, and then four parts of videos have been compressed respectively in parallel by the base principle of JPEG compression. Finally, this method is realized on the Altera StratixII FPGA EP2S60, through the experimental results, it is concluded that the designed parallel image compression system can efficiently process the image of 1280*1024 pixels at 500 frames per second real time. Acknowledgement This work is supported by the Fundamental Research Funds for the Wuhan Universities of China. References [1] [2] [3] [4] [5] [6] [7]

F. Ebrahimi. JPEG VS JPEG2000:An objective comparison of image encoding quality. http://www.genistacom /sdentific%20articles/jpgvsj2k.pdf. Xiaofan Yang, Zhongshi He, Sen Bo, Shuyu Chen, and Qingsheng Zhu. The Coding Principle of JPEG2000 with Advantages and Disadvantages. Computer Science, 29(4):82-83,2002. Luciano Volcan Agostini, Ivan Saraiva Silva, and Sergio Bampi. Multiplierless and fully pipelined JPEG compression soft IP targeting FPGAs. Microprocessors and Microsystems, 31: 487-497, 2007. He Junhui, Tang Shaohua, Li Bin. Model-based steganalytic method towards color JPEG images. Journal of Computational Information Systems, 3(6): 2293-2302, 2007. Hongwei Zhang, Jifu Sun, and Changnin Huang. Optimizing and Implementation of JPEG Image Compression Technology. Spacecraft Recovery & Remote Sensing, 29(4):18-23, 2008. Altera Inc. Stratix II Device Handbook. 2007. Qinglin Zhang. Research on Some Key Technology of High-speed Image Processing Plantform in Machine Vision. Phd thesis, Wuhan, Wuhan University, 2010.

Suggest Documents