Optimizing Visible Objects Embedding Towards Realtime Interactive Internet TV

Optimizing Visible Objects Embedding Towards Realtime Interactive Internet TV Bin Yu, Klara Nahrstedt Computer Science Department University of Illino...
Author: Justin Rich
0 downloads 1 Views 1MB Size
Optimizing Visible Objects Embedding Towards Realtime Interactive Internet TV Bin Yu, Klara Nahrstedt Computer Science Department University of Illinois at Urbana-Champaign binyu, klara @ cs.uiuc.edu ABSTRACT Embedding new visible objects such as video or images into MPEG video has many applications in newscasting, payper-view, Interactive TV and other distributed video applications. Because the embedded foreground content interferes with the original motion compensation process of the background stream, we need to decode macroblocks in I and P frames via motion compensation and re-encode all macroblocks with broken reference links via motion reestimation. Although previous work has explored DCTcompressed domain algorithms and provided a heuristic approach for motion re-estimation, the computation intensive motion compensation step is not much optimized and so still prevents efficient realtime embedding. In this work, we optimize previous work to enable realtime embedding processing that can be applied to Interactive Internet TV applications. We study the motion compensation process and show that on average up to 90% of the macroblocks decoded are not used at all. To explore this phenomenon, we propose to buffer a GOP (Group-Of-Picture) of frames and apply a backtracking process that identifies the minimum set of macroblocks which need to go through the decoding operation. At the price of a delay of one-GOP time, this approach greatly speeds up the whole embedding process and enables on-line software embedding operation even capable for processing HDTV stream. Further optimizations are discussed and a real-world application scenario is presented. Experimental results have confirmed that this approach is much more efficient than previous solutions and results in equally good video quality.

1. INTRODUCTION MPEG2 standard [1] has become the most widely used format for video transmission and storage in various video applications, such as Standard Definition TV and High Definition Digital TV (HDTV, [2]) Broadcast, Internet VideoOn-Demand, interactive TV and video conferencing. On the other hand, broadband networking technologies such as cable modem [3] and xDSL [4] are bringing to our home and office buildings an Internet connection with up to 10Mbps or even higher bandwidth at acceptable and reducing prices. More advanced technologies of even higher magnitude such as Fast Ethernet [5] and Gigabit Ethernet [6] are providing capacity to afford multiple high quality video streams simultaneously. At the same time, high-end personal computers are becoming more powerful with processors of speeds over 1GHz, large memory and storage devices. With all these exciting technologies, we can expect much better video distribution applications offering higher visual

quality, greater interactivity, more flexibility and finer service customization. Aiming at the same goal, two originally separate approaches, PC plus Internet and TV plus SetTop-Box, are rapidly converging to provide a common set of services, including per-per-view, watching multiple channels via Picture-in-Picture (PiP), web browsing, email accessing and stock/weather/sports/news updating through tickers on the screen, and etc. However, because of the lack of efficient software solutions that are capable of compositing high volume video streams on-line, most of current TV programs are edited with special hardware at the TV station or the SetTop-Box. The major drawback of this End-to-End approach is that it cannot easily scale up in the case of multiple video sources. Therefore, the huge amount of data content and the joint computing power over the Internet cannot be fully utilized. Also, since all the video processing is done in the closed world from the TV station to the SetTop-Box through predefined operation modules, it prevents third party service providers from offering more flexible and customizable services to the users. Under this background, we have started a project to provide an open, flexible and scalable software solution to compose multiple MPEG video streams on-line. Our particular vision is that almost all the value-adding services mentioned above can be characterized as one problem – the visible objects embedding (VOE) problem. VOE refers to any operation that embeds any visible objects (video or image or even text) into the original video stream, including content merging applications such as PiP, logo insertion and captioning, and control interface presentation such as displaying menus and buttons on the TV screen for the user to interact. Therefore, our major task maps to constructing a realtime VOE algorithm for digital video streams. Note that overlaying can be easily done at the end host via the pixel domain overlapping of windows system or in Set-Top-Boxes, but we argue that a pure MPEG-in-MPEG-out scenario with TV as an output device has its own advantages in terms of scalability, interactivity and customization. To verify our approach, we will present and concentrate on the Interactive Internet TV application using MPEG streams. We currently choose MPEG2 because it is the standard for digital TV broadcast and is widely used in most video distribution applications today. In this application, we will merge visible objects into streams in the MPEG compressed domain. Compressed-domain video composition algorithms have been studied before [7, 8, 9], which have achieved 10% to 30% speed-up compared to spatial-domain approach. However, we discover that they

Original MPEGcompressed Stream

Header Parsing & Inverse VLC

Additional Visible Objects

Result MPEGcompressed Stream

MC Domain Data

Foreground Overlapping

Header Packing & VLC

Motion Compensation RD Domain Data Fixing Wrong reference

Composite MC Domain Data

Figure 1: The Video Objects Embedding Process still suffer from the computation intensive macroblock compensation operation, and so are relatively slow and inefficient. Consequently, they are generally considered only for off-line editing such as watermarking, rather than on-line processing as in interactive video applications. Therefore, we have to greatly optimize previous work to enable realtime VOE processing for TV streams. We will discuss in detail how we can reduce this workload to the minimum amount necessary and lead to an averagely 10-times speed-up for the motion compensation process, along with other optimization choices that reduce computation complexity at the price of reduced visual quality. This paper is organized as follows. First, in section 2, we describe the wrong reference problem we must solve for compressed domain VOE and previous solutions. Then in section 3 we present a more efficient and flexible approach for realtime VOE processing. Section 4 is dedicated to an example scenario that demonstrates how our approach can be applied to greatly improve distributed video applications in many ways. Analysis and experimental results are presented in section 5, and finally section 6 concludes this paper and future work.

The visible information overlapping process itself is very simple. The result pixels are normally a combination of the foreground and background pixels, i.e., Pnew (i, j) = α·Pf (i, j) + (1 − α)·Pb (i, j)

where Pnew , Pf , Pb and α are new pixels, foreground pixels, background pixels and the transparency factor respectively. Since it is a linear operation, it can also be done in the DCT domain as follows: DCT (P¯new ) = α·DCT (P¯f ) + (1 − α)·DCT (P¯b )

Figure 1 shows the primary operations in the VOE process. Our goal is to embed visible objects into an MPEG stream directly at the compressed domain. To achieve this goal we need to take the original MPEG compressed frame, parse through the macroblock headers to reach the so-called MC domain. The MC domain refers to macroblock headers with information such as motion vectors and prediction errors in quantized DCT-format. After inverse VLC, the VOE process needs to execute motion compensation and then reach the RD domain. The RD domain is actually the “Reconstructed DCT Domain”, which includes the DCTformatted values for each image block. At this point we overlap the additional visible objects in DCT representation over the original frame in the “Foreground Area”. However, the merging causes major problems because of the temporal dependencies between MPEG frames since it causes some of the macroblocks to have wrong reference. For the VOE process to work correctly, this problem needs to be solved by redoing motion estimation using RD domain data to get new MC domain data. In this paper, we will show how to perform the foreground overlapping and correct wrong references while executing VOE.

2.2 Object Overlapping and the Wrong Reference Problem

(2)

where P¯ represents the block-wise data in DCT format. For opaque overlapping, α is 1 and the overlapped background image is discarded. Note that we are assuming that the foreground object to be overlaid is available in DCT domain, which is true for many video and image formats, such as MPEG, H.263, MJPEG and JPEG. However, for text data or other images such as bitmaps, we need to do DCT to convert them into a “VOE-friendly” format first. That is, we need to provide motion mode and DCT-formatted image data, which will be discussed in detail after the main algorithm is presented. After executing equation 2 we receive BG

2. BACKGROUND AND PREVIOUS WORK 2.1 Visible Object Embedding in MPEG Compressed Domain

(1)

I

FG

MB0 BG

FG

P1 P2

MB4 BG

MB1 MB3

FG

MB2

Figure 2: The Wrong Reference Problem changed macroblocks in the foreground area. At this point the Wrong Reference Problem happens because some of the changed macroblocks are used as prediction reference for future macroblocks. More specifically, let us consider the chain of I → P1 → P2 frames in Figure 2. The foreground area is the smaller rectangle (FG) and the background frame area is the large rectangle (BG). The small squares represent the macroblocks (MB). Let us assume MB2 in frame P2 uses the data of MB1 in frame P1 as a content predictor. Due to the VOE the data in MB1 is changed by the content of the overlapping objects. If MB2 uses this new data for its prediction and adds the original prediction error, the final result is obviously wrong. To make it worse, this wrongly decoded macroblock may then be used as a reference itself for other macroblocks in later frames, causing the error to propagate all the way down the motion prediction link until the next I frame appears. To fix MB2’s MC data, we need to know MB2’s RD domain data, and therefore MB1’s. Also note that MB1 itself may rely on the data from another macroblock (MB0 in Figure 2) for prediction reference. Therefore, to know MB1’s value, MB0 will also

need to be decoded to RD domain. In the worse case, for a GOP pattern of IBBPBBPBBPBBPBB, the data of one macroblock in the first I frame may be referenced 4 times to derive the content of macroblocks in all the 4 following P frames and many other macroblocks in B frames. For a maximum search distance of 16 macroblocks used for searching the optimal prediction block by the encoder, this means the prediction link may stride 4x16=64 macroblocks across the frame. Therefore, all the macroblocks in the I and P frames are decoded to RD domain in case future macroblocks will need it. In the following discussion, we will call macroblocks like MB0 or MB1 “d-MBs” since their Data is needed to be decoded to RD domain for future reference by other macroblocks. Similarly, we will call and macroblocks like MB2 “c-MBs” as their reference blocks are wrong and so their MC data has been Changed.

2.3 Previous Solutions Many MPEG compressed domain algorithms have been developed, among which [7], [8] and [9] provide a good starting point for our work. In [7], Noguchi, Messerschmitt and Shih-Fu thoroughly defined how to do motion compensation and motion estimation in the DCT domain. The major difficulties for DCT-domain algorithms come from the fact that the reference blocks are often not aligned with the regular 8x8 image block boundaries. Therefore, their value has to be derived in the DCT domain using 2 or 4 neighboring image blocks. Moreover, when the prediction is based on interlaced reference, then 2 such blocks need to be reconstructed to form a 16x8 block, which is then vertically sub-sampled to get the final reference block. For the motion estimation part for c-MBs, the standard block searching approach used by MPEG encoders is not suitable because of the massive computation work involved. Instead, the authors proposed a much cheaper heuristic way of obtaining the motion vectors by “inference”. Specifically, they only examine only 2 most likely candidate reference macroblocks located at the edge of the foreground window, which greatly simplifies the motion reestimation process without causing much perceptible image quality degradation. Although the [7] approach could save some computation compared to brute-force spacial domain approach (10% to 30% speed-up according to [7]), it is still a costly process over all, and the motion compensation part for getting the RD domain data becomes the most costly bottleneck. Though speedup compared to spacial domain approach has been achieved, realtime processing is still not readily feasible. For example, for HDTV stream with a resolution of 1920x1088, the decoding step alone is not feasible for realtime processing on general purpose single processor PCs. Based on [7], Jianhao and Shih-Fu proposed a faster solution in the sub-domain of visible watermark embedding. The situation now is different in that, instead of merging the watermarks with the original MPEG frame according to equation 2, the watermark block is simply added to the original background data. Eij = Eij + Wij .

(3)

Eij = Eij − Wref erence + Wij .

(4)

For intra-coded macroblocks, equation 3 is used to add the

watermark value Wij to the image value Eij for the ijth block. For inter-coded macroblocks, equation 4 is used because we need to first subtract the watermark added to the reference macroblock Wref erence from the original prediction error Eij . This way, the motion compensation work for reference macroblocks is eliminated since we only need to know how they have changed and can adjust the prediction errors based on the changes calculated using only the watermark embedded. After this work, [9] pointed out that the [8] approach may not work if the embedded watermark or captions are so strong that equation 3 will make Eij overflow the [16,235] bound for a pixel’s luminance value. In this case, the result will be normalized, but then the changes made to the prediction error would also depend on the original reference macroblock’s data. Therefore, to truly prevent overflow, the maximum luminance value of the reference macroblock is needed again and the savings of [8] are not available any more. The authors of [9] avoided this problem heuristically by using the average luminance of the background block to estimate the caption threshold, which is the DC coefficient in the DCT domain and can be easily acquired. The potential problem is when the maximum pixel value differs much from the average, overflow will still occur, but the authors claim that the resulting image is of acceptable high quality. Though the algorithm from [9] works fine for text format information embedding, it can not be applied to more general image/text overlapping cases where the foreground image is more important to be shown clearly and in a stable manner, and the background image needs to be weighted much less. Formally, if we still keep the original motion vectors, then the difference between original and new prediction errors (DCT (P¯dif f )) will be the original reference macroblock value (DCT (P¯b )) and its new value (DCT (P¯new )) after VOE, as in equation 5: DCT (P¯dif f ) = = =

DCT (P¯new ) − DCT (P¯b ) α·DCT (P¯f ) + (1 − α)·DCT (P¯b ) − DCT (P¯b ) (5) α·(DCT (P¯f ) − DCT (P¯b )).

Therefore, to know the changes made to the background block, we still need to decode the background block in the first place. An extreme case is opaque overlapping (α = 1), the resulting block after overlapping is simply the foreground block, and the prediction error need to be adjusted by the difference between the foreground block (DCT (P¯f )) and the background block (DCT (P¯b )). In summary, for general video/image overlapping, we must have the original macroblock value of both the reference macroblocks and the dependant macroblocks in RD domain via motion compensation, and so cannot avoid it as in [8, 9]. Therefore, we have to face the computation intensive motion compensation operation and find a new solution to minimize the amount of computation involved.

3. OUR SOLUTION 3.1 The Backtracking Process After the thorough discussion above, it is clear that to greatly accelerate the VOE process, we have to work on the expensive motion compensation work done for all macroblocks in I and P frames. The key observation is that the previous work is doing much more than the minimum decod-

d-MB

d-MB

FG

BG d-MB

d-MB

d-MB d-MB

d-MB

c-MB Figure 3: A typical inverse motion prediction link discovered by backtracking

ing necessary for rendering correct results. Originally, the reconstructed RD domain data serves two goals: getting the correct value of the d-MBs and c-MBs, and calculating new MC domain data through motion estimation. Since [7] simplified the motion estimation part by only examining the edge macroblocks from the background frame surrounding the foreground area, the second goal can be satisfied so long as we always decode the edge macroblocks – what we call the “gold edge”. For the first goal, in cases when there are only a few c-MBs, the corresponding number of d-MBs will also be small, and the total number of macroblocks necessary for reconstruction is much less than the number of all macroblocks in I and P frames. Our experimental results have also confirmed that only those macroblocks surrounding the foreground window are heavily affected by the VOE process, and most other macroblocks are not used at all. The only reason why the [7] solution makes all the efforts to completely reconstruct all reference frames is that the future motion prediction pattern is not known in advance. For example, in Figure 2, when we decode frame P1, we do not know that only a few d-MBs like MB1 will be used as reference for future macroblocks like MB2 in frame P2. The insight behind our approach is that, for most distributed video applications, various delays exist, such as queueing delay inside the network, buffering /synchronization delay at the receiver side, and delays due to jitter control and traffic policing. Therefore, for the VOE service, in many cases the user may accept the service even if it will introduce a slightly larger delay, or a larger response time for interactive applications, as long as the extra content is embedded in an impromptu manner. In other words, we take the delay as another QoS (Quality of Service) dimension that could be flexibly controlled, and try to balance it together with other QoS dimensions such as computation complexity and realtime-ness according to the user’s preference.

From the last B frame to the first I frame in a GOP, for each macroblock M B current in the frame, if it uses reference macroblock(s) M B ref erence in the foreground area, then we mark M B current as a cMB and M B ref erence as d-MBs. Also, for every d-MB, all the macroblocks it refers to will also be marked as d-MBs. In addition, all macroblocks on the gold edge are by default marked as d-MBs.

This way, all the “active” prediction links that are relevant to the motion compensation will be found out. Typically these links take the format of “c-MB → d-MBs → ... dMBs”, such as the link “M B2 → M B1 → M B0” in Figure 2. Figure 3 shows how a typical inverse prediction link discovered looks like. Note that at each step the number of macroblocks may double or quadruple since the prediction does not follow regular 8x8 block boundaries. Once the c-MBs and d-MBs are marked out, we can resume the decoding and overlaying process from the I frame at the beginning of this GOP again. By this time we have the MC domain data, we can continue the VOE processing like this: only for those c-MBs and dMBs, we need to perform motion compensation to get their RD domain data, and only for c-MBs we perform motion estimation to get their new motion modes and prediction errors. For other macroblocks, we do nothing, and their MC domain data will be directly used in later re-encoding phase. We also want to point out that this backtracking is always needed within the motion compensation process for any solutions, so it does not incur any extra cost. Rather, we are just batching up the tracking for a GOP of frames till the end of the GOP instead of doing it separately for each frame. A great saving can be expected since only c-MBs and d-MBs are motion-compensation-wise decoded, which are typically much less than the total number of macroblocks in I and P frames. This VOE solution is especially useful for timely delivery service for multiple multimedia streams from different sources to a single TV receiver. In addition, the idea of identifying the minimum set of macroblocks for motion compensation through backtracking can also be applied to many other video manipulation operations such as video clipping and video juxtaposition.

3.2

3.2.1 Under this assumption, we can simulate “predicting the future” by “buffering the past”. That is, we decode each frame to the MC domain and buffer the motion vectors and quantized DCT coefficients. After we have gone through the whole GOP, all the c-MBs can be identified by testing whether a macroblock’s motion vectors are pointing to somewhere inside the foreground area. Similarly, d-MBs can be identified if some future d-MBs or c-MBs are using it as prediction reference. Therefore, we can define a “backtracking” operation as follows:

Optimizations

Since the VOE process is quite complicated and involves many operations, there are a lot of places where we can make trade off and optimize towards a certain QoS metric. The basic idea is that we can further decrease the computation complexity through heuristic methods at the price of slight degradation in the resulting visual quality. The following are 3 examples:

Bi-direction to Uni-direction Prediction

For a c-MB that is bidirectionally predicted, an interesting phenomenon is that it is very likely that only one of its prediction links points to the foreground area and the other not. The corresponding physical meaning can be explained this way. Let us assume a jumping ball in a video moving directly towards the foreground area, as Figure 4 shows the position of 3 macroblocks that describe this scenario in three neighboring P-B-P frames. When encoding a macroblock for the middle B frame, the encoder may choose to predict it bidirectionally and use the average of the two macroblocks (P MB 1 and P MB 2) in the 2 P frames. Since P MB 2

BG FG P_MB_2 B_MB Ball moving direction

P_MB_1

Figure 4: Bi-directional prediction is located inside the foreground area, B MB will be marked as c-MB and the 2 P frame macroblocks will be marked as d-MBs. Since B MB is not very different from the 2 P macroblocks, we can exploit this phenomenon to reduce processing time by discarding the broken prediction. That is, we change the macroblock in the B frame to be unidirectionally predicted using only the one macroblock that is not in the foreground area. The same prediction error can be used since all 3 macroblocks are basically of the same content, and then the only operation we need to do is to change the motion compensation mode from bi-directional to unidirectional and delete one motion vector. In this example, we will only need to change the B MB to be unidirectionally predicted from P MB 1, and all these 3 macroblocks will not be marked as c-MB or d-MB at all. Although the result picture will have a slightly poorer quality, the experimental results show that it is not obvious to human eyes and the saving in processing power justifies this cost.

3.2.2

Sensitive Area Some times the foreground window only occupies a small portion of the background frame, and so many macroblocks from the background stream may never be affected by the VOE operation. Therefore, we can define a “sensitive-area” for each frame for the back tracking process. A “d-sensitivearea” specifies the area that may contain d-MBs, and “csensitive-area” c-MBs. The benefit of defining sensitive area comes from saving the decoding work needed for macroblocks outside the sensitive area. If a whole slice is “insensitive”, we can copy the whole slice from the input stream directly to the output stream without even decoding it to the MC domain. In case part of a slice still lies inside the sensitive region, we will still need to decode these slices to the MC domain, but macroblocks outside the sensitive area do not need to be checked for c-MBs or d-MBs in the backtracking process. For discussion convenience, we define S to be Cut off

: S Region-0 = FG + Edge ... Region-1 Region-2 Region-3 ...

Figure 5: Sensitive Regions: candidate macroblocks for backtracking the maximum search distance used by the encoder during

motion estimation (normally 16 macroblocks or 256 pixels) and define “Region-i” (i ≥ 0) as follows: Region-0 is the foreground window plus the ”gold edge” – the one circle of macroblocks in the background frame surrounding the foreground window. Region-i is acquired by enlarging Region-(i1) in all the 4 directions by a step size of S. If the enlarged region exceeds the frame boundary, then the surplus part will be cut off. This is shown in Figure 5. For every intercoded frame, the c-MBs will only be referring to macroblocks inside Region-0 (the bidirectional case can be eliminated by transforming to unidirectional prediction), and so the csensitive-area for every frame is Region-1. As to d-sensitivearea, it changes as the back tracking process progresses – the further back we go, the wider area of macroblocks may be touched by the prediction link. For the last P frame, the d-MBs are located inside the foreground area plus the gold edge (as needed for motion re-estimation), which is Region0. Therefore, the final sensitive area for the last P frame is the super set of its c and d-sensitive-area, which is Region1. All these macroblocks may need macroblocks in the last but second P frame for reference, and so those reference macroblocks (candidate d-MBs) are at most S away from the gold edge. Therefore, the d-sensitive-area for the last but second P frame is S larger than Region-1, which is Region2. Based on similar deduction, we get the final sensitive area for every frame as shown in Table 1. The numbers represent Region indices (“-” means not applicable), and the final sensitive area of each frame is the super set of that frame’s c and d-sensitive-area. Frame I B B P B B P B B P B B P B B c

-

1

1

-

1

1

1

1

1

1

1

1

1

1

1

d

5

-

-

4

-

-

3

-

-

2

-

-

0

-

-

5 1

1

4

1

1

3

1

1

2

1

1

1

1

1

Final

Table 1: Sensitive regions for each frames We will see that when the foreground window is located on the boundary of the background frame, which is the normal case, a lot of slices and macroblocks will be in the insensitive area and require little decoding/encoding operation.

3.2.3

Shortening the Delay The longer delay and/or response time caused by the buffering of a GOP of frames is the major cost of our approach. Since Delay = GOP size/f rame rate, for a video clip at 30 frames per second, a GOP of 15 frames means 0.5 second delay. As we analyzed above, this delay is used to trade knowledge about future reference patterns so that we do not need to do unnecessary decoding. However, we can treat this problem in 2 ways. First, we can select a shorter GOP size at the encoding time to facilitate video editing, or a transcoding proxy is inserted to shorten the GOP size, then the delay will be smaller accordingly. For a GOP size of 6 frames with 30 frames per second, the delay will be only 200ms. If the frame rate is higher, the delay will also be shorter. Of course, a shorter GOP or a higher frame rate means a larger bandwidth requirement for the same video quality, or a reduced video quality under the same bandwidth. Therefore, the change of GOP size is rather an engineering choice that balances between video quality, bandwidth requirement and processing delay, and we propose to adapt it according to different user preference and

resource availability. Second, since the last P frames have a relatively small sensitive area, we can start the back tracking process earlier if necessary by assuming all the macroblocks in that sensitive area are d-MBs. For example, after we have decoded the last but second P frame, if we assume all macroblocks in its d-sensitive-area are d-MBs, then we can start the backtracking from this point immediately, instead of waiting for the remaining 4 B frames and one P frame to come. This 5/15 = 33% speed-up comes at the price of having more d-MBs and so more motion compensation decoding. The earlier we start the backtracking, the more d-MBs we decode without using them. If we push this to the extreme, then all the macroblocks in the I frame are assumed to be d-MBs, and we are back to the same situation as in [7]: no delay and so no saving in decoding. This provides another way of balancing the delay and processing time, and the more processing power we have, the less delay we may achieve.

they are not synchronized in advance. Also, since they will be treated the same as original background stream macroblocks, they will be decoded by future players using the background stream’s quantization table. As a result, if the 2 streams’ quantization tables are different, then we need to decode the foreground video to MC domain and re-quantize it, the results of which will then be re-encoded as other macroblocks from the background stream. If the 2 streams happen to be using the same quantization table, then we can directly copy the foreground video’s slices directly to the output stream. This “bit string copy” saves a lot computation and simplifies the VOE implementation since we do not need to take care of decoding and motion-compensation for this new stream. II The second case considers image embedding, where the original image format may be JPEG-formatted or simply raw pixels such as bitmaps. To embed this type of static content into a MPEG stream, we need to prepare the image into MC domain format. The image data need to be DCT transformed and quantized if necessary, using the same quantization table as the background stream. Then we have to make up their macroblock headers. This is done following the “show once, copy many” approach: for the first I frame, the content is encoded as intra-coded macroblocks; for following frames, we put “dummy copy macroblocks” in the foreground area. A “dummy copy macroblock” is always using forward prediction with motion vector zero and zero prediction errors. In our implementation, the “skipped macroblock” syntax is used to achieve this effect wherever feasible. This results in motion images updated at GOP boundaries. Of course, depending on the frequency we want the images to be updated, we can also encode these intra-coded macroblocks of the image content in P frames for a higher refresh rate, or even in B frames if it does not exceed the maximum number of intra-coded macroblocks in B frames allowed by MPEG.

3.3 Discussions Though the general idea of our approach is straightforward, we need to be careful about several details for an efficient implementation. Currently we are only considering opaque embedding, which applies for most situations and can be extended to semi-transparent embedding.

3.3.1

Streaming Mode

The first problem is how to do the VOE filtering with backtracking delay in a streaming mode. A naive way is to buffer a whole GOP, run the backtracking process to mark c-MBs and d-MBs, resume the decoding and VOE process, and then re-encode the GOP all before working on the next GOP. Obviously, the VOE gateway will never be able to catch up with the incoming realtime stream. We solve this problem by “pipelining” the decoding and encoding operations along the sequence of frames. At the initialization phase, a GOP of frames will be decoded to MC domain and buffered, and the backtracking is done at this point. Starting from the second GOP, we decode one new frame from the next GOP to MC domain, and simultaneously perform the embedding and reencoding for one old frame of the current GOP. This process will be able to continue for the whole stream endlessly, and at any specific time we are decoding a frame and then encoding another frame. The only difference with normal filtering is that the frame being VOE processed is always a GOP-length “older” than the frame being decoded.

3.3.2

New Content Preparation

We divide the preparation for the new content to be embedded into three cases. I The first case is video merging, where the new content is also a MPEG video stream, with the same frame rate and GOP pattern but a smaller frame size. This foreground video may come from downscaling another TV program, or from lower frame size video clips or digital camcorders. In opaque overlaying, the foreground video macroblocks are not affected by the background stream and so can reserve their motion modes. The 2 MPEG streams’ GOP distribution need to be exactly matched – for example, I frames need to be embedded into I frames. This may requires delaying one of the 2 streams if

III For text embedding, currently we are using the same method as for image embedding. Text is first converted to raw pixels, and then DCT transformed before embedding, and the remaining work is the same as the case above. For transparent text embedding, we are planning to incorporate the results from [9].

4.

APPLICATION SCENARIO: INTERACTIVE TV

We have derived an efficient realtime VOE algorithm, and the next question concerns benefits this would bring to distributed video applications. To justify our approach, we present a new Interactive TV scenario enabled by our VOE algorithm and compare it with the conventional TV rendering approach. Figure 6 illustrates a typical scenario for conventional TV program broadcasting. The TV program is transmitted to the end host via satellite or cable broadcast, and then decoded by a Set-Top-Box or equivalent internal decoding devices of a TV set. Meanwhile, the Internet connection brings in useful information from data servers on the world wide web, such as email server or http server. Typically a viewer interacts with the Set-Top-Box via buttons on a

Email Server Internet

Satellite Server

Http Server

User with a Remote Controller

Satellite dish

TV set TV Station Cable Network

Set-Top-Box YUV signals

Figure 6: A typical architecture for TV program delivery remote controller. To provide PiP functions or present command menus, all data streams are decoded to image pixels and then composed by the Set-Top-Box. Similar effects can be achieved by the windows system on PCs with the same kind of pixel-domain overlay operation if the output is PC monitors. Note that this requires that all the overlay operations be done after the TV stream is decoded, and so all information processing are pushed to the single “hot spot”, such as the Set-Top-Box in this example. A major problem is that this centralized scheme does not scale well when the number of streams flowing in becomes larger and the way of presenting them together becomes more diversified, which will be very likely as the Internet continues to grow exponentially. In that case, the hot spot will soon “catch fire” because of the burden of maintaining connection states, synchronizing video/data streams, responding to user requests, content processing and etc. Currently this problem is solved by sacrificing flexibility and customization. For example, the viewer can not access email while watching TV, and the content source and the position of the overlay window in PiP are limited to predefined settings. Also, as mentioned in the introduction section, we cannot generally program the service in an open and customizable way. For example, if the user wants the foreground window to be smaller or at a different position on the screen, or he wants to show some graphic data other than email or web pages from the Internet, there is no way he could “enter” into the Set-Top-Box and make these happen. To fundamentally solve this problem, we propose an alternative setup as shown in Figure 7. The TV programs are captured as MPEG2 streams, and after all the VOE processing the resulting MPEG2 stream is decoded into TV format. Actually new DV transmission standards such as IEEE 1394 [10] allow the MPEG2 stream to be sent directly to 1394-enabled TV sets to save the cost of an extra decoding device. Many more information sources are added in this picture, and several changes have been made to accommodate these information streams and distribute the computation cost. First, an “Internet Request/Reply” component cooperates with a network of application-level VOE processing gateways placed inside the Internet to collect and process data coming from various kinds of servers based on the users request. Even the TV station could utilize this Internet channel for more interactivity between the TV viewers. This service model fits well into many existing architectures, such as Active Network [11] and Content Service Network [12], and the video composition can be done wherever most appropriate in terms of efficiency, cost and

convenience. Secondly, a “Content Preparation” component will collect data from devices in the Home Network and transform it into more “VOE-friendly” format, that is, DCT domain data plus certain motion modes. Lastly, the viewer has much more choices of how to talk to the VOE system. Mice and keyboards convey much more information than remote controllers, and very customized friendly GUI can be supported via hand-held devices. In general, our fast VOE algorithm has the following benefits to many video applications: • Computation Distribution Since our algorithm can process streaming video in realtime, it will decrease the resource requirement at application level gateways, as well as enable load balancing for efficient distributed video applications. For example, the multiple video streams are merged at the most appropriate distributed nodes on the path rather than at the client and the result stream is kept in standard MPEG2 format. This greatly reduces state maintenance burden at intermediate gateways and end hosts in terms of multiple connections, inter-stream synchronization and buffering management. This scalability allows multiple servers on the Internet to cooperate in a “multi-sink” (reverse multicast) tree fashion. On the other hand, multicast sessions will also benefit since the embedding can be done right before the multicast point to save end-hosts’ efforts. Further more, our algorithm enables a cost-effective way to insert information such as logos and other location/time relevant meta data for security or copyright purposes in a on-demand manner, since the new content becomes an integral part of the MPEG stream before the stream arrives at the client player. Also note that at the end hosts, the computation can also be divided into parallel software components, which is suitable for utilizing future uniform and cheap computing power. • On-line VOE Processing for High Quality Video Since the very beginning, we have been after a fast approach that can work in realtime for fairly complicated VOE operations on high quality MPEG streams. The speedup given by “doing the minimum necessary computation” can achieve many goals not possible with previous approaches. For example, although single processor PCs are not fast enough even to decode and play HDTV MPEG streams, we can embed visible objects onto this stream directly at the compressed domain

Web Server

Video Server

Image Server

... Email Server

Data Server

Internet VOE Service Gateways TV Station Internet Request/Reply

Command

TV program MPEG2 MPEG2 Capture Device

MPEG2

VOE Processing

TV set Decoding Device

Content

Command

Content Preparation

Refrigerator

YUV signal

Content

User Interaction

Home Network

...

...

DVD

Storage Device

DVD/VCR Player

Digital Camera

Telephone

Keyboard

Mouse

Hand-held Device

Figure 7: A new setup for distributed VOE services in realtime because the total amount of “real work” required is well affordable. As the TV streams go to high definition, the cost-effectiveness of our approach becomes more and more obvious. Also, current video editing software relies on spacial domain algorithms to render high quality results, and it will require a tedious compiling process before we can evaluate even a small change we have made. With our algorithm, the user can first “preview” the overall editing effects in realtime, and only use spatial domain processing for final production. • More Interactivity and Customization With a fast VOE algorithm, the bottom-line is that we can provide all the functionalities current Set-TopBoxes can do, such as presenting multiple programs via Picture-In-Picture, accessing email and web on the TV screen and stock/weather tickers embedding. Furthermore, because of the flexibility of software approach, we can do much better by allowing customization of the number, position, resolution and content source of the small overlay window in PiP, or even mimic a “desktop” with buttons and icons on the TV screen for the user to interact. More importantly, since our algorithm is based on the open MPEG2 standard and the interface of VOE can be described in a simple and standardized way, it enables an open VOE service model: application service providers could provide standard VOE processing service, and third party content providers only need to provide the content together with standard description of how this content should be presented. This programmability provides much higher level of control and customization than the current closed approaches. • Flexible Resource and QoS control The VOE process is generally complicated and computationintensive, so there are many places where different resource requirements and QoS settings are conflicting,

such as occupied bandwidth, delay, jitter, response time, processor cycles and memory space. Fortunately, a fine granularity control of resource utilization over a wide parameter space is possible, and it is the user requirement and preference that ultimately decide in which way the algorithm should be optimized..

5.

EXPERIMENTAL RESULTS

We have implemented the first version of the realtime VOE filter with functions such as video Picture-in-Picture and image embedding, as part of the Active Space project [13]. HDTV quality MPEG2 streams are multicast through Gigabit LAN, and client PCs receive the stream and output it to dedicated plasma displays. Our VOE service gateway intercepts the stream and provides VOE service in realtime. The resulting stream is then multicast to clients. We analyze the performance of proposed VOE approach in 3 aspects: • Realtime Processing The VOE service gateway runs on a general purpose PC with a single pentium IV 1.4GHz processor and 512M memory. The filter is written in Visual C++ and runs on top of Windows 2k platform. As we have expected, we can perform many VOE functions in realtime, introducing an extra delay of at most 0.5 seconds. After initialization, the processor utilization averagely levels off at 70% when no other applications are running on the gateway machine at the same time. For this experiment, we have chosen the background stream to be a HD quality (1920*1088, 30 frames per second) stream ”Trees1.mpg”, and the foreground stream to be a standard resolution MPEG video ”football sd.mpg” (480*256, 30 frames per second). Figure 8 is a screen shot of the resulting stream. Note that the foreground window can be located any where in the frame so long as it follows the 16x16 macroblock boundaries. • Computational Complexity Reduction

Figure 8: An example screen shot for PiP

Football

Stars

Trees

Previous Approach

1632000

1632000

1632000

Our Approach

139455

55509

56613

Table 2: Comparison of the number of macroblocks that need motion compensation

Figure 9: Distribution of c-MBs

different background streams, the total number of macroblocks reconstructed through motion compensation is given by Table 2. From the table we know that with our approach the most expensive motion compensation operation only needs to be done for less than 10% of previous approach, and for background streams such as Stars and Trees this percentage is even smaller. To further explain the saving, we have plotted out the distribution of c-MBs and d-MBs over the whole background area, as shown in Figure 9 and 10. As expected, c-MBs only appear around the foreground area, and the number of d-MBs also decrease dramatically going farther away from the foreground area.

Figure 10: Distribution of d-MBs

As we have analyzed above and also pointed out by [7], a most expensive work done using the [7] approach is the decoding of reference macroblocks from MC domain to RD domain, and our approach is exactly solving this problem based on the observation that normally great portion of these macroblocks are not used at all. To demonstrate this, we only need to compare the number of macroblocks that are decoded to the RD domain by the 2 approaches. In the [7] approach, all the macroblocks in all I and P frames should be counted, so for the “football in trees” example in Figure 8, the total number of macroblocks decoded to RD domain 6 ∗ 15 ∗ 500 = 1632000. for 500 frames would be 1920∗1088 16∗16 On the other hand, for our approach, the total count is dependant upon the nature of the background stream since we only work on affected macroblocks. For 3

Figure 11: Effects of the optimization: directional to uni-directional prediction”

“bi-

As discussed in section 3.2, the “bi-direction to unidirection prediction” optimization could eliminate one potential c-MB and at least one potential d-MB, and so reduce the amount of computation needed. Figure 11 shows a comparison of the number of c-MBs in each frame with or without the “bi-direction to unidirection prediction” optimization using the “footballin-stars” scenario. For I and P frames, there are no bidirectionally predicted macroblocks at all, so the two curves will join at a single point. But for B frames, we can see that the reduction is averagely more than 50%.

• Resulting Video Quality Since in general we are using the same DCT domain

optimal service setting can be easily configured based on the user’s preference. We have implemented a first version of the realtime VOE filter that can be used to embed video and images in realtime even for high definition MPEG streams, and discussed many subtle problems we have met and proposed several ways of optimize the VOE process based on our first-hand experiences.

7.

REFERENCES

[1] ISO/IEC International Standard 13818. Generic coding of moving pictures and associated audio information. 1994. [2] C. Bezzan. High definition TV: its history and perspective. Telecommunications Symposium, 1990. ITS ’90 Symposium Record., SBT/IEEE International, 1990.

Figure 12: pSNR of background image for 3 video clips

[3] A. Dutta-Roy. An overview of cable modem technology and market perspectives . IEEE Communications Magazine , Volume: 39 Issue: 6 , June 2001, Page(s): 81 -88, 2001.

approach as [7] except the changes in reconstruction, we can expect the resulting image quality follows the same analysis as [7]. The quality degradation primarily comes from DCT quantization, simplified motionestimation and some of optimizations we discussed. For the visual quality test, because the foreground video is not changed in opaque overlaying, we only calculated the pSNR (Peak Signal-To-Noise, equation 6) values for the background streams. pSN R(X, Y ) = 20log10 [ 

1 MN

255 M N

− Yij )2 (6) Specifically, we used standard MPEG2 decoder to decode both original background streams and the resulting streams after embedding to pixel domain, and used their difference (excluding the foreground area) as the noise value. From the result shown in Figure 12, we can see that the I-frames are not affected at all, and other frames’ pSNR value changes for different video clips: the more motions there are, the more noises resulting from the VOE process. i=1

j=1 (Xij

Besides, we have done a thorough subjective test. Over 100 Viewers have seen the resulting stream of our realtime VOE, and no perceptible quality degradation is observed.

6. CONCLUSION AND FUTURE WORK Visible Object Embedding is a very useful video compositing operation that is applied in many video applications. A software realtime VOE algorithm in MPEG compressed domain has great benefits and can be immediately applied to many scenarios in people’s life. In this paper, we proposed a new cost-effective and flexibly controllable approach for doing VOE in MPEG compressed domain. Compared with previous work, our approach greatly reduces computational complexity by maximally eliminating unnecessary motion compensation operations by sacrificing a small amount of delay. Specifically a backtracking operation determines the set of macroblocks whose image value are needed for future reference and whose prediction reference is wrong. This way, only the minimum amount of work necessary for correct embedding results is done, and for the most expensive motion compensation part the savings can be over 90%. The other property of our approach is that we have flexible control over how to balance between different QoS metrics, such as delay, computational complexity and bandwidth, and an

[4] Emerging high-speed xDSL access services: architectures, issues, insights, and implications . IEEE Communications Magazine , Volume: 37 Issue: 11 , Nov. 1999, Page(s): 106 -114, 1999. ]

[5] J. Spragins. Fast Ethernet: Dawn of a New Network [New Books and Multimedia] . IEEE Network , Volume: 10 Issue: 2 , March-April 1996, Page(s): 4, 1996. [6] Gigabit Ethernet. Circuits and Systems, 2001. Tutorial Guide: ISCAS 2001. The IEEE International Symposium on , 2001, Page(s): 9.4.1 -9.4.16, 2001. [7] D.G.; Shih-Fu Chang Noguchi, Y.; Messerschmitt. MPEG video compositing in the compressed domain . Circuits and Systems, 1996. ISCAS ’96., Connecting the World., 1996 IEEE International Symposium on , Volume: 2, 1996. [8] J. Meng and S.-F. Chang. Embedding Visible Watermark in Compressed Video Stream. Proceedings, 1998 International Conference on Image Processing (ICIP’98), Chicago, Illinois, October 1998. [9] Seungwook Hong Jongho Nang, Ohyeong Kwon. Caption processing for MPEG video in MC-DCT compressed domain. Proceedings of the 8th ACM international conference, October 2000. [10] D. Thompson. IEEE 1394: changing the way we do multimedia communications. IEEE Multimedia , Volume: 7 Issue: 2, April-June 2000, Page(s): 94 -100. [11] W. Sincoskie D. Wetherall G. Minden. D. Tennenhouse, J. Smith. A Survey of Active Network Research. IEEE Communications Magazine, Vol. 35, No. 1, pp80-86., January 1997. [12] Jack Brassil Wei-Ying Ma Bo Shen. Content services network: the Architecture and Protocols. The 6th International Workshop on Web Caching and Content Distribution, pp 83-101 Boston, June 2000. [13] GAIA: Active Spaces for Ubiquitous Computing. http://devius.cs.uiuc.edu/gaia/.