AVPUC: Automatic Video Production with User Customization

AVPUC: Automatic Video Production with User Customization Bin Yu, Klara Nahrstedt Department of Computer Science University of Illinois at Urbana-Cham...
Author: Ashley Cox
2 downloads 0 Views 662KB Size
AVPUC: Automatic Video Production with User Customization Bin Yu, Klara Nahrstedt Department of Computer Science University of Illinois at Urbana-Champaign, Siebel Center, 201 N. Goodwin, Urbana, Illinois 61801, U.S.A. binyu, [email protected] ABSTRACT Nowadays, multiple video cameras are employed for live broadcast and recording of almost all major social events, and all these camera streams are aggregated and rendered into one video program for audiences. While this content composition process aims at presenting the most interesting perspective of an event, it leads to the problem of how to fully customize the final video program to different audience interests without requiring too much input from the audience. The goal of this work is to solve this problem by proposing the Automatic Video Production with User Customization (AVPUC) system that separates the video stream interestingness comparison from video program production to provide space for maximized customization. Human-controlled video selection and automatic video evaluation are combined to support video content customization and reduce redundant audience inputs. Preliminary evaluation results confirm that AVPUC’s capture-evaluation-render model for video production improves audiences’ satisfaction for customized multi-perspective viewing of social events. 1. INTRODUCTION Nowadays, multiple video cameras are employed for live broadcast and recording of almost all major social events, such as sports games, artistic performance, fashion show, big ceremonies, or remote conferences, etc. Each camera is either manually operated by a professional videographer or automatically controlled by computer programs to track a particular person/object or monitor a particular aspect of the event from a specific view angle. However, with traditional content production systems, the original video streams captured by these cameras are not directly accessible to TV audiences. They are all sent to the TV studio, where the producer crew make the decision on which stream(s) to be sent out to the audiences at when. Production rules [3] have been summarized over the years on how this decision should be made to avoid low quality video streams, such as the triangle principle for shooting dialogues, etc. On top of these rules, the producer crew add their subjective judgment on which stream(s) are most relevant to the event and convey more interesting information to the audiences. As a result, the final program presented to users is a composition of video segments cut from different camera streams at different time periods. One obvious problem with this production process is that audiences cannot customize the program they watch: although streams from multiple view angles are captured, a user cannot specify the view angle he wants at a particular time. For example, let us consider the broadcast of a football game between Illinois and Indiana. There can be many places where the users may want to customize the program: audience from each state may want to view their own players more often than the other team; the favorite player may be different in each group; young viewers may prefer video scenes with more actions and close-up views, while senior viewers may want more overview pictures. The same problem exists in video conferencing applications between distributed groups. For instance, let us assume a group of people in a meeting is transmitted to remote participants. One person may be drawing on the whiteboard, and others may be sitting around the table, asking the speaking questions or converse with each other. Such a scenario can not be fully captured with one camera stream to convey all the information, yet when multiple cameras are setup, different remote participants may want to view from different view angles. For example, some may want to see the speaker's image, some may want to have a close-up view of the drawing on the whiteboard, while some may be interested in the response of people sitting around the table. Therefore, we are faced with the problem of composing these streams so that customized set of streams are presented to each remote participant.

Motivated by these problems, we propose the Automatic Video Production with User Customization (AVPUC) system, which composes multiple camera streams depicting the same event from multiple perspectives into a video program that is customized to audience's preferences. While there are many research issues associated with such a content capturing, distribution and composition system, our contributions have been focused on three specific goals: 1.

2.

3.

New Capture-Evaluation-Render Model. A new production model is needed to open up the black box of single point TV-studio-like production so that more original video footage become available for the audiences and the camera stream selection/composition process can be better customized based on user preferences. Specifically, we need to answer questions such as where the stream selection decision is made, when and how the actual stream composition should be done, and etc. Semi-Automatic Video Production. We aim at designing a video program production algorithm so that each audience will receive a customized program. Note that balance has to be kept between computer-assisted automation and user customization during the production process, because the users should be able to control what they get by inputting commands via the user interface, yet they should not be buried in tedious and routine interaction operations. Therefore, an integrated framework is needed to combine production rules and human inputs in production of the final video program, so that audiences can exercise their control as they will without being distracted from enjoying the content. New User Interface Design. Most people are accustomed to the TV remote control interface with a few number keys and navigation buttons, and users in a video conference interact with the video streams even less. Yet on the other hand, user inputs are very important for customizing the video selection process. Therefore, we need to design a simple user interface that can be easily mastered by any user yet powerful enough for users to input their commands to express their preference.

In addition, we also want to point out the research issues that fall out of the scopes of this work: 1.

2.

Network Bandwidth Constraint. In this work, we assume that the transmission channel's bandwidth is sufficient for streaming the video streams and the delay is acceptable for real-time interactions. This assumption is already valid for cable modem and xDSL users at home with up to 3-4 Mbps connection, and we believe eventually the bandwidth bottleneck will be completely released with future infrastructure upgrade and technology advance. Automatic Video/Audio Analysis. We also assume that video/audio analysis technologies allow us to implement certain analysis on the captured video/audio streams, such as object tracking and identification. With the help of context information, such tasks can already be done using vision and aural analysis techniques. For example, in [21], the authors present the algorithms to track the position of players and the soccer ball using a single camera stream, and it can even tell whether a player is holding the ball. In [16], human motion can be identified in the video shoot by a single camera. In [20], audio analysis can tell where the speaker is for a recorded meeting.

The rest of this paper is organized as follows. In the next Section, we will discuss how this work fits into the related work, and then in Section 3 we present the AVPUC solution. Finally Section 4 concludes our contributions and discusses some future directions. 2. RELATED WORK 2.1. Interactive Digital TV Interactive Digital TV (iTV for short) is an emerging TV service provided by many service providers such as OpenTV [17] and MSN TV [14] in recent years. Typical services include overlay of information related to current video content, Email and Web access, electronic games, content recording and etc, which are not available in traditional TV services. Especially related to our work, iTV allows the audience to choose view angles. For example, with OpenTV service, the audience can choose from four or eight camera views in watching Australian rugby [17]. Despite iTV’s popularity, current multi-view service is still very limited due to the architecture of TV service provision system. Figure 1 (on next page) illustrates how typical iTV providers setup their network to provide such multi-angle viewing service in broadcasting a dance performance. All camera streams are sent to the TV studio, where a human editor will select 4 or 8 streams, and compose them with split screen method. Then the composite stream is sent to the audience, who can select a particular stream from the four views via user control device. We can see that there are several places where improvement can be made. a) Only a few (typically four) candidate streams are available, which

are selected in the TV studio instead of according to audiences’ preference; b) The user interface only allows the audience to choose one stream out of the candidates, so not much user preference can be input, and c) No automation is available to help the audience in the switching of streams, which may be a big problem when the number of candidate streams is larger. In Section 3, we will discuss how AVPUC system solves these problems.

User Control

Human Editor

Switch

TV Studio

Cable Internet Satellite

: single camera angle stream : 4 streams composed in split-screen

Set-Top Box or PC

: selected stream : user control feedback

Figure 1. Typical iTV Service Provision Setup

2.2. Aggregation of Multiple Correlated Video Streams There has been a continuing interest in the last decade on aggregating multiple correlated video streams depicting the same event into a single video program, and three directions have been studied most heavily. One direction is to simply combine all streams or the key objects (such as a talking head) extracted from every stream into one video frame with split screen layout [1, 3, 11, 15]. The advantage of this approach is that each viewer can see all the camera streams at the same time to get an overview of the captured activity. However, because the result is Multicast to all, a user cannot customize the received stream. Also, because of the limited screen space and bandwidth, only a few (normally less than 4) streams can be shown in acceptable quality, and more streams would require video wall displays and very high speed connectivity. The second approach of aggregation relies on reconstructing a 3D virtual world from multiple camera streams and then rendering a particular perspective on user demand [8, 9, 18]. The benefit of this method is that the users get a sense of tele-immersion by selecting their preferred views within the 3D virtual world, such as which angle they like to watch and at what distance. However, one obvious problem is that large mount of computation and bandwidth resources are needed to render realistic views. For example, as reported in [8], when there are more than 5 camera streams, it can only support less than 5 frames per second. Moreover, the reconstruction job normally requires a large number of camera streams as input data, and these cameras have to be exactly calibrated, which greatly increases the setup overhead of such systems. We argue that such high fidelity tele-immersion experience may not justify its setup overhead and resource cost, and instead camera streams shoot from multiple most likely view angles will already achieve the level of customization necessary for most applications. The third approach is to capture human activities by automating the camera control and then selecting the best stream based on videography principles given by professional videographers [18, 7, 13, 22]. For example, Polycom [18] employs voiced based automatic speaker tracking technique to control a pan-tile-zoom camera. The advantage is that human labor is saved and recording of daily activities becomes very simple and cheap, and the composition operations are less expensive than the aforementioned two approaches in terms of computation and bandwidth. However, the reported systems have only been tested on simple scenarios such as lecture presentations and meetings. The authors believe that with today’s video/audio analysis technology, pure automated camera management and video production is still not able to render satisfactory programs when capturing more complex scenarios such as sports games and art performances, and some degree of human involvement from the users is still necessary.

In summary, with AVPUC, we want to combine the advantages of these approaches while avoiding their shortcomings. In particular, we adopt the video composition approach to reduce resource and setup requirement, and we combine users’ input with automated stream selection based on videography principles to render a professional video program. Moreover, AVPUC supports each user to further customize the resulting stream for himself using the system generated video program as a starting point, which is not available in all the aforementioned systems. 2.3. Sharing of Panoramic Camera Another set of related work study how to share panoramic camera stream between multiple users, and example projects are FLYSPEC [2] and ShareCam [5]. Their research is primarily about how to share access to the critical resource (the panoramic camera stream) among multiple users, and they propose the solution of combining the majority users’ request in controlling the camera. In AVPUC, the camera streams are sent in Multicast to all subscribing users, so it does not have similar problem in exclusive access. Instead, remote users controls which camera streams to subscribe and how these streams should be composed based on his personal preference. 3. PROPOSED SOLUTION In this section, we will first describe the general capturing-evaluation-composition model of the AVPUC system, and then discuss the internal video scoring mechanism, and finally how users input commands via the user interface to customize the composed program. 3.1. Capturing-evaluation-composition Model of AVPUC System More … Content Capturing Station Content Capturing Station

Automatic Feature Analysis

Video Rendering

Manual Annotation

Stream Multicast

Transmission Channel

Stream Selection

User Input

Personal Production Station

Content Capturing Station

Automatic Scoring

Manual Scoring

Content Evaluation Station

Other Content Evaluation Stations …

: single camera angle stream

: rating stream

: video stream combined with meta data stream

: user control feedback

Figure 2. Architecture of AVPUC System

Figure 2 shows the proposed architecture of AVPUC system using an example of broadcasting a dance performance. Overall, the AVPUC system is composed of four categories of components:

3.1.1. Content Capturing Station (upper left in Figure 2) A content capturing station is an independent service provider with the purpose of a) capturing video content and b) providing meta-data that describes the features of the captured content.

For the content capturing task, each content capturing station will setup its own set of cameras at the scene being captured. Dedicated videographer may be operating these cameras on the site or via remote controls, or the cameras can be automatically controlled by computer programs. For the feature extraction task, each captured video/audio stream will first be analyzed by computer programs, and then further annotated by human operators to extract useful meta-data about the stream. A feature can be a Boolean or Scale value answering a particular question on each data unit of the video/audio stream, and for different type of content the features may be defined differently. For example, for a dance performance, a Boolean feature may be whether a particular dancer is well framed in the camera view, or whether he is facing this camera in this video frame, and a Scalar feature may be how many people are included in this camera view. On the other hand, for a group conference, a Boolean feature may be whether the person in this camera view is speaking, and a Scalar feature may be the volume of speaking person. Note that some desired features can be set easily by a human operator in real-time, but may not be accurately evaluated with today's computer technology. So we envision that at the beginning of AVPUC deployment, both computer-based and manual feature annotation will be used, though in the future human labor can be reduced when video analysis technology improves. The user interface of the capturing station will be discussed in Section 3.3. Note that the feature is a timed variable because its value may be changing over time, so all the features form a metadata stream by themselves. MPEG-7 is a perfect standard to encode the set of feature descriptions for. The meta-data stream is generated in real-time as the content is captured, coupled with its corresponding video/audio stream at the Content Capturing Station for Multicast over the transmission channel. Also note that all the camera streams are open to public subscription, either content evaluation stations (described below) or ordinary audiences, which is not true for today’s TV industry. In addition, as shown in Figure 2, multiple content capturing stations may be capturing the same event, and they compete by their special view angles and the quality of their annotations in terms of accuracy, usefulness and response time. 3.1.2. Transmission Channel (center in Figure 2) The transmission channel serves three purposes in the AVPUC system: 1. The raw video content plus the features in metadata stream from the Content Capturing Stations needs to be Multicasted to all subscribing Content Evaluation Stations and Personal Production Stations. 2. The rating streams from Content Evaluation Stations need to be Multicasted to all subscribing Personal Production Stations 3. The feedback streams from the Personal Production Stations may need to be sent to Content Capturing Stations and Content Evaluation Stations for them to analyze user preference. Different distribution media may be used in combination in the AVPUC system to serve all the three purposes, such as Cable Network, Internet or Satellite. Note that feedback explosion may occur at Content Capturing Stations and Content Evaluation Stations when there are too many incoming user feedback, and we assume some feedback aggregation proxy network will be employed to solve this problem. 3.1.3. Content Evaluation Station (bottom in Figure 2) Content evaluation stations provide the service of evaluating the content streams from content capturing stations and publishing rating streams corresponding to the content streams. Each rating stream is a sequence of interest scores for each data unit of the corresponding video/audio content stream, and the interest score represents the interest level or importance level of a stream segment. In other words, the higher a stream scores, the more likely that this stream should be selected in rendering the program to the audience (the exact rendering process is described in subsection 3.3). The scores are normalized into the range of [0, 1], and this range is consistent among all content evaluation stations. The scores are generated by automatic analysis and human evaluators, and details on this scoring process are given in subsection 3.2. The published rating streams will be subscribed by audiences and used for the decision of which camera streams should be selected for rendering. Such decoupling of content streams comparison and mixing delays the actual stream filtering and composition to be done at the each audience’s personal computing device, which allows for maximized customization of the finally rendered program. 3.1.4. Personal Production Station (upper right in Figure 2)

Each audience owns a production station that receives the content and meta-data streams from the Content Capturing Stations and rating streams from Content Evaluation Stations, and performs the task of modifying their rating scores based on audience preference, comparing the streams by their interest scores and composing a final video program rendered on the user interface. Other controls may also be rendered on the user interface for users to input commands, and details on the user interface design will be discussed in subsection 3.3. The production decision (which stream(s) to be shown at when) is essentially based on the feature meta-data streams, the rating streams, and user inputs. The goal is to select the stream that is most desirable to the audience. In the next subsection, we will present in details how the rating scores are derived and how the production decision is made, and in subsection 3.3 we will describe how audiences can interact with the personal production station user interface to customize the final program they get on their screens. 3.2. Semi-Automatic Video Program Production In the AVPUC system, the video program production is divided in two steps: generation of stream interest scores and composing high scoring streams into the video program. 3.2.1. Generation of Interest Scores As described above, in the AVPUC system, the interest level or importance of a particular stream segment is represented by an interest score. We define six kinds of interest scores, whose relation is shown in Figure 3: Automatic Evaluator Score

Final Evaluator Score

Manual Evaluator Score

Automatic Audience Score

Final Audience Score

Manual Audience Score

Figure 3. Relation of Interest Scores

Automatic Evaluator Score SAE – the interest score that is given by evaluation rules defined by human evaluators for each frame (including video and audio data) in each stream at the content evaluation station. Formally, assume for a set of streams, n Boolean features and m Scalar features are defined, noted as fB1, fB2 through fBn and fS1, fS2 through fSm respectively. A set of weights are set at the content evaluation station for each Boolean feature, noted as wB1, wB2, … wBn. For each Scalar feature fSj, suppose its value is within range Minj and Maxj, a set of kj nonoverlapping ranges is defined between Minj and Maxj : [ rj 0 =Minj, rj1 ], [ rj1 rj 2 ], … [ r j ( k j −1) rjk j =Maxj], and the lth range of the jth Scalar feature has a corresponding weight wS jl . Note that each weight value wBn or wS jl is an integer within the range of [0 MAX_WEIGHT] where MAX_WEIGHT is a predefined threshold. Since all the weight value has an upper bound, the value SAE for all the frames in all the streams evaluated by the same content evaluation station also has an upper bound MAX_CES, against which SAE is normalized to the range of [0 1]. Then SAE for a particular frame F is given by the following algorithm in pseudo code. Intuitively, camera streams differ in their feature values, and by assigning higher weights for a particular set of features, the SAE for some streams will be higher than other streams. Note that calculating the SAE value for every frame would lead to a lot of redundant computation because normally the value of the features only changes every tens or hundreds of frames. Therefore, in our implementation, we only need to check whether any feature value has changed, and then adjust SAE’s value accordingly.

Algorithm for Calculating SAE for Frame F SAE = 0; For each i from 1 to n If ( fBi is true for frame F) Then SAE = SAE + wBi; End If End For For each j from 1 to m For each l from 1 to kj

If (frame F’s fS j falls in the range of [ rj ( l −1)

r jl ] ) Then

SAE = SAE + wS jl ; End if End For End For SAE = SAE / MAX_CES; Output SAE;

z

Manual Evaluator Score SME – the interest score that is assigned by the human evaluator at the content evaluation station to adjust the automatic evaluator score for each frame (including video and audio data) in each stream. As we discussed above, this is still necessary because human can still capture many interesting or important perspectives that are not easily identified by computers. For the dancer performance example, a human evaluator may find a particular view angle to be very “visually pleasant”, which is hard to quantify in computer automation. So SME is designed to allow evaluators to add in their subjective opinions easily. SME is also in the range of [0 1], and we will describe how the evaluator can adapt SME value via the newly designed user interface in subsection 3.3.

z

Final Evaluator Score SFE – the final interest score that is published by the content evaluation station for each frame (including video and audio data) in each stream. SFE is given by the following equation: SFE = SAE + SME SFE is updated when SAE is changed because of feature value change or when SME is updated by the human evaluator, and its value falls in the range of [0 2].

z

Automatic Audience Score SAA – the interest score that is generated by automatic analysis of a video/audio stream based on audience defined rules for each frame (including video and audio data) in each stream at the audience’s personal production station. Essentially this score is generated with the same algorithm used to generate the Automatic Evaluator Score SAE, and the only difference is that the weights are set by the audience instead of by the evaluator. This score is intended to be used only by those advanced audiences who want to take advantage of the computer system to automate his customization of the video program.

z

Manual Audience Score SMA – the interest score that is assigned by the audience at the personal production station to adjust the Automatic Audience Score and the Final Evaluator Score for each frame (including video and audio data) in each stream. This is the instrument used by each audience to express his personal preference in customizing his own video program, and we will describe how this is done via the newly designed user interface in subsection 3.3

z

Final Audience Score SFA – the final interest score used to compare video frames between video streams and select the streams to be composed into the final video program. SFA is given by the following equation: SFA = SFE + SAA + SMA SFA is updated when SFE is updated at the content evaluation station, or when SAA is changed because of feature value change, or when SMA is updated by the human audience, and its value falls in the range of [0 4].

In summary, two similar steps happen at both the content evaluation stations and the personal production stations: a) generation of an automatic score based on content stream’s feature meta-data and the pre-defined feature weights, and b) specification of a manual score by a human operator (an evaluator or an advanced audience). The only difference between what happens at the two stations is whether the feature weights and manual score are set by a professional evaluator or by an audience. The professional evaluator will need to consider the common interest of all the audiences who are subscribing to his service. For example, when broadcasting a football game between Illinois and Indiana States, a content evaluation station can label itself as Illinois Fan TV Service, and give more preference to the camera streams

tracking players from Illinois. In addition, the evaluator is more dedicated to the stream selection labor, such as looking for more interesting scenes by inspecting multiple camera streams, or changing the weights more frequently to reflect delicate changes in the camera streams so that more interesting streams get a higher score. On the other hand, an audience will be more focused on enjoying the video content, but still wants to customize the video program based on the evaluator’s suggestions by lightly interacting with the user interface. This difference in the level of involvement with the video composition process leads to different user interface designs at the content evaluation station and the personal production station, as will be described in sub section 3.3. So now we temporarily assume the Final Audience Score SFA is already given for a set of streams, and discuss how the it can be used for the video program composition at the personal production station. 3.2.2. Score-based Video Composition In the video composition process, we need to answer the following questions: which streams should be simultaneously presented to the audience at a particular time point, and when more than one stream should be presented. Based on the answers, we need to select the rendering method. We adopted some example rendering methods that are already familiar with TV audiences, as shown in Figure 4.

A single

A PiP

A B

B Up-down

A B

D

2-PiP

A

C

B

D

Split Screen for 4

Figure 4. Rendering Methods Used In AVPUC

The AVPUC system decides which method to use in the following way. Assume we already have the final audience scores for n streams. And we also assume the five top ranking streams are stream 1, 2, 3, 4 and 5, with corresponding scores SFA-1, SFA-2, SFA-3, SFA-4 and SFA-5. The selection algorithm used by the personal production stations is given below:

Selection Algorithm for Deciding Which Stream(s) to render If (SFA-1 / SFA-2 > 2 and SFA-2 / SFA-3 > 2) Then /* PiP is used when two streams’ scores are much higher than other streams, yet these two streams’ scores also differs much */ Method = “PiP”; Else If (SFA-1 / SFA-2 < 2 and SFA-2 / SFA-3 > 2) Then /* Up-Down is used when two streams’ scores are much higher than other streams, and these two streams’ scores are close to each other. */ Method = “Up-Down” Else If (SFA-1 / SFA-2 > 2 and SFA-2 / SFA-3 < 2 and SFA-3 / SFA-4 > 2) Then /* 2-PiP used when three streams’ scores are much higher than other streams, yet the top ranking stream’s score is also much higher than the second and third streams. */ Method = “2-PiP” Else If (SFA-1 / SFA-2 < 2 and SFA-2 / SFA-3 < 2 and SFA-3 / SFA-4 < 2 and SFA-4 / SFA-5 > 2) Then /* Split Screen for 4 is used when four streams’ scores are much higher than other streams. */ Method = “Split Screen for 4” Else /* Single will be used by default*/ Method = Single

Table 1 gives a few example score values and the corresponding composition methods used:

Case 1 Case 2 Case 3 Case 4 Case 5

SFA-1 3.2 3.1 2.8 3.3 3.5

SFA-2 1.5 2.3 1.3 2.9 1.5

SFA-3 0.6 1.1 1.2 2.7 1.2

SFA-4 0.5 0.9 0.4 2.5 1.1

SFA-5 0.4 0.8 0.3 1.1 1.0

Composition Method PiP Up-Down 2-PiP Split-Screen for 4 Single

Table 1. Example Final Audience Scores and Corresponding Composition Method Note that the threshold constant may be set differently to achieve different composition effects, and the constant threshold values used in the algorithm above are only to illustrate how the AVPUC system determines which composition method to use. Also note that there can be many more composition methods, but we only want to illustrate the idea of AVPUC with these five simple methods above. The rules on “which method should be used at what condition” may also be different, which will be the subject of future user studies. 3.3. User Interface Design 3.3.1. Content Capturing Station User Interface As shown in Figure 5, the user interface at the content capturing station is pretty simple. The preview of the camera stream is shown on the left, and the automatically extracted features and manually defined features are listed on its right. The user interface also allows the human operator to add/remove features (via the “New Features…” button) and set feature values (via the up and down triangle buttons beside the feature values.

9

Figure 5. Content Capturing Station User Interface

3.3.2. Evaluator User Interface and Generation of Final Evaluator Score SFE

9

Figure 6. Evaluator User Interface

Figure 6 (on the next page) shows the evaluator user interface. The screen resolution is assumed to be 1024 by 728 pixels. On the bottom, 10 (or more if there are more camera streams available) camera streams are shown as image sequence icons (160 by 120 pixels) that are updated periodically (such as every half second). Each icon is assigned a sequence number shown on the left/top corner of it. At any time, one of these icons can be selected by a left mouse click or by pressing the number key of that icon’s sequence number. The selected icon is marked by a dark frame around it, and the corresponding camera stream is played back in full motion (e.g., 30 frames per second) in a larger window (320 by 240 pixels) on the top left. On the right of this play back window, the name and value of features corresponding to this selected camera stream (annotated at the content capturing station) and the feature weights and the value ranges (assigned by the evaluator at the content evaluation station) are displayed. The evaluator can click the up and down triangles on each side of the feature weights to increase or decrease the feature weights. For example, in Figure 6, we can see that the seventh feature is “Number of dancers in view”, and the feature value for the current frame shown is “9”. It is a Scalar feature, and it has two value ranges [0 1] and [2 10], with corresponding weights of “8” and “1”. This setting may be indicating a preference for those close-up shots showing only one dancer. Note that the evaluator can not change the features or values because they are set at the content capturing station, but they can change the weights and the value ranges for Scalar features (not shown in this Figure) to control the generation of the Automatic Evaluator Score. In addition, this evaluator interface allows the evaluator to change the Manual Evaluator Score in the following ways: 1. Left click on a selected icon: this will increase that stream’s manual score by a predefined amount, such as 0.1. 2. Right click on a selected icon: this will increase that stream’s manual score by a predefined amount, such as 0.1. 3. Double Left click on a selected icon: “nail down” to that stream by setting its manual score to 1.0 and resetting other streams’ manual score to 0. Because the automatic score is at most 1.0, this action will make sure that this nailed stream has the highest Final Evaluator Score. Any other operation with the user interface will reset the manual score of all streams to previous values. Overall, the evaluator’s task is to set the weights of the features to track automatically recognized features, and monitor the streams and manually adjust the scores based on his judgment as he watches the streams being played back. 3.3.3. Audience User Interface The user interface at the audience’s personal production station has three modes: watching mode, basic interaction mode and advanced interaction mode.

The watching mode is designed for passive audiences who do not want to do any thing but passively watch the given video program determined by the content evaluation station’s ratings. So the user interface shows nothing but the rendered video program in full screen and no figure is shown here to save space.

Figure 7. Basic Interaction User Interface at Personal Production Station

Figure 7 shows the basic interaction mode of the audience user interface. On the center top renders the final video program in high resolution (640 by 480 pixels), and below it there are a list of icons representing the 5 currently top scoring streams in order from highest scoring on the left to lowest scoring on the right. Similar to the evaluator user interface, these icons are assigned with sequence numbers written on the left/top corner. In this mode, the automatic audience score is simply set to 0, and the rendered program is determined by the rating streams from the content evaluation station plus audience input. Because we assume an ordinary audience will only use his TV-like remote control to input his command in the basic interaction mode, so the user input only takes the format of button click. Specifically, the “nail down” operation is designed this way: when the user press the number key on the remote control, both of the corresponding stream’s Automatic Audience Score and Manual Audience Score are set to 1.0 and other streams’ are reset to 0, so that this stream will be selected for rendering. When a “reset” key is pressed, the scores will be reset to their original values. The advanced interaction mode of the personal production station user interface is the same as that of the content evaluation station’s, because essentially an advanced audience is doing the same task as the evaluator does: setting up detailed feature weights to affect the Automatic Audience Score SAA, and inspecting all camera streams and change Manual Audience Score SMA to honor interesting streams. The only difference is that the resulting score will be customized to the audience’s personal taste. Overall, the resulting video program is a combined decision based on the set of feature values provided by the content capturing station, the evaluator and audience’s setting of weights for features, and their subjective judgment of the interestingness of streams input through the user interface. 4. IMPLEMENTATION AND EVALUATION 4.1. Implementation Details We have implemented the AVPUC system in our Local Area Network in Microsoft Visual C++. For the video content, because it is very difficult for us to acquire multiple streams that are monitoring a real-world event (e.g. dance performance, soccer games), currently we resort to a 3D animation software called D-player [5] that is originally used for street dance training. It allows users to select arbitrary view angles, so we captured two dance clips, each consisting of 10 image sequences from 10 view angles for 3 minutes, as already shown in Figure 6. Each captured video stream is 640x480 pixel resolution, 30 frames per second, MJPEG encoded, and the bitrate for each stream is 8.8 Mbps. The feature metadata stream is currently manually annotated with the content capturing user interface to demonstrate the idea, and we are in the process of incorporating video analysis tools into the system for automatic feature analysis.

4.2. Performance Test We use a desktop PC with Pentium IV 3.2 GHz processor and 1GByte memory to store the 10 video streams and act as both the content capturing station and the content evaluation station. Another PC with same configuration is used for personal production station. The computational cost of video composition is obvious low because all the editing operations can be done by replacing some pixels in one video frame with pixels from another frame, as opposed to reconstructing a virtual 3D world and generating pixel values for virtual view points. Our system does have an additional cost of transmitting all the icon streams, especially for the evaluator user interface. Overall, our current system can support real-time video streaming and composition at 30 frames per second at both the evaluator and audience stations, while according to the latest statistics given in [8] only 6 frames per second can be supported for session of size 4 (4 streams). The average response time from a user input to the change in final video program (if any) is less than 0.2 second, which reflects the simplicity of the score-based production framework. 4.3. Usability Study For the usability study, the authors used recorded camera streams of the dance and manually labeled meta data streams. The content evaluation streams are also pre-calculated by the authors and stored. We have invited 12 volunteer viewers to use the AVPUC system’s personal production station interface to watch the dance clip and answer some survey questions. All viewers were graduate students in engineering school, with age between 22 and 30, and gender of 8 male and 4 female. 80% users answered positively to the question "Is the experience better or worse than watching TV?", while the answer of majority to the "most desired improvement" question was "more key shortcut". One subject commented that “it is like making and watching a TV show at the same time…” The first lessons we learned from the user study are: 1. The amount of user interaction is closely related to how interested the viewer is in the presented content. 2.

Because the user interface is new to most users considering their previous experiences with UI for TV watching or video games, it takes some time for the users to get familiar with the user interface and to actively interact with it.

3.

The amount of the user interaction is also related to the convenience of the user interface. For example, one of the primary interactions in the basic interaction mode is the setting of manual audience score. Initially this was done with the mouse, so several users found it inconvenient. Later we changed all interactions to be triggered by pressing buttons, and so later users perform the interactions much more frequently than previous users.

We have to admit the experiments are still preliminary, and we are in the process of setting up a conference room that will be equipped with 10 cameras to capture meetings from various angles. We will then use the AVPUC system for live transmission of the meetings and for recording, and will be able to give more quantitative evaluation results of our system soon. We will also invite some viewers from different groups to use the user interface and collect their feedback.

5. CONCLUSION This paper presents the Automatic Video Production with User Customization system that combines automatic video analysis based video evaluation with human understanding input from user interface in aggregating multiple camera streams into a video program. Our major contribution is a new content capturing—evaluation—composition architecture with scoring based video production and innovative user interface that maximizes the user customization level and minimizes unnecessary user interactions. Now we are working on the following directions to improve on it: z Group conferencing experiment. We are installing 15 cameras in a conference room with a round table and several plasma displays, and we are planning to capture regular meetings for live broadcast and on-demand viewing. A more mature usability evaluation will be performed. z New production algorithms. The current production algorithm is very simple, and we want to enhance it by considering a group of audiences’ preference. Also the comparison of multiple camera streams based on extracted features is very similar to web document retrieval problem, and we will borrow some ideas from that field to utilize the relation between camera streams (such as relation between camera positions). In addition, the audiences’ inputs may be used by the content evaluation stations to adapt their setting, or even the content capturing stations may use them to automate the camera control. z User interface design. We will continue to improve the user interface design so that the users can express more preference in a more convenient way. In addition, when multiple displays are available, we also need to change the user interface of AVPUC to take full advantage of it. Moreover, we will explore the VCR-like functionalities.

z

Streaming protocol and coding standard. There can be a lot of work to do in terms of streaming protocol design and video/audio coding to save network bandwidth and improve quality of service when we deploy our system in the Wide Area Network. For example, we may dynamically change the video frame rate and number of icons on the audience user interface to adapt to changing network bandwidth. REFERENCES

1.

2.

3. 4.

5. 6.

7. 8.

9. 10. 11. 12. 13. 14. 15. 16.

17. 18. 19.

20.

21. 22.

A. Mankin, L. Gharai, R. Riley, M. P. Maher and J. Flidr, “The Design of a Digital Amphitheater”, In Proceedings of the 10th International Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV00), Chapel Hill, North Carolina, USA, 2000 Q. Liu, D. Kimber, J. Foote, L. Wilcox and J. Boreczky, “FLYSPEC: A Multi-User Video Camera System with Hybrid Human and Automatic Control”, In Proceedings of the Tenth ACM International Conference on Multimedia (MM02), Juan les Pins, France, 2002 D. Arijon, “Grammar of the Film Language”, Focal Press, 1976. D. G. Boyer and M. E. Lukacs, “The Personal Presence System: A Wide Area Network Resource for the Real Time Composition of Multipoint Multimedia Communications”, In Proceedings of the Second ACM International Conference on Multimedia (MM94), San Francisco, California, USA, 1994 “D-player”, http://www.idance.co.kr/dplayer_frame.htm (in Korean Language) D. Song and K. Goldberg, “ShareCam Part I: Interface, System Architecture, and Implementation of a Collaboratively Controlled Robotic Webcam”, In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, Nevada, USA, 2003 E. Machnicki and L.A. Rowe, “Virtual Director: Automating a Webcast”, In Proceedings of SPIE Multimedia Computing and Networking (MMCN2002), San Jose, CA, USA, 2002 H. Harlyn Baker, N. Bhatti, D. Tanguay, I. Sobel, D. Gelb, M. E. Goss, J. MacCormick, K. Yuasa, W. B. Culbertson, and T. Malzbender, “Computation and Performance Issues In Coliseum: An Immersive Videoconferencing System”, In Proceedings of the Eleventh ACM International Conference on Multimedia (MM03), Berkeley, CA, USA, 2003 “International Organisation for Standardisation: Overview of the MPEG-7 Format”, http://www.cselt.it/mpeg/standards/mpeg-7/mpeg-7.htm, 2001 J. Lanier, “National Tele-Immersion Initiative”, http://www.advanced.org/teleimmersion2.html K. Watabe, S. Sakata, K. Maeno, H. Fukuoka, and T. Ohmori, “Distributed Desktop Conferencing System with Multiuser Multimedia Interface”, In IEEE Journal on Selected Areas in Communications, VOL. 9, NO. 4, 1991 M. Chen, “Design of a Virtual Auditorium”, In Proceedings of the Ninth ACM International Conference on Multimedia (MM01), Ottawa, Ontario, Canada, 2001 M. Gleicher, R. Heck, M. Wallick, “A Framework for Virtual Videography”, in Proceedings of 2nd International Symposium on Smart Graphics, Hawthorne, NY, USA, 2002. “MSN TV”, http://www.msntv.com/pc/ M. T. Sun, A.C. Loui and T.C. Chen, “A Coded-Domain Video Combiner For Multipoint Continuous Presence Video Vonferencing”, In IEEE Transactions on Circuits and Systems for Video Technology, VOL. 7, NO. 6, 1997 N. Krahnstoever, M. Yeasin, R. Sharma, "Towards a Unified Framework for Tracking and Analysis of Human Motion", in Proceedings of International Conference on Computer Vision, Workshop on Detection and Recognition of Events in Video, Vancouver, Canada, 2001 “OpenTV”, http://www.opentv.com/onair/player.html?section=sports&file=sports/foxtel.asx PolyCom, http://www.polycom.com P. Kelly, A. Katkere, D. Kuramura, S. Moezzi, S. Chatterjee, and R. Jain, “An Architecture for Multiple Perspective Interactive Video”, In Proceedings of the Third ACM International Conference on Multimedia (MM95), San Francisco, CA , USA, 1995 R. Cutler, Y. Rui, A. Gupta, J. Cadiz, I. Tashev, L.W. He, A. Colburn, Z. Zhang, Z. Liu, and S. Silverberg, “Distributed Meetings: A Meeting Capture and Broadcasting System”, In Proceedings of the Tenth ACM International Conference on Multimedia (MM02), Juan les Pins, France, 2002 Y. Ohno, J. Miura, and Y. Shirai , “Tracking Players and Estimation of the 3D Position of a Ball in Soccer Games”, in Proceedings of International Conference on Pattern Recognition (ICPR'00), Volume 1, Barcelona, Spain, 2000 Y. Rui, L.W. He, A. Gupta and Q. Liu, “Building An Intelligent Camera Management System”, In Proceedings of the Ninth ACM International Conference on Multimedia (MM01), Ottawa, Ontario, Canada, 2001