Video Summarization Using Clustering

Video Summarization Using Clustering Tommy Chheng Department of Computer Science University of California, Irvine [email protected] Abstract In this p...
Author: Felix Cook
3 downloads 0 Views 949KB Size
Video Summarization Using Clustering

Tommy Chheng Department of Computer Science University of California, Irvine [email protected]

Abstract In this paper, we approach the problem of video summarization. We propose an automated algorithm to identify the unique segments of a video. The video segments are separated using the k-means clustering. We use Euclidean distance with histograms of the corresponding segment as the distance metric. YouTube videos are used to test our procedure.

1

Introduction

We have seen YouTube and other media sources pushing the bounds of video consuming in the past few years. As media sources compete for more of a viewer’s time everyday, one possible alleviation is a video summarization system. A movie teaser is an example of a video summary. However, not everyone has the time to edit their videos for a concise version. See [2] for a more detailed description of the problem statement. This paper highlights a fast and efficient algorithm using k-means clustering with RGB histograms for creating a video summary. It is aimed particularly at low quality media, specifically YouTube videos.

2

Approach

A outline of our system will be as follows: 1. Split the input file into time segments of k seconds: f0 ...fn 2. Take the first frame of each segment. Let this frame be representative of the segment. We assign it x0 ...xn 3. Compute the histograms from x0 ...xn and assign it y0 ...yn . 4. Cluster the histograms(y0 ...yn ) into k groups using K-Means. Euclidean distance will be the error function. 5. Round robin for segment selection: Iterate through the k groups and select a segment randomly from a cluster, add it to list l until the number of desired segments are chosen. 6. Join list l of segments together to generate a video summary. A diagram of the system can be seen in Figure 1.

Figure 1: Overview

Figure 2: RGB Histogram

2.1

Feature Selection

We selected RGB color histograms for our feature comparator due to its global nature and speed of processing. In Rui’s Unified Video Summarization system[2], he cities histograms are a good trade-off between accuracy and speed. Additionally, Valdes’ work[1] for the TRECVID 2007 Rushes Task also cites video summarization methods based on histograms were comparable to other features but without the performance loss. One particular attribute of the histograms is their global content. Histogram is a frequency approach where it compresses the information of a video frame into a vector. Each entry in the vector is a count of a color. Histograms lose spatial information but in a task like video summarization, the spatial information may not be needed. The majority of YouTube videos are lower quality so extracting more challenging features tends to be more difficult. Histograms can perform well because they do not attempt to infer any semantic meaning in the actual segments. 2.2

K-means Clustering

For our task, we went with an unsupervised learning approach because of the lack of prior knowledge from Internet videos. We use k-means clustering to group together the related scenes. 2.2.1

Algorithm

We want to group all the similar histograms into the k clusters. Each histogram is representative of the corresponding video segment. Our version of the K-means algorithm is defined below: 1. Select k random centroid points on our multi-dimensional space. 2. Compute each histogram against all the cluster centroids. 3. Each histogram is assigned to the cluster that minimizes the error function.

4. Recompute cluster centroids. 5. On every iteration, check to see if the centroids converged. If not, we go to step 2. 2.2.2

Error function

We use Euclidean distance as our error function. This is the general approach when directly comparing histograms. v u I uX S = t (x − y )2 (1) i

i

i=1

Additionally, we also experimented with the cosine similarity and saw no noticeable difference in the clustering output.

3

Results

We selected k = 8 as our k-means parameter and use 20 segments for the output video. 3.1

Dataset

We processed following YouTube videos in our system. All of these videos are 320x240. 1. MotoGP: Recent round of the world motorcycle racing series. This represents a typical sports video. 2. ”Chad Vader”: A typical comedy video. 3. Tour of LA beaches: A semi-edited amateur web video. 4. Man Vs Wild Episode. 3.2

Clusters Generated

We see some interesting and useful results. In the Tour of LA beaches video shown in Figure 3, the clustering grouped the scenes into the beach, boardwalk and indoor separately. This is a good summary for viewers because it shows all the major sections of the video clip. When we clustered the MotoGP clip, it was able to separate all the action footage from the pit stand footage. This is particularly useful for viewers who only want to watch the race and not the pit stand. In Figure 5, the Chad Vader video clip separated all the credits into one cluster. It has a negative side effect for the video summary creation. Since we are using a round-robin approach for segment joining, the credits were dispersed throughout the summary. The Man vs Wild episode was able to correctly cluster different segments. It particularly helped that the uniquely identifying segments had much color similarity. When the Bear(the main actor) was in the desert, the colors are populated with a higher color intensity. Similarly, when he was in the Florida everglades, the colors are lower in intensity. 3.3

Performance

The majority of our runtime is in the processing overhead including the histogram extraction. In each iteration of K-Means clustering, the n frames are compared against k

Figure 3: Tour of LA beaches clusters: Each row is a cluster.

Figure 4: MotoGP clusters

Figure 5: Chad Vader clusters

Figure 6: Man vs Wild clusters

Name MotoGP Chad Vader Tour of LA beaches Man vs Wild Episode

Video Duration 9:53 5:33 8:46 50:00

Processing Time 15 seconds 22 seconds 20 seconds 2 minutes 59 seconds

Figure 7: Performance runtime

centroids. The iterations are generally constant. It took approximately 10 iterations to converge. This gives us a O(kn) runtime for the clustering algorithm. Certainly scalable for any production use.

4 4.1

Problems Repeated segments

We run into problems of repeated segments when dealing with static images in the videos. When a static image is present for a long time, two or more segments will be created from this image. During the clustering, all of the segments with the static image will be clustered in the same group. On the round-robin segment fetching, these static images will be littered through the summary video. This was the case in the Tour of the LA Beaches video as seen in Figure 2. 4.2

Background

In the MotoGP video clip, the majority of the segments consists of the road in the background. Our algorithm grouped most of these shots into one cluster. The intended behavior would be to capture the different teams into different clusters because each team has a unique color scheme. However, the background dominated and grouped most of these segments together. It would interesting future work to see if two levels of clustering would be helpful: one for the initial segments and another sub-clustering for within each set.

5

Conclusion

We have presented a system to automatically create a summarized video from a YouTube video. K-means is a simple and effective method for clustering similar frames together. Our system is modular in design so future work can be developed by substituting in various components. Instead of using histograms, future work can try to use other features such as motion vectors or even audio. However, we have demonstrated that a simple feature with a simple unsupervised learning technique can be a good starting point for a video summarization system. Acknowledgments Thanks to Deva Ramanan and the CS273 class for the experience in Machine Learning.

References [1] V´ıctor Vald´es and Jos´e M. Mart´ınez. On-line video skimming based on histogram similarity. In TVS ’07: Proceedings of the international workshop on TRECVID video

summarization, pages 94–98, New York, NY, USA, 2007. ACM. [2] Regunathan Radhakrishnan Ajay Divakaran Thomas S. Huang Yong Rui, Ziyou Xiong. Unified framework for video summarization. MERL, Sept 2004. http://www.merl.com/publications/TR2004-115/.

Suggest Documents