Multimedia Information Retrieval
Dr. Qi Tian Department of Computer Science The University of Texas at San Antonio
[email protected] October 31, 2007
Outline
Overview Problems Our Work Trends and Directions
Qi Tian Univ. of Texas at San Antonio
Multimedia Multimedia Information Information Retrieval Retrieval
Motivation ¾ With the explosive growth of digital media data, there is a huge demand for new tools and systems that enables average users to more efficiently and more effectively search, access, process, manage, author and share these digital media contents.
Qi Tian Univ. of Texas at San Antonio
Overview
Multimedia Multimedia Information Information Retrieval Retrieval Text-based Information Retrieval ¾ ¾ ¾
Too many images to annotate High cost of human interpretation Subjectivity of visual content, e.g., “A picture is worth a thousand words”
Content-based Retrieval ¾ automatically retrieves images, video, and audio based on the visual and audio content
History ¾ ¾ ¾
Conference on Database Applications of Pictorial Applications in 1979 NSF workshop in 1992 More active field since 1997 when Internet and web browsing became popular Qi Tian Univ. of Texas at San Antonio
Overview
Multimedia Multimedia Information Information Retrieval Retrieval A VERY DIVERSIFIED FIELD! Data Types ¾
Text, hypertext, image, audio, graphics, animation, paintings, video/movie, rich text, spread sheet, slides, combinations of these and user interaction
Research Problems ¾
Systems, content, services, user, evaluation, implementation, social/business, applications
Methodologies ¾
Database, information retrieval, signal and image processing, graphics, vision, human-computer interaction, machine learning, statistical modeling, data mining, pattern analysis, data fusion, social sciences, and domain knowledge for applications Qi Tian Univ. of Texas at San Antonio
Overview
Multimedia Multimedia Information Information Retrieval Retrieval
Overview Content-based Image Retrieval Content-based Video Retrieval Content-based Audio Retrieval
Qi Tian Univ. of Texas at San Antonio
Approaches
Hierarchical Hierarchical Levels Levels High-level Bridge the semantic gap, integration of context and content, hybrid (text and content) approaches
Mid-level Active Learning, Boosting, Incremental Learning
Low-level Feature Extraction and Representation, Dimension Reduction and Selection.
Qi Tian Univ. of Texas at San Antonio
Overview
Multimedia Multimedia Information Information Retrieval Retrieval Content-based Image Retrieval Image DB
metadata
Feature Extraction
9Color
Color
C/C++
Texture
Off-line
structure
Automatic
User Interface Visual C++
Color
memory Feature weighting
histogram 9Color moments 9Color correlogram
Texture 9Tamura
texture 9Co-occurrence matrices 9Gabor features 9Wavelet moments
Shape 9Fourier
descriptor
Structure 9Edge-based
Similarity ranking Qi Tian Univ. of Texas at San Antonio
features
Approaches
Many Many CBIR CBIR systems systems have have been been built.. built..
WISE WISEby byStanford Stanford The growing list: ADL, AltaVista Photofinder, Amore, ASSERT, BDLP, Blobworld, CANDID, C-bird, Chabot, CBVQ, DrawSearch, Excalibur Visual RetrievalWare, FIDS, FIR, FOCUS, ImageFinder, ImageMiner, ImageRETRO, ImageRover, ImageScape, Jacob, LCPD, MARS, MetaSEEk, MIR, NETRA, QBIC by IBM Almaden Photobook, Picasso, PicHunter, PicToSeek, QBIC, Quicklook2, SIMBA, SQUID, Surfimage, SYNAPSE, MARS by UIUC WebSEEK by Columbia TODAI, VIR Image Engine, VisualSEEk, VP Image Retrieval System, WebSeer, WISE… WebSEEK byWebSEEk, Columbia
Qi Tian Univ. of Texas at San Antonio
Overview
Multimedia Multimedia Information Information Retrieval Retrieval Content-based Video Retrieval Traditional Video Retrieval ¾Query-by-textual keyword
Automatic Visual Concept Detection e.g., indoor/outdoor, Sky, Car, Building, US-flag Example concepts: Airplane, Building, Car, Crowd, Desert, Explosion, Outdoor, People, Vehicle, Violence
Video Retrieval – Scene ¾How to recognize a scene? Context – Use Proto-Concepts to describe context – Machine learning to link context to concepts
Sky
Sky
Water
Qi Tian Univ. of Texas at San Antonio
Sky
Water
Overview
Multimedia Multimedia Information Information Retrieval Retrieval Content-based Audio Retrieval
to search sounds by their features in the waveform, statistics, or transform domains ¾Short time energy ¾ Speech,
Music, Environment Audio, Silence
Applications
¾Zero-crossing rate ¾Pitch period
Entertainment ¾ Film making - searching sound effects ¾ TV/radio studio - editing programs ¾ Karaoke, music stores, or online shopping - query by humming the melody Audio/video archive management ¾ Segmenting and indexing of raw recordings ¾ Searching and browsing audio/video clips Surveillance ¾ Monitoring criminal or emergent events ¾ Film rating
¾MFCC ¾Spectrogram ¾LPC
Alarming!!!
glass breaking, explosion, cry, shot, …
Qi Tian Univ. of Texas at San Antonio
MIR MIR
Top Top 10 10 Problems Problems in in MIR MIR Bridge the Semantic Gap ¾
high level concept (sites, objects, events) and low-level visual/audio features (color, texture, shape and structure, layout; motion; audio - pitch, energy, etc.).
How to Best Combine Human Intelligence and Machine Intelligence. ¾
Keep human in the loop, e.g. Relevance Feedback
New Query Paradigms ¾
Query by keywords, similarity, sketching an object, sketching a trajectory, painting a rough image, etc. Can we think of useful new paradigms?
Multimedia Data Mining ¾
¾
Searching for interesting/unusual patterns and correlations in multimedia has many important applications, including Web Search Engines and dealing with intelligence data. Work to date on Data Mining has been mainly in Text data.
How to Use Unlabeled Data ¾ ¾
Active learning, e.g., in Relevance Feedback Label propagation, e.g., image/video annotation
Xiong, Zhou, Tian, Rui and Huang, “Semantic Retrieval of Video”, IEEE SP Mag., March 2006
MIR
Top Top 10 10 Problems Problems (continued) (continued) Using Virtual Reality Visualization To Help ¾ ¾
Can we use 3D audio/visual visualization techniques to help a user to navigate through the data space to browse and to retrieve? e.g., 3D MARS
Incremental Learning ¾
Change the parameters of the retrieval algorithms incrementally, not needing to start from scratch every time we have new data.
Structuring Very Large Databases ¾
Researchers in audio/visual scene analysis and those in Databases and Information Retrieval should really collaborate CLOSELY to find good ways of structuring very large multimedia databases for efficient retrieval and search.
Performance Evaluation ¾
e.g., TRECVID for video retrieval, how about image retrieval?
What Are the Killer Applications of Multimedia Retrieval? ¾
e.g., medical multimedia document management Qi Tian Univ. of Texas at San Antonio
Approaches
Our Our Recent Recent Work Work Discriminant Analysis Training
Feature Extraction
Useful Usefulin inother otherapplications applications
Data Modeling •Dimension •Dimensionreduction reduction •Statistical •Statisticalestimation estimation
Similarity Estimation •Classification •Classification •Ranking/indexing •Ranking/indexing
Image ImageDatabase Database
Qi Tian Univ. of Texas at San Antonio
User
Approaches Approaches
Our Our Recent Recent Work Work in in CBIR CBIR Semantic Subspace Projection 9 Bridge the semantic gap Hybrid Discriminant Analysis 9 Learn high dimensional data with small samples Adaptive Discriminant Projection 9 Adaptively learn a projection from data distribution Distance Measure for Similarity Estimation 9 Investigate the relations between probabilistic distributions, distance metric, and mean estimation
Qi Tian Univ. of Texas at San Antonio
Our Our Recent Recent Work Work in in CBIR CBIR Related Publications Journals o o o o o o o
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) IEEE Transactions on Circuits and Systems for Video Technology (CSVT) IEEE Multimedia (MM) International Journal on Computer Vision (IJCV) Pattern Recognition (PR) ACM Transactions on Knowledge Discovery from Data (TKDD) IEEE Transactions on Computational Biology and Bioinformatics
Conferences o ACM Multimedia o IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) o International Conference on Pattern Recognition (ICPR) o Others including ICME, ICIP, ICASSP, MIR, CIVR
Qi Tian Univ. of Texas at San Antonio
Current Directions Web Image Search and Mining Image Annotation Affective Video Retrieval Information Fusion in MIR Integration of Context and Content for Multimedia Management Multimodal Emotion Recognition
Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Web Search Web Search 1.0 – Traditional Text Retrieval Web Search 2.0 – Page-level Relevance Ranking Web Search 3.0 – Object-level Structured Search
Object Level Vertical Search (MSRA Libra: http://libra.msra.cn/ Live Product Search (http://products.live.com) Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Image Annotation Photo sharing through the Internet has become a common practice. flickr.com: 19.5 million photos (30% growth/month), 2005 Photo.net and airliners.net: millions of images
Most image search engines relies on textual descriptions of the images, e.g., Google, Yahoo, MSN
In general, people do not spend time labeling or annotating their personal photos
Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Image Annotation Can computer do this?
Building Sky Lake Landscape Tree
Image Annotation System ¾ A statistical model that can relate words to image features. ¾ Sketch an image, extract feature vectors. ¾ Descriptive words--top words ranked according to likelihood.
Current work Li, Wang (alipr.com 2006), Blei, Jordan (2003); Vasconcelos (UCSD-SML 2007), Zhang et al. (MSRA 2005), Li et al. (MSRA 2007) Promising Direction: Web Image Annotation – an integration of IR and Content Analysis Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Affective Video Retrieval Affective: ¾“a feeling or emotion as distinguished from cognition, thought, or action”
Real Multimedia Retrieval ¾ Search for a subset of nicest holiday pictures to show them to friends ¾ Selecting the most appropriate background music for the given situation ¾ Search for the most impressive video clips ¾ Search for the most appealing photographs of one and the same content ¾ Search for all film comedies I like most
Alternative approach Search for 9 Mood 9 Matches to users’ profile Like/dislike, interest/no interest
Affective Video Retrieval Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Underlying Idea Arousal t
Video ¾ Temporal content flow ¾ Continuous transitions from one affective state to another
Valence t
Temporal measurement of ¾ Arousal ¾ Valence
Arousal Affect curve
Combining the two curves into the Affect Curve Qi Tian Univ. of Texas at San Antonio
Valence
Trends and Directions
Affective Media Content Characterization Hanjalic & Xu Media
Feature extraction
Suspense horror
Arousal = f1 (feature values) Valence = f2 (feature values)
Hilarious fun
A bit somber, medium excitement Romantic “feel-good”
Mapping AV values onto the 2D affect space
Affective content characterization
Qi Tian Univ. of Texas at San Antonio
Trends and Directions Hanjalic & Xu
Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Information Fusion in MIR Fusion: A merging of diverse, distinct, or separate elements into a unified whole (Merriam-Webster dictionary). ¾ Feature Extraction Module: 9 Multiple features ->vectors 9 Concatenated vector 9 Feature Fusion: more discriminating hyperspace can be found in the new vector
¾ Matching Module: 9 9 9 9
One type of classifiers for multiple features or Multiple types of classifiers for one feature or Both The output score can be combined
¾ Decision Module: 9 The output decision of each classifier can be combined
Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Information Fusion in MIR Two Forms
¾Multi-modality 9 e.g. Video clip->visual information, audio information, textural information 9 Multi-modality fusion occurs at feature extraction module. 9 Single source information may be represented by multiple features, e.g. Color image->color, texture, shape
¾Multi-Classifiers (Ensemble of Classifiers) 9 9 9 9
A set of classifiers are trained to solve the same problem Applied on single or multiple source of information A single type of base classifiers or Different types of classifiers (Bayesian, K-NN, SVM)
Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Information Fusion in MIR Fusion Schemes ¾ The prediction of multiple classifiers need to be integrated into one fused decision by Fusion Scheme. 9 The output of different classifiers need to be normalized.
¾ Rule-based 9 Decision is made by a simple operation on the output of all classifiers 9 e.g. Max, Min, Sum, Mean (Matching Module) 9 e.g. AND, OR (Decision Module)
¾ Learning-based 9 The output of all classifiers is fed into a learning process to obtain the final decision 9 e.g. Decision Tree, Neutral Network
No one is guaranteed to be the best empirically or theoretically. Applications 9 Multimodal Biometrics Fusion (face, fingerprint, iris, palmprint, voice, hand geometry) 9 Audio/visual fusion for multimodal emotion recognition Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Integration of Context and Content for Multimedia Management
from Carlson & Hatfield, 1992 Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Integration of Context and Content for Multimedia Management ¾ An increasing number of active research in this direction ¾ Crucial to human-human communication and human understanding of multimedia. 9 without context it is difficult for a human to recognize various objects, ¾ Enable that the (semi)automatic content analysis and indexing methods become more powerful in managing multimedia data ¾ Contextual information 9 e.g., cell ID for the mobile phone location, GPS integrated in a digital camera, camera parameters, time information, and identity of the producer
Qi Tian Univ. of Texas at San Antonio
Trends and Directions
Integration of Context and Content for Multimedia Management (T-MM Special Issue) Topics of interest include: ¾ Contextual metadata extraction ¾ Models for temporal context, spatial context, imaging context (e.g., camera metadata), social context, and so on ¾ Web context for online multimedia annotation, browsing, sharing and reuse ¾ Context tagging systems, e.g., geotagging, voice annotation ¾ Context-aware inference algorithms ¾ Context-aware multi-modal fusion systems (text, document, image, video, metadata, etc.) ¾ Models for combining contextual and content information ¾ Context-aware interfaces and collaboration ¾ Novel methods to support and enhance social interaction, including innovative ideas like social, affective computing, and experience capture. ¾ Applications such as using context and similarity for face and location identification ¾ Context-aware mobile media technology and applications ¾ Using context to browse and navigate large media collections Qi Tian Univ. of Texas at San Antonio
Projects in High Impact Features Web-based Large Scale User participated
A combination of internet and multimedia
Examples: Photo Search Home photo management, mobile photo search, face search Millions Books Project Tremendous Space for Search and Multimedia Data Mining Human-centered Multimedia Search Search for UCC data (user preference, profile, and opinion), Social search (public relations, names, personalization)
Acknowledgement Current Collaborators •
Funding Agencies
Academia
•
Army Research Office (ARO)
•
Department of Homeland Security (DHS)
•
San Antonio Life Science Institute (SALSI)
•
Center for Infrastructure Assurance and Security (CIAS)
University of Illinois University of Amsterdam Chinese Academy of Science University of Science and Technology of China
•
Industry Institute of Infocomm Research, Singapore
Students
HP Labs, Palo Alto, CA Microsoft Research (MSR) Redmond Microsoft Research Asia (MSRA) NEC Labs America
•
Jerry Yu (2007, Kodak Research, NJ)
•
Yijuan Lu (2008 expected)
•
Yuning Xu
Kodak Research Lab IBM T.J. Watson Research Ctr
Qi Tian Univ. of Texas at San Antonio
Q Q& &A A
Thank you ! Questions ?
Qi Tian Univ. of Texas at San Antonio