Semantics of Visual Discrimination John R. Smith IBM T. J. Watson Research Center
[email protected]
February 2015 © 2015 IBM Corporation
Many Thanks To … IBM T. J. Watson - Multimedia Research Team
Liangliang Cao
Quoc-Bao Nguyen
Michele Merler
Noel Codella
Wei Liu
Rosario Uceda-Sosa
IBM T. J. Watson - Exploratory Computer Vision Team
Matthew Hill
Sharath Pankanti
Rogerio Feris
Quanfu Fan
Nalini Ratha
Chiao-Fe Shu
Chung-Ching Lin
Lisa Brown
IBM Research Collaborators
Gang Hua
© 2015 IBM Corporation
Shih-Fu Chang
John Kender
Daniel Ellis
Semantics of Visual Discrimination
Felix Yu
2
Big News Event in 2005 How many cameras?
© 2015 IBM Corporation
Semantics of Visual Discrimination
3
Similar Big News Event Eight Years Later How many cameras?
Cannot count them all … image and video is here! © 2015 IBM Corporation
Semantics of Visual Discrimination
4
Massive Multimedia is the Biggest Wave of All! Safety / Security
Image
Tera
1B medical images
Med
Text
Models
>> Computation
Mips
Video (movie)
Text
Giga
Video
Features
Audio
Customer
Text
Medical
Video
Data Volume
Peta
>> Sophistication
Video
10s millions cameras
Expressiveness
High
# algorithms
Exa
1000x Text (book)
Low
Structured data 1990’s
2000’s
2010’s
Per item
2020’s
1B camera phones Media
100 video hrs/minute 5
Wide Area Imagery
100’s TB per day
Digital Marketing
12% of video views
Semantics of Visual Discrimination
Enterprise Video
Used by 1/3 of enterprises © 2015 IBM Corporation
Key Messages: Image and video data is growing in volume and importance Manual analysis is not scalable … cannot keep up Vision system development has historically required deep expertise
Expert Analysis Data-driven Visual Learning Increasing availability of labeled training data Emergence of increasingly sophisticated visual learning algorithms
Focus needed on Visual Semantic Modeling How to best model visual world (concepts and relationships) How to combine visual semantic modeling with learning systems 6
Semantics of Visual Discrimination
© 2015 IBM Corporation
Use Cases
Multiple Industry Problems Require Large-Scale Visual Content Analysis Safety and Security (IBM
Intelligent Video Analytics)
Healthcare Cognitive Systems
Volume: 10K’s managed cameras per city Velocity: real-time alerts, 20M video events/day
Velocity: 50K radiology images per day per radiology dept.
Variety: street scenes, rail stations, crowds, people, environmental conditions
Variety: images, video, text, patent records, cases, scientific literature, ontologies/semantics Veracity: subjective interpretation across millions of categories (modalities, body views, organ systems, pathologies, anomalies)
Veracity: analysis of complex activities (trip wires, abandoned objects)
Data-in-Motion
Images
Content-Filtering
Video
Real-time Alerting
Multimedia
Data-at-Rest Content Classification
ms
…
sec
Enterprise Content Management (IBM IMARS)
Behavior Analysis
min
hr
Cross-Camera Mining Activity Based Intelligence
day
wk
mo
yr
Retail and Mobile Commerce (IBM System V)
Volume: 70PB broadcast/yr, 40K hrs per news archive Velocity: 100 videohrs/min to YouTube Variety: mobile, user generated, professional
Veracity: robust content extraction for objects, places, scenes, activities, people 8
Content-based Search
Real World Events
Broadcast Monitoring
us
(IBM Watson Research)
Volume: 1B medical images per year (growing 20-40% /yr)
Semantics of Visual Discrimination
Volume: 500B consumer photos/yr Velocity: 100M customers per week for large retailers Variety: transient and dynamic content Veractity: predicting consumer attributes from diverse sources including visual data (images and video) © 2015 IBM Corporation
Safety and Security: Urban Surveillance
Trip Wires Offline Training
• • • • 9
Object Detection Forensic Search
Semantic Extraction
Machine Learning
Alerts
Attributes
Object Tracking
Expert developed algorithms Limited object detection and tracking Limited robustness to environment Manual tuning for new deployments
• • • • IBM CONFIDENTIAL
Visual Recognition
Pattern Discovery
Activity Detection
Large-scale data-driven visual learning Semantic extraction including attributes Integrated analysis across cameras Self-configuration, tuning, adaptation © 2015 IBM Corporation
IBM Intelligent Video Analytics (IVA) for Smarter Cities
Vehicle detection/classification
Person and face attributes
Activities and behaviors
Visual Learning CAPTURE
ANALYZE
DECIDE
ACT
Traditional Computer Vision
Real world metrics (speed, size) 10
Trip wires and safety regions Semantics of Visual Discrimination
Object detection and tracking © 2015 IBM Corporation
Media-in-the-Wild: Traditional Media Archives
User Generated Content
• Limited indexing and search of news archives, TV programs, movies
• Want to understand diverse user generated content at a semantic level (e.g., sports, activities, life events)
• Relies heavily on related information (e.g., metadata, speech transcript, user tags)
11
• Extraction of visual insights • Discovery of patterns across users, segments, time, geographies
IBM CONFIDENTIAL
© 2015 IBM Corporation
Visual search uses thousands of Visual Classifiers across dozens of Facets (Objects, Scenes, Locations, Activities, People, Events, etc.)
City scene
Adult person
Parade marching
Playing Fetch
Visual Content Extraction Scenes
Objects
People
Actions
At a Wedding
Activities
Events
Shopping At a Concert
Cake
12
Waving
IBM CONFIDENTIAL
People Gathering
© 2015 IBM Corporation
Example: Visual Recognition for Image Semantic Indexing Visual Semantic Label Categories
© 2015 IBM Corporation
Demo
E.g., Surfing
13
Image Classification Results using Visual Semantic Faceted Hierarchy [OBJECT] Ambulance Vehicle
© 2015 IBM Corporation
[OBJECT] Wind-Powered Boat
Semantics of Visual Discrimination
14
More Examples of Classification Results using Semantic Faceted Hierarchy [ANIMAL] Dog
© 2015 IBM Corporation
[ANIMAL] Walrus
Semantics of Visual Discrimination
15
Machine Learning Shifts Effort to Organizing Image Data for Training
Accurate recognition of sports by training 150 categories 16
16
Hot Air Ballooning
Hang Gliding
Figure Skating
Skiing
Equestrian
Softball
IBM CONFIDENTIAL
© 2015 IBM Corporation
Semantic Searches is supported by Automatically Extracted Semantic Labels (Boolean = AND/OR) Car AND Street Scene
© 2015 IBM Corporation
Beach OR Sunset
IBM CONFIDENTIAL
17
Semantic Indexing applies to Video Content Extraction and Search Boating
Parade AND Urban Scene
Other activities: Running, Skiing, …
Other Combining Functions: AND, OR, X, MIN
On-line photos and videos have enormous potential as a rich source of information about consumers
¼ Trillion
100 Hours
photos hosted on Facebook from 1.3 Billion users
of video are uploaded to YouTube every minute
51% of the Class of 2015 (high school) use Instagram daily
. . . and growing
80% of Pinterest users are female
Growing Amount of Consumer Images and Video is Source for Insights • Pins / Re-pins • Likes / Dislikes • Tweets • Favorites
Consumer Photos, Pins, Likes
• Access Public Visual Content • Perform Semantic Classification Semantics Products
Style
Kitchen Gallery
Styles
Logos Brands
Designs
Visual Classifiers
Consumer • Personalization • Promotions • Campaigns • Planning 20
Dream Home
Wedding
Targeting, Marketing and Planning
IBM CONFIDENTIAL
Visual Extraction • Preferred Styles • Hobbies/interests • Life Events • Products © 2015 IBM Corporation
Healthcare: Radiology Image Analysis
Medical Insights
• Specialized algorithms per medical modality and disease • Limited scaling and coverage • Bottleneck in algorithm development by computer vision experts
• Diverse visual extraction across modality, anatomy, pathology • Address wide spectrum across patient image data, images and figures in medical literature, visual knowledge repositories • Beyond radiology to analyze images broadly in medicine
21
IBM CONFIDENTIAL
© 2015 IBM Corporation
Multi-modal Analysis helps Build Medical Knowledge and Aid Diagnosis
Multimedia Medical Data
Medical Semantics* Actionable Information Medical Image Taxonomy (RadLex, IRMA)
Classification (modality, view, organ system, artifacts)
Anatomy/ Features (FMA, OBO)
Pathology/ Disease (UMLS, ICD, ICF)
Anomaly Analysis
Semantic Tagging
(body/organ regions, local features)
(pathology, disease)
Pneumonia (0.3) Tuberculosis (0.1) Normal (0.4)
Cloudy (0.3) Shadowed(0.3) Darkened (0.1)
Annotated References
PatientCentric Medicine
Automatic Triage and Retrieval Selected References
X-Ray (0.9) Chest (0.8) Lungs (0.8)
Categorical Examples
Computer Aided Diagnosis
Coincidental Diagnosis
Multimodal Context Data
Medical Knowledge
Patient Records
Massive Machine Learning* © 2015 IBM Corporation
22
IBM CONFIDENTIAL
Huge Diversity of Imagery in Medicine Spans Modality, Anatomy, Disease Visual Classification
Multi-Modal Information
Modality
…
X-ray
CT
MRI
Region Text
Pelvis
Head
Chest
…
Disease
…
Tuberculosis
Cancer
Collapse
Image
NIH PubMed Medical Modality Classification:
Visual Recognition
Extracting knowledge from NLM/NIH PubMed 14.2 million articles (3.8 million free full-text articles) Rich contextual information: text + figures + captions Challenge: given an unknown PubMed image determine medical category automatically Automatically classifying millions of published medical images by modality, region, disease along with associated text builds multi-modal knowledge 23
IBM CONFIDENTIAL
© 2015 IBM Corporation
Medical Knowledge Management – Automatic Medical Image Classification Medical image taxonomy Modality
…
X-ray
CT
MRI
Region Pelvis
Head
Chest
…
Disease
…
Tuberculosis
Cancer
MRI Brain Axial
MRI Knee
PET Color
DX Appendage
DX Torso
DX Cervical Spine
Collapse
NIH PubMed figures + captions
24
© 2015 IBM Corporation Semantics of Visual Discrimination
ImageCLEF Medical Image Classification* Goal: automatically classify images into correct medical category • Data-set of NIH PubMed articles with 305,000 images • IBM achieved #1 performance in 2012 and 2013 • Top performance in each task (visual, text, combined) 90 80
IBM Non-IBM
70
feature
feature
feature
Mean Accuracy
60 50
40
Fusion 30
20 10 0
Combined
Text
Visual
“X-ray”
* http://imageclef.org/2013/medical 25
Massive-Scale Multimedia Semantic Modeling
© 2014 IBM Corporation
Visual Recognition Technical Foundation
Multi-layer Learning Architecture for Image and Video Analysis Scenes Locations Settings Activities
Expectation Maximization
SVMs
Ensemble Classifiers
Texture
Events
People
People
Regression Decision K-means Tree
Models Factor Graph
Obj ects
Addaboost
GMM
Active Learning
Bayes Net
Deep Learning
GMM Markov Model
Neural Net
Labeled Data
Unlabeled Data
Color
Faces
Activities
Behaviors
People
Nearest Neighbor
Places
Scenes
Actions
Cars
Animals
Activities
Actions Objects
Vehicles
Living Objects
Clustering
Semantics
Objects
Edges
Features
Shape Motion
Background
Camera Motion
Shot Boundaries
Energy
Regions Moving Objects
Zerocrossings
Frequencies
N N N N N N N N N N
P P P P P P P P P P
Negative Examples
Positive Examples
Spectrum
Tracks Scene Dynamics
Images and Video
• Visual Recognition = predicting “semantic” labels for unknown images and video 27
Semantics of Visual Discrimination
© 2015 IBM Corporation
Data-driven Machine Learning to Create Visual Semantic Classifiers Natural Photos and Video
Medical Modalities & Viewpoints
Same Approach
© 2015 IBM Corporation
28
Rich Set of Visual Features is Needed to Learn Semantic Discrimination
Aardvark [1] Spatial Scale Global (0.86)
Even-Toed Ungulate [14] Local Binary Pattern Global (0.68)
Primate [1] Local Binary Pattern Global (0.69)
Seal, Sea Lion, Walrus [3] Local Binary Pattern Global (0.83)
Bat [1] Thumbnail Vector Global (0.93)
Meat-Eater [12] Color Correlogram Global (0.68)
Rabbit, Pika, Hare [2] Color Moments Grid (0.79)
Toothless Mammal [2] Local Binary Pattern Global (0.85)
Sea Cow [1] Color Histogram Global (0.95)
Whale, Dolphin, Porpoise [5] Color Histogram Global (0.95)
M
Elephant [1] Spatial Scale Global (0.81) 29
Odd-Toed Ungulate [4] Edge Histogram Layout (0.79)
IBM CONFIDENTIAL
© 2015 IBM Corporation
Increasing Availability of Labeled Data is Accelerating Visual Recognition Scenes and Objects:
Objects and Actions:
(SUN Database: 131,072 images, 908 scenes, 3,819 objects, (http://groups.csail.mit.edu/vision/SUN/)
(PASCAL VOC: 11,530 images with 27,450 ROI annotated objects and 6,929 segmentations (http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2012/)
20 objects
segmentations
10 actions
Scenes:
Objects:
(INRIA Holidays Data Set ): 1,491 images, 500 queries, 991 search images (http://lear.inrialpes.fr/~jegou/data.php)
ImageNet linguistic visual objects (http://image-net.org/)
• Full data: 15,589 WordNet synsets • ILSVRC: 1,000 object categories and 1.2 million images 30
IBM CONFIDENTIAL
© 2015 IBM Corporation
Visual Classifier Learning Allows Metrics-based Optimization Training Images and Video
Discriminative Model
Visual Classifier Learning Ensemble Visual Classifier Training
Positive Examples
color
edges
texture …
CNN features Metrics-based Validation
Visual Model Fusion (non-linear SVM) 1 2 …
Negative Examples
color
…
edges
…
texture …
… N
1 2 …
…
…
… N
1 2 …
…
…
… N
color
edges
Visual Feature Extraction 31
Semantics of Visual Discrimination
Accuracy vs. Score
Avg. Precision vs. Score
F1 vs. Rank
Precision vs. Score
Score vs. Rank
CNN features
Visual Model Training (linear SVM)
texture …
Precision vs. Recall
CNN features
Visual Recognition Example Upload Unknown Images
Visual Recognition Results
Demo 32
© 2015 IBM Corporation Semantics of Visual Discrimination
Visual Indexing and Search using Semantic Faceted Taxonomy Training Examples
Visual Semantic Taxonomy
Negative
Positive
scenes objects
events
Classifier Learning
people
Visual Query [C1 AND C2] OR [C3]
activities
Visual Concept Extraction
LearnClassifiersFromVectors Visual Semantic Search
LearnClassifiersFromImages Learn Visual Classifiers
color texture …shape + -
+ + -
edges + -
+ + -
Temporal parseVideo Segmentation
Shot Boundary parseVideo Determination
+ - … + + -
+ -
Extract ConceptExtract Concepts
Key Frame parseSequence Extraction
Extract ExtractFeatures Features
Visual Classifiers 33
Images and Videos
Semantics of Visual Discrimination
Visual Classifier Learning Allows Metrics-based Optimization Training Images and Video
Discriminative Model
Visual Classifier Learning Ensemble Visual Classifier Training
Positive Examples
color
edges
texture …
CNN features Metrics-based Validation
Visual Model Fusion (non-linear SVM) 1 2 …
Negative Examples
color
…
edges
…
texture …
… N
1 2 …
…
…
… N
1 2 …
…
…
… N
color
edges
Visual Feature Extraction 34
Semantics of Visual Discrimination
Accuracy vs. Score
Avg. Precision vs. Score
F1 vs. Rank
Precision vs. Score
Score vs. Rank
CNN features
Visual Model Training (linear SVM)
texture …
Precision vs. Recall
CNN features
Different Approaches can be used for Managing and Selecting Negatives for Visual Discriminative Learning Positive Examples
Negative Examples Whale
X
Whale
Person, Dog, Whale
Common background of all images (unlabeled positive images are included as negatives) Zebra, Nature, Sunset
Traditional taxonomy mutually exclusivity (unlabeled positive images are included as negatives) Whale
Zebra, Cat, Koala
Faceted classification ensures mutual exclusivity within facets (allows correct positives and negatives) © 2015 IBM Corporation
Semantics of Semantic Visual Discrimination Large-Scale Modeling
35
What label?
a) b) c) d) e)
Man Dog Frisbee Beach Playing
FACETS: [Person] [Animal] [Object] [Setting] [Activity]
Need to learn and assign labels from multiple facets! 36
Semantics of Visual Discrimination
© 2015 IBM Corporation
Facets can be Nested in Hierarchical Faceted Classification Scheme People (facet)
Objects (facet)
Children concepts are complete with respect to parent concept Pose (facet)
State (facet)
Couple
…
Expression (facet)
Gender (facet)
Simple reasoning to select positive and negative training data for each semantic concept
View (facet)
…
…
…
Portrait
…
…
…
Upper body
Unknown view …
Hands
…
Full body
© 2015 IBM Corporation
Each facet provides an alternative partitioning of its parent
…
Group
…
Age (facet)
Sibling concepts are designed to be mutually exclusive
…
…
…
Individual
…
…
Number (facet)
Activities (facet)
…
Affiliation (facet)
…
NonHuman
Human
Settings (facet)
Semantics of Visual Discrimination
37
Facets can be Nested in Hierarchical Faceted Classification Scheme People (facet)
Objects (facet)
Number (facet)
Pose (facet)
State (facet)
Each facet provides an alternative partitioning of its parent
…
Group
…
Expression (facet)
Gender (facet)
Simple reasoning to select positive and negative training data for each semantic concept
View (facet)
…
…
…
Portrait
…
…
…
Upper body
Unknown view
…
Hands
…
Full body
© 2015 IBM Corporation
Sibling concepts are designed to be mutually exclusive
…
Couple
…
Age (facet)
…
Children concepts are complete with respect to parent concept
…
… Individual
Activities (facet)
…
Affiliation (facet)
…
NonHuman
Human
Settings (facet)
Semantics of Visual Discrimination
Positive
Negative
38
Key Challenge in Visual Discriminative Learning is Managing and Selecting of Negative Training Examples Semantic Learning
Training Setup ? Positive Examples
Ensemble Classifier Learning
Discriminative Model
Fusion Model Training
Unit Model Validation
1 2…
…
color
edges
Negative Examples ???
…
…N
shape texture
Unit Model Training
Visual Model Training
1 2…
…
…
…N
1 2…
…
…
…N
color
edges
shape texture
Visual Feature Extraction
© 2015 IBM Corporation
Semantics of Visual Discrimination
Metrics-based Validation
Precision vs. Recall
Accuracy vs. Score
Avg. Precision vs. Score
F1 vs. Rank
Precision vs. Score
Score vs. Rank
39
Approaches for Labeling Training Images for Discriminative Learning
K Categories
N Images
40
1
2
3
4
5
6
7
8
9
10
11
12
A
N
Y
N
N
N
N
Y
N
Y
N
N
N
B
N
N
Y
N
N
N
N
N
N
N
N
N
C
N
N
N
N
N
N
Y
N
N
Y
N
N
D
Y
N
N
N
N
N
N
N
N
N
N
Y
E
N
N
N
N
N
Y
N
Y
N
N
N
N
F
N
N
N
N
N
N
N
N
N
N
N
N
G
N
N
N
Y
Y
N
N
N
N
N
N
N
H
N
N
N
N
N
N
N
N
N
N
N
N
I
N
N
N
N
N
N
N
N
N
N
N
N
J
N
Y
N
N
N
N
N
N
N
N
Y
N
K
N
N
N
N
N
N
Y
N
N
N
N
N
L
N
N
N
N
N
N
N
N
Y
N
N
N
M
N
N
N
Y
N
N
N
N
N
N
N
N
Semantics of Visual Discrimination
© 2015 IBM Corporation
Approaches for Labeling Training Images for Discriminative Learning
K Categories
N Images
41
Exhaustive: label all N images by all K categories N *K is large (does not scale!)
1
2
3
4
5
6
7
8
9
10
11
12
A
N
Y
N
N
N
N
Y
N
Y
N
N
N
B
N
N
Y
N
N
N
N
N
N
N
N
N
C
N
N
N
N
N
N
Y
N
N
Y
N
N
D
Y
N
N
N
N
N
N
N
N
N
N
Y
E
N
N
N
N
N
Y
N
Y
N
N
N
N
F
N
N
N
N
N
N
N
N
N
N
N
N
Assume a very “efficient” person can label 100K images per day per concept
G
N
N
N
Y
Y
N
N
N
N
N
N
N
1M image training set
H
N
N
N
N
N
N
N
N
N
N
N
N
10K concepts
I
N
N
N
N
N
N
N
N
N
N
N
N
J
N
Y
N
N
N
N
N
N
N
N
Y
N
K
N
N
N
N
N
N
Y
N
N
N
N
N
L
N
N
N
N
N
N
N
N
Y
N
N
N
M
N
N
N
Y
N
N
N
N
N
N
N
N
Semantics of Visual Discrimination
0.1 concept per day per person
Need 100K persondays © 2015 IBM Corporation
Approaches for Labeling Training Images for Discriminative Learning
N Images A
1
2
3
N
Y
N
N
Y
K Categories
B C
N
D
Y
E
N
F G
4
N
8
9
Y
N
Y
N N
11
12 N
N N
Y
N
N
Y Y
N
Y
N
N
N
N
Y
N N
N
N
N
J
N
Y
N
Y
N
N N
N
Y
N
Y
N
N N
N
10
N
K
42
7
Y
I
M
6
N
H
L
5
Y
N
Y
N
N
N
N
N
N Semantics of Visual Discrimination
Exhaustive: label all N images by all K categories N *K is large (does not scale!) Binary: label p positives (p