Semantics of Visual Discrimination

Semantics of Visual Discrimination John R. Smith IBM T. J. Watson Research Center [email protected] February 2015 © 2015 IBM Corporation Many Thank...
0 downloads 1 Views 4MB Size
Semantics of Visual Discrimination John R. Smith IBM T. J. Watson Research Center [email protected]

February 2015 © 2015 IBM Corporation

Many Thanks To … IBM T. J. Watson - Multimedia Research Team

Liangliang Cao

Quoc-Bao Nguyen

Michele Merler

Noel Codella

Wei Liu

Rosario Uceda-Sosa

IBM T. J. Watson - Exploratory Computer Vision Team

Matthew Hill

Sharath Pankanti

Rogerio Feris

Quanfu Fan

Nalini Ratha

Chiao-Fe Shu

Chung-Ching Lin

Lisa Brown

IBM Research Collaborators

Gang Hua

© 2015 IBM Corporation

Shih-Fu Chang

John Kender

Daniel Ellis

Semantics of Visual Discrimination

Felix Yu

2

Big News Event in 2005 How many cameras?

© 2015 IBM Corporation

Semantics of Visual Discrimination

3

Similar Big News Event Eight Years Later How many cameras?

Cannot count them all … image and video is here! © 2015 IBM Corporation

Semantics of Visual Discrimination

4

Massive Multimedia is the Biggest Wave of All! Safety / Security

Image

Tera

1B medical images

Med

Text

Models

>> Computation

Mips

Video (movie)

Text

Giga

Video

Features

Audio

Customer

Text

Medical

Video

Data Volume

Peta

>> Sophistication

Video

10s millions cameras

Expressiveness

High

# algorithms

Exa

1000x Text (book)

Low

Structured data 1990’s

2000’s

2010’s

Per item

2020’s

1B camera phones Media

100 video hrs/minute 5

Wide Area Imagery

100’s TB per day

Digital Marketing

12% of video views

Semantics of Visual Discrimination

Enterprise Video

Used by 1/3 of enterprises © 2015 IBM Corporation

Key Messages: Image and video data is growing in volume and importance Manual analysis is not scalable … cannot keep up Vision system development has historically required deep expertise

Expert Analysis  Data-driven Visual Learning Increasing availability of labeled training data Emergence of increasingly sophisticated visual learning algorithms

Focus needed on Visual Semantic Modeling How to best model visual world (concepts and relationships) How to combine visual semantic modeling with learning systems 6

Semantics of Visual Discrimination

© 2015 IBM Corporation

Use Cases

Multiple Industry Problems Require Large-Scale Visual Content Analysis Safety and Security (IBM

Intelligent Video Analytics)

Healthcare Cognitive Systems

Volume: 10K’s managed cameras per city Velocity: real-time alerts, 20M video events/day

Velocity: 50K radiology images per day per radiology dept.

Variety: street scenes, rail stations, crowds, people, environmental conditions

Variety: images, video, text, patent records, cases, scientific literature, ontologies/semantics Veracity: subjective interpretation across millions of categories (modalities, body views, organ systems, pathologies, anomalies)

Veracity: analysis of complex activities (trip wires, abandoned objects)

Data-in-Motion

Images

Content-Filtering

Video

Real-time Alerting

Multimedia

Data-at-Rest Content Classification

ms



sec

Enterprise Content Management (IBM IMARS)

Behavior Analysis

min

hr

Cross-Camera Mining Activity Based Intelligence

day

wk

mo

yr

Retail and Mobile Commerce (IBM System V)

Volume: 70PB broadcast/yr, 40K hrs per news archive Velocity: 100 videohrs/min to YouTube Variety: mobile, user generated, professional

Veracity: robust content extraction for objects, places, scenes, activities, people 8

Content-based Search

Real World Events

Broadcast Monitoring

us

(IBM Watson Research)

Volume: 1B medical images per year (growing 20-40% /yr)

Semantics of Visual Discrimination

Volume: 500B consumer photos/yr Velocity: 100M customers per week for large retailers Variety: transient and dynamic content Veractity: predicting consumer attributes from diverse sources including visual data (images and video) © 2015 IBM Corporation

Safety and Security: Urban Surveillance

Trip Wires Offline Training

• • • • 9

Object Detection Forensic Search

Semantic Extraction

Machine Learning

Alerts

Attributes

Object Tracking

Expert developed algorithms Limited object detection and tracking Limited robustness to environment Manual tuning for new deployments

• • • • IBM CONFIDENTIAL

Visual Recognition

Pattern Discovery

Activity Detection

Large-scale data-driven visual learning Semantic extraction including attributes Integrated analysis across cameras Self-configuration, tuning, adaptation © 2015 IBM Corporation

IBM Intelligent Video Analytics (IVA) for Smarter Cities

Vehicle detection/classification

Person and face attributes

Activities and behaviors

Visual Learning CAPTURE

ANALYZE

DECIDE

ACT

Traditional Computer Vision

Real world metrics (speed, size) 10

Trip wires and safety regions Semantics of Visual Discrimination

Object detection and tracking © 2015 IBM Corporation

Media-in-the-Wild: Traditional Media Archives

User Generated Content

• Limited indexing and search of news archives, TV programs, movies

• Want to understand diverse user generated content at a semantic level (e.g., sports, activities, life events)

• Relies heavily on related information (e.g., metadata, speech transcript, user tags)

11

• Extraction of visual insights • Discovery of patterns across users, segments, time, geographies

IBM CONFIDENTIAL

© 2015 IBM Corporation

Visual search uses thousands of Visual Classifiers across dozens of Facets (Objects, Scenes, Locations, Activities, People, Events, etc.)

City scene

Adult person

Parade marching

Playing Fetch

Visual Content Extraction Scenes

Objects

People

Actions

At a Wedding

Activities

Events

Shopping At a Concert

Cake

12

Waving

IBM CONFIDENTIAL

People Gathering

© 2015 IBM Corporation

Example: Visual Recognition for Image Semantic Indexing Visual Semantic Label Categories

© 2015 IBM Corporation

Demo

E.g., Surfing

13

Image Classification Results using Visual Semantic Faceted Hierarchy [OBJECT] Ambulance Vehicle

© 2015 IBM Corporation

[OBJECT] Wind-Powered Boat

Semantics of Visual Discrimination

14

More Examples of Classification Results using Semantic Faceted Hierarchy [ANIMAL] Dog

© 2015 IBM Corporation

[ANIMAL] Walrus

Semantics of Visual Discrimination

15

Machine Learning Shifts Effort to Organizing Image Data for Training

Accurate recognition of sports by training 150 categories 16

16

Hot Air Ballooning

Hang Gliding

Figure Skating

Skiing

Equestrian

Softball

IBM CONFIDENTIAL

© 2015 IBM Corporation

Semantic Searches is supported by Automatically Extracted Semantic Labels (Boolean = AND/OR) Car AND Street Scene

© 2015 IBM Corporation

Beach OR Sunset

IBM CONFIDENTIAL

17

Semantic Indexing applies to Video Content Extraction and Search Boating

Parade AND Urban Scene

Other activities: Running, Skiing, …

Other Combining Functions: AND, OR, X, MIN

On-line photos and videos have enormous potential as a rich source of information about consumers

¼ Trillion

100 Hours

photos hosted on Facebook from 1.3 Billion users

of video are uploaded to YouTube every minute

51% of the Class of 2015 (high school) use Instagram daily

. . . and growing

80% of Pinterest users are female

Growing Amount of Consumer Images and Video is Source for Insights • Pins / Re-pins • Likes / Dislikes • Tweets • Favorites

Consumer Photos, Pins, Likes

• Access Public Visual Content • Perform Semantic Classification Semantics Products

Style

Kitchen Gallery

Styles

Logos Brands

Designs

Visual Classifiers

Consumer • Personalization • Promotions • Campaigns • Planning 20

Dream Home

Wedding

Targeting, Marketing and Planning

IBM CONFIDENTIAL

Visual Extraction • Preferred Styles • Hobbies/interests • Life Events • Products © 2015 IBM Corporation

Healthcare: Radiology Image Analysis

Medical Insights

• Specialized algorithms per medical modality and disease • Limited scaling and coverage • Bottleneck in algorithm development by computer vision experts

• Diverse visual extraction across modality, anatomy, pathology • Address wide spectrum across patient image data, images and figures in medical literature, visual knowledge repositories • Beyond radiology to analyze images broadly in medicine

21

IBM CONFIDENTIAL

© 2015 IBM Corporation

Multi-modal Analysis helps Build Medical Knowledge and Aid Diagnosis

Multimedia Medical Data

Medical Semantics* Actionable Information Medical Image Taxonomy (RadLex, IRMA)

Classification (modality, view, organ system, artifacts)

Anatomy/ Features (FMA, OBO)

Pathology/ Disease (UMLS, ICD, ICF)

Anomaly Analysis

Semantic Tagging

(body/organ regions, local features)

(pathology, disease)

Pneumonia (0.3) Tuberculosis (0.1) Normal (0.4)

Cloudy (0.3) Shadowed(0.3) Darkened (0.1)

Annotated References

PatientCentric Medicine

Automatic Triage and Retrieval Selected References

X-Ray (0.9) Chest (0.8) Lungs (0.8)

Categorical Examples

Computer Aided Diagnosis

Coincidental Diagnosis

Multimodal Context Data

Medical Knowledge

Patient Records

Massive Machine Learning* © 2015 IBM Corporation

22

IBM CONFIDENTIAL

Huge Diversity of Imagery in Medicine Spans Modality, Anatomy, Disease Visual Classification

Multi-Modal Information

Modality



X-ray

CT

MRI

Region Text

Pelvis

Head

Chest



Disease



Tuberculosis

Cancer

Collapse

Image

NIH PubMed Medical Modality Classification:

Visual Recognition

 Extracting knowledge from NLM/NIH PubMed 14.2 million articles (3.8 million free full-text articles)  Rich contextual information: text + figures + captions  Challenge: given an unknown PubMed image determine medical category automatically  Automatically classifying millions of published medical images by modality, region, disease along with associated text builds multi-modal knowledge 23

IBM CONFIDENTIAL

© 2015 IBM Corporation

Medical Knowledge Management – Automatic Medical Image Classification Medical image taxonomy Modality



X-ray

CT

MRI

Region Pelvis

Head

Chest



Disease



Tuberculosis

Cancer

MRI Brain Axial

MRI Knee

PET Color

DX Appendage

DX Torso

DX Cervical Spine

Collapse

NIH PubMed figures + captions

24

© 2015 IBM Corporation Semantics of Visual Discrimination

ImageCLEF Medical Image Classification* Goal: automatically classify images into correct medical category • Data-set of NIH PubMed articles with 305,000 images • IBM achieved #1 performance in 2012 and 2013 • Top performance in each task (visual, text, combined) 90 80

IBM Non-IBM

70

feature

feature

feature

Mean Accuracy

60 50

40

Fusion 30

20 10 0

Combined

Text

Visual

“X-ray”

* http://imageclef.org/2013/medical 25

Massive-Scale Multimedia Semantic Modeling

© 2014 IBM Corporation

Visual Recognition Technical Foundation

Multi-layer Learning Architecture for Image and Video Analysis Scenes Locations Settings Activities

Expectation Maximization

SVMs

Ensemble Classifiers

Texture

Events

People

People

Regression Decision K-means Tree

Models Factor Graph

Obj ects

Addaboost

GMM

Active Learning

Bayes Net

Deep Learning

GMM Markov Model

Neural Net

Labeled Data

Unlabeled Data

Color

Faces

Activities

Behaviors

People

Nearest Neighbor

Places

Scenes

Actions

Cars

Animals

Activities

Actions Objects

Vehicles

Living Objects

Clustering

Semantics

Objects

Edges

Features

Shape Motion

Background

Camera Motion

Shot Boundaries

Energy

Regions Moving Objects

Zerocrossings

Frequencies

N N N N N N N N N N

P P P P P P P P P P

Negative Examples

Positive Examples

Spectrum

Tracks Scene Dynamics

Images and Video

• Visual Recognition = predicting “semantic” labels for unknown images and video 27

Semantics of Visual Discrimination

© 2015 IBM Corporation

Data-driven Machine Learning to Create Visual Semantic Classifiers Natural Photos and Video

Medical Modalities & Viewpoints

Same Approach

© 2015 IBM Corporation

28

Rich Set of Visual Features is Needed to Learn Semantic Discrimination

Aardvark [1] Spatial Scale Global (0.86)

Even-Toed Ungulate [14] Local Binary Pattern Global (0.68)

Primate [1] Local Binary Pattern Global (0.69)

Seal, Sea Lion, Walrus [3] Local Binary Pattern Global (0.83)

Bat [1] Thumbnail Vector Global (0.93)

Meat-Eater [12] Color Correlogram Global (0.68)

Rabbit, Pika, Hare [2] Color Moments Grid (0.79)

Toothless Mammal [2] Local Binary Pattern Global (0.85)

Sea Cow [1] Color Histogram Global (0.95)

Whale, Dolphin, Porpoise [5] Color Histogram Global (0.95)

M

Elephant [1] Spatial Scale Global (0.81) 29

Odd-Toed Ungulate [4] Edge Histogram Layout (0.79)

IBM CONFIDENTIAL

© 2015 IBM Corporation

Increasing Availability of Labeled Data is Accelerating Visual Recognition Scenes and Objects:

Objects and Actions:

(SUN Database: 131,072 images, 908 scenes, 3,819 objects, (http://groups.csail.mit.edu/vision/SUN/)

(PASCAL VOC: 11,530 images with 27,450 ROI annotated objects and 6,929 segmentations (http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2012/)

20 objects

segmentations

10 actions

Scenes:

Objects:

(INRIA Holidays Data Set ): 1,491 images, 500 queries, 991 search images (http://lear.inrialpes.fr/~jegou/data.php)

ImageNet linguistic visual objects (http://image-net.org/)

• Full data: 15,589 WordNet synsets • ILSVRC: 1,000 object categories and 1.2 million images 30

IBM CONFIDENTIAL

© 2015 IBM Corporation

Visual Classifier Learning Allows Metrics-based Optimization Training Images and Video

Discriminative Model

Visual Classifier Learning Ensemble Visual Classifier Training

Positive Examples

color

edges

texture …

CNN features Metrics-based Validation

Visual Model Fusion (non-linear SVM) 1 2 …

Negative Examples

color



edges



texture …

… N

1 2 …





… N

1 2 …





… N

color

edges

Visual Feature Extraction 31

Semantics of Visual Discrimination

Accuracy vs. Score

Avg. Precision vs. Score

F1 vs. Rank

Precision vs. Score

Score vs. Rank

CNN features

Visual Model Training (linear SVM)

texture …

Precision vs. Recall

CNN features

Visual Recognition Example Upload Unknown Images

Visual Recognition Results

Demo 32

© 2015 IBM Corporation Semantics of Visual Discrimination

Visual Indexing and Search using Semantic Faceted Taxonomy Training Examples

Visual Semantic Taxonomy

Negative

Positive

scenes objects

events

Classifier Learning

people

Visual Query [C1 AND C2] OR [C3]

activities

Visual Concept Extraction

LearnClassifiersFromVectors Visual Semantic Search

LearnClassifiersFromImages Learn Visual Classifiers

color texture …shape + -

+ + -

edges + -

+ + -

Temporal parseVideo Segmentation

Shot Boundary parseVideo Determination

+ - … + + -

+ -

Extract ConceptExtract Concepts

Key Frame parseSequence Extraction

Extract ExtractFeatures Features

Visual Classifiers 33

Images and Videos

Semantics of Visual Discrimination

Visual Classifier Learning Allows Metrics-based Optimization Training Images and Video

Discriminative Model

Visual Classifier Learning Ensemble Visual Classifier Training

Positive Examples

color

edges

texture …

CNN features Metrics-based Validation

Visual Model Fusion (non-linear SVM) 1 2 …

Negative Examples

color



edges



texture …

… N

1 2 …





… N

1 2 …





… N

color

edges

Visual Feature Extraction 34

Semantics of Visual Discrimination

Accuracy vs. Score

Avg. Precision vs. Score

F1 vs. Rank

Precision vs. Score

Score vs. Rank

CNN features

Visual Model Training (linear SVM)

texture …

Precision vs. Recall

CNN features

Different Approaches can be used for Managing and Selecting Negatives for Visual Discriminative Learning Positive Examples

Negative Examples Whale

X

Whale

Person, Dog, Whale

Common background of all images (unlabeled positive images are included as negatives) Zebra, Nature, Sunset

Traditional taxonomy mutually exclusivity (unlabeled positive images are included as negatives) Whale

Zebra, Cat, Koala

Faceted classification ensures mutual exclusivity within facets (allows correct positives and negatives) © 2015 IBM Corporation

Semantics of Semantic Visual Discrimination Large-Scale Modeling

35

What label?

a) b) c) d) e)

Man Dog Frisbee Beach Playing

FACETS: [Person] [Animal] [Object] [Setting] [Activity]

Need to learn and assign labels from multiple facets! 36

Semantics of Visual Discrimination

© 2015 IBM Corporation

Facets can be Nested in Hierarchical Faceted Classification Scheme People (facet)

Objects (facet)

Children concepts are complete with respect to parent concept Pose (facet)

State (facet)

Couple



Expression (facet)

Gender (facet)

Simple reasoning to select positive and negative training data for each semantic concept

View (facet)







Portrait







Upper body

Unknown view …

Hands



Full body

© 2015 IBM Corporation

Each facet provides an alternative partitioning of its parent



Group



Age (facet)

Sibling concepts are designed to be mutually exclusive







Individual





Number (facet)

Activities (facet)



Affiliation (facet)



NonHuman

Human

Settings (facet)

Semantics of Visual Discrimination

37

Facets can be Nested in Hierarchical Faceted Classification Scheme People (facet)

Objects (facet)

Number (facet)

Pose (facet)

State (facet)

Each facet provides an alternative partitioning of its parent



Group



Expression (facet)

Gender (facet)

Simple reasoning to select positive and negative training data for each semantic concept

View (facet)







Portrait







Upper body

Unknown view



Hands



Full body

© 2015 IBM Corporation

Sibling concepts are designed to be mutually exclusive



Couple



Age (facet)



Children concepts are complete with respect to parent concept



… Individual

Activities (facet)



Affiliation (facet)



NonHuman

Human

Settings (facet)

Semantics of Visual Discrimination

Positive

Negative

38

Key Challenge in Visual Discriminative Learning is Managing and Selecting of Negative Training Examples Semantic Learning

Training Setup ? Positive Examples

Ensemble Classifier Learning

Discriminative Model

Fusion Model Training

Unit Model Validation

1 2…



color

edges

Negative Examples ???



…N

shape texture

Unit Model Training

Visual Model Training

1 2…





…N

1 2…





…N

color

edges

shape texture

Visual Feature Extraction

© 2015 IBM Corporation

Semantics of Visual Discrimination

Metrics-based Validation

Precision vs. Recall

Accuracy vs. Score

Avg. Precision vs. Score

F1 vs. Rank

Precision vs. Score

Score vs. Rank

39

Approaches for Labeling Training Images for Discriminative Learning

K Categories

N Images

40

1

2

3

4

5

6

7

8

9

10

11

12

A

N

Y

N

N

N

N

Y

N

Y

N

N

N

B

N

N

Y

N

N

N

N

N

N

N

N

N

C

N

N

N

N

N

N

Y

N

N

Y

N

N

D

Y

N

N

N

N

N

N

N

N

N

N

Y

E

N

N

N

N

N

Y

N

Y

N

N

N

N

F

N

N

N

N

N

N

N

N

N

N

N

N

G

N

N

N

Y

Y

N

N

N

N

N

N

N

H

N

N

N

N

N

N

N

N

N

N

N

N

I

N

N

N

N

N

N

N

N

N

N

N

N

J

N

Y

N

N

N

N

N

N

N

N

Y

N

K

N

N

N

N

N

N

Y

N

N

N

N

N

L

N

N

N

N

N

N

N

N

Y

N

N

N

M

N

N

N

Y

N

N

N

N

N

N

N

N

Semantics of Visual Discrimination

© 2015 IBM Corporation

Approaches for Labeling Training Images for Discriminative Learning

K Categories

N Images

41

 Exhaustive: label all N images by all K categories  N *K is large (does not scale!)

1

2

3

4

5

6

7

8

9

10

11

12

A

N

Y

N

N

N

N

Y

N

Y

N

N

N

B

N

N

Y

N

N

N

N

N

N

N

N

N

C

N

N

N

N

N

N

Y

N

N

Y

N

N

D

Y

N

N

N

N

N

N

N

N

N

N

Y

E

N

N

N

N

N

Y

N

Y

N

N

N

N

F

N

N

N

N

N

N

N

N

N

N

N

N

 Assume a very “efficient” person can label 100K images per day per concept

G

N

N

N

Y

Y

N

N

N

N

N

N

N

 1M image training set

H

N

N

N

N

N

N

N

N

N

N

N

N

 10K concepts

I

N

N

N

N

N

N

N

N

N

N

N

N

J

N

Y

N

N

N

N

N

N

N

N

Y

N

K

N

N

N

N

N

N

Y

N

N

N

N

N

L

N

N

N

N

N

N

N

N

Y

N

N

N

M

N

N

N

Y

N

N

N

N

N

N

N

N

Semantics of Visual Discrimination

 0.1 concept per day per person

Need 100K persondays © 2015 IBM Corporation

Approaches for Labeling Training Images for Discriminative Learning

N Images A

1

2

3

N

Y

N

N

Y

K Categories

B C

N

D

Y

E

N

F G

4

N

8

9

Y

N

Y

N N

11

12 N

N N

Y

N

N

Y Y

N

Y

N

N

N

N

Y

N N

N

N

N

J

N

Y

N

Y

N

N N

N

Y

N

Y

N

N N

N

10

N

K

42

7

Y

I

M

6

N

H

L

5

Y

N

Y

N

N

N

N

N

N Semantics of Visual Discrimination

 Exhaustive: label all N images by all K categories  N *K is large (does not scale!)  Binary: label p positives (p