Distributed Vision Processing in Smart Camera Networks ICASSP 2009 Taipei April 20, 2009
Hamid Aghajan Stanford University, USA aghajan AT stanford.edu stanford edu
Part 2/4: Smart Cameras http://wsnl.stanford.edu/ICASSP09/
Our Lab Students (credits for the results presented): Chen Wu Jingyu J gyu Cu Cui Amir Khalili Nan Hu, Tianshi Gao Stephan Hengstler, Huang Lee, Itai Katz Tommi Maataa (Philips, TU Eindhoven, Netherlands) Linda Tessens, Marleen Morbee (Ghent U., Belgium)
Y2E2 iRoom
EE Dept. D t - WSNL: WSNL Wi Wireless l Sensor S Networks N t k Lab L b
WSNL - Stanford
Distributed Vision Processing
2
Syllabus • •
WSNL - Stanford
Smart cameras Case Study – Human pose analysis
Distributed Vision Processing
3
Technology CrossCross-Roads Sensor Networks
Image Sensors
• Wireless communication • Networking
• Rich information • Low power, power low cost
Smart Camera Networks Signal Processing • Embedded bedded processing p ocess g • Collaborative methods
WSNL - Stanford
Vision Processing Architecture? Algorithms? A li ti Applications? ? Distributed Vision Processing
• Scene understanding • Human gesture
Potential impact on design methodologies in each discipline
4
Vision • Rich content – Window to the world
• Unobtrusive interface
Assisted living
– Non-user-wearable
• Context-based processing – Many applications: Versatile high-level interfacing with common vision blocks Gaming Retail ads Avatars
Face profile: Remote gaming
WSNL - Stanford
Distributed Vision Processing
5
Multi--Camera Vision Multi • Added Add d coverage – Areas of interest – Occlusion handling Smart homes: user behavior modeling
• 3D reconstruction • Added confidence Tele-presence
– Event interpretation p
• Role selection Large-area view: Location Close-up view: Pose Assisted living WSNL - Stanford
Distributed Vision Processing
6
Smart Environments ¾ Observe → interpret → build up behavior models → react ¾ Quantitative knowledge + Qualitative assessment Sensing Processing
Context Behavior Model
¾ Responsive R i tto events t – Adapt services – Employ additional sensors – Send alerts
¾ Interactive
Vision can play an enabling role
– Based B d on gesture, t location, l ti region i off iinterest t t off user
¾ Self configure, discover the interests, adapt to user WSNL - Stanford
Distributed Vision Processing
7
Vision - Potentials Assistive technologies Response systems Companion robots
Robotics
Surveillance
Event detection Identification / Tracking g Large-scale deployments
Enabling technologies: o
Vision processing
o
Wireless sensor network
o
Embedded computing
o
Signal processing
Tele-presence p Virtual reality Gaming over network
WSNL - Stanford
Vision and Multi-modal Sensor Network
Multimedia
Human C H Computer t Interaction
Distributed Vision Processing
Immersive virtual reality Non-restrictive interface Occupancy sensing
8
Smart Camera Networks Rich design space driven by application requirements
Camera Node
Energy consumption?
Data aggregation? Vision System: M Mono or stereo? t ? Resolution? Field-of-View?
Data Exchange: Type of data? Traffic load?
Distributed Observations Task: Tracking? Counting?
Camera orientation? Placement? Which cameras sense? Network topology? Vision Vi i algorithm: l ith Local vs. central processing WSNL - Stanford
Distributed Vision Processing
Application Requirements: A Accuracy? ? Coverage? C ? Network Lifetime? 9
Classical MultiMulti-Camera Application: Surveillance
WSNL - Stanford
Distributed Vision Processing
10
Classical MultiMulti-Camera Application: Surveillance Network Intelligence
Network Objective
Required Bandwidth
high
Event description
low ~ 10 KB/s
Object j description p ~ 1 MB/s medium
Object detection
medium
Moving scenes ~none none
WSNL - Stanford
Raw video stream
Distributed Vision Processing
~ 10 MB/s high
11
Smart Cameras
Image Sensor
Radio
CIF ((352x288)) VGA (640x480)
Data Rate kB/s to MB/s 802.11 and 802.15.4
Processing Unit 32-Bit RISC, 20-200 MHz
Energy Source
Storage SRAM & Flash (MBs)
Resource constraints: Computation, energy, communication bandwidth WSNL - Stanford
Distributed Vision Processing
12
Big Picture • Process locally … Fuse globally – Move o ea away ay from o sstreaming ea g raw a video deo – “Smart” cameras: Local processing power
Low
WSNL - Stanford
Inter mediate
High
Distributed Vision Processing
13
Big Picture • Process locally … Fuse globally – Algorithm design dependent on system and application: • • • •
Network's scale and size Available bandwidth Processing powers (embedded vs. vs central) Application requirements (accuracies, latency, data fusion level)
Multi camera Multi-camera hardware & network
WSNL - Stanford
• Local processing and centralized processing • Communication bandwidth • Latency of real real-time time results • Resolution in image view and time • Temporal alignment (synchronization) • Camera view overlaps, data redundancies • Data exchange methods
Distributed Vision Processing
Vision Vi i algorithms
14
Big Picture • Process locally … Fuse globally –Different e e levels e eso of local oca p processing: ocess g • Extract generic features (e.g. silhouette, edges) – Low order of magnitude data reduction from raw video
• Report mid-level objects (e.g. segment area of interest) – High order or magnitude data reduction
• Decision-level processing (e.g. classify an action) – Small number of information bytes
WSNL - Stanford
Distributed Vision Processing
15
The Issue of Privacy – Cameras: • Offer a non-wearable sensing option (unobtrusive ..) • However, are often regarded g as rather invasive sensing g
– Privacy concerns MUST be addressed for home applications Added motivation for “Smart” Cameras
WSNL - Stanford
Distributed Vision Processing
16
The Issue of Privacy • Smart cameras + a multi-layered privacy handling approach: – Turn video into text in normal state (as well as at alerts) – Map person’s gesture onto: silhouette, avatar
• Alert mechanism: • Implement multi-level alert system (green - yellow – red) • Activate voice communication first to check status • Image query only possible by authorized nurse / family • Raw video saved locally for post-event analysis / diagnosis WSNL - Stanford
Distributed Vision Processing
17
Distributed Processing
Low
WSNL - Stanford
Inter mediate
High
Distributed Vision Processing
18
Distributed Processing
Low
WSNL - Stanford
Inter mediate
High
Distributed Vision Processing
19
Distributed Processing
Distribution across space
WiCa: NXP Semiconductor Research WSNL - Stanford
Distributed Vision Processing
20
Layered Processing Description Layers
Processing Layers
Description Layer 4 : Actions, labels
G
Processing Layer 3 : Interpretative
Reasoning processor (PC)
Description Layer 3 : Poses, attributes Fusion processor (embedded or PC)
Pixel processor (SIMD)
Distribution across processors
E1
E2
E3
Processing Layer 2 : Collaborative Description Layer 2 : Low-level features
f11
f12
f21
F1
f22
f31
f32
F3
F2
Processing Layer 1 : Distributed Description Layer 1 : Image / video
R2
R1
Camera 1
Camera 2
R3
Camera 3
Multi-camera networks: Distribution across space WSNL - Stanford
Distributed Vision Processing
21
Fusion and Feedback
Feedback • Initialize in-node feature extraction • Focus on what is important • Assign tasks to cameras
Active vision
WSNL - Stanford
Distributed Vision Processing
22
The Big Picture
Network Feedback ((robustness,, efficiency) y) • Low-level vision: appearance • High-level: activity interpretation WSNL - Stanford
Distributed Vision Processing
23
Interfacing Vision
¾ ¾ ¾ ¾
WSNL - Stanford
What accuracies / observation frequencies are needed? Task assignment to cameras Priorities of parameters to extract Process based on available contextual information
Distributed Vision Processing
24
Interfacing Vision
Description Layers
Queries Context Persistence Behavior attributes
Processing Layers
Description Layer 4 : Actions, labels
G
Processing Layer 3 : Interpretative
Reasoning processor (PC)
Description Layer 3 : Poses, attributes
E1
E2
E3
Processing Layer 2 : Collaborative
Fusion processor (embedded or PC)
Description Layer 2 : Low-level features
f11
f12
f21
F1
f22
f31
f32
F3
F2
Processing Layer 1 : Distributed
Pixel processor (SIMD)
Description Layer 1 : Image / video
WSNL - Stanford
R1
R2
R3
Distributed Vision Processing
25
Syllabus • •
WSNL - Stanford
Smart cameras Case Study – Human pose analysis
Distributed Vision Processing
26
Posture Estimation – Review y y
Discriminative -> template-based Generative -> model-based ◦ Bottom-up ◦ Top-down
y
Combined ◦ Discriminative for body parts ◦ Detect each body part as a unit ◦ Generative for whole-body configuration ◦ Find best model to match composition of all parts)
Multi-View Issues Opportunities:
Challenges:
• Complementary info • Occlusion handling • Outlier rejection • Distributed processing
• Correspondence • Redundant data • Misleading info in some images • Communication (bandwidth, latency)
WSNL - Stanford
Distributed Vision Processing
27
Posture Estimation
K K
3D model
CAM1
CAM2
CAM n
Image features
Image features
Image features
New configuration optimization
Evaluate similarity
Good enough?
Output 3D model
Bottom-up
WSNL - Stanford
Top-down
Distributed Vision Processing
28
Multi--View Camera Networks Multi • Combine bottom bottom-up up and top top-down down approaches – Powerful local image processor – Limited communication ◦ Generative (model-based) for whole whole-body body configuration
◦ Discriminative (template-based) for body parts
• Vision p processing g options: p – Segmentation with generic features – Opportunistic segmentation -- detection of body parts WSNL - Stanford
Distributed Vision Processing
29
Pose Estimation – Top Top--Down Approach ¾ 3D model -> 2D projections of edges and silhouettes ¾ Validate 2D projections with image observations + Easy to handle occlusions - Difficult to optimize: non-convex - Time consuming in calculating projections and evaluating them
WSNL - Stanford
Distributed Vision Processing
30
Pose Estimation – Bottom Bottom--Up Approach ¾ Look L k ffor body b d partt candidates did t iin iimages ¾ Assemble 2D/3D models from body part candidates + Distribute more computation in images (body part candidates, local assemblage) - Difficult to handle occlusions without knowing relative configuration of body parts - Not direct to map from 2D assemblage to the 3D model
WSNL - Stanford
Distributed Vision Processing
31
Model--based Fusion Model • Motivation to build a human model: ¾ A concise reference for merging information from cameras ¾ Universal interface for different gesture interpretation applications ¾ Allows new viewing angles in virtual domain ¾ Facilitates active vision methods: • Focus on what is important • Exchange descriptions only relevant to the model • Develop more detail in time • Initialize next operations (segmentation, motion tracking, etc)
¾H Helps l address dd privacy i concerns iin various i http://wsnl.stanford.edu/videos/gesture/rotate2.avi applications
WSNL - Stanford
Distributed Vision Processing
32
Case Study: Pose Analysis ¾ What is the p problem we try y to solve? • Reconstruct ion of detailed dynamic body model • Has to be real-time?
• Detect e ec ion o o of ce certain a poses (ges (gesture u e co control, o , fallen, a e , …)) • How critical is missed detection? Or false alarm?
• Extraction of long-term behavior routines • Afford o d to make a e sshort-term o t te mistakes? sta es Ca Can ignore g o e low-confidence o co de ce frames? a es
¾ system constraints? • • • • WSNL - Stanford
Real-time, frames-per-second Local versus central processing power Communication bandwidth L Latency Distributed Vision Processing
33
Case Study: Pose Analysis ψ ijK ( x i , x j )
x2 x4
x3
x5
x6
Graphical Model
x1 x7 x8
x9
Optimal solution for body model reconstruction
x10
Kinematic edges: Angle and distance constraints Silhouette + Edge Silhouette + Edge
Each camera sends silhouette and edge maps
Optimization
Silhouette + Edge
Distributed WSNL - Stanford
Central Distributed Vision Processing
34
Smart Cameras - Communication Constraints WiCa WiC 1.1
WiCa 1.1
ZigBee Channel 1
Sensor
ZigBee
IC3D
CPLD
ZigBee Channel 2
DPRAM
SD Slot
ZigBee Channel 3 AquisGrain 2.0
Requirements q • Real-time: • 30 fps • Latency of 10 ms
• Wireless link: • 100 kbps data per channel / 30 fps ~ 400B/frame WSNL - Stanford
Joint work with NXP Semiconductors, The Netherlands
Distributed Vision Processing
35
Case Study: Human Pose
WSNL - Stanford
Distributed Vision Processing
36
Generic Features Generic features for body parts • More processing power at cameras • Limited communication bandwidth
Color segmentation and ellipse fitting in local processing Background B k d subtraction
Distributed Processing: g POEM refine POEM: fi W Watershed h d color models segmentation Segmentation
Rough R h segmentation
Previous color distribution 3D human body y model
Ellipse fitting
Previous geometric configuration and motion
Maintain current model Combine 3 views to get 3D skeleton geometric configuration Update 3D model (color/texture (color/texture, motion)
Y
Check stop criteria
Update each test Collaborative Processing: Score test configuration using configurations PSO Model Fitting
N
Generate test configurations
Local processing i from other cameras
• Ellipse parameters are sent to central processor – Reduced data communication load WSNL - Stanford
Distributed Vision Processing
37
Segmentation - Generic Features
background subtraction
markers background Markers for the person
images optical flow estimation
watershed segmentation foreground
markers
Info from model
body part segments
WSNL - Stanford
ellipse fitting and attributes extraction
Distributed Vision Processing
watershed
K- means clustering( color)
38
Distributed Processing http://wsnl.stanford.edu/videos/gesture/ellipfull1.avi Initialize from model, or refresh (k-means
Refine color models (adaptivity)
Enforce spatial connectivity for ambiguous pixel colors
Concise description of segments
Color segmentation and ellipse fitting in local processing Background B k d subtraction
R Rough h segmentation
POEM: refine POEM fi color models
W Watershed h d segmentation
Ellipse fitting
Distributed Processing: Segmentation Previous color distribution
Feedback 3D human body y model
Previous geometric configuration and motion
Maintain current model Combine 3 views to get 3D skeleton geometric configuration Update 3D model (color/texture (color/texture, motion)
Y
Check stop criteria
Update each test Collaborative Processing: Score test configuration using configurations PSO Model Fitting
N
WSNL - Stanford
Distributed Vision Processing
Generate test configurations
Local processing i from other cameras
39
Distributed Processing
WSNL - Stanford
Distributed Vision Processing
40
Collaborative Model Fitting ¾Exchange segments and attributes, combine to reconstruct a 3D model ¾Subject’s information mapped and maintained in the model: d l • •
Geometric configuration: dimensions, lengths, angles Color / texture / motion of different segments
Particle Swarm Optimization (with goodness of ellipse fits to segments)
WSNL - Stanford
Projection on image planes
Distributed Vision Processing
Parameters of body parts
41
Collaborative Model Fitting z θ1
ϕ1
θ2
θ3
ϕ3
θ4 ϕ4
ϕ2
O
y
x
ellipses CAM1
ellipses
ellipses
CAM2
CAM3
• Red: projection of skeleton on image plane • Green: region of arms grown from red lines • Blue: ellipses from segmentation ¾ Score = Area (ellipses falling within green polygons) / Area (green polygons) WSNL - Stanford
Distributed Vision Processing
42
Model--based Pose – Generic Features Model Frame 105 Frame 105
WSNL - Stanford
Distributed Vision Processing
43
Model--based Pose – Generic Features Model
WSNL - Stanford
Distributed Vision Processing
44
Opportunistic Features • Generic features are used to detect body parts – Can improve p using g more specific p features for each body yp part?
WSNL - Stanford
Distributed Vision Processing
45
Case Study: Human Pose
WSNL - Stanford
Distributed Vision Processing
46
Opportunistic Segmentation Opportunistic Reconstruction
Head candidates
• Different features for bodyy parts p • Limited communication bandwidth
Hands candidates Torso width Line segments delineating upper body Skeletons of thighs and calfs
Head candidates http://wsnl.stanford.edu/videos/gesture/features3.avi
s n2
n1
n3 WSNL - Stanford
Distributed Vision Processing
47
Collaborative Processing • Multi-camera validation – Outlier rejection – Occlusion handling g
• Model construction Camera 3 will not participate in hand modeling
WSNL - Stanford
Distributed Vision Processing
48
Collaborative Model Construction
Camera 1, 2, 3:
Head candidates
Head position
Camera 1, 2, 3:
Torso width
Torso orientation
Camera 1, 2, 3:
Hands candidates
Hands positions Occlusion inference for both arms
Camera 1, 2, 3:
Camera 1, 2, 3:
WSNL - Stanford
Arms angle configurations
Distance maps from line segments d li delineating ti upper-body b d Orientation of thighs and calfs
Distributed Vision Processing
Legs angle configurations
49
Collaborative Model Construction w1
w2
3
y
1
w1
w2
b
a
α
x
3
Torso angle
2
4
1 2
WSNL - Stanford
3 4
Distributed Vision Processing
50
Collaborative Model Construction
http://wsnl.stanford.edu/videos/gesture/combine1.avi WSNL - Stanford
Distributed Vision Processing
51
Collaborative Model Construction
http://wsnl.stanford.edu/videos/gesture/rotate2.avi
WSNL - Stanford
http://wsnl.stanford.edu/videos/gesture/jogging1.avi
Distributed Vision Processing
52
Collaborative Model Construction
http://wsnl.stanford.edu/videos/gesture/pang.avi WSNL - Stanford
Distributed Vision Processing
53
Communication Load • Data record per frame:
Each pixel requires many processing passes WSNL - Stanford
→
Distributed Vision Processing
Line memory limitation 54
Embedded Implementation • Further F h hardware h d constraints: i – Multiple image passes are required – Line memory available on WiCa allows ~1pass 1pass per full frame – This imposes severe limit on the algorithm
• Process a small subset of features:
WSNL - Stanford
Distributed Vision Processing
55
Real--Time WiCa Implementation Real
WSNL - Stanford
Distributed Vision Processing
56
30 Frames per Second
http://wsnl.stanford.edu/videos/gesture/realtime1.avi
WSNL - Stanford
Distributed Vision Processing
57
Ping Pong
WSNL - Stanford
Distributed Vision Processing
58
Ping Pong
ICDSC (International Conference on Distributed Smart Cameras) Sept 2007, Vienna, Austria
http://wsnl.stanford.edu/videos/gesture/realtime2.avi WSNL - Stanford
Distributed Vision Processing
59
Spatiotemporal Smoothing
Two-camera feature fusion and temporal smoothing http://wsnl.stanford.edu/videos/ballgame/comparison32.avi No smoothing
WSNL - Stanford
Distributed Vision Processing
60
WiCa Implementation
WSNL - Stanford
Distributed Vision Processing
61
Case Study: Pose Analysis • May only need high-level posture state in some applications – E.g. assisted living
WSNL - Stanford
Distributed Vision Processing
62
Publications http://wsnl.stanford.edu
WSNL - Stanford
Distributed Vision Processing
63
Distributed Vision Processing in Smart Camera Networks ICASSP 2009 Taipei April 20, 2009
Hamid Aghajan Stanford University, USA aghajan AT stanford.edu stanford edu
Part 2/4: Smart Cameras http://wsnl.stanford.edu/ICASSP09/