Hamid Aghajan Stanford University, USA

Distributed Vision Processing in Smart Camera Networks ICASSP 2009 Taipei April 20, 2009 Hamid Aghajan Stanford University, USA aghajan AT stanford.e...
Author: Harry Johnson
4 downloads 0 Views 4MB Size
Distributed Vision Processing in Smart Camera Networks ICASSP 2009 Taipei April 20, 2009

Hamid Aghajan Stanford University, USA aghajan AT stanford.edu stanford edu

Part 2/4: Smart Cameras http://wsnl.stanford.edu/ICASSP09/

Our Lab Students (credits for the results presented): Chen Wu Jingyu J gyu Cu Cui Amir Khalili Nan Hu, Tianshi Gao Stephan Hengstler, Huang Lee, Itai Katz Tommi Maataa (Philips, TU Eindhoven, Netherlands) Linda Tessens, Marleen Morbee (Ghent U., Belgium)

Y2E2 iRoom

EE Dept. D t - WSNL: WSNL Wi Wireless l Sensor S Networks N t k Lab L b

WSNL - Stanford

Distributed Vision Processing

2

Syllabus • •

WSNL - Stanford

Smart cameras Case Study – Human pose analysis

Distributed Vision Processing

3

Technology CrossCross-Roads Sensor Networks

Image Sensors

• Wireless communication • Networking

• Rich information • Low power, power low cost

Smart Camera Networks Signal Processing • Embedded bedded processing p ocess g • Collaborative methods

WSNL - Stanford

Vision Processing Architecture? Algorithms? A li ti Applications? ? Distributed Vision Processing

• Scene understanding • Human gesture

Potential impact on design methodologies in each discipline

4

Vision • Rich content – Window to the world

• Unobtrusive interface

Assisted living

– Non-user-wearable

• Context-based processing – Many applications: Versatile high-level interfacing with common vision blocks Gaming Retail ads Avatars

Face profile: Remote gaming

WSNL - Stanford

Distributed Vision Processing

5

Multi--Camera Vision Multi • Added Add d coverage – Areas of interest – Occlusion handling Smart homes: user behavior modeling

• 3D reconstruction • Added confidence Tele-presence

– Event interpretation p

• Role selection Large-area view: Location Close-up view: Pose Assisted living WSNL - Stanford

Distributed Vision Processing

6

Smart Environments ¾ Observe → interpret → build up behavior models → react ¾ Quantitative knowledge + Qualitative assessment Sensing Processing

Context Behavior Model

¾ Responsive R i tto events t – Adapt services – Employ additional sensors – Send alerts

¾ Interactive

Vision can play an enabling role

– Based B d on gesture, t location, l ti region i off iinterest t t off user

¾ Self configure, discover the interests, adapt to user WSNL - Stanford

Distributed Vision Processing

7

Vision - Potentials Assistive technologies Response systems Companion robots

Robotics

Surveillance

Event detection Identification / Tracking g Large-scale deployments

Enabling technologies: o

Vision processing

o

Wireless sensor network

o

Embedded computing

o

Signal processing

Tele-presence p Virtual reality Gaming over network

WSNL - Stanford

Vision and Multi-modal Sensor Network

Multimedia

Human C H Computer t Interaction

Distributed Vision Processing

Immersive virtual reality Non-restrictive interface Occupancy sensing

8

Smart Camera Networks Rich design space driven by application requirements

Camera Node

Energy consumption?

Data aggregation? Vision System: M Mono or stereo? t ? Resolution? Field-of-View?

Data Exchange: Type of data? Traffic load?

Distributed Observations Task: Tracking? Counting?

Camera orientation? Placement? Which cameras sense? Network topology? Vision Vi i algorithm: l ith Local vs. central processing WSNL - Stanford

Distributed Vision Processing

Application Requirements: A Accuracy? ? Coverage? C ? Network Lifetime? 9

Classical MultiMulti-Camera Application: Surveillance

WSNL - Stanford

Distributed Vision Processing

10

Classical MultiMulti-Camera Application: Surveillance Network Intelligence

Network Objective

Required Bandwidth

high

Event description

low ~ 10 KB/s

Object j description p ~ 1 MB/s medium

Object detection

medium

Moving scenes ~none none

WSNL - Stanford

Raw video stream

Distributed Vision Processing

~ 10 MB/s high

11

Smart Cameras

Image Sensor

Radio

CIF ((352x288)) VGA (640x480)

Data Rate kB/s to MB/s 802.11 and 802.15.4

Processing Unit 32-Bit RISC, 20-200 MHz

Energy Source

Storage SRAM & Flash (MBs)

Resource constraints: Computation, energy, communication bandwidth WSNL - Stanford

Distributed Vision Processing

12

Big Picture • Process locally … Fuse globally – Move o ea away ay from o sstreaming ea g raw a video deo – “Smart” cameras: Local processing power

Low

WSNL - Stanford

Inter mediate

High

Distributed Vision Processing

13

Big Picture • Process locally … Fuse globally – Algorithm design dependent on system and application: • • • •

Network's scale and size Available bandwidth Processing powers (embedded vs. vs central) Application requirements (accuracies, latency, data fusion level)

Multi camera Multi-camera hardware & network

WSNL - Stanford

• Local processing and centralized processing • Communication bandwidth • Latency of real real-time time results • Resolution in image view and time • Temporal alignment (synchronization) • Camera view overlaps, data redundancies • Data exchange methods

Distributed Vision Processing

Vision Vi i algorithms

14

Big Picture • Process locally … Fuse globally –Different e e levels e eso of local oca p processing: ocess g • Extract generic features (e.g. silhouette, edges) – Low order of magnitude data reduction from raw video

• Report mid-level objects (e.g. segment area of interest) – High order or magnitude data reduction

• Decision-level processing (e.g. classify an action) – Small number of information bytes

WSNL - Stanford

Distributed Vision Processing

15

The Issue of Privacy – Cameras: • Offer a non-wearable sensing option (unobtrusive ..) • However, are often regarded g as rather invasive sensing g

– Privacy concerns MUST be addressed for home applications Added motivation for “Smart” Cameras

WSNL - Stanford

Distributed Vision Processing

16

The Issue of Privacy • Smart cameras + a multi-layered privacy handling approach: – Turn video into text in normal state (as well as at alerts) – Map person’s gesture onto: silhouette, avatar

• Alert mechanism: • Implement multi-level alert system (green - yellow – red) • Activate voice communication first to check status • Image query only possible by authorized nurse / family • Raw video saved locally for post-event analysis / diagnosis WSNL - Stanford

Distributed Vision Processing

17

Distributed Processing

Low

WSNL - Stanford

Inter mediate

High

Distributed Vision Processing

18

Distributed Processing

Low

WSNL - Stanford

Inter mediate

High

Distributed Vision Processing

19

Distributed Processing

Distribution across space

WiCa: NXP Semiconductor Research WSNL - Stanford

Distributed Vision Processing

20

Layered Processing Description Layers

Processing Layers

Description Layer 4 : Actions, labels

G

Processing Layer 3 : Interpretative

Reasoning processor (PC)

Description Layer 3 : Poses, attributes Fusion processor (embedded or PC)

Pixel processor (SIMD)

Distribution across processors

E1

E2

E3

Processing Layer 2 : Collaborative Description Layer 2 : Low-level features

f11

f12

f21

F1

f22

f31

f32

F3

F2

Processing Layer 1 : Distributed Description Layer 1 : Image / video

R2

R1

Camera 1

Camera 2

R3

Camera 3

Multi-camera networks: Distribution across space WSNL - Stanford

Distributed Vision Processing

21

Fusion and Feedback

Feedback • Initialize in-node feature extraction • Focus on what is important • Assign tasks to cameras

Active vision

WSNL - Stanford

Distributed Vision Processing

22

The Big Picture

Network Feedback ((robustness,, efficiency) y) • Low-level vision: appearance • High-level: activity interpretation WSNL - Stanford

Distributed Vision Processing

23

Interfacing Vision

¾ ¾ ¾ ¾

WSNL - Stanford

What accuracies / observation frequencies are needed? Task assignment to cameras Priorities of parameters to extract Process based on available contextual information

Distributed Vision Processing

24

Interfacing Vision

Description Layers

Queries Context Persistence Behavior attributes

Processing Layers

Description Layer 4 : Actions, labels

G

Processing Layer 3 : Interpretative

Reasoning processor (PC)

Description Layer 3 : Poses, attributes

E1

E2

E3

Processing Layer 2 : Collaborative

Fusion processor (embedded or PC)

Description Layer 2 : Low-level features

f11

f12

f21

F1

f22

f31

f32

F3

F2

Processing Layer 1 : Distributed

Pixel processor (SIMD)

Description Layer 1 : Image / video

WSNL - Stanford

R1

R2

R3

Distributed Vision Processing

25

Syllabus • •

WSNL - Stanford

Smart cameras Case Study – Human pose analysis

Distributed Vision Processing

26

Posture Estimation – Review y y

Discriminative -> template-based Generative -> model-based ◦ Bottom-up ◦ Top-down

y

Combined ◦ Discriminative for body parts ◦ Detect each body part as a unit ◦ Generative for whole-body configuration ◦ Find best model to match composition of all parts)

Multi-View Issues Opportunities:

Challenges:

• Complementary info • Occlusion handling • Outlier rejection • Distributed processing

• Correspondence • Redundant data • Misleading info in some images • Communication (bandwidth, latency)

WSNL - Stanford

Distributed Vision Processing

27

Posture Estimation

K K

3D model

CAM1

CAM2

CAM n

Image features

Image features

Image features

New configuration optimization

Evaluate similarity

Good enough?

Output 3D model

Bottom-up

WSNL - Stanford

Top-down

Distributed Vision Processing

28

Multi--View Camera Networks Multi • Combine bottom bottom-up up and top top-down down approaches – Powerful local image processor – Limited communication ◦ Generative (model-based) for whole whole-body body configuration

◦ Discriminative (template-based) for body parts

• Vision p processing g options: p – Segmentation with generic features – Opportunistic segmentation -- detection of body parts WSNL - Stanford

Distributed Vision Processing

29

Pose Estimation – Top Top--Down Approach ¾ 3D model -> 2D projections of edges and silhouettes ¾ Validate 2D projections with image observations + Easy to handle occlusions - Difficult to optimize: non-convex - Time consuming in calculating projections and evaluating them

WSNL - Stanford

Distributed Vision Processing

30

Pose Estimation – Bottom Bottom--Up Approach ¾ Look L k ffor body b d partt candidates did t iin iimages ¾ Assemble 2D/3D models from body part candidates + Distribute more computation in images (body part candidates, local assemblage) - Difficult to handle occlusions without knowing relative configuration of body parts - Not direct to map from 2D assemblage to the 3D model

WSNL - Stanford

Distributed Vision Processing

31

Model--based Fusion Model • Motivation to build a human model: ¾ A concise reference for merging information from cameras ¾ Universal interface for different gesture interpretation applications ¾ Allows new viewing angles in virtual domain ¾ Facilitates active vision methods: • Focus on what is important • Exchange descriptions only relevant to the model • Develop more detail in time • Initialize next operations (segmentation, motion tracking, etc)

¾H Helps l address dd privacy i concerns iin various i http://wsnl.stanford.edu/videos/gesture/rotate2.avi applications

WSNL - Stanford

Distributed Vision Processing

32

Case Study: Pose Analysis ¾ What is the p problem we try y to solve? • Reconstruct ion of detailed dynamic body model • Has to be real-time?

• Detect e ec ion o o of ce certain a poses (ges (gesture u e co control, o , fallen, a e , …)) • How critical is missed detection? Or false alarm?

• Extraction of long-term behavior routines • Afford o d to make a e sshort-term o t te mistakes? sta es Ca Can ignore g o e low-confidence o co de ce frames? a es

¾ system constraints? • • • • WSNL - Stanford

Real-time, frames-per-second Local versus central processing power Communication bandwidth L Latency Distributed Vision Processing

33

Case Study: Pose Analysis ψ ijK ( x i , x j )

x2 x4

x3

x5

x6

Graphical Model

x1 x7 x8

x9

Optimal solution for body model reconstruction

x10

Kinematic edges: Angle and distance constraints Silhouette + Edge Silhouette + Edge

Each camera sends silhouette and edge maps

Optimization

Silhouette + Edge

Distributed WSNL - Stanford

Central Distributed Vision Processing

34

Smart Cameras - Communication Constraints WiCa WiC 1.1

WiCa 1.1

ZigBee Channel 1

Sensor

ZigBee

IC3D

CPLD

ZigBee Channel 2

DPRAM

SD Slot

ZigBee Channel 3 AquisGrain 2.0

Requirements q • Real-time: • 30 fps • Latency of 10 ms

• Wireless link: • 100 kbps data per channel / 30 fps ~ 400B/frame WSNL - Stanford

Joint work with NXP Semiconductors, The Netherlands

Distributed Vision Processing

35

Case Study: Human Pose

WSNL - Stanford

Distributed Vision Processing

36

Generic Features Generic features for body parts • More processing power at cameras • Limited communication bandwidth

Color segmentation and ellipse fitting in local processing Background B k d subtraction

Distributed Processing: g POEM refine POEM: fi W Watershed h d color models segmentation Segmentation

Rough R h segmentation

Previous color distribution 3D human body y model

Ellipse fitting

Previous geometric configuration and motion

Maintain current model Combine 3 views to get 3D skeleton geometric configuration Update 3D model (color/texture (color/texture, motion)

Y

Check stop criteria

Update each test Collaborative Processing: Score test configuration using configurations PSO Model Fitting

N

Generate test configurations

Local processing i from other cameras

• Ellipse parameters are sent to central processor – Reduced data communication load WSNL - Stanford

Distributed Vision Processing

37

Segmentation - Generic Features

background subtraction

markers background Markers for the person

images optical flow estimation

watershed segmentation foreground

markers

Info from model

body part segments

WSNL - Stanford

ellipse fitting and attributes extraction

Distributed Vision Processing

watershed

K- means clustering( color)

38

Distributed Processing http://wsnl.stanford.edu/videos/gesture/ellipfull1.avi Initialize from model, or refresh (k-means

Refine color models (adaptivity)

Enforce spatial connectivity for ambiguous pixel colors

Concise description of segments

Color segmentation and ellipse fitting in local processing Background B k d subtraction

R Rough h segmentation

POEM: refine POEM fi color models

W Watershed h d segmentation

Ellipse fitting

Distributed Processing: Segmentation Previous color distribution

Feedback 3D human body y model

Previous geometric configuration and motion

Maintain current model Combine 3 views to get 3D skeleton geometric configuration Update 3D model (color/texture (color/texture, motion)

Y

Check stop criteria

Update each test Collaborative Processing: Score test configuration using configurations PSO Model Fitting

N

WSNL - Stanford

Distributed Vision Processing

Generate test configurations

Local processing i from other cameras

39

Distributed Processing

WSNL - Stanford

Distributed Vision Processing

40

Collaborative Model Fitting ¾Exchange segments and attributes, combine to reconstruct a 3D model ¾Subject’s information mapped and maintained in the model: d l • •

Geometric configuration: dimensions, lengths, angles Color / texture / motion of different segments

Particle Swarm Optimization (with goodness of ellipse fits to segments)

WSNL - Stanford

Projection on image planes

Distributed Vision Processing

Parameters of body parts

41

Collaborative Model Fitting z θ1

ϕ1

θ2

θ3

ϕ3

θ4 ϕ4

ϕ2

O

y

x

ellipses CAM1

ellipses

ellipses

CAM2

CAM3

• Red: projection of skeleton on image plane • Green: region of arms grown from red lines • Blue: ellipses from segmentation ¾ Score = Area (ellipses falling within green polygons) / Area (green polygons) WSNL - Stanford

Distributed Vision Processing

42

Model--based Pose – Generic Features Model Frame 105 Frame 105

WSNL - Stanford

Distributed Vision Processing

43

Model--based Pose – Generic Features Model

WSNL - Stanford

Distributed Vision Processing

44

Opportunistic Features • Generic features are used to detect body parts – Can improve p using g more specific p features for each body yp part?

WSNL - Stanford

Distributed Vision Processing

45

Case Study: Human Pose

WSNL - Stanford

Distributed Vision Processing

46

Opportunistic Segmentation Opportunistic Reconstruction

Head candidates

• Different features for bodyy parts p • Limited communication bandwidth

Hands candidates Torso width Line segments delineating upper body Skeletons of thighs and calfs

Head candidates http://wsnl.stanford.edu/videos/gesture/features3.avi

s n2

n1

n3 WSNL - Stanford

Distributed Vision Processing

47

Collaborative Processing • Multi-camera validation – Outlier rejection – Occlusion handling g

• Model construction Camera 3 will not participate in hand modeling

WSNL - Stanford

Distributed Vision Processing

48

Collaborative Model Construction

Camera 1, 2, 3:

Head candidates

Head position

Camera 1, 2, 3:

Torso width

Torso orientation

Camera 1, 2, 3:

Hands candidates

Hands positions Occlusion inference for both arms

Camera 1, 2, 3:

Camera 1, 2, 3:

WSNL - Stanford

Arms angle configurations

Distance maps from line segments d li delineating ti upper-body b d Orientation of thighs and calfs

Distributed Vision Processing

Legs angle configurations

49

Collaborative Model Construction w1

w2

3

y

1

w1

w2

b

a

α

x

3

Torso angle

2

4

1 2

WSNL - Stanford

3 4

Distributed Vision Processing

50

Collaborative Model Construction

http://wsnl.stanford.edu/videos/gesture/combine1.avi WSNL - Stanford

Distributed Vision Processing

51

Collaborative Model Construction

http://wsnl.stanford.edu/videos/gesture/rotate2.avi

WSNL - Stanford

http://wsnl.stanford.edu/videos/gesture/jogging1.avi

Distributed Vision Processing

52

Collaborative Model Construction

http://wsnl.stanford.edu/videos/gesture/pang.avi WSNL - Stanford

Distributed Vision Processing

53

Communication Load • Data record per frame:

Each pixel requires many processing passes WSNL - Stanford



Distributed Vision Processing

Line memory limitation 54

Embedded Implementation • Further F h hardware h d constraints: i – Multiple image passes are required – Line memory available on WiCa allows ~1pass 1pass per full frame – This imposes severe limit on the algorithm

• Process a small subset of features:

WSNL - Stanford

Distributed Vision Processing

55

Real--Time WiCa Implementation Real

WSNL - Stanford

Distributed Vision Processing

56

30 Frames per Second

http://wsnl.stanford.edu/videos/gesture/realtime1.avi

WSNL - Stanford

Distributed Vision Processing

57

Ping Pong

WSNL - Stanford

Distributed Vision Processing

58

Ping Pong

ICDSC (International Conference on Distributed Smart Cameras) Sept 2007, Vienna, Austria

http://wsnl.stanford.edu/videos/gesture/realtime2.avi WSNL - Stanford

Distributed Vision Processing

59

Spatiotemporal Smoothing

Two-camera feature fusion and temporal smoothing http://wsnl.stanford.edu/videos/ballgame/comparison32.avi No smoothing

WSNL - Stanford

Distributed Vision Processing

60

WiCa Implementation

WSNL - Stanford

Distributed Vision Processing

61

Case Study: Pose Analysis • May only need high-level posture state in some applications – E.g. assisted living

WSNL - Stanford

Distributed Vision Processing

62

Publications http://wsnl.stanford.edu

WSNL - Stanford

Distributed Vision Processing

63

Distributed Vision Processing in Smart Camera Networks ICASSP 2009 Taipei April 20, 2009

Hamid Aghajan Stanford University, USA aghajan AT stanford.edu stanford edu

Part 2/4: Smart Cameras http://wsnl.stanford.edu/ICASSP09/

Suggest Documents