3D Computer Vision. Federico Tombari, Ph.D University of Bologna

3D Computer Vision Federico Tombari, Ph.D [email protected] University of Bologna Summary  Part 1: Sensors and Representations  Part 2:...
2 downloads 4 Views 5MB Size
3D Computer Vision Federico Tombari, Ph.D [email protected] University of Bologna

Summary 

Part 1: Sensors and Representations



Part 2: Indepth stereo vision



Part 3: Tasks and applications

Part 1 – Sensors and Representations 



3D sensors 

Active



Passive

Data Representations 

Unorganized



Organized



Range maps



Transformations

3D Computer Vision

Reconstruction Acquisition Recognition

3D sensors 

Goal: create a point cloud of (samples from) the surface of an object/scene 

Collection of distance measurements from the sensor to the surface



Distances are them transformed into 3D coordinates (x,y,z) by means of calibration information



Usually, 3D sensors acquire only a view of the object (2.5D data)



Some sensors also acquire information concerning color or light intensity (RGB-D data)



Contact sensors



Active sensors





LIDAR, rangefinders



Time-of-Flight cameras



Laser Triangulation



Structured light



Medical imaging (CT, MRI)

Passive sensors 

Stereo



Structure-from-motion



Shape-from-shading, shape-from-silhouette, shape-from-defocus, ..

LIDAR  

LIDAR: Light Detection And Ranging A light pulse is emitted from the sensor and the round-trip time is 𝒄𝒕 computed as 𝒅 = 

 

𝟐 The higher the time, the further away the point from the sensor

Usually visible or near-infrared light is used Arrays of emitters are employed together to yield a set of simultaneous range measureaments (3D slice)



Slices can be swept using a motor to yield a slice array



Pros:





High range (hundred meters / kms)



Works indoor/outdoor



Real-time (eg. 100 Hz slices)

Cons 

Accuracy is an issue due to speed of light (3 10^8 m/s -> 1 mm every 3.3 ps)



Cost



Color/intensity information is usually not provided

Velodyne

SICK LMS500

Time-of-Flight camera  





A particular LIDAR device yielding a full 2D array of range measurements A light pulse is emitted through an infrared illuminator; then: 

Phase-shift measurament of the returning light pulse on each pixel (Photonic Mixed devies, Canesta Vision (now Microsoft), Swiss Ranger )



«range-gated imager»: each pixel has a shutter that starts closing when the light pulse is emitted; the less light received, the further away the point from the sensor (Zcam by 3DV Systems, now Microsoft)

Pros 

No motor needs to be employed



Real-time (30-100 fps)



Cost effective

Cons 

Low resolution



Dark, non-reflective objects are hard to be acquired



Hardly works under sunlight (no outdoors)



Multiple reflections can yield false measurements



Interference between different sensors

MESA SwissRanger 4000

Laser triangulation 



Laser + camera system 

A laser dot or stripe is emitted on the scene



The camera locates the dot/stripe.



The distance is determined via triangulation of the position of the dot (emitting angle, receiving angle and baseline need to be known)

Pros 



LASER b

O a

b

d

Accuracy (tens of micrometers)

Cons 

Limited range



Slow scanning time (often requires static scene)



$$



No color information, needs to be paired with a color camera

Minolta Vivid 9I

Structured light  









Camera + projector system Similar to laser triangulation, where the laser stripe is replaced by a stripe projected by a light projector Projecting a set of stripes (2D pattern) allows for multiple sampling, hence a full 2D range image can be acquired at once (but problem of confusing different fringes) Using infrared projection and two cameras (one in infrared, one in the visible band) yields accurate RGB-D data

Pros: 

Relatively cheap



Sub-millimiter accuracy (down to tens of micrometers)



Real-time

Cons: 

Limited range



Hardly usable outdoor or in presence of other light sources



Highly dependent from the object surface characteristics (eg. reflective, translucent, ..)



Interference between different sensors

Microsoft Kinect

Stereo vision  



Two (or more) cameras

Cameras have to be sync-ed, especially in presence of non-static scenes Depth is retrieved via triangulation of the point projections on the two views



Correspondence problem!



Pros:





Cheap



Passive



Real-time



Color/intensity can be directly associated to range data (RGB-D)

Cons 



Videre Design

P’ P pL

p’R

Low accuracy (tend to fail on low-textured regions, repetitive patterns and depth borders)

A projector can help adding texture to the scene to improve accuracy on low-textured regions

pR pL

OL

pR OR

Spacetime stereo  





Stereo using spatial and temporal information[Davis 05][Zhang 03]

To gather information, the appearance must change over time (but not the geometry!) A random pattern is projected to augment each frame with a different texture (no structured light, no interference) More accurate than standard stereo, but depth must be constant in time (static objects)

Joint spatio-temporal window

Structure-from-motion  

Monocular system

Instead of spatially extending multiple views, they are temporally extended



Requires either the surface or the camera to move



Tracking and matching features



Pros





Cheap and simple hardware (only one camera needed)



RGB-D data



Solving SfM also yields camera pose at each time instant

Cons 

Highly dependent from the available motion that the object/camera can undergo



Sparse depth information

3D Data Representations 

3D data may be represented in different formats 

Unorganized  



Organized 





(Binary) voxelized cloud: a 3D regular grid of (binary) density values. Range image: a 2D regular grid (an image) of 3D coordinates.

3D data may be  



Point cloud: a set of vertices. Polygon mesh: a set of vertices and their connections (topology).

Pure 3D:

represent only the geometry of the scene RGB-D: store the geometry of the scene as well as the intensity of the 3 color channels

3D data may represent  

Full 360° (3D) view of an object /scene 2.5D view of an object /scene

Point cloud 



Point cloud is the most common 3D data representation in computer vision. It is just a collection of 3D coordinates: 

Unorganized: hence, nearest neighbor searches are costly 



Must build an index structure (e.g. kDtree or Oc-tree)

no topology 

Inner and outer parts of the surface are hard to discriminate

Polygon mesh 



Polygon mesh is the most common 3D data representation in computer graphics. It is a collection of 3D vertices and of faces connecting them: 

Unorganized: hence, nearest neighbor searches are costly 



Must build an index structure (e.g. kDtree or Octree)

Has topology 

Essential for rendering



Provides surface orientation

Polygon Mesh Rendering

Range image 





Range image is a useful representation for efficient 3D data processing The term is a bit ambiguous as it is used to denote 

a single channel image whose pixels encode the distance of the scene point from the sensor



a three channels image whose pixels encode the coordinates of the scene points

It is an organized representation 



No topology 



Nearest neighbor searches can be carried out efficiently by exploiting the image lattice But it is easy to create it from the lattice

Only 2.5D views: it provides also the sensor position (point (0,0,0)).

Image from Stuttgart Range Image Database (http://range.informatik.uni-stuttgart.de/)

From range images to meshes

Voxelized clouds 

Voxelization is a useful representation for efficient 3D data processing 



It represents surfaces with a regular grid of voxels (volumetric pixels) 



Nearest neighbor searches can be carried out efficiently by exploiting the regular structure

No topology 



3D coordinates are implicitly defined by the index in the grid.

It is an organized representation 



It is also the output of some sensors (e.g. metal detectors, body scanners)

But it is easy to create it from the grid

Suitable also for full 3D views.

Change representation Need vantage point

Point Cloud

Range Image Points from pixel data

Points from voxel coordinates

Surface reconstruction (Marching Cubes, Poisson, ... )

Quantize (step?)

Discard faces

Need vantage point

Triangulation on Lattice

Triangulation on Grid

Voxelized cloud

Polygon Mesh Quantize (step?)

Bibliography 



[Davis05] J. Davis, D. Nehab, R. Ramamoorthi, S. Rusinkiewicz, “Spacetime stereo: a unifying framework for depth from triangulation,” Trans. Pattern Analysis and Machine Intelligence, vol. 27(2), 2005 [Zhang03] L. Zhang, B. Curless, S. Seitz, “Spacetime stereo: shape recovery for dynamic scenes” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2003

Part 2 – Indepth stereo vision 

Why stereo?



Epipolar geometry



Algorithms



Real-time stereo

The single-view case 





By means of only one viewpoint, an observer can not determine univocally the distance between him and visible points in real world. Human vision relies in this case on «monocular cues» (paralellism, similarity, relative movement, ..) to estimate distances Doesn’t work always – lots of ambiguities (see picture)

Using multiple viewpoints 



On the other hand, using multiple views (e.g. two) ambiguities can be solved, so that it is possible to reconstruct the 3D geometry of the observed scene Problem solved? Not so easy! There’s still the correspondence problem!

P’

P’

P

d

d’

P

pL

p’R

pL OL

pR

pL OL

pR OR

Ideal stereo setup 

Ideal stereo setup: 

Optical axes are parallel



Same focal length



Image planes are coplanar



The two Reference Frames have parallel axes

z

z

P

uL

uR pL

pR

f 



Under these conditions, the two Reference Frames only differ for a translation along the x direction equal to b. b stands for baseline and it is a characteristic parameter of a stereo setup.

f OL

xL b

OR

xR

Ideal stereo setup (2) vL = vR = y f/z uL = xL f/z uR = xR f/z

b f z d

•Big disparity: close point •Small disparity: far away point •d=0 : infinite depth (horizon)

vR

vL

uL – uR = d, disparity f f d  ( xL  xR )  b  z z

uR

uL

L 



R

Given two homologous (i.e. corresponding) points (pL, pR), each being the projection of the same point P on the scene, and knowing the parameters b and f, we can directly compute the depth of P

Given pL, how can we determine its homologous pR (and viceversa)? Correspondence problem

Disparity map  



Grayscale image displaying at each pixel the associated disparity value.

For visualization purposes, disparity values are remapped in the range [0,255]. Color mapping is also effective in conveying scene depth understanding.

Tsukuba dataset: Left image and ground-truth disparity map. Max Disp. 15. Scale factor: 16.

Correspondence search constraints 





To retrieve depths via stereo, a point on one image has to be associated to its homologous in the other image (correspondence problem) This problem is hard, since theoretically each point could match to all points in the other view. Correspondence («stereo matching») algorithms exploits constraints aimed at reducing the number of potential candidates 

epipolar constraint



disparity range constraint



smoothness constraint



uniqueness constraint



ordering constraint

Epipolar geometry 







Epipoles are determined by the intersection of the segment lying on the optical centers with the image planes For different 3D points, epipolar lines rotate around the epipole

P lies on OLP, and the projection of OLP over pR belongs to lR Thus, corresponding points must lie on conjugate epipolar lines

P pR

pL

lL

pL

eL OL

eL, eR: epipoles lL, lR: epipolar lines

POLOR : epipolar plane

eR

pR

lR

OR

Epipolar constraint 







The epipolar constraint reduces the correspondence problem from 2D to 1D (search along epipolar lines only) If conjugate epipolar lines are collinear and parallel to the horizontal image axis, the candidates to search are easy to determine (e.g. pL(i*,j*) -> pR(i*,j)) This happens in the ideal stereo setup; while, in real systems, it is impossible to recreate these conditions via mechanical alignment We need to obtain it via software: stereo rectification [Trucco 98]

Stereo calibration and rectification 

Stereo camera calibration: 1. Calibration of each view: a) Estimate the K matrix (intrinsic parameter matrix, or camera calibration matrix): •fu =f / pw s u  f u

K   0  0

0

fv 0

v0  1 

•fv = f / ph •u0, v0: pixel coord. of image center •s = skew factor

Estimate the lens distortion parameters: radial distortion (k1, k2, k3, ..) tangential distortion (t1, t2, ..) 2. Estimate the extrinsic parameters related to both views: Rotation matrix R (3x3) and translation vector T (1x3) which define the transformation with respect to an absolute 3D Reference Frame This way, we are able to obtain:  Perspective projection matrices (ppm): P  K  [ R; T ]  Rectification homographies: 3x3 matrices which transform each view to a new 3D space where each view are rectified with respect to each other(warping). b)

Disparity range constraint 

 

The possible disparity values are reduced from all possible candidates in the scanline to [dmin dmax]. This is equivalent to setting a minimum and maximum depth range to the sensor As a consequence, there are less match ambiguities and a reduced computational burden 



zmax

(complexity of most algorithms: O(w x h x (dmax-dmin) ) ).

Disparity range: 

[dmin, dmax]

pR

dmin dmax

[zmax, zmin]

zmin

Horizontal offset 

The horizontal («x») offset is an alternative representation of the disparity range constraints (ox/dmax instead of dmin/dmax)

ox ox +dmax

zmax

pR max zzmin







[ox, ox +dmax]

[zmax, zmin]

Horopter (stereo depth of field): is defined as the range [zmin, zmax] Note: with the same number of elements within the disparity range and by varying the offset, the depth range dz in the 3D scene is modified (higher ox, smaller dz, higher depth resolution)

min zzmax zmin

Stereo taxonomy [Scharstein02] Area-based (dense) 





Local methods: the correspondence problem is solved at each point by employing only pixels in the same neighborhood of the point. 

fast



Problems along low-textured areas/ depth borders

Global methods: the correspondence problem is interpreted as a cost minimization problem computed on a graph, which takes into account all pixels of the image. Each disparity thus depends from all other pixels in the image 

Better results in terms of accuracy of the retrieved disparity map



Higher computational cost

Semi-global methods: the cost problem is defined on a graph computed not on all image pixels (typically loopy graph) but on each scanline (acyclic graph) 

Good accuracy/speed trade-off

Feature-based (sparse, only computed on features)

Block-based stereo

S

đ

d

WTA approach (Winner-Take-All)

Block-based stereo j

i

j+dmin j+d

j+dmax

i

2W+1 2W+1

R

L  

Uses squared windows of size 2W+1 At each point of coordinates (i,j) of the reference image L, a matching function is computed among all pairs formed by the candidates belonging to the disparity range

(i, j )  L, d  [d min , d max ] : S (i, j, d ) 

W

W

  Li  m, j  n, Ri  m  d , j  n

m  W n  W

Matching measures 

Matching measures are typically either similarity (affinity)-based or dissimilarity (distoriont)-based. The former are generally based on cross-correlation, while the latter are derived from the Lp -norm distance:

One of the most used similarity measures is the NCC (Normalised CrossCorrelation): W W Li  m, j  n   Ri  m  d , j  n    L(i, j )  R(i, j , d ) m  W n  W NCC (i, j , d )   W W W W L(i, j ) 2  R(i, j , d ) 2 2 2     L i  m , j  n  R i  m  d , j  n     

m  W n  W



Being a similarity measure, the disparity value to choose for the current point corresponds to the maximum of the NCC values computed on the points within the disparity range:

d (i, j )  arg max

dd min , d max 



m  W n  W

NCC i, j, d 

It holds the property of being invariant to constant linear transformations between the two images

La  R

Matching measures (2) 

Traditional dissimilarity measures are derived from the Lp norm of the difference between the two vectors L(i,j) and R(i,j,d), representing the two squared window on p which the disparity is being evaluted: W W p d p (i, j, d )  L(i, j )  R(i, j, d ) p    Li  m, j  n   Ri  m  d , j  n  m  W n  W



The most commonly used dissimilarity measures are the SSD (Sum of Squared Differences), choosing p=2:

d 2 (i, j, d )  L(i, j )  R(i, j, d ) 2  2



W

d1 (i, j, d )  L(i, j )  R(i, j, d ) 1 

2

  Li  m, j  n  Ri  m  d , j  n

m  W n  W

and the SAD (Sum of Absolute Differences), with p=1: 1



W

W

W

  Li  m, j  n  Ri  m  d , j  n

m  W n  W

Being dissimilarity measures, the disparity value associated to the current point corresponds to the minimum of the SSD/SAD values computed on the points within the disparity range

d (i, j ) arg min d p i, j, d  d dmin ,dmax 

Main problems 





Disparity variations within the squared window: the block-based algorithm relies on the assumption that disparity is constant within the window (fronto-parallel surfaces). This is not true in real scenes, anytime we are considering points along depth borders (occluding boundaries).

A window covering regions at different depths doesn’t have a corresponding window in the other image since these regions will either be non-contiguous or occluding each other.

This implies higher ambiguity in determining the maximum/minimum of the matching measure being used and, thus, inaccuracies in localizing the occluding boundaries (smearing/fattening problem).

Main problems (2) 

Occlusions: areas of the scene visible only in the refence image (occluded by foreground objects)

?

Main problems (3) 





Low textured regions and repetitive patterns along epipolar lines

Choice of the window size: a bigger window has a higher SNR (the window «captures» more useful appearance information especially on low-textured regions), but lower spatial resolution (details are not retrieved). The higher the window size, the more probable the constant disparity assumption will be violated.

Facing photometric distortions 



Photometric distortions are common due to illumination differences in the two images (due to non-Lambertian surfaces, specularities..) and different intrinsic camera parameters (gain, exposure, ..) To compensate for these differences, typical solutions are: 

Band-pass filters (eg. by means of the LOG – Laplacian of Gaussian operator) applied on both images. This is typically done as a pre-processing step before stereo matching



Subtraction of the mean value computed on a squared window (e.g. [Di Stefano04, Faugeras03]) (also a pre-processing step)



More robust matching measures towards photometric distortions

ZNCC, ZSSD, ZSAD 

 



A transformation that can be applied to image with the aim of increasing the robustness of stereo matching in presence of photometric distortions is subtracting to each point P the average intensity value computed on a squared window centered in P. Also this approach compensates constant intensity offset variations. In practice, this transformation can be either done as a pre-processing step prior to stereo matching, or can be included in the matching measures typically employed by stereo algorithms. When applied to Lp-norm based measures, we obtain the Zero-mean Sum of Squared Differences (ZSSD) and the Zero-mean Sum of Absolute Differences (ZSAD):

ZSSD(i, j, d ) 

W

  Li  m, j  n  

m  W n  W

ZSAD(i, j, d ) 

W

2

W

L

(i, j )   Ri  m  d , j  n    R (i  d , j ) 

W

  Li  m, j  n  

m  W n  W

L

(i, j )   Ri  m  d , j  n    R (i  d , j ) 

ZNCC, ZSSD, ZSAD (2) 

Applying the same approach to the NCC measure, we obtain the Zero-mean Normalised Cross-Correlation (ZNCC): W

ZNCC (i, j , d ) 

  Li  m, j  n  

m  W n  W W

W

L

(i, j )   Ri  m  d , j  n    R (i  d , j ) 

  Li  m, j  n   L (i, j )

m  W n  W



W

2



W

W

2     R i  m  d , j  n   ( i  d , j )   R

m  W n  W

ZNCC is thus invariant to affine intensity transformations between the two images:

La  R  b

“Interest operators” 







Only certain «interest points» are selected (the others are discarded) on which to compute stereo matching These points are selected a-priori as reliable for correspondence matching (e.g. characterized by enough texture) E.g.: Moravec operator [Hanna85,Moravec79]: based on the intensity variation of a pixel P over a neigborhood N(P) (3x3..11x11). 8 directional variations are being computed as the sum of the squared differences among adjacent pixel along 8 directions

(I(i,j)-I(i+1,j+1))2  (i,j)  N(P)

s1(P) =

s1(P), .. ,s8(P) 

Intensity variation: s = min ( s1(P), .. , s8(P) )



Interest points are those above a threshold

1

2

4

5

7

8

3 1 4 7

6 9

2

3

5

6

8

9

Disparity filtering methods 

Analysis of the disparity surface at each stereo correspondence

Reliable correspondence

d Depth border

d Low-textured region

d

Disparity filtering methods (2) 



Ratio between best and second-best maxima

d

d

d

d

Analysis of the peak spread

Disparity filtering methods (3) 

Ratio between best and second-best maxima

f xlocmax   TH RATIO  0,1 f xmax 



The correspondence is invalid

Analysis of the peak spread

f ' ( xmax )  f ' ( xmax )  TH PEAK 2

The correspondence is invalid

Left-Right consistency check 





Left ref.

Right ref.

LR map

RL map

Compute two disparity maps using, as reference image, both the Left Image (LR map) and the Right image (RL map) Validate correspondences only if coherent over the both views: If, according to the LR map, pR is the best match for pL, then for the RL map, pL must be the best match for pR [Fua93].

Left-Right consistency check (2) 

Useful to filter out some correspondence errors due to occlusions: if pL is not visible on R, during Stage 1 it will be wrongly matched with a point on R, which in turn, if visible on L, will have its own homologous p’L != pL

Some results (local algorithms)

Ground truth

Variable support (Shiftable Windows [Bobick99])

Block-based

Advanced local (Segment support [Tombari07])

Some results (global algorithms)

Belief Propagation [Klaus06]

Ground truth

Graph Cuts [Kolmogorov07]

The stereo dilemma 

Accurate algorithms exists...



But for slight improvements in accuracy, we currently pay a high price

Accuracy (% errors) 100% (Ground Truth) Global Methods

90%

Variable support

80%

SO, DP

70%

Block matching

60% …

Hours

Minutes

Seconds

Real-time

Efficiency (computation time)

Real-time stereo 

Even considering the simplest algorithm (i.e. block-matching) the complexity of the stereo algorithm is quite high: O(M2 x N2 x D)



For real-time applications, we need computational optimizations



coarse-to-fine search 







the disparity is estimated at low resolution, and refined (but exploring only a small disparity range) at high resolution

Hardware acceleration: DSP [Faugeras93b], FPGA [Woodfill98, Corke99, Jia03], GPU or dedicated chips [Vanderval01]. Incremental schemes to avoid redundant operations involved in the computation of function C(i,j,d) [Faugeras93b, DiStefano04, Fua93]. Parallellization via SIMD instruction sets (Single Instruction Multiple Data) for multimedial data (e.g. MMX) available on current general-purpose CPUs [DiStefano04].

Box-Filtering [McDonnell81] 2n+1

Left

2n+1

Y+1

y

2n+1

2n+1

y

Y+1

x+d

x

SAD ( x, y, d ) 

Right

n

 L( x  j , y  i )  R ( x  d  j , y  i )

i , j  n

SAD( x, y  1, d )  SAD( x, y, d )  U ( x, y  1, d ) n

U ( x, y  1, d ) 

 L( x  j, y  n  1)  R( x  d  j, y  n  1)

j  n

n



 L ( x  j , y  n )  R ( x  d  j , y  n)

j  n

Box-Filtering (2) SAD( x, y  1, d )  SAD( x, y, d )  U ( x, y  1, d ) n

U ( x, y  1, d ) 

 L( x  j, y  n  1)  R( x  d  j, y  n  1)

j  n

A

 L ( x  j , y  n )  R ( x  d  j , y  n)

j  n

2n+1

y-n

B

Y+1

D

C

x-n-1

x

x+n

Right

A'

B'

D'

C'

y

2n+1

2n+1

y

y+n+1



2n+1

Left y-n

n

Y+1

y+n+1

x+d-n-1

x+d

x+d+n

U x, y  1, d   U ( x  1, y  1, d )  A  A'  B  B'  C  C '  D  D' SADx, y  1, d   SAD( x, y, d )  U ( x  1, y  1, d )  A  A'  B  B'  C  C '  D  D'

Bibliography – Part 2 [Bobick99] A. Bobick, S. Intille, “Large occlusion stereo”, International Journal of Computer Vision, 33(3):181– 200, 1999. [Corke99] P. Corke, P. Dunn, J. Banks “Frame-rate stereopsis using non parametric transforms and programmable logic”, Proc. IEEE Conf. On Robotics and Automation, 1999. [Di Stefano04] L. Di Stefano, M. Marchionni, S. Mattoccia “A Fast Area-Based Stereo Matching Algorithm”, Image And Vision Computing, Vol. 22, No. 12, Oct. 2004. [Faugeras93] O. Faugeras et. al. “Real time correlation-based stereo: algorithm, implementations and applications”, INRIA Rapport de recherche N. 2013, 1993. [Faugeras93b] O. Faugeras et. al. “Real time correlation-based stereo: algorithm, implementations and applications”, INRIA Rapport de recherche N. 2013, 1993.

[Fua93] P. Fua “A parallel stereo algorithm that produces dense depth maps and preserves image features”, Machine Vision and Applications, 1993. [Hanna85] M.J. Hanna “SRI's Baseline Stereo System”, Proc. Image Understanding Workshop, 1985. [Jia03] Y. Jia, Y. Xu, W. Liu, C. Yang, Y. Zhu, X. Zhang, L. An “A Miniature Stereo Vision Machine for Real-Time Dense Depth Mapping”, 3th Int. Conf. Computer Vision Systems, 2003.

[Klaus06] A. Klaus, M. Sormann and K. Karner, “segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure”, Int’l Conf. Pattern Recognition 2006.

Bibliography – Part 2 [Kolmogorov01] V. Kolmogorov, R. Zabih “Computing visual correspondence with occlusions using graph cuts”, In Eighth Intern. Conf. on Computer Vision, 2001. [Moravec79] H.P. Moravec “Visual mapping by a robot rover”, Proc. of. 6th Int. Joint Conf. on Artificial Intelligence, 1979. [McDonnell81] M. Mc Donnell, “Box-Filtering techniques, Computer Graphics and Image Processing, 17:65-70, 1981 [Scharstein02] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms”, International Journal of Computer Vision, 47(1/2/3):7–42, 2002. [Tombari07] F. Tombari, S. Mattoccia, L. Di Stefano, “Segmentation-based adaptive support for accurate stereo correspondence“, IEEE Pacific-Rim Symposium on Image and Video Technology, 2007

[Trucco98] E. Trucco, A. Verri “Introductory Techniques for 3-D Computer Vision”, Prentice Hall, 1998. [Vanderval01] G. Van der Val, M. Hansen, M. Piacentino “The ACADIA Vision Processor”, 5th Int. Work. on Computer Architecture for Machine Perception, 2001. [Woodfill98] J. Woodfill, B. Von Herzen “Real-time stereo vision on the PARTS reconfigurable computer”, Proc. IEEE Symp. On FPGA for Custom Computing Machines, 1998.

Part 3 – tasks and applications 

3D Computer Vision tasks     



Registration SLAM Retrieval Recognition Semantic Segmentation

Applications

3D registration  

Alignment of partially-overlapping 2.5D views

Useful to yield a high-def, fully-3D reconstruction of an object from views acquired from different view points 

Pairwise registration



Registration from multiple views

Coarse-to-fine approach 



Coarse registration provides an initial guess for the set of views that need to be registered 

By hand



By matching features

Fine registration is generally based on Iterative Closest Points (ICP [Besl92]) 

Will diverge if initial guess is not reliable enough or data is noisy

Coarse registration

Fine registration (ICP-based)

Registration from multiple views 

Global selection and registration (unordered input views) [Huber03]

Coarse pairwise

Fine pairwise

Coarse pairwise

Fine pairwise

unordered input views

After global selection

...

... Coarse pairwise

Global selection

Global refinement

Fine pairwise

After global refinement (Scanalyze)

Global selection 0.5

0.0

0.7 0.3

0.6

0.2

0.1

0.4

0.5

0.0

Results Spacetime Stereo

Kinect sensor

SLAM (1) 





Simultaneous Localization and Mapping 

incrementally build a map of the agent’s surroundings (mapping)



Localize itself within that map

Odometry, inertial sensing 

Measurement drifts



Visual odometry [Nistèr 04] [Konolige 06]

3D / photometric sensors 

Laser scanner



Sonar



Stereo [Sim 06]



Visual sensors (vSLAM) [Karlsson 05][Folkesson 05] 





Landmark initialization?

6DOF SLAM

monoSLAM [Davison 03] [Eade 06][Clemente 07] 

Visual odometry + single camera

Credits: J.B.Hayet

SLAM (2) EKF: • Update via odometry • Update via landmark re-observation Odometry

Geometric/photometric data

Landmark extraction

Landmarks: • Re-observable • Distinctive • Stationary

Extended Kalman Filter Data association Mapping update Local vs. Global consistency (loop closure, bundle adjustment)

SLAM (3) 

MonoSLAM converging to Structure-fromMotion [Strasdat 10] 



PTAM [Klein 07], DTAM [Newcombe 11])

6DOF SLAM with RGB-D sensors 

Kinect Fusion [Newcombe 11b]



RGB-D dense point cloud mapping [Henry 11]

3D Shape Retrieval / Categorization 



Differently from recognition, which recognizes specific object instances (my cup, that teddy bear), shape retrieval/categorization associates a category label to a given query model Typically no clutter and occlusion, but high intra-class variance

CUPS

A human..or a cup?

HUMANS

Shape retrieval via global descriptors 3

2

4



Computation of model descriptors (offline)



Computation of query descriptor



kNN matching



Retrieval / categorization

1

3D Object Recognition 







Determine the presence of a model in a scene and estimate its 6DOF pose Challenges 

Clutter, occlusions



Sensor noise: missing parts, holes (transparent/dark objects), artifacts



Dealing with large model libraries

To deal with clutter and occlusions, object descriptors are «shrinked» to a small region around interest points (local descriptors) [Tombari10] Otherwise, area-based (template matching) [Hinterstoisser12] Courtesy of S. Hinterstoisser

Results

Kinect Spacetime Stereo

Syntethic data

Real-time stereo

Semantic segmentation 

Goal: Determine 3D connected components with specific properties or belonging to a particular semantic category



Applications: urban/indoor scene understanding, robot localization and navigation



Also: useful as the first step of an object categorization/recognition algorithm

Approaches 1. Segmentation + segment classification  Euclidean clustering  Smooth region growing (clustering neighboring points on smooth surfaces)  Exploit prior knowledge (e.g. dominant plane(s) )

2. Pointwise feature classification + grouping  Inference on a loopy graph (MRF) [Tombari11]

 Associative Markov Networks (AMN) [Anguelov05][Triebel07][Munoz09]

Applications - robotics 

Autonomous mobile robots – AMR (navigation)



Object recognition, grasping and manipulation (social robotics)

Applications - video surveillance

Tracking and motion detection Behavior analysis

Retail intelligence Crowd monitoring

People counting

Other applications  

 

Shape retrieval (www) High-def 3D model acquisition (computer graphics) Biometrical systems (eg. face recognition) Medical imaging (MRI, CT, PET, x-ray, ultrasound, ..) Michelangelo project

3D face recognition

Google warehouse

3D medical imaging

And yet.. 

Autonomous vehicle navigation (AVN)



Augmented reality



Human computer interaction (HRI)



Videogaming, entertainment



..

Autonomous Vehicle Navigation

Augmented reality by Lego and Intel Microsoft Xbox

Point Cloud Library 

Reference open source community for 3D computer vision and robotic perception



Includes modules for





Keypoint extraction (pcl_keypoint)



Global/local descriptors (pcl_features)



Object Recognition in clutter (pcl_recognition)



Surface registration (pcl_registration)



Cloud segmentation (pcl_segmentation)

And many more .. (machine learning, stereo, I/O, ...)

www.pointclouds.org

PCL module 

22 Code libraries + separate CUDA/GPU/Mobile modules

Bibliography – part 3a 















[Anguelov 05] D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, A. Ng, “Discriminative learning of markov random fields for segmentation of 3-d scan data”, Proc. CVPR, 2005 [Clemente 07] L. Clemente, A. J. Davison, I. Reid, J. Neira, J. Tardòs, «Mapping large loops with a single hand-held camera», Proc. Conf. Robotics: Science and Systems (RSS), 2007 [Davison 03] A. J. Davison, “Real-time simultaneous localisation and mapping with a single camera”, Proc. ICCV, 2003 [Eade 06] E. Eade, T. Drummond, “Scalable monocular SLAM”, Proc. Conf. on Computer Vision and Pattern Recognition, 2006 [Folkesson 05] J. Folkesson, P. Jensfelt, H. Christensen, “Vision SLAM in the Measurement Subspace,” IEEE Int. Conf. Robotics and Automation (ICRA), 2005 [Henry 11] P. Henry, M. Krainin, E. Herbst, X. Ren, D. Fox, “RGB-D Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor Environments”, Proc. Int. Symp. on Experimental Robotics, 2010 [Hinterstoisser 12] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm N. Navab, P. Fua, V. Lepetit, “Gradient Response Maps for Real-Time Detection of Texture-Less Objects”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 2012 [Huber03] D.F. Huber, M. Hebert, “Fully automatic registration of multiple 3D data sets”, Image and Vision Computing, 21:637-650, 2003

Bibliography – part 3b 















[Karlsson 05] N. Karlsson, E. D. Bernardo, J. Ostrowski, L. Goncalves, P. Pirjanian, M. E. Munich, “The vSLAM algorithm for robust localization and mapping,” Proc. Int. Conf. on Robotics and Automation ICRA), 2005 [Klein 07] G. Klein, D. W. Murray, “Parallel tracking and mapping for small AR workspaces”, Proc. Int. Symp. on Mixed and Augmented Reality (ISMAR), 2007 [Konolige 06] K. Konolige, M. Agrawal, R.C. Bolles, C. Cowan, M. Fischler, B. Gerkey, "Outdoor mapping and navigation using stereo vision“, Proc. Int. Symp. on Experimental Robotics (ISER),2006 [Munoz 09] D. Munoz, J. A. Bagnell, N. Vandapel, M. Hebert, “Contextual classification with functional max-margin markov networks”, Proc. CVPR, 2009. [Newcombe 11] R.A. Newcombe, S.J. Lovegrove, A.J. Davison, “DTAM: Dense Tracking and Mapping in Real-Time”, IEEE International Conference on Computer Vision (ICCV), 2011 [Newcombe 11b] R.A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A.J. Davison, P. Kohli, J. Shotton, S. Hodges, A. Fitzgibbon, «KinectFusion: Real-Time Dense Surface Mapping and Tracking”, Proc. Int. Symp. on Mixed and Augmented Reality (ISMAR), 2011 [Nistèr 04] D. Nistèr, O. Naroditsky, J. Bergen, “Visual odometry”, Proc. Conf. on Computer Vision and Pattern Recognition (CVPR), 2004 [Strasdat 10] H. Strasdat, J.M.M. Montiel, A. J. Davison, «Real-time Monocular SLAM: Why Filter?”, Proc. ICRA, 2010

Bibliography – part 3c 



[Sim 06] R. Sim, J. J. Little, “Autonomous vision-based exploration and mapping using hybrid maps and rao-blackwellised particle filters,” Proc. Conf. on Intelligent Robots and Systems (IROS), 2006 [Triebel 07] R. Triebel, R. Schmidt, O. M. Mozos, W. Burgard, “Instance-based AMN classification for improved object recognition in 2d and 3d laser range data”, Proc. Int. Conf. on Art. Intelligence, 2007



[Tombari et al., 10] Tombari F, Salti S, Di Stefano L. “Unique signatures of histograms for local surface description” In: Proc. Europ. Conf. on Computer Vision (ECCV), Springer-Verlag, Berlin, Heidelberg, pp 356-369, 2010.



[Tombari 11] F. Tombari, L. Di Stefano, “3D Data Segmentation by Local Classification and Markov Random Fields”, Proc. Conf. on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), 2011