Urban scene understanding from aerial and ground LIDAR data

Machine Vision and Applications DOI 10.1007/s00138-010-0279-7 ORIGINAL PAPER Urban scene understanding from aerial and ground LIDAR data Eunyoung Ki...
Author: Dortha Ward
2 downloads 0 Views 2MB Size
Machine Vision and Applications DOI 10.1007/s00138-010-0279-7

ORIGINAL PAPER

Urban scene understanding from aerial and ground LIDAR data Eunyoung Kim · Gérard Medioni

Received: 23 April 2009 / Revised: 14 April 2010 / Accepted: 4 June 2010 © Springer-Verlag 2010

Abstract We present a framework to segment cultural and natural features, given 3D aerial scans of a large urban area, and (optionally) registered ground level scans of the same area. This system provides a primary step to achieve the ultimate goal of detecting every object from a large number of varied categories, from antenna to power plants. Our framework first identifies local patches of the ground surface and roofs of buildings. This is accomplished by tensor voting that infers surface orientation from neighboring regions as well as local 3D points. We then group adjacent planar surfaces with consistent pose to find surface segments and classify them as either the terrain or roofs of buildings. The same approach is also applied to delineate vertical faces of buildings, as well as free-standing vertical structures such as fences. The inferred large structures are then used as geometric context to segment linear structures, such as power lines, and structures attached to walls and roofs from remaining unclassified 3D points in the scene. We demonstrate our system on real LIDAR datasets acquired from typical urban regions with areas of a few square kilometers each, and provide a quantitative analysis of performance using externally provided ground truth. Keywords Large scale range image processing · Segmentation

E. Kim (B) · G. Medioni Computer Science Department, University of Southern California, Los Angeles, CA, USA e-mail: [email protected] G. Medioni e-mail: [email protected]

1 Introduction Recent advances in light detection and ranging (LIDAR) technology provide the ability to collect 3D data over large areas with excellent resolution and accuracy. The precision of LIDAR data is getting high enough to capture not only common large structures(e.g., building), but also very small or narrow objects such as antennas and A/C units(≤30 cm) in an urban scene. This improvement enables 3D based urban scene understanding for various applications such as urban planning, autonomous robotic navigation, environment monitoring and 3D city modeling. Processing these valuable 3D datasets in order to extract static entities such as buildings, doors, mailboxes and vegetation is still a formidable challenge. We address here the bottom up segmentation process to produce bounding boxes for objects, which can then be further labeled by object recognition methods [3,6]. Figure 1 shows a representative urban region that contains a variety of common structures such as densely built buildings and houses, vegetation and objects(e.g. fire hydrants). The area of the region is 2,196×2,997 m2 and the number of 3D points captured from the region produce 2.5 GB(aerial) and 12 GB (ground) binary files. Specifically, we process the entire region by dividing it into uniform small blocks (50 × 50 m2 ) each having up to a few million 3D points. The density of 3D datasets varies from less than one point/m2 to hundreds of points/m2 . We process 3D LIDAR datasets by enforcing general context constraints and provide a structured input which could be used by an object recognition system. Figure 2 shows a block diagram of our system. The system starts by identifying the terrain and the roof surfaces of the buildings (houses) densely built in common urban regions. Given the aerial LIDAR data, surface

123

E. Kim, G. Medioni

Fig. 3 3D point cloud with non-uniform density: sparse aerial data (rectangle) and dense ground level data (circle)

Fig. 1 Example of typical urban area

orientation is inferred by locally fitting a planar surface to 3D points. We then compute surface segments, which are groups of planar patches having similar pose. Each surface segment corresponds to either the ground or roofs of large structures. Tensor voting [10] is used in both processes, surface orientation inference and grouping to infer surface segments robustly by gathering surface information from neighboring regions as well as local 3D points. We then read the ground based LIDAR data, if present. For the areas we have processed, the aerial and ground datasets are registered. How to register two datasets is an interesting problem, beyond the scope of this paper. A challenging issue using both aerial and ground based LIDAR datasets is the non-uniform sampling of 3D measurements as illustrated in Fig. 3. The aerial data is very sparse but uniformly distribFig. 2 Flowchart of the proposed approach (bounding boxes on structures: results of object segmentation)

123

uted, while the ground data has very dense and accurate, but irregularly sampled 3D points due to the viewpoint. The system also extracts another important component of large structures, walls. Our planar patch grouping approach contributes to inferring vertical planar surfaces from nonuniform density 3D point clouds. The ground surface and major components of buildings (roofs and walls) provide spatial context to delineate smaller objects. We describe our approach to finding the instances of classes: line-like objects (e.g., power lines and lamp posts), objects on top of buildings (e.g., dishes and antennas) and objects attached to walls(e.g., A/C units). Figure 4 displays the output of the system on area 1 in Fig. 1. In total, 21,337,155 (aerial) and 81,396,776 (ground) 3D points are taken from the region. The rest of the paper is organized as follows: Sect. 2 briefly describes related work. The details of the planar patch grouping module are presented in Sect. 3. In Sects. 4 and 5, the methods for wall detection and urban object segmentation are described, respectively. Section 6 presents some

Urban scene understanding from aerial and ground LIDAR data

Fig. 4 Experimental results of our system: the ground (purple) and roof surfaces of the structures with rectangle representation for every planar patch and the inferred vertical walls(yellow bounding boxes with 20 cm width)

experimental results. The paper concludes with a discussion of future work.

2 Related work Many researchers have proposed a diversity of methods for building and terrain segmentation using aerial LIDAR data. The most widely used approach for these applications is a planar patch grouping based approach, as most of the buildings and terrain are nearly planar. Experimental comparison in [12] demonstrates that the surface-based approach generally perform best for complex urban scenes. In [2], planar surfaces are detected and classified as roofs based on RANSAC based plane fitting and the gradient orientation. Jiang et al. [5] suggested the segmentation method that measures the local properties of the surface patch around each point and groups points with relatively consistent curvature into one region. Automatic building construction method from airborne LIDAR data proposed in [14,16] is based on a region-growing algorithm using the plane-fitting techniques. Verma et al. [14] presented a terrain and building segmentation approach that detects planar patches using Total Least Square (TLS) and then groups consistent neighboring patches together under region-growing framework. In [9], the same approach is used for the initial planar patch fitting and grouping. After the initial processing, the planar patches are refined using MLESAC and EM algorithms and then, the segmentation result is improved by applying the belief propagation (BP). For building segmentation, model-fitting based methods have been popularly used as well. Luo and Gavrilova [8] and You et al. [15] suggested a building modeling system that extracts the building structures by fitting user-defined geometric primitives to the aerial LIDAR data.

There are some works to detect specific objects (or geometric features) by point level classification. In [1], Anguelov et al. introduced a framework based on a subclass of Markov Random Field (MRF) for segmenting 3D data into four classes (ground, building, tree and shrubbery). Lalonde et al. proposed a method of segmentation of LIDAR data into three classes using local 3D points statistics [7]. In this paper, they compute the saliency feature from the distribution of 3D points. Gaussian mixture model using EM algorithm is used to learn the parametric model of the saliency features distribution in offline. The directional Associative Markov Network (AMN) based approach is proposed in [11]. This work aims at classifying a 3D point cloud into five categories (wire, pole/trunk, scatter, ground and facade) based on contextual information between local features. The major contribution of our proposed system over these approaches is that the system covers identifying vertical walls of buildings and segmenting small urban objects based on context constraints with large structures, which are rarely addressed in previous work.

3 Ground and roof segmentation To identify the ground and roof surfaces of the buildings from the given aerial 3D point cloud, we first infer surface orientation by fitting planar patches, and then group adjacent patches with consistent pose together. We use tensor voting (TV) [10], which is a well-established methodology to infer the properties of geometric features for many applications [13], so as to infer surface property robustly from noisy data. 3.1 Tensor voting in 3D This section gives an overview of TV. TV is a perceptual organization framework that is able to infer salient geometric structures such as point, curve and surfaces based on the support the tokens which comprise them receive from their neighbors in ND, without predefined model and iteration. It only requires one parameter to define the scale of voting. TV consists of two steps: (1) data encoding and (2) tensor voting (communication with neighbors). All input tokens, a set of unoriented or oriented points, are encoded as a second order symmetric tensor which is equivalent to an ellipsoid whose major axis serves as the orientation of the token and the length of that axis represents the saliency. Then, each input tensor collects the tensor vote cast from its neighbors by aligning the voting fields along the orientation of tensors. The voting fields are defined with saliency decay function that attenuates the saliency with length and curveness of smoothest path connecting the receiver P and

123

E. Kim, G. Medioni

3.2 Surface orientation inference

Fig. 5 Illustration of decay function

voter Q. (Eq. 1)  −

DF(s, ρ) = e

s 2 + cρ 2 σ2



(1)

θ θl where ρ = 2 sin l , s = sin θ (See Fig. 5). After voting, the tensor can be decomposed as: T = λ1 e1 e1T + λ2 e2 e2T + λ3 e3 e3T

  = (λ1 − λ2 )e1 e1T + (λ2 − λ3 ) e1 e1T + e2 e2T   +λ3 e1 e1T + e2 e2T + e3 e3T

where λi are the eigenvalues in decreasing order and ei are the corresponding eigenvectors. From the equation, we can infer three types of information as follows: • Pointness (ball tensor): no orientation, saliency • Curveness (plate tensor): orientation is e3 and saliency is λ2 − λ3 • Surfaceness (stick tensor): orientation is e1 and saliency is λ1 − λ2 .

Fig. 6 Overview of surface inference (For illustration purposes, we use a 2D grid representation instead of 3D voxels)

123

Unlike other existing approaches [14,16] that identify a planar patch for every voxel uniformly partitioned, we use a volume tree based approach in order to fit multi-scale planar patches expecting the following advantages: (1) We can infer surface orientation from irregularly sampled 3D points by fitting multi-scale planar patches. (2) We can improve the computational performance by fitting a single large planar patch to a vast extent of a flat area instead of fitting a number of small patches. Figure 6 shows a process flow of our approach. Given an input 3D pointcloud (Fig. 6a), the planar surface fitting is initially started from large volume and the volume is repeatedly divided into several small volumes until the estimated planar surface locally fits well to 3D points in the volume. The details of the proposed approach is described in Algorithm 1. The entire 3D dataset is first uniformly partitioned into a voxel set V p with σx × σ y × σz voxels. (e.g., 6 × 6 × 6 in Fig. 6) The resolution σx × σ y × σz of the voxelization process highly correlates with the scale of the planar patch to be fitted and the number of neighbors with influence on inferring surface orientation. At each level, the representative 3D point from every voxel is assigned to input tokens of TV. In the TV framework, each token, which is equivalent to the region, collects surface information from its neighboring regions. Then, after decomposition, the e1 eigenvector of the tensor is allocated to the normal N of the planar patch in the corresponding voxel. TV serves to infer the surface orientation from neighboring surfaces, while conventional approaches only estimate it from local points in the voxel. In Fig 6, the red points (or blue points in level 3) illustrate the representative points from the voxels, having preferred orientation inferred by TV.

Urban scene understanding from aerial and ground LIDAR data

Algorithm 1 Pseudocode for surface orientation inference 1: Set of planar patches ( = ∅) 2: Construct a root volume V p of the volume tree with σx × σ y × σz voxels 3: Map 3D points in P to the voxel set V p 4: For every voxel in V p , infer surface orientation by TV process 5: for Voxel v, v ∈ V p s.t. length(v) ≥ τmin do 6: Compute a plane model π 7: if π is well fitted to 3D points in v then 8: =π ∪ 9: else 10: v is divided into σx × σ y × σz voxels and then, same process from line 3 is applied to new voxels again(v is regarded as V p ) 11: end if 12: end for

Now, we have to verify whether the estimated planar patch is completely supported by 3D points in the voxel so as to replace those points with the planar patch. The first condition is that every point in the region should be at most τu distance from the plane. τu corresponds to uncertainty in the 3D dataset. Otherwise, the candidate is discarded. We also compute three eigenvectors (e1 , e2 , e3 ) of a covariance matrix of 3D points in the region. The candidate is rejected, when e1 does not match with N of the candidate and the scale corresponding to e3 is not small enough to be considered as flat (≤τu ). If there is no planar patch fitted to the region(e.g. red region in Fig. 6), the voxel is recursively divided into smaller voxels and the proposed planar fitting approach is applied again so as to find smaller planar patches well fitted to the region. During the splitting process, the resolution of voxels at the next level is adaptively determined according to the user-defined parameter τmin , which is highly related to the resolution of 3D pointcloud. The hierarchical splitting process continues as long as the size of the estimated voxel at the next level

is greater than τmin so that the length of cell never become smaller than the point resolution. Note that all parameters are fixed in all experiments. (τmin = τu = 0.5 m). 3.3 Planar patch grouping To infer smooth surfaces from the planar patches, we use a connected component region growing method that aggregates contiguous planar patches with similar pose. The TV framework is exploited to prevent false boundary detection caused by local discontinuities. TV measures surface saliency that highly correlates with uniformity of data forming the surface. Each planar patch becomes the input token with a preferred orientation. After the TV process, we notice that the magnitude of surface saliency of the planar patches reflects the similarity in pose with its neighboring surfaces as shown in Fig. 7. Based on this observation, we can define a lower-bound of surface saliency for belonging to the same surface segment.

Fig. 7 Surface saliency of the planar patches extracted from the 3D point cloud shown in Fig. 2

123

E. Kim, G. Medioni

That is, the planar patches with surface saliency higher than τg group together, if they are adjacent. Let τg denote the lower-bound. To compute τg , suppose that there is a token P and its neighboring tokens Q which cast their information to the token P. The tokens in Q and token P make a smooth surface with the maximum difference in pose θt . Given the token P and θt , we artificially generate a set of neighbors Q and then, compute the surface saliency of P that corresponds to τg in (2σ + 1) × (2σ + 1) × (2σ + 1) voxels. Using Q = q1 , . . . , qk , τg is λ1 − λ2 of TP : 

Tp =

k 

S(s, k)Tk =

i=0

k 



e



s 2 + cρ 2 σ2 Tk ,

(2)

i=0

where Tk is a tensor of qk , ∀k and T p is a tensor of P, ρ = 2 sin θt , s = θt l , l = dis(q , P). k sin θt l In this framework, the threshold τg works properly, because (1) the surface saliency is computed by the decay function [10], (2) the input tokens have uniform density, and (3) all input tokens are re-scaled to collect surface information from σ × σ × σ neighboring voxels, where σ is the scale of the TV process. In the experiments, θt is set to π9 and the corresponding τg is 25. After the TV process, all planar patches are sorted in decreasing order of surface saliency. Then, the planar patch πi with the highest saliency starts to extend its region by adding adjacent planar patches. If neighboring planar patch π j has higher saliency than τg , it is added to πi without any extra process. Otherwise, a 2-ring local test is applied to find the exact boundary of the segment πi . To belong to πi ’s segment, π j and its neighboring patches should have consistent pose with πi ’s segment. This process is iteratively applied until every planar patch is classified into surface segments. Among the segmented surfaces {γ1 , . . . , γm }, the ground region γg (γg ⊂ ) is identified based on the height and size of the region. Each region γi (γi ⊂ and γi = γg ) that has an area larger than a threshold and no 3D points below non-boundary region of γi is classified as a roof of a building.

4 Wall identification Inferring the presence of a wall is a crucial component for urban scene understanding in that (1) it delineates a boundary of a building, and (2) it provides a geometric constraint to recognize objects attached on the walls, and (3) a wall, itself, is an important part of an object such as a building. This section addresses our wall detection method that identifies any free-standing wall normal to the ground surface such as a facade and fence. The planar patch grouping method provides the initial wall hypotheses by identifying

123

planar structures orthogonal to the ground surface from a point cloud that contains unorganized 3D points with various depth errors.

4.1 Building wall detection For every region γi classified as a roof surface, we can simply derive the plausible positions of the walls from the boundary of the region γi . Let βi denote a 2D boundary map of γi and every cell in βi is initialized to 0. Every 3D point belonging to the region γi is mapped to one of cells ((x, y)) in the map βi and its height is assigned to βi (x, y). When several points are assigned to the same cell βi (x, y), βi (x, y) has the minimum height. After the assignment, for every occupied cell (xo , yo ), if its 8-neighbors are fully occupied, βi (xo , yo ) is set to 0 so that the occupied cells in the map βi correspond to the boundary of the region γi . Finally, we obtain a 3D point cloud Pi that probably contains vertical faces of the building γi by gathering 3D points with height lower than βi . Then, the module computes a group of the vertical wall candidates Wi by inferring planar surfaces from the point cloud Pi as described in Algorithm 2. Pi is obtained from both sparse aerial data and irregularly sampled ground data, consequently, 3D points in Pi have non-uniform density and have different uncertainties in depth. Therefore, we use the planar patch grouping approach proposed in Sect. 3 with multiple uncertainties (φ1 , . . . , φn |φ1 < · · · < φn ) to extract as many wall candidates as possible. Similarly, multiple minimum sizes of planar patches (ω1 , . . . , ωn |ω1 < · · · < ωn ) are defined to compensate for the error by putting the limitation on the scale of the planar structure, when bigger error is allowed. We observe that a wall is a linear structure, when projected down onto the ground plane. So, after estimating the initial wall candidates C, we estimate the final candidates Wi by projecting 3D points of each candidate C and fitting a 2D line to the projected points. Each wall candidate corresponds to a plane orthogonal to the ground plane. The wall candidates in Wi have infinite extent, so the next step determines the boundary of a candidate using the point cloud Pi . The point cloud Pi is classified into the wall candidates in Wi by this step. The overall process is given in Algorithm 3. The wall candidates are sorted by the size of the corresponding planar surface in decreasing order, so as to first assign 3D points to the wall candidate with a large extent in that the large candidate coincides with the actual wall with high confidence. From the candidate w with the largest region, we identify the 3D points closely distributed around the candidate w, which represent the wall candidate. Then, these 3D points

Urban scene understanding from aerial and ground LIDAR data

Algorithm 2 Pseudocode for wall candidate identification 1: Set of vertical walls Wi (Wi = ∅) , Wall candidates C(C = ∅) , Valid point cloud Pwp (Pwp = Pi ) 2: for j = 0 to n do 3: Infer planar surfaces  from Pwp by running the planar patch grouping with φ j and ω j (φ j ∈ , ω j ∈ ) 4: for Planar surface λ, λ ∈  do 5: Compute a dominant orientation of λ by removing noisy planar patches from λ 6: if λ is orthogonal to the ground surface γg then 7: Compute a plane model for s with the dominant orientation 8: C =λ∪C 9: Pwp = Pwp \Pseg , Pseg : 3D points in λ 10: end if 11: for Candidate c, c ∈ C do 12: 1. Let Pc be 3D points on the candidate c 13: 2. Project every 3D point in Pc onto the ground and fit a 2D line to the projected points 14: 3. Refine the plane model of c 15: 4. Wi = c ∪ Wi 16: end for 17: end for 18: end for

Algorithm 3 Pseudocode to define the spatial extent of the wall candidate 1: Sort the wall candidates by the size of the planar surface 2: for Candidate w, w ∈ Wi do 3: Let Pw denote a set of 3D points supporting the candidate w, initially, Pw = ∅ 4: for 3D point p, p ∈ Pi do 5: if distance( p, w) ≤ τu then 6: Pw = p ∪ Pw 7: end if 8: end for 9: 1. Project 3D points in Pw onto the wall w and transform the to the wall coordinate. 10: 2. Construct 2D occupancy map on the wall 11: 3. Map the transformed points to the occupancy grid. 12: 4. Group adjacent cells horizontally to determine the width of the wall 13: end for

are projected onto the candidate plane and transformed into the wall coordinate that coincides with any 2D plane coordinate system (In our implementation, x z plane). Now, all supporting points Pw are on the 2D plane coordinate system of w. To find the exact shape of the candidate occupied by 3D points, we construct a 2D occupancy map with the length of the cell τu , and all of supporting points in Pw are partitioned into the grid cells. We start the cell grouping from the cells occupied by 3D points belonging to the region of w until there are no more occupied cells in neighborhood. The final width of the grouped cells determines the extent of the wall. We use two different wall representations: mesh and rectangle representations. The mesh representation of the wall is constructed by triangulating the adjacent occupied cells. The advantage of the mesh representation is that it captures an exact shape of the wall. We can also infer a rectangle representation of the candidate from the estimated pose of the wall and βi . The majority of buildings in an urban area have a consistent orientation. To enforce this observation, we compute the saliency of the orientation for every wall using Eq. 3 and the orientation with the maximum saliency is classified as the

dominant orientation υd . Then, the other walls are updated by υd . sal(υ) =

n  k=1,υk ∈N (υ)

(υk · υ) +

m 



υk · υ



(3)

k=1,υk ∈N (υ )

where N (υ) is the set of neighboring vectors of υ and υ is an orthogonal vector to υ on the ground plane. i.e, υ · υ = 0 and υ · υg = 0 (υg is the orientation of the ground). 4.2 Wall detection for free-standing objects The wall identification approach described in Sect. 4.1 can be applied to unclassified 3D points in order to recognize the vertical faces of non-building objects in the scene. However, in this case, we use strict values for the two free parameters of the planar patch grouping process, since there are no geometric constraints that guarantee the existence of a face such as roof. In other words, the wall candidates of nonbuilding objects are only extracted from dense and accurate 3D point cloud, which captures the inherent shape of the vertical planar structures. Figure 8 shows a fence detected by our method.

123

E. Kim, G. Medioni

Fig. 8 Identification of free standing walls

5 Small object segmentation This section introduces our approach to small object segmentation. We identify a group 3D points, most likely representing 3D objects on large structures in urban areas, for example, linear structures and objects on roofs/walls of buildings. Given urban 3D datasets, we are faced with the following common challenges in small object segmentation: (1) Objects are in a noisy and cluttered environment (Fig. 9a) and (2) Objects are represented by very few points due to their small size(Fig. 9c, d). Our approach to object segmentation dealing with these challenges is to use the semantic context provided by the large structures detected so far in order to infer the presence of smaller objects. 5.1 Linear structure identification This section focuses on identifying linear structures in the scene, especially vertical poles (e.g., poles of signs and street lights) and horizontal curves (e.g., power lines and arms of street lights) from the remaining 3D point cloud Puc unclasFig. 9 Challenges in object segmentation with LIDAR datasets

123

sified as roofs/walls. Algorithm 4 provides a pseudocode of our approach to linear structure identification. Given the unclassified point set Puc , initially, we construct a voxel set Vuc of the point cloud Puc to infer a spatial relationship between 3D points from the voxel set Vuc efficiently. For every voxel, we compute a tensor information from the covariance matrix of local 3D points in the voxel to infer a property of a linear structure in the cell. Given three eigenvectors e1 , e2 , e3 and scales σ1 , σ2 , σ3 (σ1 > σ2 > σ3 ) cor1 − σ2 responding to eigenvectors, σ σ2 − σ3 represents the saliency of a linear structure in the cell and, if the saliency is greater than a threshold, the voxel is considered as having the linear component with orientation e1 . The voxels with the linear structure are sorted by saliency of the structure. From the linear structure with the highest saliency, we extend it by grouping its neighboring 3D points that have consistent linear property. Because we aim to identify both straight and curved linear structures, which correspond a group of local line segments smoothly connected, linear structures for contiguous voxels are refined based on the linear structure of the current voxel. The current linear component predicts several possible linear components for its neighboring voxels by Gaussian sampling on it. Then, it is extended to the neighboring voxel, when the proposed linear components coincides with linear property of 3D points in the voxel. The resulting V N in Algorithm 4 contains a group of 3D points representing a single line structure. To identify a vertical linear structure standing straight on the ground surface, we enforce the constraints that all structures are connected to the ground surface and every local linear component grouped together must have an orientation orthogonal to the ground surface. Similarly, for horizontal curve identification, we only use the local linear components parallel to the ground surface.

Urban scene understanding from aerial and ground LIDAR data

Algorithm 4 Pseudocode of linear structure identification 1: Set of line structures(ls) L S(L S = ∅) 2: Construct an uniform voxel set Vuc 3: Map 3D points in Puc to Vuc 4: For every voxel, compute tensor information using 3D points in the voxel 5: Sort the voxels in Vuc by the line-ness 6: while Voxel v, v ∈ Vuc s.t. v contains ls do 7: new voxel set V N , initially, V N = v 8: while Voxel κ, κ ∈ V N s.t. c contains ls do 9: for Voxel κn , s.t. κn ∈ 26-neighbors of κ do 10: Predict line components for the voxel κn 11: Identify the line component l that maximizes the consistency with 3D points in κn among the predicted 12: if l fits to 3D points in κn then 13: ls in κn = l 14: V N = κn ∪ V N 15: end if 16: end for 17: end while 18: LS = V N ∪ LS 19: end while

Additional geometric constraints can be applied to segment more specific objects. For example, to extract power lines from the unclassified 3D points, a large height difference from the ground surface is required as an additional constraint. 5.2 Object segmentation on roofs and walls This section presents our approach to grouping 3D points that represent small interesting objects attached to geometric components of large structures. An additional challenging issue to deal with is that most of those objects are vertically (or horizontally) disconnected as shown in Fig. 9b and c, for LIDAR datasets are acquired from very limited views in the air and on the ground. Our basic idea to overcome this challenge is to infer the spatial extent of an object from the top surface of the object, mostly visible from view points in the air and on the ground. Because we know the orientation of the surface on which objects rest, we can identify the visible surface by grouping the farthest 3D points from the surface. This approach is applied to unclassified 3D points in the aerial LIDAR data in order to segment objects on roofs such as chimneys, antennas and dishes. For every roof surface, we extract 3D points higher than the roof surface and partition them into voxels at different scales(long and narrow cells) for computational efficiency. Then, we determine the spatial extent (width and depth) of each object by inferring a top surface of the object. The height of the object corresponds to the distance between the surface and the top surface of the object. Finally, the 3D points in the estimated volume are classified as one single object on roofs. The same approach is also used to identify objects extruding from vertical walls of buildings, such as A/C units. For

every wall, the outgoing vector υow of the wall is computed based on the volume of the building inferred from the roof and wall information and then, we segment 3D points on the wall using the position of the wall and υow . After extracting 3D points on the wall, the 3D point cloud is divided into small voxels. We then group 3D points from the most protruding surface of every object to define a dimension of the object. In our experiments, the proposed method provides reasonable segmentation results in spite of the fact that the aerial and ground LIDAR datasets contain very few 3D points on vertical parts of objects.

6 Experimental results We have tested the proposed system on several massive LIDAR datasets that capture vast urban areas(approximately 7 km2 each) as illustrated in Fig. 1. These datasets are acquired from one airborne scanner and four car-mounted TITAN scanners, facing left, right, forwardup, and forwarddown with the positional information acquired from Global Positioning System (GPS) [4]. These 3D point clouds are registered into a common coordinate system, called Universal Transverse Mercator (UTM) coordinate system, which is a grid based method of specifying locations on the surface of the earth. The reported registration error between airborne and car-mounted scans is 0.05 m. The aerial LIDAR dataset has almost uniform resolution, approximately 25 points per m2 , while resolution of the ground level dataset varies from 0.1 points/m2 to hundreds of points/m2 according to the viewing direction. A very limited number of ground truth labels for some objects in Area 1 (Fig. 1) is available. Some regions are hidden as sequestered data for independent evaluation. Because an entire 3D dataset is too large to be processed at once, we partition the volume into small blocks

123

E. Kim, G. Medioni

Fig. 10 Qualitative result on roof identification

Fig. 11 Experimental results of large structure identification

Table 1 Quantitative analysis on power line identification

Table 2 Quantitative analysis on small object segmentation

Radius of line (m)

Distance error (m)

Orientation error (degree)

Category

line 1

0.34

0.509

3.68

dish

line 2

0.497

0.78

1.92

chimney

line 3

0.247

0.446

6.71

A/C units

line 4

0.607

0.13

1.05

antenna

123

nb of G.T

Scale of G.T (unit:m)

D.R (%)

3

1.3–1.7

100

13

0.27–3.6

100

18

0.19–2.2

100

2

0.24–0.53

100

G.T ground truth, D.R detection rate

Urban scene understanding from aerial and ground LIDAR data

Fig. 12 Experimental result of object segmentation on walls: ground truth of A/C units (green box) and our results (red box)

(50 × 50 m2 in our experiments). Every block partially overlaps its eight neighbors along the boundary of the block for efficient merging process. Each block is processed in parallel

and then, we merge the consistent results in the overlapping regions. The result of roof identification is compared to the ground truth in the evaluation region (Area 1) as depicted in Fig. 10. For intuitive comparison, for each inferred planar surface, we found the corresponding ground truth using the position, and assign same color to the surface. With the assumption that the roof is flat, the ground truth roofs have 0.21 m error on average, and the results inferred by our approach demonstrate average 0.5 m error compared to the ground truth. Figure 11 shows another qualitative result of roof and terrain segmentation on area 2 in Fig. 1. The region covers 700 × 800 m2 area and there are many small houses in close proximity with nearby vegetation. For visualization, we compute four vertices for every planar patch using 3D points in the patch and use Oriented Bounding Box (OBB) representation with 20 cm width for walls.

Fig. 13 Experimental results of object segmentation on roofs

123

E. Kim, G. Medioni Table 3 Computational time (unit:s) nb of points

S.I

W.I

143,643 (A)

3.3

15.4

1,074,287 (G) 143,388 (A)

1,600,652 (G)

O.W

O.R

T

0.6 (V)

0.2

0.2

19.9

0.1

0.1

20.1

0.2

1.7

22.6

0.2 (H) 3.1

14.1

2,429,229 (G) 288,172 (A)

L.I

1.9 (V) 0.8 (H)

3.5

15.1

1.5 (V) 0.6 (H)

A aerial data, G ground data, S.I surface identification, W.I wall identification, L.I linear structure identification, V vertical, H horizontal, O.W object segmentation on walls, O.R object segmentation on roofs, T total processing time

We compared the inferred lines to the ground truth and the quantitative analysis is given in Table 1. The ground truth contains only 4 power lines correctly visible in the 3D datasets. The table gives average errors in distance and orientation between the ground truth and the estimated power lines. The output of object segmentation is also validated with the ground truth of some objects and, despite the challenges mentioned in Sect. 5, it detects every ground truth object as shown in Table 2. That is, given ground truth of antennas, dishes, chimneys and A/C units, we compare the segmented bounding volumes to ground truth objects and the result of the comparison shows that these volumes are matched with every ground truth in terms of size and position. Figures 12 and 13 display some experimental results of object segmentation on walls and roofs with ground truth. The color-coded bounding boxes represent the inferred volumes of the objects on the large structures. Table 3 shows examples of processing time on several blocks. Most of this time is spent on iterative planar patch grouping. The average processing time of our system on area 2 is 19.4 s for each 50 × 50 m2 block. Our module is implemented on a PC with two 3.0 Hz CPUs and 8 GB of RAM.

7 Conclusion This paper presents an overall framework of context based object segmentation for 3D urban scene understanding, when aerial and ground level LIDAR datasets are given. The proposed system includes three major components: (1) Ground and roof surface identification: The ground and roofs of the buildings are identified by fitting the planar patches to 3D points and grouping smooth planar surfaces together. (2) Wall identification: we suggest an efficient wall detection method that finds vertical planar structures well fitted to 3D point cloud. (3) Urban object segmentation: some

123

of meaningful objects are segmented from the unclassified 3D points based on context constraints. We evaluated our system on LIDAR datasets and the experimental results indicate that our system provides useful outputs of object segmentation and large structure identification for urban scene understanding. Using the results of object segmentation, we intend to extend our work to 3D object classification to recognize urban infrastructure in the environment.

References 1. Anguelov, D., Taskar, B., Chatalbashev, V., Koller, D., Gupta, D., Heitz, G., Ng, A.: Discriminative learning of markov random fields for segmentation of 3D scan data. In: Proceedings of IEEE CVPR, pp. 169–176 (2005) 2. Nardinocchi, C., Scaioni, M., Forlani, G.: Building extraction from LIDAR data. In: IEEE/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas, pp. 79–84 (2001) 3. Funkhouser, T., Kazhdan, M., Min, P., Shilane, P.: Shape-based retrieval and analysis of 3D models. CACM 48(6), 58–64 (2005) 4. Golovinskiy, A., Kim, V.G., Funkhouser, T.: Shape-based recognition of 3D point clouds in urban environments. In: ICCV (2009) 5. Jiang, J., Zhang, Z., Ming, Y.: Data segmentation for geometric feature extraction from LIDAR point clouds. In: Proceedings of IEEE International Geoscience and Remote Sensing symposium, pp. 3277–3280 (2005) 6. Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3D scenes. IEEE TPAMI 21(5), 433– 449 (1999) 7. Lalonde, J.F., Vandapel, N., Huber, D., Hebert, M.: Natural terrain classification using three-dimensional ladar data for ground robot mobility. J. Field Robotics 23(10), 839–861 (2006) 8. Luo, Y., Gavrilova, M.L.: 3D building reconstruction from LIDAR data. Lect. Notes CS 3980, 431–439 (2006) 9. Matei, B.C., Sawhney, H.S., Samarasekera, S., Kim, J., Kumar, R.: Building segmentation for densely built urban regions using aerial LIDAR data. In: Proceedings of IEEE CVPR, pp. 1–8 (2008) 10. Medioni, G., Lee, M.S., Tang, C.K.: A Computational Framework for Segmentation and Grouping. Elsevier Science Inc, New York (2000) 11. Munoz, D., Vandapel, N., Hebert, M.: Directional associative markov network for 3-D point cloud classification. In: Proceedings of 3DVT (2008) 12. Sithole, G., Vosselman, G.: Experimental comparison of filter algorithms for bare-earth extraction from airborne laser scanning point clouds. ISPRS J. Photogramm. Romote Sens. 59(1–2), 85–101 (2004) 13. Tang, C.K., Medioni, G.: Curvature-augmented tensor voting for shape inference from noisy 3D data. IEEE TPAMI 24(6), 858– 864 (2002) 14. Verma, V., Jumar, R., Hsu, S.: 3D building detection and modeling from aerial LIDAR data. In: Proceedings of IEEE CVPR, pp. 2213–2220 (2006) 15. You, S., Hu, J., Neumann, U., Fox, P.: Urban site modeling from LIDAR. Lect. Notes CS 2669, 579–588 (2003) 16. Zhang, K., Yan, J., Chen, S.C.: Automatic construction of building footprints from airborne LIDAR data. IEEE Trans. Geosci. Remote Sens. 44(9), 2523–2533 (2006)

Urban scene understanding from aerial and ground LIDAR data

Author Biographies Eunyoung Kim received her B.S. and M.S. degrees in Information and Communication Engineering from Sungkyunkwan University in 2003 and 2005, respectively. She is currently pursuing Ph.D. degree in Computer Science, Viterbi School of Engineering, at University of Southern California (USC). Her research interests include object categorization and modeling in range images.

Gérard Medioni received his Diplome d’Ingenieur degree from the Ecole Nationale Supérieure des Télécommunications, Paris, France, in 1977 and M.S. and Ph.D. degrees from the University of Southern California (USC), Los Angeles, in 1980 and 1983, respectively. Since then, he has been with the USC, where he is currently a Professor of computer science and electrical engineering, a Codirector of the Institute for Robotics and Intelligent Systems, and a Codirector of the USC Games Institute. From 2001 to 2007, he was the Chairman of the Department of Computer Science, USC. He is also an Associate Editor for the Image and Vision Computing Journal, Associate Editor for the Pattern Recognition and Image Analysis Journal, and Associate Editor for the International Journal of Image and Video Processing. His research interests include computer vision, particularly edge detection, stereo and motion analysis, shape inference and description, and system integration. He served as Program Co-chair of the 1991 IEEE Computer Vision and Pattern Recognition (CVPR) Conference in Hawaii and the 1995 IEEE Symposium on Computer Vision in Miami, FL; General Co-chair of the 1997 IEEE CVPR Conference in Puerto Rico; Conference Co-chair of the 1998 ICPR Conference in Australia; General Co-chair of the 2001 IEEE CVPR Conference in Kauai, HI; and General Co-chair of the 2007 IEEE CVPR Conference in Minneapolis, MN. He serves as General Co-chair of the 2009 IEEE CVPR Conference in Miami.

123

Suggest Documents