A 3D Laser and Vision Based Classifier

A 3D Laser and Vision Based Classifier Bertrand Douillard, Alex Brooks, Fabio Ramos Australian Centre for Field Robotics, Sydney, Australia {b.douilla...
Author: Lisa Pope
6 downloads 0 Views 841KB Size
A 3D Laser and Vision Based Classifier Bertrand Douillard, Alex Brooks, Fabio Ramos Australian Centre for Field Robotics, Sydney, Australia {b.douillard,a.brooks,f.ramos}@acfr.usyd.edu.edu Abstract—This paper presents a method for modelling semantic content in scenes, in order to facilitate urban driving. More specifically, it presents a 3D classifier based on Velodyne data and monocular color imagery. The system contains two main components: a ground model and an object model. The ground model is a novel extension of elevation maps using Conditional Random Fields. It allows estimation of ground type (grass vs. asphalt) in addition to modelling the geometry of the scene. The object model involves two segmentation procedures. The first is a novel extension of elevation maps to a hierarchical clustering algorithm. The second is a new algorithm for defining regions of interest in images, which reasons jointly in the 3D Cartesian frame and the image plane. These two procedures provide a segmentation of the objects in the 3D laser data and in the images. Based on the resulting segmentation, object classification is implemented using a rule based system to combine binary deterministic and probabilistic features. The overall 3D classifier is tested on logs acquired by the MIT Urban Grand Challenge 2007 vehicle. The classifier achieves an accuracy of 89% on a set of 500 scenes involving 16 classes. The proposed approach is evaluated against seven other standard classification algorithms, and is shown to produce superior performance.

I. I NTRODUCTION As formulated by the CMU team after it won the DARPA Urban Challenge 2007 [1], next generation systems will require richer semantic representations: “[the vehicle] Boss has a very primitive notion of what is and is not a vehicle: if it is observed to move within some small time window and is in a lane or parking lot, then it is a vehicle; otherwise it is not. Time and location are thus the only elements that Boss uses to classify an object as a vehicle. This can cause unwanted behavior; for example, Boss will wait equally long behind a stopped car (appearing reasonable) and a barrel (appearing unreasonable), while trying to differentiate between them. A richer representation including more semantic information will enable future autonomous vehicles to behave more intelligently”. This paper presents a method for providing a rich semantic representation for outdoor urban environments, based on monocular color imagery and 3D laser data (the output of the system is illustrated in Fig. 4 and 5). As highlighted by the CMU team, such representations are a necessity for enhancing the general level of autonomy of robotic systems. To the authors’ knowledge, this work represents the first attempt to classify logs recorded by vehicles involved in the Urban Grand Challenge 2007. II. R ELATED W ORK In the context of laser-based classification, the use of Markov Random Fields (MRFs) have shown recent popularity [2], [3], [4]. Each laser return is represented by a node in a

MRF. Note that object segmentation cannot be extracted from these models without further reasoning. A combination of laser and vision sensors forms the most standard multi-modal setup for classification in robotics applications. The MRF framework mentioned above has also been used with vision and 3D laser data [5]. Unlike the approach described in this paper, this MRF-based approach does not attempt to use geometry to segment objects in the scene. A number of approaches have opted for a similar strategy to the one adopted in this work. For instance in [6], obstacles are first extracted by analyzing the terrain surface scanned by a downward looking laser. Classification is then based on the combination of geometrical and color features. In [7], a trail identification algorithm first exploits the laser information to make a decision with respect to the sub-type of environment the robot is in. It then knows what to look for in the vision data to complete the identification of a trail. The contributions of this work are the following: (1) it proposes an extension of elevation maps by means of conditional random fields (CRFs) which allow estimating the type of the ground in the vicinity of the robot in addition to modelling the geometry of the scene; (2) it proposes an extension of elevation maps to a hierarchical clustering technique; (3) it introduces an approach for extracting regions of interest (ROIs) in an image, based on joint reasoning in the 3D Cartesian frame and in the image plane; (4) it experimentally demonstrates the benefit of the integration of logic in classification systems. III. M ODELLING THE G ROUND Building the ground model involves (i) separating the ground from other objects, then (ii) estimating semantic labels for areas of the ground. The first step is achieved by using a laser scan (as shown in Fig. 1(a)) to build an elevation map [8] (as shown in Fig. 1(c)), then estimating unobserved data (as shown in Fig. 1(d)). The second step involves using vision data (Fig. 1(b)) to estimate semantic labels for observed regions, and interpreting the grid as a probabilistic network in order to propagate label estimates to unobserved regions. A. Building an Elevation Map Elevation maps are 2 12 D grid-based terrain models. In our implementation, each grid covers an area of 40 by 40 meters with a resolution of 40 centimeters, resulting in 10000 cells. Ground and non-ground regions are segmented as follows. First, cells which are directly observed by the laser are classified [8]. It can be seen from Figure 1(c) that a large number of cells do not receive enough returns to allow computation of the terrain elevation (cells left white). As a consequence, the

(a) Input 3D Laser Data

(b) Input Image

(c) Under-Sampled Elevation Map

(d) Interpolated Elevation Map

Fig. 1. (a) A single Velodyne scan (colors mapped to height), and (b) an image of the same scene. (c) shows the elevation map built from the laser data. (d) shows the elevation map after interpolation. In the bottom two figures, yellow cells are identified as belonging to the ground, and dark green cells as occupied. The blue triangles indicate the position of the vehicle carrying the sensors. The units of the axes are meters.

next step is to propagate ground labels to unobserved regions. This operation is implemented as an interpolation process [9]. Its output is illustrated in Fig. 1(d). B. Cell-wise Semantic Classification of Ground Cells The map building process described above exploits the geometric information provided by the laser to reconstruct the terrain surface. In order to go beyond a purely geometric representation, the next step is to infer a semantic label for each ground cell. In our implementation, this involves the use of color imagery to classify cells into one of two classes: asphalt or grass. These two classes were chosen because they are the two main ground types in the dataset used for our experiments. Moving to a dataset which involves more than two ground types does not imply any change from an algorithmic point of view. Classification of ground cells proceeds according to three main steps: (1) generation of Regions of Interest (ROIs) in the image, (2) feature extraction within each ROI, and (3) feature-based classification. Each ROI is obtained by first projecting the 3D center of a cell into the image plane (the projection mechanisms are detailed in [10]) and then defining a rectangular region around the projected point. The size of this region is adjusted as a function of depth using a simple linear mapping between depth and ROI size. ROIs whose sizes are below a pre-defined threshold ([5x5] pixels in our implementation) are disregarded. Examples of ground ROIs are shown in Fig. 2(a). Feature extraction is then performed in each of the ROIs. Since in most cases asphalt and grass can be discriminated based on their colors (grey and green respectively) we use color features for classification. Specifically, we use normalized HSV histograms in which each bin represents one channel. Note that due to perspective constraints, the ROIs of the cells in the background are too small and are disregarded. As a consequence, vision features cannot be extracted for these ground cells. The ground cells left white in Fig. 2(b) are in this case. Having defined the ROIs and computed the features, the next operation is classification. This is performed using Logitboost [11], implemented with decision stumps as weak classifiers. This produces a probabilistic output: a distribution

over possible class labels. The estimated class label of a cell is obtained as the label with the largest probability. The output of the cell-wise classification process is illustrated in Figure 2(b). In our tests, care is given to using balanced training sets, that is, training sets containing the same number of samples in each class, which helps to obtain accurate Logitboost models. C. Propagating Ground Label Estimates As can be seen in Fig. 2(b), the majority of the ground cells are left white since ROIs could not be generated and hence label estimates could not be computed. This section proposes a technique to propagate the label estimates to these unobserved cells. In particular, the terrain grid is interpreted as a probabilistic network modelled as a Conditional Random Field (CRF). In this network, one node represents one ground cell and is connected to the eight neighboring cells. Conditional random fields are undirected graphs which model the structure of the joint probability distribution on a set of random variables [12]. In this work, the CRF is used to simultaneously account for the correlations between neighbouring cells and the direct observations of cell probabilities. Learning and inference is this probabilistic model of the ground follow standard methods and are not described here due to space constaints but are detailed in [9]. The output of the inference process is illustrated in Fig. 2(c). IV. M ODELLING O BJECTS Once the ground has been identified, non-ground objects are segmented in the laser and vision data. Based on this segmentation, classification is performed using a rule-based system. A. Object Segmentation in the Elevation Map The segmentation of the elevation map is performed as a two-pass clustering process: a first pass in 2D and a second in 3D. During the first pass, cells identified as occupied (dark green cells in Fig. 1(d)) are clustered based on a simple neighbourhood criteria: occupied cells in contact with each other are defined to be in the same cluster. This operation provides an initial set of clusters. During the second pass, each cluster is treated independently. The laser returns in each cluster are processed to produce a new elevation map with a finer resolution: 20cm instead of 40cm. This local elevation

(a) Cell-wise Semantic Classification of Ground Cells

(b) Cell-wise Classification

(c) Ground Labels Propagation

Fig. 2. Example of ground cell classification; same scene as in figure 1. (a) Estimates displayed in the image plane and indicated by the color of the ROI; each ROI corresponds to a rectangle. In all the displays, grey refers to the class asphalt and green to the class grass. (b) Estimates displayed in the 3D scene and indicated by the colors of the ground cells. In this scene, the estimates obtained by cell-wise classification are correct. (c) Estimate propagation with the CRF ground model: the predicted class labels are correct.

map is then refined further: each grid cell is divided vertically, creating a set of 3D voxels of size 20cm3 . Voxels containing no returns are disregarded. The remaining voxels are clustered as follows: voxels in contact with one another are clustered together. An example of 3D clusters obtained with this process is displayed in Fig. 3.

(a)

(b)

Fig. 3. Example output of the 3D segmentation process. These two figures correspond to the left part of the scene presented in Fig. 2. In this instance, the segmentation procedure is able to correctly separate a tree, a car and a pedestrian, with the canopy of the tree reaching over the pedestrian and the car. This case demonstrates the ability of the segmentation algorithm to represent overhanging structures.

B. ROI Definition for Object Classification To exploit the vision data, a ROI in the image is associated with each 3D cluster. This is performed by means of an algorithm composed of two main parts: (1) ROI definition by point fitting, and (2) occlusion detection using approximate ray-tracing. Examples of ROIs generated by the algorithm can be seen in Fig. 4(b) and 5(b); the same figures also show examples of laser returns projected into the image using the procedure developed in [10]. The initial definition of the ROI associated with a given 3D cluster is obtained by first projecting the laser returns in the cluster into the image. This produces a set of points in the image plane. A grid is defined on top of these points and the cells of this grid which contain more than Np points are marked with a one (Np is set to 20 in our implementation); the other cells are marked with a zero. A greedy algorithm

then attempts to grow an axis-aligned rectangle which contains only cells marked with a one. This results in a set of possible rectangles. The rectangle which contains the largest number of returns is retained as a candidate ROI. The 3D clusters are processed in a near-to-far fashion. This sequence provides a mechanism to reason about occlusions. In the 3D scene, objects in the foreground are less likely to be occluded than objects in the background. ROIs associated to forground objects are therefore constructed first, to allow incremental checking for occlusions. This is done by checking whether each newly-constructed ROI intersects with previously-constructed ROIs (that is, ROIs corresponding to clusters closer to the vehicle). If an intersection is detected, the size of the new ROI is decreased to ensure zero overlap with existing ROIs. This sequence of computations constitutes an approximate ray-tracing algorithm, detecting clusters which are on the same line of sight with respect to the vehicle. Due to perspective constraints, ROIs in the background of the image may be too small to allow the extraction of meaningful vision features. As a consequence, ROIs which cover an area inferior to 20x20cm are disregarded. In this case, vision features cannot be extracted and the associated object is not classified. Such cases occur mostly for objects in the background of the scene, and do not prevent the system from processing objects in the immediate neighbourhood of the vehicle. C. Feature-Based Object Classification We now describe how ROIs are classified into semantically meaningful object categories useful for autonomous urban driving. The aim of this section is not to propose a new classification algorithm, but rather to experimentally compare a set of classifiers and suggest directions for improving their performance. Specifically, this section explores ways to exploit logical dependencies in the feature space. The motivation for doing so can be understood with an example. Consider the class “tree”. According to the English dictionary, a tree is defined as: “a woody plant that has a single main stem or trunk, with a distinct elevated crown”. This definition can be

reformulated using a set of attributes connected by the logical operand “and”: ∀x if isWoody(x) ∧ hasTrunk(x) ∧hasDistinctElevatedCrown(x) ⇒ isTree(x) (1)

The terms on the left hand side can be thought of as binary features extracted from the data. Given such binary features, the formula defines a simple classifier. By means of logical reasoning, this classifier generates a mapping to class labels and the operator “∧” (logical “and”) provides a mechanism to exploit the logical dependencies in the feature space. This section describes a system which reproduces such dictionarylike definitions of classes, and experimentally evaluates the effect on performance. The logical definition of a class requires a set of binary features. A set of twenty one binary features was defined based on the 3D point clouds. Each feature was designed to capture one particular aspect of a point cloud. For example, one feature has the name isTall and is equal to one when the maximum height in the point cloud is above a pre-defined threshold (2.2 meters in our implementation). This feature is deterministic and tuned by hand, however others are probabilistic and trained with labelled data, such as the outputs of Logiboost binary classifiers. The feature isGreen, for example, is trained to detect image patches whose color matches the foliage of a tree. Examples of the output of the system are dispayed in Fig. 4 and 5. The full set of features is described in [9]. Ideally the set of logical rules for combining the features are learned from data, however this is a challenging task. Two approaches for automated logical rule learning from data were evaluated. The first was based on interpreting a set of rules as a decision tree [13] and learning the structure of the tree. The set of rules obtained with this approach did not reproduce intuitive definitions of object classes such as the one in Eq. 1. Consequently, the ability of these rules to generalize was limited (as detailed in Sec. V-C). The second approach was the statistical structure learning algorithm described in [14]. The output of this algorithm is a graph in which the links encode the relationships between nodes representing features and nodes representing class labels. The graphs obtained were sparse, with a number of class labels not linked to any features. Hence, this second approach could not capture the logical structure of the problem either. By including human domain knowledge in the set of rules, a more effective and intuitive set of rules can be generated. The full set of the rules is described in [9]. V. R ESULTS The dataset used in the experimental analysis was acquired by the team representing MIT in the Urban Challenge 2007 [10]. It corresponds to a 20 minute test run in the city of Boston, MA, USA. It contains 9077 images obtained with a monocular color camera, each image being associated with a set of 3D laser returns acquired with a Velodyne sensor. An image and the associated set of Velodyne scans form what we will refer to as a scene. A set of 500 scenes evenly spaced across the dataset were manually labeled (which represents

Cell-wise Classification. Accuracy: 97.7% Asphalt Grass Precision Recall Asphalt 93823 145 99.8% 97.7% Grass 2172 5105 70.2% 97.2% Label Propagation. Accuracy: 99.4% Truth \ Inferred Asphalt Grass Precision Recall Asphalt 16086 55 99.7% 99.7% Grass 46 84 65.1% 60.4% Truth \ Inferred

TABLE I G ROUND C LASSIFIER P ERFORMANCE

about one million labeled laser returns). All performance evaluations are computed using 5-fold cross-validation. A. Ground Classification Results Table I shows the ground classification results using the Logitboost classifier described in Sec. III-B. The left part of the table provides the confusion matrix and the right part precision and recall values. For comparison purposes, a KNN and SVM classifier are trained for the same task. For each of these two classifiers, parameters are tuned to maximise classification performance. The best accuracy of a KNN classifier is 91.3%. The best accuracy obtained with the SVM classifier is 94.5%. The Logitboost classifier described in this paper achieves the best performance: 97.7%. The output of the classifier is illustrated in Fig. 2(b). We now evaluate the propagation mechanism described in Section III-C. The evaluation is performed as follows. In each test scene, 80% of the hand-labeled ground cells are processed with the ground classifier. This initialises the terrain grid with an incomplete set of estimates, as displayed in Fig. 2(b). The propagation algorithm is then run and the remaining 20% of the labeled ground cells are used to compute the accuracy of that prediction. Table I provides the results of the evaluation. The overall prediction accuracy is high (99.4% but is skewed by predictions in the class “asphalt” which contains many more samples. The prediction and recall values show that the performance for grass is lower. Grass is more difficult because the data tend to be more sparse (asphalt tends to be densely sampled because there is always densely-sampled asphalt near the car). When the data are sparsely sampled, predictions must be made based on fewer actual observations implying higher uncertainty in the estimates. The output of the prediction process is illustrated in Fig. 2(c). B. Object Classification Performance of the rule-based system described in section IV-C is summarised in Table II which presents precision and recall values for the classification of 14 classes. The corresponding accuracy is 89.0%. This performance constitutes a central result in this work. Since the classifier is based on logical rules, it can logically define when it cannot assess the precise class of an object, and can instead return a more general label such as “Large Object” or “Large Structure”. This aspect of the system is presented in details in [9]. Sample outputs of the classification system are illustrated in Figure 4.

(a)

(b)

Fig. 4. Semantic representation of scene 22 (one of the 500 labeled scenes contained in the dataset). (a) 3D view of the inferred class labels.

The blue triangle indicates the vehicle’s position. The ground model developed in Sec. III has been used to classify the ground cells. Note that the green patch in the 3D plot corresponds to the grass which can be seen on the right of the image, demonstrating correct ground classification in this scene. The object model developed in Sec. IV has been used to classify the objects above the ground. (b) displays the inferred labels as well as the ROIs and the projected laser returns. The colour of each ROI matches the colour of the associated object in the 3D plot. All objects in the scene are correctly classified. Also, note that the segmentation algorithm has been able to detect the overhanging structure of the tree on the right. The objects plotted with a blue colour in the 3D representation correspond to objects for which a ROI could not be generated in the image due to perspective constraints, as discussed in Sec. IV-B.

(a)

(b)

Semantic representation of scene 141, in which all classifications are correct. Note that the sets of projected returns are shifted slightly to the right with respect to the associated object, due to timestamping issues when acquiring the data. A number of pedestrians have not generated enough laser returns for a ROI large enough to be built, in particular, the pedestrians in the background. As a result, they do not appear in the model of the scene. The ground classification is correct. Note that the ground classifier has been trained to classify lane markings as part of the class “asphalt” by including training samples corresponding to road markings. In the 3D part of this display, a beige cube appears in the middle of the road. It was computed during the generation of the elevation map and corresponds to an incorrect minimum and maximum height measured in the cell. These spurious measurements are due to laser beams hitting road markings at grazing angle (this phenomenon was also observed in [10]). This cube obscures the cells behind it which are consequently evaluated as occluded during the interpolating of the ground height. The interpolation algorithm is, in the case of these cells, not able to evaluate a height and thus leaves a trail of white cells behind the cube.

Fig. 5.

Car Tree Trunk Tree Bush Vegetation Sign Pole Pedestrian Large Object Large Structure Wall Building Truck Other

Precision (in %) 96.7 83.3 92.0 91.0 96.3 95.0 75.9 82.1 96.6 93.0 89.8 92.5 12.7 55.6

Recall (in %) 83.3 100 96.0 57.4 90.9 94.4 100 92.7 98.6 100 81.2 88.6 93.3 60.3

Number of instances 648 12 364 89 300 300 29 123 354 100 420 134 110 134

representing MIT during the Urban Grand Challenge 2007. The classifier achieved an accuracy of 89% on a set of 500 scenes involving 16 classes. The proposed approach outperformed the seven other classifiers tested. VII. ACKNOWLEDGEMENT The authors would like to thank Albert Huang for providing software and data and sharing his expertise, and Roman Katz for useful discussions. This work is supported by the ARC Center of Excellence programme, the Australian Research Council (ARC), the New South Wales (NSW) State Government.

TABLE II O BJECT C LASSIFIER P ERFORMANCE Classifier

Features

KNN SVM KNN KNN Decision Trees SVM Logitboost Rule-based

Spin Images [15] Spin Images [15] Spherical Harmonic descriptor [16] Binary Features Binary Features Binary Features Binary Features Binary Features

5-fold cross validation accuracy (in %) 30.4 32.0 52.6 79.9 80.2 80.7 81.9 89.0

TABLE III O BJECT C LASSIFICATION C OMPARISONS

C. Comparison With Other Techniques Using identical features, the rule-based system is compared against the following standard classifiers: KNN, SVM, Logitboost, and Decision Trees; and also to classifiers based on 3D point cloud descriptors such as the Spin Image or the Spherical Harmonic descriptor. The full analysis of these results is presented in [9]. Here, the main aspects of this comparison are summarized. The superior performance of the rule-based system can be explained as follows. Introducing logical reasoning decreases the number of admissible feature combinations. The view of logic as a set of constraints in the feature space is also developed in [17]. A total of 21 binary features generate 221 = 2, 097, 152 code words which classifiers such as KNN, SVM or Logitboost have to separate. On the other hand, the set of rules in the proposed system results in 45 possible feature combinations. The compression factor associated to the use of logic can be approximated here as: 221 /45 ≈ 46000. In other words, human domain knowledge has been used to drastically reduce the set of feature combinations which the classifier must consider. The poor results for the Spherical Harmonic Representation and the Spin Image suggest that the Velodyne data is insufficiently dense to allow accurate object classification based on laser data alone. The comparison with other approaches highlights the need to incorporate visual information and more compact features such as the binary features used here. VI. C ONCLUSION This paper proposed a 3D classifier based on two main components: a ground model and an object model. The 3D classifier was tested on logged data acquired by the vehicle

R EFERENCES [1] C. U. et al., “Autonomous driving in urban environments: Boss and the urban challenge,” Journal of Field Robotics Special Issue on the 2007 DARPA Urban Challenge, Part I, vol. 25, no. 8, pp. 425–466, June 2008. [2] D. Anguelov and et al., “Discriminative learning of Markov random fields for segmentation of 3D scan data,” in Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2005. [3] D. Munoz, N. Vandapel, and M. Hebert, “Directional associative markov network for 3-d point cloud classification,” in International Symposium on 3D Data Processing, Visualization and Transmission, 2008. [4] R. Triebel, K. Kersting, and W. Burgard, “Robust 3D scan point classification using associative Markov networks,” in Proc. of the IEEE International Conference on Robotics & Automation (ICRA), 2006. [5] I. Posner, M. Cummins, and P. Newman, “Fast probabilistic labeling of city maps,” in Proc. of Robotics: Science and Systems, 2008. [6] A. Castano, A. Talukder, and L. Matthies, “Obstacle detection and terrain classification for autonomous off-road navigation,” Autonomous Robots, vol. 18, pp. 81–102, 2005. [7] C. Rasmussen, “A hybrid vision + ladar rural road follower,” in Proc. of the IEEE International Conference on Robotics & Automation (ICRA), 2006. [8] S. T. et al., “Winning the darpa grand challenge,” Journal of Field Robotics, 2006. [9] B. Douillard, “Vision and laser based classification in urban environments,” Ph.D. dissertation, School of Aerospace and Mechanical Engineering, The University of Sydney, 2009. [10] J. L. et al., “A perception-driven autonomous urban vehicle,” Journal of Field Robotics, vol. 25, no. 10, pp. 727–774, October 2008. [11] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting,” The Annals of Statistics, vol. 28, no. 2, 2000. [12] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. of the International Conference on Machine Learning (ICML), 2001. [13] O. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. WileyInterscience, 2001. [14] M. Schmidt, K. Murphy, G. Fung, and R. Rosales, “Structure learning in random fields for heart motion abnormality detection,” in Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2008. [15] A. Johnson and M. Hebert, “Surface matching for object recognition in complex three-dimensional scenes,” Image and Vision Computing, vol. 16, pp. 635–651, 1998. [16] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz, “Rotation invariant spherical harmonic representation of 3d shape descriptors,” in Proceedings of the 2003 Eurographics/ACM SIGGRAPH symposium on Geometry processing, 2003. [17] C. Kemp, “The acquisition of inductive constraints,” Ph.D. dissertation, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 2007.