Body Part Tracking with Random Forests and Particle Filters

000 001 002 003 004 005 006 007 Body Part Tracking with Random Forests and Particle Filters 008 009 010 011 012 013 Anonymous Author(s) Affiliation...
Author: Gabriel Walters
3 downloads 0 Views 749KB Size
000 001 002 003 004 005 006 007

Body Part Tracking with Random Forests and Particle Filters

008 009 010 011 012 013

Anonymous Author(s) Affiliation Address email

014 015 016 017 018

Abstract Low cost depth cameras have created new opportunities for Human-computer interaction (HCI) where users can interact through the camera with the swipe of a hand. The key for producing compelling interactions is for the user to feel that they are accurately tracked. This includes both spatially and temporally. While people tend to adjust for errors in tracked position they will generally find discontinuities in velocity to be unacceptable. This paper introduces an approach for combining per pixel classification of body parts with multiple particle filters for tracking critical body parts. The results produce smoother temporal continuity.

019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053

1

Introduction

Pose estimation and tracking from a depth camera has many applications in HCI. These include games where the user can control the game with their body or doctors scrolling through medical information without having to put down a scalpel. The per pixel classification approach proposed by Shotton et al. [1] is effective at predicting joint positions for a full body but it does not consider any temporal information. This approach can result in large changes in the estimated joint positions from one frame to the next. Users generally find this unacceptable for interactions involving the relative position of their hands. For instance, controlling an onscreen pointer with your hand is incredibly frustrating if it jumps around while your hand is moving smoothly along an arch. The per pixel classification is unable to handle occlusion because if there are no visible pixels for a body part it can not make a prediction. An example of this would be one hand passing in front of the other. The per pixel classification approach falls into the larger category of object recognition. Object recognition tries to recognize objects from a single image. The two other main categories of pose estimation and tracking are particle filters and graphical models for modelling temporal and spatial dependences. Most object tracking research utilizes particle filters, also known as Sequential Monte Carlo, to estimate possible positions of an object over a sequence of images. Particle filters are effective at tracking simple objects such as points in space but are exponentially more computationally expensive when tracking higher dimensional representations such as a human pose [3]. Temporal and spatial models use Markov Random Fields to represent the hierarchical dependencies between joints and time [4] [5]. While these approaches produce excellent results, their computation costs make them infeasible for realtime tracking. In this paper I present a slight modification to the random tree classifier presented by Shotton et al. to determine the probability of a pixel being a particular body part. This is then combined with a particle filter to track body parts over time. The observation model is simply the probability of a pixel being classified as the body part being tracked. This produces lower worst case errors for joint positions and velocities. It is also more robust at handling occlusion. 1

054 055 056 057 058 059 060 061 062 063 064 065 066 067

2

Generating Data

The data used for this paper was synthesized using standard computer graphic tools. All parts of the pipeline use open datasets and tools. Step 1 is generating a human mesh with MakeHuman (http://www.makehuman.org/). This body mesh is imported into Blender3d (www.blender.org) using the MakeHuman plugin. To classify each body part, a texture is mapped onto the mesh where each colour represents a different body part. The remaining inputs are a skeleton and animation from the CMU mocap database (http://mocap.cs.cmu.edu/). A custom script loads all the required inputs and renders a depth image along with the corresponding colour coded image. This process is outlined in Figure 1

068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107

Figure 1: Pipeline for generating the depth image and colour coded image. The inputs are a body type mesh, colour map texture, skeleton and animation. This process allows for hundreds of thousands of unique poses and their classifications to be rendered. Due to processing limitations, this paper focuses on 10-20 animations with 500 frames each. The images were rendered at 256x192.

3

Per Pixel Body Part Classifier

A random forest classifier is trained to classify each valid pixel in the depth image with a body part. The feature representation used for classification is similar to Shotton et al. [1] representation but uses a fixed size feature vector. This allows for off the shelf high performance random forest learners to be used. This was critical to the project as it allowed the random forest classifier to be trained in a reasonable timeframe. The random forest learner used for this project was based on the standard approach of maximizing the gain in information for each new node that is added to the tree. For each training depth image, 500 random pixels on the character were selected. The features were generated from the depth image and the labels were generated from the colour coded image. The labels were body part IDs of the range 0 to 19. The feature vector for each pixel contains two parts: a local patch of relative depths and the relative position of the pixel compared to the rest of the body. Figure 2 shows both the local patch and relative position feature representations. The local patch representation is simply the image patch around the pixel with the depth of pixel subtracted. This assumes that the distance of the character to the camera is always the same. To handle characters at different depths, the size of the patch in the depth image must be adjusted by the distance of the character. The features passed to the learner 2

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

and classifier are resized using standard computer graphics techniques to generate a patch of the correct dimensions. It is reasonable to assume uniform scaling across the entire character; therefore, the resizing only needs to be done once for the entire depth image. At a depth where the character spans most of the image, a window of 30x30 was used. The relative position feature vector is the position of a pixel relative to the center of the body. The center is simply the mean pixel value in x,y,z coordinates for the entire character. Pixels with large depths are considered to be part of the background and are ignored. The standard deviation with respect to x,y,z is then determined from the center of the character. The feature vector is the position of a pixel relative to the center and normalized by the standard deviation. See Figure 2 and formula 1 feature2 =

ppixel − µbody σbody

(1)

A comparison between just using local features and using the relative position of a pixel with local features can be found in Section 5. The key intuition is that the relative position feature was required for the random forest learner to learn the overall topology of a character. This allowed for better distinction of the different segments of the torso. This was not required by Shotton et al. [1] because their approach has an arbitrary large window where any two pixels could be compared. It can be assumed that their approach learnt the topology of the torso by evaluating pixels near and above the top and bottom of the body. Random forests contain many decision trees generated from random subsets of the data. Therefore, the probability of a pixel being classified type b can be approximated by the number of votes for b over the total number of votes.

133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161

Figure 2: Local features of a patch are shown on the left. The relative position of a pixel is shown on the right.

4

Particle Filter Tracker

Particle filters, also known as Sequential Monte Carlo, provide a Bayesian framework for object tracking. By using the random forest as the observation model it is possible to track each body part with an independent particle filter. For most tracking applications the observation model can not distinguish between two objects being tracked. In these cases the particle filter has to be multimodal to allow particles to jump between different objects. Particles tend to jump when one object is close to another. Because the random forest observation model does distinguish between object instances we can use an independent particle filter for each body part being tracked. For example, the left hand is distinguished from the right hand. This allows a very simple particle filter to be used. 3

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215

For time t, let xb t be the state of the body part b and let zt be the depth image. The sequences of b b states up until time t are xb 0:t = {x0 ...xt } and the sequence of images is z0:t = {z0 ...zt }. The b state xt is the position and velocity of body part b at time t. The probability of the image at time t for a given body part is p(zt | xb t , b) and is referred to as the observation model. The probability b of the state at time t given the previous state at t-1 is p(xb t | xt−1 ) and is called the transition distribution. A Gaussian transition distribution was chosen for the velocity with standard deviation σpb . The sampled position is computed from the previous position and the velocity is sampled from the Gaussian distribution. σpb is learnt from the distribution of velocities in the mocap animations. By applying Bayes rule and Bayesian recursion we get the following formula:

p(xb t

R b b b b b p(zt | xb p(zt | xb t , b) p(xt | xt−1 )p(xt−1 | z0:t−1 , b)dxt−1 t , b)p(xt | z0:t−1 , b) R = | z0:t , b) = b b p(zt | z0:t−1 , b) p(zt | xb t , b)p(xt | z0:t−1 , b)dxt (2)

By tracking each body part with its own particle filter it has its own likelihood distribution p(zt | xb t , b). The likelihood is the probability of a pixel being of the type b. This is the ratio of votes received by the random forest classifier over the total number of votes. The posterior p(zt | xb 0:t , b) for each body part is approximated with a set of N particles {xb,i } . Each candidate particle is sampled from a proposal distribution. The proposal was sei=1..N t lected to be the transition distribution. Selecting the transition distribution results in a bootstrap filter where the weight of each particle is the likelihood of the observation multiplied with the previous weight. b,i wtb,i = wt−1 p(zt | xb (3) t , b) The weights are renormalized after sampling and propagate through time along with the particles. The state of body part b is simply the expectation of its distribution. E[xb 0:t ] =

N X

wtb,i xb,i t

(4)

i=1

The initial distribution, p(xb 0 ), is initialized with velocity zero and positions of random pixels corresponding to the body part being tracked. Sequential importance sampling is the simplest particle filter and was adequate for short sequences. The algorithm below is run each frame for each body type. 1) For i = 1, ...N draw samples from the proposal distribution. In our case it is the transition b distribution p(xb t | xt−1 ) b,i 2) For i = 1, ..., N , update the importance weights w ˆtb,i = wt−1 p(zt | xb t , b)

3) For i = 1, ..., N , normalize the importance weights wtb,i = P j

5 5.1

w ˆtb,i =1N w ˆtb,j

Experiments Random Forests

The two feature sets described in Section 2 were evaluated by training two random forests. A dataset of 20 poses was selected from the CMU database. 1000 random pixels were selected for each pose and a 30x30 window was used for the local patch. On a standard desktop, the random forest training took approximately 12 hours. This dataset is significantly less than 100 000 poses used by Shotton et al. [1]. Even with such a small dataset the classifiers performed exceptionally well on a test set of 30 similar poses. All training and evaluation was done with synthetic data generated from the same body type. Figure 4 shows the per body part accuracy for the local patch feature set and the local patch with relative pixel position feature set. The local patch with relative position information outperformed for all body parts. Figure 3 displays the qualitative differences between the two feature representations. 4

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234

Figure 3: Four example test poses. For each pose, the ground truth is on the left followed by the depth image, followed by the local patch classification, follow by the local + relative pixel position classification

235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252

Figure 4: Accuracy of the random forest classifier for the two feature representations 5.2

Particle Filter Tracking

253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

Figure 5: Three consecutive frames of the head, hands and feet being tracked by independent particle filters. The particle filter tracking was evaluated with 3 short sequences of poses. The actions included squatting, jumping jacks and boxing. Figure 5 shows three frames of the boxing sequence being tracked by 500 particles for the head, hands and feet. For the random forest classifier, the position of 5

270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323

Figure 6: Average and maximum per pixel errors for each body part being tracked. Velocities are measured in pixels per frame. This figure shows that on average the particle filter tracker does worse at predicting the center of the body part. In worst case scenarios the max error is less as seen by the large error on the left hand. a body part was determined by the mean location of all pixels with that classification. For the particle filter it was the expectation of the distribution. Experiments showed that the particle filter had higher errors for both the positions and velocities. However the particle filter had lower maximum errors in worst case scenarios. Figure 6 shows the mean and maximum errors for the three sequences. Since the random forest classifier only classifies pixels that are on the character it produces more accurate results on average. However, when it fails to correctly categorize any pixel for a body part then it has no reasonable estimate to fall back on.

6

Discussion

My work has shown that random forest classifiers can be used as the observation model for a particle filter. Unfortunately, it has not conclusively shown that a particle filter will improve body part tracking in the average case. However it has made a strong case for further research. This must include evaluation on larger datasets with synthetic and real data. Testing with real data was outside the scope of this project because it required additional infrastructure to do background removal and other image preprocessing. Future work will need to address these limitations. All work was implemented in python and did not run in real time. However, the random forest classifier and simple particle filter could run in realtime on a GPU. Further work will include GPU implementations of algorithms and an evaluation of their performance. A key advantage of using particle filters is they handle occlusion and temporal discontinuities better than per pixel classification. If several pixels of a small body part are classified incorrectly then the random forest estimation would jump in a single frame. The particle filter estimation will just add the observations to the distribution but not commit to the incorrect result. Some may argue that with a well enough trained classifier this would not be an issue. However occlusion is a real problem that can not be fixed by improving the classifier. Instead, it must be addressed by a temporal model such as the particle filter approach presented in this paper. References [1] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In Proc. CVPR, 2011 . [2] T.B Moeslund, A. Hilton, and V. Krger. A survey of advances in vision-based human motion capture and analysis. CVIU(103), No. 2-3, November-December 2006, pp. 90-126. [3] K. Okuma, A. Taleghani, N. De Freitas, J. Little, and D. Lowe. A Boosted Particle Filter: Multitarget Detection and Tracking. The European Conference on Computer Vision (ECCV ’04), May 2004 [4] Y. Wu, G. Hua, and T. Yu. Tracking articulated body by dynamic Markov network. ICCV, pp. 1094-1101, 2003 [5] D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D. Gupta, and A. Ng. Discriminative learning of markov random fields for segmentation of 3D scan data. In Proc. CVPR, 2005

6

Suggest Documents