Semantic Image Segmentation with Deep Learning

Torr Vision Group, Engineering Department Semantic Image Segmentation with Deep Learning Sadeep Jayasumana 07/10/2015 Collaborators: Bernardino Rome...
35 downloads 2 Views 5MB Size
Torr Vision Group, Engineering Department

Semantic Image Segmentation with Deep Learning Sadeep Jayasumana 07/10/2015

Collaborators: Bernardino Romera-Paredes Shuai Zheng Phillip Torr

Torr Vision Group, Engineering Department

Live Demo - http://crfasrnn.torr.vision/

Torr Vision Group, Engineering Department

Outline  Semantic segmentation  Why?  CNNs for Pixelwise prediction  CRFs  CRF as RNN  Conclusion

Torr Vision Group, Engineering Department

Semantic Segmentation • Recognizing and delineating objects in an image  Classifying each pixel in the image

Torr Vision Group, Engineering Department

Why Semantic Segmentation? • To help partially sighted people by highlighting important objects in their glasses

Torr Vision Group, Engineering Department

Why Semantic Segmentation? • To let robots segment objects so that they can grasp them

Torr Vision Group, Engineering Department

Why Semantic Segmentation? • Road scenes understanding • Useful for autonomous navigation of cars and drones

Image taken from the cityscapes dataset.

Torr Vision Group, Engineering Department

Why Semantic Segmentation? • Useful tool for editing images

Torr Vision Group, Engineering Department

Why Semantic Segmentation? • Medical purposes: e.g. segmenting tumours, dental cavities, ...

Image taken from Mauricio Reyes

ISBI Challenge 2015, dental x-ray images

Torr Vision Group, Engineering Department

But How? • Deep convolutional neural networks are successful at learning a good representation of the visual inputs.

• However, here we have a structured output.

Torr Vision Group, Engineering Department

CNN for Pixel-wise Labelling • Usual convolutional networks

Torr Vision Group, Engineering Department

CNN for Pixel-wise Labelling • Usual convolutional networks

• Fully convolutional networks

Long et. al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015.

Torr Vision Group, Engineering Department

Fully Convolutional Networks [Long et al, CVPR 2014]

Torr Vision Group, Engineering Department

Fully Convolutional Networks [Long et al, CVPR 2014]

+ Significantly improved the state of the art in semantic segmentation. - Poor object delineation: e.g. spatial consistency neglected.

Image

FCN Results

Ground truth

Torr Vision Group, Engineering Department

Conditional Random Fields (CRFs) • A CRF can account for contextual information in the image

Coarse output from the pixel-wise classifier

MRF/CRF modelling

Output after the CRF inference

Torr Vision Group, Engineering Department

Conditional Random Fields (CRFs) ∈ {bg, cat, tree, person, …}

• Define a discrete random variable Xi for each pixel i. • Each Xi can take a value from the label set. • Connect random variables to form a random field. (MRF)

Torr Vision Group, Engineering Department

Conditional Random Fields (CRFs) ∈ {bg, cat, tree, person, …}

• • • •

= bg

= cat

Define a discrete random variable Xi for each pixel i. Each Xi can take a value from the label set. Connect random variables to form a random field. (MRF) Most probable assignment given the image → segmentation.

Torr Vision Group, Engineering Department

Finding the Best Assignment = bg Pr Pr

=

,

= |

=

,…,

=

= exp −

|

|

= Pr ( = | )

= cat

• Maximize Pr

=

→ Minimize

• So we have formulated the problem as an energy minimization.

Torr Vision Group, Engineering Department

|

=

_

+

_

=

Torr Vision Group, Engineering Department

|

=

_

+

_

Unary energy 

(

=

) = ?

=

Torr Vision Group, Engineering Department

|

=

_

+

_

Unary energy 

(

=

) = ?

 Your label doesn’t agree with the initial classifier → you pay a penalty.

=

Torr Vision Group, Engineering Department

|

=

_

+

_

Unary energy 

(

=

) = ?

 Your label doesn’t agree with the initial classifier → you pay a penalty.

Pairwise energy 

(

=

,

=

) = ?

 You assign different labels to two very similar pixels → you pay a penalty.  How do you measure similarity?

Torr Vision Group, Engineering Department

|

=

_

+

_

Unary energy 

(

=

) = ?

 Your label doesn’t agree with the initial classifier → you pay a penalty.

Pairwise energy 

(

=

,

=

) = ?

 You assign different labels to two very similar pixels → you pay a penalty.  How do you measure similarity?

Torr Vision Group, Engineering Department

|

=

_

+

_

Unary energy 

(

=

) = ?

 Your label doesn’t agree with the initial classifier → you pay a penalty.

Pairwise energy 

(

=

,

=

) = ?

 You assign different labels to two very similar pixels → you pay a penalty.  How do you measure similarity?

Torr Vision Group, Engineering Department

Dense CRF Formulation [Krähenbühl & Koltun, NIPS 2011.]

• Pairwise energies are defined for every pixel pair in the image.

=

(

)+

( , ,

• Exact inference is not feasible. • Use approximate mean field inference.

)

Torr Vision Group, Engineering Department

Dense CRF Formulation [Krähenbühl & Koltun, NIPS 2011.]

• Pairwise energies are defined for every pixel pair in the image.

=

(

)+

( , ,

• Exact inference is not feasible. • Use approximate mean field inference. exp (−

)=

=

( )

)

Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

U Q I

Bilateral

Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

U Q I

Bilateral

Conv

Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

U Q I

Bilateral

Conv

Conv

Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

U Q I

Bilateral

Conv

Conv

+

Torr Vision Group, Engineering Department

Fully Connected CRFs as a CNN

U Q I

Bilateral

Conv

Conv

+

SoftMax

Torr Vision Group, Engineering Department

CRF as a Recurrent Neural Network U Q I

Bilateral

Conv

Conv

+

SoftMax

Mean-field Iteration

• Each of these blocks is differentiable  We can backprop

Torr Vision Group, Engineering Department

CRF as a Recurrent Neural Network Image CRF Iteration

Unaries

Output

SoftMax CRF as RNN

• Each of these blocks is differentiable  We can backprop

Torr Vision Group, Engineering Department

Putting Things Together FCN

CRF-RNN

Torr Vision Group, Engineering Department

Experiments FCN

FCN

CRF

[Long et al, 2014] [Chen et al, 2015]

68.3

69.5

FCN

CRFRNN

Ours

72.9

Torr Vision Group, Engineering Department

Try our demo: http://crfasrnn.torr.vision Code & model: https://github.com/torrvision/crfasrnn Shuai Zheng

Bernardino Romera-Paredes

Philip Torr

Torr Vision Group, Engineering Department

Examples

http://pp.vk.me/c622119/v622119584/20dc3/7lS5BU2Bp_k.jpg

Torr Vision Group, Engineering Department

Examples

http://media1.fdncms.com/boiseweekly/imager/mountain-bikers-are-advised-to-dism/u/original/3446917/walk_thru_sheep_1_.jpg

Torr Vision Group, Engineering Department

Examples

http://img.rtvslo.si/_up/upload/2014/07/22/65129194_tour-3.jpg

Torr Vision Group, Engineering Department

Examples

http://www.toxel.com/wp-content/uploads/2010/11/bike05.jpg

Torr Vision Group, Engineering Department

Not-so-good examples

http://www.independent.co.uk/incoming/article10335615.ece/alternates/w620/planecat.jpg

Torr Vision Group, Engineering Department

Not-so-good examples

http://i1.wp.com/theverybesttop10.files.wordpress.com/2013/02/the-world_s-top-10-best-images-of-camouflage-cats-5.jpg?resize=375,500

Torr Vision Group, Engineering Department

Tricky examples

http://se-preparer-aux-crises.fr/wp-content/uploads/2013/10/Golum.png

Torr Vision Group, Engineering Department

Tricky examples

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRf4J7Hszkc8Wf6riVUX-cV_K-un8LJy5dYIBW1KDIn6i7UCzGHpg

Torr Vision Group, Engineering Department

Tricky examples

http://i.huffpost.com/gen/1478236/thumbs/s-DIRD6-large640.jpg

Torr Vision Group, Engineering Department

Conclusion • CNNs yield a coarse prediction on pixel-labeled tasks. • CRFs improve the result by accounting for the contextual information in the image. • Learning the whole pipeline end-to-end significantly improves the results.

CNN

CRF

Torr Vision Group, Engineering Department

Conclusion • CNNs yield a coarse prediction on pixel-labeled tasks. • CRFs improve the result by accounting for the contextual information in the image. • Learning the whole pipeline end-to-end significantly improves the results.

Thank You!

CNN

CRF

Suggest Documents