Torr Vision Group, Engineering Department
Semantic Image Segmentation with Deep Learning Sadeep Jayasumana 07/10/2015
Collaborators: Bernardino Romera-Paredes Shuai Zheng Phillip Torr
Torr Vision Group, Engineering Department
Live Demo - http://crfasrnn.torr.vision/
Torr Vision Group, Engineering Department
Outline Semantic segmentation Why? CNNs for Pixelwise prediction CRFs CRF as RNN Conclusion
Torr Vision Group, Engineering Department
Semantic Segmentation • Recognizing and delineating objects in an image Classifying each pixel in the image
Torr Vision Group, Engineering Department
Why Semantic Segmentation? • To help partially sighted people by highlighting important objects in their glasses
Torr Vision Group, Engineering Department
Why Semantic Segmentation? • To let robots segment objects so that they can grasp them
Torr Vision Group, Engineering Department
Why Semantic Segmentation? • Road scenes understanding • Useful for autonomous navigation of cars and drones
Image taken from the cityscapes dataset.
Torr Vision Group, Engineering Department
Why Semantic Segmentation? • Useful tool for editing images
Torr Vision Group, Engineering Department
Why Semantic Segmentation? • Medical purposes: e.g. segmenting tumours, dental cavities, ...
Image taken from Mauricio Reyes
ISBI Challenge 2015, dental x-ray images
Torr Vision Group, Engineering Department
But How? • Deep convolutional neural networks are successful at learning a good representation of the visual inputs.
• However, here we have a structured output.
Torr Vision Group, Engineering Department
CNN for Pixel-wise Labelling • Usual convolutional networks
Torr Vision Group, Engineering Department
CNN for Pixel-wise Labelling • Usual convolutional networks
• Fully convolutional networks
Long et. al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015.
Torr Vision Group, Engineering Department
Fully Convolutional Networks [Long et al, CVPR 2014]
Torr Vision Group, Engineering Department
Fully Convolutional Networks [Long et al, CVPR 2014]
+ Significantly improved the state of the art in semantic segmentation. - Poor object delineation: e.g. spatial consistency neglected.
Image
FCN Results
Ground truth
Torr Vision Group, Engineering Department
Conditional Random Fields (CRFs) • A CRF can account for contextual information in the image
Coarse output from the pixel-wise classifier
MRF/CRF modelling
Output after the CRF inference
Torr Vision Group, Engineering Department
Conditional Random Fields (CRFs) ∈ {bg, cat, tree, person, …}
• Define a discrete random variable Xi for each pixel i. • Each Xi can take a value from the label set. • Connect random variables to form a random field. (MRF)
Torr Vision Group, Engineering Department
Conditional Random Fields (CRFs) ∈ {bg, cat, tree, person, …}
• • • •
= bg
= cat
Define a discrete random variable Xi for each pixel i. Each Xi can take a value from the label set. Connect random variables to form a random field. (MRF) Most probable assignment given the image → segmentation.
Torr Vision Group, Engineering Department
Finding the Best Assignment = bg Pr Pr
=
,
= |
=
,…,
=
= exp −
|
|
= Pr ( = | )
= cat
• Maximize Pr
=
→ Minimize
• So we have formulated the problem as an energy minimization.
Torr Vision Group, Engineering Department
|
=
_
+
_
=
Torr Vision Group, Engineering Department
|
=
_
+
_
Unary energy
(
=
) = ?
=
Torr Vision Group, Engineering Department
|
=
_
+
_
Unary energy
(
=
) = ?
Your label doesn’t agree with the initial classifier → you pay a penalty.
=
Torr Vision Group, Engineering Department
|
=
_
+
_
Unary energy
(
=
) = ?
Your label doesn’t agree with the initial classifier → you pay a penalty.
Pairwise energy
(
=
,
=
) = ?
You assign different labels to two very similar pixels → you pay a penalty. How do you measure similarity?
Torr Vision Group, Engineering Department
|
=
_
+
_
Unary energy
(
=
) = ?
Your label doesn’t agree with the initial classifier → you pay a penalty.
Pairwise energy
(
=
,
=
) = ?
You assign different labels to two very similar pixels → you pay a penalty. How do you measure similarity?
Torr Vision Group, Engineering Department
|
=
_
+
_
Unary energy
(
=
) = ?
Your label doesn’t agree with the initial classifier → you pay a penalty.
Pairwise energy
(
=
,
=
) = ?
You assign different labels to two very similar pixels → you pay a penalty. How do you measure similarity?
Torr Vision Group, Engineering Department
Dense CRF Formulation [Krähenbühl & Koltun, NIPS 2011.]
• Pairwise energies are defined for every pixel pair in the image.
=
(
)+
( , ,
• Exact inference is not feasible. • Use approximate mean field inference.
)
Torr Vision Group, Engineering Department
Dense CRF Formulation [Krähenbühl & Koltun, NIPS 2011.]
• Pairwise energies are defined for every pixel pair in the image.
=
(
)+
( , ,
• Exact inference is not feasible. • Use approximate mean field inference. exp (−
)=
=
( )
)
Torr Vision Group, Engineering Department
Fully Connected CRFs as a CNN
Torr Vision Group, Engineering Department
Fully Connected CRFs as a CNN
U Q I
Bilateral
Torr Vision Group, Engineering Department
Fully Connected CRFs as a CNN
U Q I
Bilateral
Conv
Torr Vision Group, Engineering Department
Fully Connected CRFs as a CNN
U Q I
Bilateral
Conv
Conv
Torr Vision Group, Engineering Department
Fully Connected CRFs as a CNN
U Q I
Bilateral
Conv
Conv
+
Torr Vision Group, Engineering Department
Fully Connected CRFs as a CNN
U Q I
Bilateral
Conv
Conv
+
SoftMax
Torr Vision Group, Engineering Department
CRF as a Recurrent Neural Network U Q I
Bilateral
Conv
Conv
+
SoftMax
Mean-field Iteration
• Each of these blocks is differentiable We can backprop
Torr Vision Group, Engineering Department
CRF as a Recurrent Neural Network Image CRF Iteration
Unaries
Output
SoftMax CRF as RNN
• Each of these blocks is differentiable We can backprop
Torr Vision Group, Engineering Department
Putting Things Together FCN
CRF-RNN
Torr Vision Group, Engineering Department
Experiments FCN
FCN
CRF
[Long et al, 2014] [Chen et al, 2015]
68.3
69.5
FCN
CRFRNN
Ours
72.9
Torr Vision Group, Engineering Department
Try our demo: http://crfasrnn.torr.vision Code & model: https://github.com/torrvision/crfasrnn Shuai Zheng
Bernardino Romera-Paredes
Philip Torr
Torr Vision Group, Engineering Department
Examples
http://pp.vk.me/c622119/v622119584/20dc3/7lS5BU2Bp_k.jpg
Torr Vision Group, Engineering Department
Examples
http://media1.fdncms.com/boiseweekly/imager/mountain-bikers-are-advised-to-dism/u/original/3446917/walk_thru_sheep_1_.jpg
Torr Vision Group, Engineering Department
Examples
http://img.rtvslo.si/_up/upload/2014/07/22/65129194_tour-3.jpg
Torr Vision Group, Engineering Department
Examples
http://www.toxel.com/wp-content/uploads/2010/11/bike05.jpg
Torr Vision Group, Engineering Department
Not-so-good examples
http://www.independent.co.uk/incoming/article10335615.ece/alternates/w620/planecat.jpg
Torr Vision Group, Engineering Department
Not-so-good examples
http://i1.wp.com/theverybesttop10.files.wordpress.com/2013/02/the-world_s-top-10-best-images-of-camouflage-cats-5.jpg?resize=375,500
Torr Vision Group, Engineering Department
Tricky examples
http://se-preparer-aux-crises.fr/wp-content/uploads/2013/10/Golum.png
Torr Vision Group, Engineering Department
Tricky examples
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRf4J7Hszkc8Wf6riVUX-cV_K-un8LJy5dYIBW1KDIn6i7UCzGHpg
Torr Vision Group, Engineering Department
Tricky examples
http://i.huffpost.com/gen/1478236/thumbs/s-DIRD6-large640.jpg
Torr Vision Group, Engineering Department
Conclusion • CNNs yield a coarse prediction on pixel-labeled tasks. • CRFs improve the result by accounting for the contextual information in the image. • Learning the whole pipeline end-to-end significantly improves the results.
CNN
CRF
Torr Vision Group, Engineering Department
Conclusion • CNNs yield a coarse prediction on pixel-labeled tasks. • CRFs improve the result by accounting for the contextual information in the image. • Learning the whole pipeline end-to-end significantly improves the results.
Thank You!
CNN
CRF