Efficient similar images research TER 2012
Anthony Biga, Iliasse Hassala, Amine Oueslati and Paraita Wohler UNSA UFR Sciences - Master IFI/MBDS
2012 June 04th
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
1 / 35
How to efficiently find similar images ?
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
2 / 35
Presentation
Used in the industry : I I
Picasa (Google) iPhoto/Aperture (Apple)
Problems : I I I
CPU intensive memory consuming doesn’t seem to scale well
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
3 / 35
Presentation
Finding similar images can be separated in 2 stages :
Extraction of images features For each image we extract interest points, which represents caracteristic parts of the image. For every interest point, we get their features which are vectors.(SURF algorithm).
Similarity search We compare every image’s features with each others, using the K-Nearest Neighbors algorithm to determine similarities.
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
4 / 35
Workflow
features extraction
every similarity pair
images set
AB, IH, AO and PW (Master IFI/MBDS)
similarity search
Efficient similar images research
2012 June 04th
5 / 35
1
SURF and K-NN SURF K-NN
2
Our workflow SURF C++/OpenCL Java
K-NN Java threads pool Hadoop map-reduce
Hadoop Map-Reduce 3
Benchmarks and results
4
Conclusion
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
6 / 35
Speeded Up Robust Features (SURF)
Used for : Camera calibration 3D reconstruction object recognition discrete images correspondences Basically 6 sequential steps : 1 2 3 4 5 6
compute integral image calculate hessian determinant apply gaussian filters select the best interest points compute orientation normalize vectors
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
7 / 35
Example
False positive :
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
8 / 35
Example
Good match :
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
9 / 35
K-Nearest Neighbor The Knn search is a problem found in many domains such as Data compression, DNA sequencing, image retrieval etc. The problem : I I I I
Set of n elements in a d-dimensional space E q ∈ E. Similarity function ∆ k smaller than n.
knn algorithm Apply ∆ to the n elements and return the k most similar elements to q.
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
10 / 35
K-Nearest Neighbor Custom version of the k-nn for comparing 2 images img1 and img2 : I
I
I
We choose a similarity function which return true if the euclidean distance between two points is ≤ ε. Then the two points are similar. For each point of img1, we compare it with each point of img2 till the similarity function return true. Finally, img1 is similar to img2 if we find k or more similar points.
In this case k doesn’t represent the k most similar point but just k pairs of similar points. Brute-force algorithm : compute the distance from every img1 descriptors to every img2 descriptors.
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
11 / 35
K-Nearest Neighbor problem Find the optimal value for k . The number of descriptors depends on the image. In some images we have a small number of descriptors.
=⇒ If k is greater than the number of descriptors, these images will never be chosen.
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
12 / 35
clsurf An OpenCL implementation of the SURF algorithm.
http://code.google.com/p/clsurf/ clsurf has been developed by the Northeastern University Computer Architecture Research Group. The application should run correctly on NVIDIA and ATI GPUs without any changes. Extract as much parallelism as possible from the SURF algorithm : I I I I
compute integral image calculate hessian determinant select the best interest points normalize vectors
Has a large number of tunable parameters to change the precision and the number of descriptors=⇒ impacts the performance
restriction Compatibility : the device must support the version 1.2 of CUDA or latest. AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
13 / 35
JOpenSURF A Java implementation of SURF http ://code.google.com/p/jopensurf/
What does JOpenSURF do ? SURF Matching points finding Graphic representation
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
14 / 35
Matching points
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
15 / 35
K-nn : Java thread pool Problem K-nn needs a lot of memory
Solution Do not load the entire file in memory
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
15 / 35
K-nn optimisation Descriptors file is accessed n2 How to minimize K-nn computing time ?
Solution Do not compute two times distances for the same couple of images Use a thread pool
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
16 / 35
Hadoop Map/Reduce Overview Programming model for processing large data sets Typically used to do distributed computing on clusters Written in many programming languages. A popular free implementation is Apache Hadoop. The model is inspired by the Map and Reduce functions
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
17 / 35
"Map" step Master node takes the input. Divides it into smaller sub-problems Distributes them to worker nodes Worker node processes the smaller problem, and passes the answer back to its master node.
"Reduce" step The master node collects the answers Combines them to form the output
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
18 / 35
Hadoop Block Nested Loop Join (Pairwise) Used to join two sets R and S Partition R and S, each into n equal-sized disjoint blocks Perform (BNLJ) for each possible Ri ,Sj pairs of blocks Get k-nn results from n local k-nn results for every record in R
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
19 / 35
1st Job Performs the K-nn search The Mapper takes < Imgi ; DescriptorsList > Produces all the possible pairs of img’s and descriptors. The reducer computes the local K-nn The 1st Job produces < Img1i ; Img2j > which are the most similar images
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
20 / 35
2nd Job Performs a filtering process Eliminates duplicated entries and produces the most similar images
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
21 / 35
Problem of Scaling HBNLJ algorithm doesn’t scale well with multidimensional data Lack of space disk (eg for 1600 pictures, 1.4 GB input file size, it needs more than 7 TB of intermediate space)
Solution ! Modify the output of the Map phase of the first round instead of < img1i ; img2j ; descriptorsList > we output < img1i ; img2i ; offsetbegin ; offsetend > The reducer will compute the K-nn using the initial input file We replace the Disk space glutonny with CPU + I/O time
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
22 / 35
Benchmarks
What we tested : integrity check Surf implementations different combinations of integrity check + Surf KNN implementations Data set : heterogenous set of images consist of 5 differents directories based on size (100,200,400,800,1600)
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
23 / 35
Benchmarks
Platform for SHA-1+SURF benchmarking : CPU RAM GPU OS
Intel(R) Xeon(R) CPU E5620 @ 2.40GHz 18480828 kB NVidia Corporation GF108 (Quadro 600) Fedora 16 x86_64
Benchmark protocol : 3 iterations at night, to reduce side effects
ant clean compile after every iteration
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
24 / 35
Integrity and SURF results
SHA-1 computation time
110
OpenCL Java
1200
90 80
1000
time (seconds)
Time (seconds)
SURF computation time
1400
C++ Java
100
70 60 50 40
800 600 400
30 20
200
10 0
0
200
400
600
800
1000
1200
1400
1600
0
0
200
Number of images
AB, IH, AO and PW (Master IFI/MBDS)
400
600
800
1000
1200
1400
1600
Number of images
Efficient similar images research
2012 June 04th
25 / 35
KNN Java implementation results
Comportement du KNN 250
temps en secondes
200
150
100
50
0 0
AB, IH, AO and PW (Master IFI/MBDS)
200
400
600
800 1000 Nombre d’images
Efficient similar images research
1200
1400
1600
2012 June 04th
26 / 35
Conclusion Conclusion and thanks We implemented a complete workflow based on multiple technologies Huge speed gain expected with GPGPU. Speedup of 3 mesured (in comparison with java) We learned a lot from PhD’s and researchers (EPW,CafeIn,RoundTable...)
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
27 / 35
AB, IH, AO and PW (Master IFI/MBDS)
Efficient similar images research
2012 June 04th
28 / 35