Mix and Match: Joint Model for Clothing and Attribute Recognition

1 YAMAGUCHI et al.: MIX AND MATCH Mix and Match: Joint Model for Clothing and Attribute Recognition Kota Yamaguchi1 1 Tohoku University Sendai, Ja...

Author: Helen Barton

2 downloads 0 Views 6MB Size

Report

Download PDF

Recommend Documents

Application of Pattern Recognition in Mix-and- Match Lithography

Her Mix and Match Meal Plan ~ calories

MIX & MATCH Collection

Lifts and Slings: Can you Mix and Match?

Filler. College or Wide Rulled 150ct Papel. Save 90. mix & match mix & match. buy 1 get 1. mix & match FREE

Person Attribute Recognition with a Jointly-trained Holistic CNN Model

Diagrams: Making Faces Jack-o -Lantern Mix-and-Match Chart

Joint Extraction of Entities and Relations for Opinion Recognition

MAX MODEL ATTRIBUTE MATRIX

Anatomy and Physiology II Exam 1 Review. 1. Mix and Match the endocrine gland location with their name from the mix and match list

Clothing Co-Parsing by Joint Image Segmentation and Labeling

Human Attribute Recognition by Deep Hierarchical Contexts

Joint Learning for Attribute-Consistent Person Re-Identification

3D Morphable Model Construction for Robust Ear and Face Recognition

A Decision Theoretic Model for Stress Recognition and User Assistance

STUDY FOR DESIGN AND IDENTIFICATION OF A BOLTED JOINT MODEL

Geometry and Attribute Compression for Voxel Scenes

CLOTHING AND TEXTILES DRESS CLOTHING DIVISION

Joint Localization and Activity Recognition from Ambient FM Broadcast Signals

4-H H Clothing Construction Projects : Leader's Guide for clothing Level I and Clothing Level 2

Clothing and Dressing for People with Disabilities

CLOTHING AND TEXTILES

UV Light and Clothing

Clothing and Gear List

1

YAMAGUCHI et al.: MIX AND MATCH

Mix and Match: Joint Model for Clothing and Attribute Recognition Kota Yamaguchi1

1

Tohoku University Sendai, Japan

2

NTT Yokosuka, Japan

3

Tokyo University of Science Tokyo, Japan

http://vision.is.tohoku.ac.jp/~kyamagu

Takayuki Okatani1 [email protected]

Kyoko Sudo2 [email protected]

Kazuhiko Murasaki2 [email protected]

Yukinobu Taniguchi3 [email protected]

Abstract This paper studies clothing and attribute recognition in the fashion domain. Specifically, in this paper, we turn our attention to the compatibility of clothing items and attributes. For example, people do not wear a skirt and a dress at the same time, yet a jacket and a shirt are a preferred combination. We consider such inter-object or interattribute compatibility in the recognition problem, and formulate a Conditional Random Field (CRF) that seeks the most probable combination in the given picture. The model takes into account the location-specific appearance with respect to a human body and the semantic correlation between clothing items and attributes, which we learn using the max-margin framework. We evaluate our model using two datasets that resemble realistic application scenarios: on-line social networks and shopping sites. The empirical evaluation shows that our model effectively improves the recognition performance over baselines including the state-of-the-art feature designed exclusively for clothing recognition. The results also suggest that our model generalizes well to different fashion-related applications.

1

Introduction

Clothing recognition is recently getting more and more attention in the vision community perhaps due to its usefulness in real-world applications, such as on-line social networking [24], e-commerce [3, 4, 8, 16, 21], trend analysis [20], or personal fashion recommender [12, 18]. This paper studies clothing detection, which is an essential component to the above applications or more advanced clothing analysis such as parsing [5, 11, 14, 17, 25], style understanding [9, 10], and attribute recognition [1, 2, 30]. In this paper, we specifically aim at answering if the following hypothesis is correct: Taking into account multiple clothing items or attributes together improves the detection c 2015. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.

2

YAMAGUCHI et al.: MIX AND MATCH

Figure 1: Some clothing pairs are compatible while others are not. In this paper, we aim to take advantage of such relationship in clothing recognition.

1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1

gl

compatible incompatible

as se s h shat irt dr top ja ess ck veet co st sc at tig arf pahts n s ts sh kir or t so ts c bo ks sh ots oe ne b s ck ag la ce

glasses hat shirt top dress jacket vest coat scarf tights pants skirt shorts socks boots shoes bag necklace

Figure 2: Pearson correlation between clothing items in Chictopia dataset. Notice exclusive blocks, e.g., shirt, top, and dress.

accuracy over detecting individual items independently. The intuition is that people do not wear a skirt and a dress at the same time, and thus there is an exclusive relation between these items. We illustrate such clothing-compatibility in Fig 1. Our detection framework explicitly considers such inter-object or inter-attribute relationship. Inter-object relationship was previously utilized in clothing parsing as an appearancelevel compatibility between adjacent pixels or regions [5, 14, 23, 25]. Our study is distinct from clothing parsing in that we are rather considering the compatibility between clothing items at semantics-level; i.e., people do not wear dress and skirt together not because they are visually distinct, but because of their functionality. Our approach is the second-order joint model based on the Conditional Random Field (CRF) over the combination of clothing items or attributes. Given an image, we consider the probability distribution over clothing items, and output the maximum a posteriori (MAP) assignment as a detection result. In the unary term, we also take advantage of strong contextual relationship between the location of clothing and human body. Correlation between items are explicitly modeled in the second-order term in our model (See Fig 2). We empirically show that our approach is outperforming the independent baseline including the approach based on the state-of-the-art feature for clothing recognition [25]. Our model is similar to part-aligned attribute detectors [27, 28, 29] in that we take advantage of pose-aligned feature to recognize detailed attributes. However, our main focus in this paper is the joint modeling of inter-object or inter-attribute relationships. Our model has the same spirit with Wang’s work [22] in that both aim at jointly modeling inter-label relationships, but this paper focuses more on empirical studies in realistic scenarios with real-world data. Also, by taking advantage of relatively stable pose-variation in fashion pictures, we apply a simple-yet-effective deterministic approach to compute localized image-features based on Convolutional Neural Networks [7]. To study clothing detection in a realistic scenario, we use two datasets each with different application in mind: Chictopia dataset [24] for clothing detection in fashion blogs and Dress dataset for attribute recognition. The successful empirical results suggest that our model generalizes well to two different fashion applications.

3

YAMAGUCHI et al.: MIX AND MATCH 5

ribbon skirt

(a) Chictopia dataset

(b) Dress dataset

Figure 3: Our datasets consider two scenarios; a) Chictopia dataset considers automatic clothing tagging in fashion blogs, and b) Dress dataset considers attribute recognition in e-commerce.

1

sh oe ja s ck sh et irt to p dr e ba ss g pa nt bo s ot sk s ir tig t h sh ts or ne ts c sc klac a ha rf e t gl as co ses at ve s so t ck s

top

2

0

Dress dataset 700 600 500 400 300 200 100

sk :m sl irt: in i skeevkne neirt:ge:p e ck at lai rib :ro he n sk bo un re sl irt: n:r d− d be eev nor ibb ne ck ne lt:b e:s ma on sk ck: elt hor l t− ne irt:o V− sle sleck: th nec ev s e bi er k es nekirt ve: jou sleck::lac3/4− e sle ev cre sle or ev e:1 w− ev to na e:l /2−ne es to p:lame ong sleck skp:fl cent:fl −sl eve sk irt: ow ow eev s er e neirt: tigher s ne ckf:lowt ck ca er :U m −n iso ec le k

sleeve

Chictopia dataset

x 10 3

to skp:p irt lai :p n la in

neck

neck boat-neck ribbon ribbon skirt sleeve plain normal knee sleeve plain 1/2-sleeves top plain

sk irt

tags shoes sweater bag

0

Figure 4: Label distribution.

We summarize our contribution in the following. • CRF-based detection that takes into account inter-label correlation

• Simple yet effective deterministic approach to localize region of interests in fashion problems • Empirical evaluation using Chictopia and Dress datasets that confirms the generalizability of the proposed model in different fashion-recognition problems

2

Dataset

Chictopia dataset Chictopia dataset is a collection of blog posts from Chictopia, an online social networking site specialized for fashion. Using the publicly available data [24], we first applied a human-pose detector [25, 26] and only kept images with a standing person. The detected bounding-boxes around the body are later used to compute image features. Next, we searched images with at least 2 clothing-keywords under 18 categories (See Fig 7) in metadata, and identified 26,8124 usable images. Fig 3(a) shows an example picture. Note that clothing tags are noisy and far from perfect. It is common to observe missing items, especially for minor items such as necklace or socks. Also, sometimes there are conflicting items appearing together, such as shoes and boots, due to the tagging-format in Chictopia where users can associate clothing keywords together with a free-form text. However, we did not apply any manual transformation to these noisy data for scalability in the realistic scenario. Tag statistics are shown in Fig 4. We also use the publicly available Fashionista dataset [23] to learn the human-body detector and the spatial prior about clothing, which is also collected from Chictopia. Dress dataset Dress dataset consists of 712 images of dress products we collected on an e-commerce site. For each of the dress images, we manually gave bounding-box annotations for 8 parts of dress (top, skirt, neck, sleeve, ornament, ribbon, belt, and pocket). Only top and skirt appear at every picture, and other parts might not be present. We manually annotated

4

YAMAGUCHI et al.: MIX AND MATCH

each part with detailed binary attributes, such as round-neck or V-neck for neck part. In the initial annotation, we had in total 58 attributes. Out of 58 attributes, we chose 26 for evaluation and removed infrequent attributes that occurred less than 4% in the dataset. The bounding-box annotations are used to learn our localized image feature we discuss in Sec 4. Fig 3(b) shows an example picture from Dress dataset. In Dress dataset, the product image does not always contain a person, and sometimes both frontal and rear views appear side-by-side. We first apply a human-body detector trained on Fashionista dataset and calculate features from only one of the views. Although the bounding box does not perfectly align when a person is not present, we could obtain reasonable bounding boxes around the dress region1 .

3

Joint detection model

Let us denote a set of labels by Y ⌘ {yi }, yi 2 {0, 1}, where i is one of the clothing items or attributes, such as shirt or skirt:plain. Given a feature X ⌘ {xi }, we define our joint probability distribution over labels by a log-linear model: ln P(Y |X) ⌘ Â wi f (xi , yi ) + i

Â

i, j2V

wi, j y(yi , y j )

ln Z,

(1)

where we denote the model parameters by w ⌘ [wi , wi, j ] , 8i, j, the normalization constant by Z, and the set of label-pairs by V . Our model consists of the unary term f (xi , yi ) that considers the likelihood of assigning a label given a feature, and the binary term y(yi , y j ) that considers the inter-label relationships.

3.1

Data likelihood

In this paper, we use logistic regression of each label, expressed by: f (xi , yi ) ⌘ ln p(yi |xi ),

p(yi = 1|xi ) ⌘

s (aTi xi + bi ),

(2) (3)

where ai and bi are the regression parameters for each item. We learn the logistic regression [6] from the training examples. The unary term can be thought of a regular appearancebased detector. Our joint model augments the prediction of the unary term by inter-label correlation. Note that it is possible to directly use xi for the potential term in Eq 1. We did not choose to do so to include additional non-linearity in the model and also to make learning computationally tractable.

3.2

Inter-label correlation

We use the normalized Pearson correlation for the binary term: ( ln 12 (1 + ci, j ) , if yi = y j y(yi , y j ) ⌘ ln 12 (1 ci, j ) , otherwise

(4)

1 We have also tried to learn a dress detector based on region proposals, but we could not observe any better result than the human-body detector perhaps due to the lack of training data. In this paper, we chose to use a human-body detector.

5

YAMAGUCHI et al.: MIX AND MATCH 1

belt:belt neck:round−neck neck:crew−neck neck:U−neck neck:V−neck neck:camisole neck:bijou ornament:flower pocket:pockets ribbon:ribbon skirt:plain skirt:flower skirt:lace skirt:tight skirt:gathered skirt:normal skirt:other skirt:mini skirt:knee sleeve:plain sleeve:3/4−sleeves sleeve:1/2−sleeves sleeve:short−sleeves top:plain top:flower top:lace

0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6

ne c nek:r ck ou be n l ne:cre d−t:be neck:w−neclt ne c U ne k c k − c or k:c:V−neck na n amne k po m ec is ck c en k: o ribket t:fl bijole bo:po ow u n c e s :r k r skkirtibbets irt :p on :f la sk sskirlow in irt ki t:la er sk:ga rt:t ce ir th ig skt:noereht i r d sl skrt:o ma ee i t l s s k s sl le ve le irrt:mher ee ev :3 e t: in ve e: /4 ve kne i :s 1/2−s :pl e ho − le ai rt− sleev n s e es t le ve toop: eve s p: pl s toflowain p: e la r ce

−0.8 −1

Figure 5: Pearson correlation between dress attributes in Dress dataset. Some attributes appear across parts (e.g., plain pattern) and give a strong positive correlation. Notice negative correlation within a part-block indicating some attributes are exclusive each other (e.g., neck shapes). where ci, j is the Pearson correlation between item i and j in the training examples. The binary term enforces the inter-label correlation in the assignment. We visualize in Fig 2 and 5 the correlation between annotations in Chictopia and Dress datasets. In Chictopia, clearly we can observe exclusiveness in upper-body (shirt, top, dress), lower-body (dress, pants, skirt, shorts), or footwear (boots, shoes). Also, the positive correlation between tops and bottoms indicates they are likely worn together. In Dress, we can observe exclusive groups (e.g, neck shapes, skirt length). Some attributes are positively correlated (e.g., top and skirt have the same pattern, flower and lace are likely to appear together). Our joint model encompasses such second-order information.

3.3

MAP inference

Given a feature X, we can detect clothing items by MAP inference over the joint model: Y ⇤ 2 arg max P(Y |X). Y

(5)

We use the loopy belief propagation [15] to approximately solve Eq 5.

3.4

Max-margin learning

We learn the model parameters w with the Structural SVM framework [19]. Let us denote the concatenation of potential functions by Y(X,Y ) so that Eq 1 is expressed in the linear

6

YAMAGUCHI et al.: MIX AND MATCH

form: ln P(Y |X) = wT Y(X,Y ) ln Z. Using the margin-rescaling formulation, our learning problem can be expressed by the following optimization: min w,x

1 C N kwk2 + Â xk 2 N k

(6)

s.t. 8k, 8Y, wT (Y(Xk ,Yk )

Y(Xk ,Y ))

D(Yk ,Y )

xk ,

where we denote the slack variables by x ⌘ {xk }, the loss function by D(Yk ,Y ), and the number of training examples by N. C is the constant parameter to balance between the regularization and the loss term. The intuition of this objective is that we constrain w such that for any training example k, the true assignment Yk produces the maximum log-linear score with a margin against any other incorrect assignment Y . We solve Eq 6 using the cutting-plane algorithm for a general loss [19]. In this paper, we propose to use the class-weighted zero-one loss: D(Yk ,Y ) ⌘

1 dk (yk,i , yi ), 2Â i

8 N > < N Ni , if yk,i = 0, yi = 1 dk (yk,i , yi ) ⌘ NN , if yk,i = 1, yi = 0 > : i 0, otherwise,

(7)

(8)

where Ni is the number of positive examples for label i in the training set. The purpose is to penalize more for making an incorrect prediction to infrequent label i. Note that both of our datasets form a long-tailed distribution; The majority of labels does not appear frequently, while the rest appears more often (Fig 4). Without considering the class balance in the loss, the learned model tends to find a trivial parameter that always assigns 0 to rare class, such as hat or vest.

4

Localized image feature

Most of the existing work in clothing recognition requires an accurate pose estimation beforehand [13, 23, 25], because clothing items are worn on a specific body part. However, that approach has a drawback in that the failure in part localization can easily lead to incorrect recognition results. In this paper, instead of relying on the pose estimation of every bodypart, we relax this localization requirement to only a bounding-box around the human-body, and use rather a deterministic approach to define a region of interest. Given a bounding-box around the full human body, we extract an image patch from a specific location relative to the body bounding-box. This simple approach yields a surprisingly good result as we show in the experiment. We learn the relative locations based on training data. We show the relative location of bounding-boxes used for Chictopia and Dress in Fig 6. For Chictopia, we first calculate the average tight bounding-box from pixel-wise annotations in the Fashionista dataset, make the boxes symmetric, and enlarge the box size by 40%. We apply the same procedure for Dress dataset except that there is no pixel-wise annotation there.

7

YAMAGUCHI et al.: MIX AND MATCH hat

hat

human human

coat

scarf glasses jacket vest shirt dress top necklace

shorts

pants

skirt ribbon belt

glasses coat top

necklace scarf

sleeve neck

top

tights

ornament bag skirt

shirt

pants

dress

jacket

pocket

skirt

tights

vest

shorts boots

socks

socks

bag shoes

boots shoes

(a) Chictopia clothing

(b) Dress attributes

Figure 6: Relative location of part bounding boxes.

Figure 7: Average shape of each item within the relative boundingbox in Fashionista dataset [23]. Our deterministic localization covers sufficient regions of interest.

The major concern for this approach is the recall under pose-variance. However, we observed from data that in fashion pictures people do not make a significant pose-change and we are able to extract sufficient coverage of the part-regions by this simple deterministic localization. Using pixel-wise annotation in Fashionista dataset, we show the average shape of clothing of Chictopia at the relative locations in Fig 7, which confirms that our simple localization sufficiently covers the appearance of target item. A by-product of our deterministic approach is that we can localize regions in a constant time. Once the region of interest is determined, we extract the CNN feature (fc7 of AlexNet, 4096 dimensions) learned from ImageNet [7], and use as a feature-input to the logistic regression in Eq 2.

5

Experimental results

We compare the following methods in the experiments. Style Descriptor: The hand-crafted image-feature based on pose estimation proposed in the state-of-the-art clothing parsing [25]. We learn logistic regression for prediction. CNN Global: We predict labels by logistic regression from the CNN feature calculated from the full-body bounding box without part localization. CNN Local: We predict by logistic regression using the localized CNN feature. This is equivalent to removing the second-order term in Eq 1. CNN Local CRF: Our joint detection model described in Eq 1. The comparison to Style Descriptor constitutes the relative performance of our CNNbased detection over the state-of-the-art. We can measure how much the localized image feature or the joint model improves detection performance from the comparison between the CNN Global, the CNN Local, and the CRF models. Our experimental protocol is based on 10-fold cross validation. From Chictopia dataset, we randomly sample 9,000 training examples and 1,000 testing examples each with at least 2 clothing tags. From Dress dataset, we make a 90% train / 10% test split. We measure

8

YAMAGUCHI et al.: MIX AND MATCH

Table 1: Total performance evaluation. (a) Clothing-detection in Chictopia dataset Method Style-Descriptor CNN-Global CNN-Local CNN-Local-CRF

Accuracy 0.690 ± 0.002 0.740 ± 0.003 0.768 ± 0.004 0.782 ± 0.003

Method Style-Descriptor CNN-Global CNN-Local CNN-Local-CRF

Accuracy 0.781 ± 0.015 0.837 ± 0.006 0.840 ± 0.007 0.843 ± 0.007

Precision 0.345 ± 0.003 0.390 ± 0.004 0.436 ± 0.006 0.456 ± 0.005

Recall 0.646 ± 0.009 0.573 ± 0.009 0.622 ± 0.007 0.595 ± 0.008

F1 0.450 ± 0.004 0.464 ± 0.005 0.512 ± 0.005 0.516 ± 0.004

(b) Attribute-prediction in Dress dataset Precision 0.526 ± 0.035 0.632 ± 0.021 0.638 ± 0.020 0.652 ± 0.023

Recall 0.661 ± 0.021 0.723 ± 0.013 0.727 ± 0.020 0.708 ± 0.013

F1 0.585 ± 0.028 0.674 ± 0.015 0.680 ± 0.018 0.678 ± 0.016

the recognition performance in terms of accuracy, precision, recall, and F1. We repeat this procedure for 10 times and report the average with standard deviation. The performance is summarized in Table 1. We also show the precision of individual items in Fig 8. We first observe that the CNN-based feature is outperforming the Style Descriptor designed for clothing recognition. Using the localized image feature (CNN Local), we further boost the performance over the full-body feature (CNN Global). This effect is apparent in smaller items appearing at the end of body, such as glasses, hat, or boots in Chictopia. Note that the Style Descriptor is performing better than CNN Global for footwear because of its dependency to precise human-pose estimation. However, our CNN-based feature beats this baseline with a simple localization approach. Compared to Chictopia, we have observed less significant improvement in Dress dataset, perhaps due to the lack of training examples (e.g., neck:camisole). Our CRF model makes improvement over CNN Local in accuracy, precision, with a small loss in recall. This can be explained by the effect of inter-label correlation in our model successfully suppressing conflicting prediction (e.g., tights and pants, round-neck and vneck) made in the independent model (CNN Local). Precision is typically more valuable than recall in the detection problem, since we can easily improve recall by predicting everything as positive. Our model successfully augments the independent prediction by CNN-based features towards the higher-precision regime. Fig 5 shows a few qualitative results. There is often noise in the annotation (e.g., jacket and vest in the second row). Detecting by the global feature tends to predict false positives for small items (e.g., hat). Our CRF model produces mostly similar results to CNN Local, except that the model lowers the detection confidence of incompatible labels (e.g., top vs. dress, short-sleeves vs 1/2-sleeves). Discussion Our joint model successfully improves the precision of predictions, but the drawback is that we are more likely to miss less frequent combination, such as a layered combination of shirt and top. Perhaps for an application where higher recall is important, independent prediction of items would be still useful since joint prediction would suppress such cases in the long tail. Predicting a combination in the long tail is inherently a difficult problem. In this work, we did not fine-tune the CNN feature mainly due to the noise in data (Chictopia dataset) and the lack of data size (Dress dataset). CNN is known to perform excellent

9

YAMAGUCHI et al.: MIX AND MATCH 0.8 Style−Descriptor CNN−Global CNN−Local CNN−Local−CRF

0.6

0.4

0.2

0

s sse gla

t t ha shir

t top ress acke d j

st

ve

rf ts at ts co sca tigh pan

s s rt ts ts ski shor sock boo shoe

e g ba klac c ne

(a) Clothing-detection precision in Chictopia dataset 1 0.8

Style−Descriptor CNN−Global CNN−Local CNN−Local−CRF

0.6 0.4 0.2 0

elt ck ck ck ck ole jou er ets on in er ce ht ed al er ini ee ain ves ves es ain er ce lt:b ne ne ne ne is :bi low ck ibb :pla low t:la t:tig er rm oth t:m :kn :pl ee ee ev :pl low :la be und−rew− k:U− k:V− :camneck ent:fet:po on:r skirt kirt:f skir skir :gath irt:noskirt: skir skirt eeve 4−sl 2−sl rt−sle top top:f top s rt sk ro :c ec ec ck sl e:3/ e:1/ sho am ock ribb ski ck: ck n n ne ev ev e: orn p ne ne sle slesleev

(b) Attribute-prediction precision in Dress dataset

Figure 8: Precision for individual items or attributes. when there is a high-quality, supervised dataset, but such dataset does not exist in a new application problem such as fashion. In that sense, this data-bottleneck issue is always a challenge for applying deep models in a new domain.

6

Conclusion and future work

We proposed a joint clothing detection model that considers inter-label correlation of items. The model also takes advantage of the spatial prior of clothing with respect to human-body. The empirical study using the two realistic datasets reveals that our model performs the best among the baseline approaches including the state-of-the-art feature. Also, our model successfully augments independent predictions by logistic regression to higher-precision regime, using second-order relationships between labels. Though we specifically studied the clothing detection in the paper, our CRF framework can be applied to other category-specific applications such as product catalogs in ecommerce sites or in used-car markets. We hope to extend our work to different applications. Also, it is our future work to study the precise effect of pose variation in fashion applications, and if any, to improve our model [28]. We would like to see how an optimized CNN architecture [11] trained from the large amount of fashion images affects the clothing and attribute detection performance.

10

YAMAGUCHI et al.: MIX AND MATCH Truth glasses dress shoes bag necklace

Style-Desc CNN-Global bag dress dress shoes scarf vest shoes necklace glasses scarf hat jacket vest coat CNN-Local CNN-Local-CRF dress dress shoes shoes bag bag glasses glasses scarf scarf necklace necklace top

Truth Style-Desc top shoes jacket bag vest shorts skirt shoes bag necklace CNN-Local shoes jacket bag dress vest shirt

CNN-Global hat dress bag shorts vest shoes glasses

CNN-Local-CRF shoes vest jacket bag dress

(a) Clothing detection Truth neck:V-neck ornament:flower skirt:plain skirt:normal skirt:knee sleeve:plain top:plain

Style-Descriptor top:plain skirt:mini skirt:flower skirt:plain top:lace top:flower skirt:other

CNN-Global sleeve:plain skirt:normal top:plain skirt:plain neck:V-neck pocket:pockets skirt:knee

CNN-Local skirt:normal sleeve:plain top:plain skirt:plain neck:V-neck ornament:flower top:lace

CNN-Local-CRF top:plain skirt:normal skirt:plain sleeve:plain neck:V-neck ornament:flower

Truth neck:round-neck ornament:flower skirt:plain skirt:normal skirt:mini sleeve:plain sleeve:short top:plain

Style-Descriptor skirt:plain top:plain neck:round-neck skirt:mini ornament:flower sleeve:plain neck:V-neck sleeve:short-sleeves skirt:gathered neck:bijou

CNN-Global sleeve:plain sleeve:short-sleeves neck:round-neck top:plain skirt:plain skirt:normal skirt:knee sleeve:1/2-sleeves neck:U-neck skirt:mini

CNN-Local sleeve:plain skirt:plain skirt:normal sleeve:short-sleeves skirt:knee sleeve:1/2-sleeves ornament:flower neck:round-neck

CNN-Local-CRF sleeve:plain skirt:normal skirt:plain sleeve:short-sleeves skirt:knee top:plain neck:round-neck

(b) Attribute prediction

Figure 9: Qualitative results. False positives are marked red. Items are ordered by detection confidence. The CRF model sorts items by marginal probability.

References [1] Lukas Bossard, Matthias Dantone, Christian Leistner, Christian Wengert, Till Quack, and Luc Van Gool. Apparel classification with style. ACCV, pages 1–14, 2012. [2] Huizhong Chen, Andrew Gallagher, and Bernd Girod. Describing clothing by semantic attributes. In ECCV, pages 609–623. 2012. [3] George A Cushen and Mark S Nixon. Mobile visual clothing search. In ICME Workshops, pages 1–6. IEEE, 2013. [4] Wei Di, Catherine Wah, Anurag Bhardwaj, Robinson Piramuthu, and Neel Sundaresan. Style finder: Fine-grained clothing style detection and retrieval. In CVPR Workshops, pages 8–13, 2013. [5] Jian Dong, Qiang Chen, Wei Xia, Zhongyang Huang, and Shuicheng Yan. A deformable mixture parsing model with parselets. ICCV, 2013.

YAMAGUCHI et al.: MIX AND MATCH

11

[6] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A library for large linear classification. J Machine Learning Research, 9:1871–1874, 2008. [7] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, pages 675–678, 2014. [8] Yannis Kalantidis, Lyndon Kennedy, and Li-Jia Li. Getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, pages 105–112. ACM, 2013. [9] M. Hadi Kiapour, Kota Yamaguchi, Alexander C. Berg, and Tamara L. Berg. Hipster wars: Discovering elements of fashion styles. ECCV, 2014. [10] Iljung S Kwak, Ana C Murillo, Peter N Belhumeur, David Kriegman, and Serge Belongie. From bikers to surfers: Visual recognition of urban tribes. BMVC, 2013. [11] Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Luoqi Liu, Liang Lin, and Shuicheng Yan. Deep human parsing with active template regression. TPAMI, 2015. [12] Si Liu, Jiashi Feng, Zheng Song, Tianzhu Zhang, Hanqing Lu, Changsheng Xu, and Shuicheng Yan. Hi, magic closet, tell me what to wear! In ACM Multimedia, pages 619–628, 2012. [13] Si Liu, Zheng Song, Guangcan Liu, Changsheng Xu, Hanqing Lu, and Shuicheng Yan. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In CVPR, pages 3330–3337, 2012. [14] Si Liu, Jiashi Feng, Csaba Domokos, Hui Xu, Junshi Huang, Zhenzhen Hu, and Shuicheng Yan. Fashion parsing with weak color-category labels. IEEE Transactions on Multimedia, 16(1), January 2014. [15] Joris M. Mooij. libDAI: A free and open source C++ library for discrete approximate inference in graphical models. J Machine Learning Research, 11:2169–2173, August 2010. [16] Rasmus Rothe, Marko Ristin, Matthias Dantone, and Luc Van Gool. Discriminative learning of apparel features. Machine Vision Applications, 2015. [17] Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, and Raquel Urtasun. A high performance crf model for clothes parsing. ACCV, 2014. [18] Edgar Simo-Serra, Sanja Fidler, Francesc Moreno-Noguer, and Raquel Urtasun. Neuroaesthetics in fashion: Modeling the perception of fashionability. CVPR, 2015. [19] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. In J Machine Learning Research, pages 1453–1484, 2005. [20] Sirion Vittayakorn, Kota Yamaguchi, Alexander C Berg, and Tamara L Berg. Runway to realway: Visual analysis of fashion. WACV, 2015.

12

YAMAGUCHI et al.: MIX AND MATCH

[21] Xianwang Wang and Tong Zhang. Clothes search in consumer photos via color matching and attribute learning. ACM Multimedia, 2011. [22] Yang Wang and Greg Mori. A discriminative latent model of object classes and attributes. ECCV, 2010. [23] Kota Yamaguchi, M Hadi Kiapour, Luis E Ortiz, and Tamara L Berg. Parsing clothing in fashion photographs. In CVPR, pages 3570–3577, 2012. [24] Kota Yamaguchi, Tamara L Berg, and Luis E Ortiz. Chic or social: Visual popularity analysis in online fashion networks. ACM Multimedia, 2014. [25] Kota Yamaguchi, M Hadi Kiapour, Luis E Ortiz, and Tamara L Berg. Retrieving similar styles to parse clothing. TPAMI, 2014. [26] Yi Yang and Deva Ramanan. Articulated pose estimation with flexible mixtures-ofparts. In CVPR, pages 1385–1392, 2011. [27] Ning Zhang, Ryan Farrell, Forrest Iandola, and Trevor Darrell. Deformable part descriptors for fine-grained recognition and attribute prediction. In ICCV, pages 729–736, 2013. [28] Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. Part-based r-cnns for fine-grained category detection. In ECCV, pages 834–849. 2014. [29] Ning Zhang, Manohar Paluri, Marc’Aurelio Ranzato, Trevor Darrell, and Lubomir Bourdev. Panda: Pose aligned networks for deep attribute modeling. In CVPR, pages 1637–1644. IEEE, 2014. [30] Weipeng Zhang, Jie Shen, Guangcan Liu, and Yong Yu. A latent clothing attribute approach for human pose estimation. ACCV, 2014.