# fashion_style_generator__fbeff155.pdf Fashion Style Generator Shuhui Jiang1 and Yun Fu1,2 1Department of Electrical and Computer Engineering, Northeastern University, Boston, MA 02115, USA 2College of Computer and Information Science, Northeastern University, Boston, MA 02115, USA {shjiang,yunfu}@ece.neu.edu In this paper, we focus on a new problem: applying artificial intelligence to automatically generate fashion style images. Given a basic clothing image and a fashion style image (e.g., leopard print), we generate a clothing image with the certain style in real time with a neural fashion style generator. Fashion style generation is related to recent artistic style transfer works, but has its own challenges. The synthetic image should preserve the similar design as the basic clothing, and meanwhile blend the new style pattern on the clothing. Neither existing global nor patch based neural style transfer methods could well solve these challenges. In this paper, we propose an end-to-end feed-forward neural network which consists of a fashion style generator and a discriminator. The global and patch based style and content losses calculated by the discriminator alternatively back-propagate the generator network and optimize it. The global optimization stage preserves the clothing form and design and the local optimization stage preserves the detailed style pattern. Extensive experiments show that our method outperforms the state-of-the-arts. 1 Introduction Applying artificial intelligence to solve problems in art and fashion fields attract a lot of attentions such as fashion style classification [FYihui Ma and Tong, 2017; Kiapour et al., 2014; Jiang et al., 2016a], clothing parsing [Yamaguchi et al., 2013; Yamaguchi et al., 2012], clothing retrieval [Jiang et al., 2016b] and recommendation [Fu12 et al., 2017]. In this paper, we focus on a novel problem: fashion style generation. It is different from existing online clothing design tools 1,2, which directly put a picked icon on the basic clothing. As shown in Figure 1 (b), with inputs of a basic clothing image and a style image, we automatically generate a clothing image blending with the new style while preserving the basic design. The definition of style in this paper is similar as the 1https://www.customink.com/lab?ref=nav_v2 2http://www.ooshirts.com Style Generator Patch Discriminator D G (a) Framework of the training stage (b) Examples of fashion style generation Figure 1: Fashion style generator framework overview. The input X consists of a set of clothing patches X(1) and full clothing images X(2). The system consists of two components: an image transformation network G served as fashion style generator, and a discriminator network D calculates both global and patch based content and style losses. G is a convolutional encoder decoder network parameterized by weights θ. Six generated shirts with different styles by our method are shown as examples. (We highly recommend to zoom in all the figures with color version for more details.) recent neural style transfer works [Gatys et al., 2015]. Taking Van Gogh s Starry Night as the example style image, style is between the low-level color/texture (e.g., blue and yellow color, rough or smoother texture) and the high-level objects (e.g., house and mountain). Style is a relatively abstract concept. Fashion style generation has at least two practical usages. Designers could quickly see how the clothing looks like in a given style to facilitate the design processing. Shoppers could synthesize the clothing image with the ideal style and apply clothing retrieval tools [Jiang et al., 2016b] to search the similar items. Fashion style generation is related to existing neural style transfer works [Gatys et al., 2015; Li and Wand, 2016a; Efros and Freeman, 2001], but has its own challenges. In fashion style generation, the synthetic clothing image should Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Figure 2: Limitations of applying the global [Gatys et al., 2015] (a) and patch [Li and Wand, 2016a] based neural style transfer methods to fashion style generation. The left two columns are input content and style images. The right three columns are synthetic results in different iterations. In (a), we apply global method on artistic style transfer in the first row and on fashion style generation in the second row. In (b), we apply patch method on face-to-face transfer in the first row and on fashion style generation in the second row. This figure demonstrates that applying global or patch based methods may fail to synthesize high quality fashion style images. blend the style of the style image while preserving the original form and shape of the clothing. Very few works have focused on fashion style generation. To our best knowledge, there is no publication so far and we only find an unpublished course project, which investigates Gatys s [Gatys et al., 2016] neural style transfer work to fashion style transfer3. [Gatys et al., 2016] performed artistic style transfer, combining the content of one image with the style of another by jointly minimizing the content reconstruction loss and the style reconstruction loss. Although [Gatys et al., 2016] produces high quality results in painting style transfer, it is computationally expensive since each step of the optimization requires forward and backward passes through the pretrained network. Meanwhile, existing works are mainly focused on painting or other applications, which may not well capture the challenges of fashion style generation task. Existing neural style transfer works mainly consist of two kinds of approaches: global and patch. Global (i.e., full image) based methods [Gatys et al., 2015; Johnson et al., 2016; Gatys et al., 2016; Ulyanov et al., 2016] achieve impressive results in artistic style transfer, but with limited fidelity in local detail, especially to high-resolution images. As shown in Figure 2 (a), the global structure of content images (i.e., buildings and T-shirt) is well preserved; however, the detailed structures of the style images are not well blended on the Tshirt. We could see that the yellow stars are transferred on the background instead of the T-shirt. Patch based approaches, such as deep Markovian models [Li and Wand, 2016a; Li and Wand, 2016b; Ding et al., 2016], capture the statistics of local patches and assemble them to 3http://personal.ie.cuhk.edu.hk/ lz013/ papers/fashionstyle_poster.pdf high-resolution images. While they achieve high fidelity of details, the additional guidance is required if the global structure should be reproduced [Efros and Freeman, 2001; Li and Wand, 2016a; Li and Wand, 2016b]. As shown in Figure 2 (b), patch based approaches well preserve both global and local structure only when the style and content images are with the similar structure such as face-to-face. However, in fashion style generation, the style image is not necessarily to be the clothing image or with the similar structure as the content image. Lack of additional global guidance would destroy the global structure of the synthetic image. For example, in the second row of Figure 2 (b), the global structure of the left part of the synthetic clothing is destroyed during the synthesis processing. To address the above challenges, we propose an end-toend feed-forward neural network of fashion style generation. We combine the benefits of both global and patch based methods, and meanwhile avoid the disadvantages. As shown in Figure 1, the inputs consist of a set of clothing patches and full images. There are two components: an image transformation network G served as the fashion style generator, and a discriminator network D calculates both global and patch based content and style reconstruction losses. Furthermore, an alternating global-patch backpropagation strategy is proposed to optimize the generator to preserve both global and local structures. In online generation stage, we only need to do the forward propagation, which makes it is hundreds faster than the existing methods with both forward and backward passes [Li and Wand, 2016a; Gatys et al., 2016]. Experimental results demonstrate that for both speed and quality, the proposed method outperforms the state-of-the-arts in fashion style generation task. 2.1 Problem Formulation For an input clothing image q and a style image ys, we want to synthesize a clothing image ˆy through a style generator G. ˆy blends the style of ys on q and meanwhile preserves the form and design of q. We achieve it through off-line training the parameters θ of G with a set of clothing images X and the style image ys. Recently, a wide variety of feed-forward image transformation tasks have been solved by training deep convolutional neural networks [Johnson et al., 2016; Li and Wand, 2016b]. A general feed-forward network consists of an image transformation network G and a discriminator network D. For style transfer/generation, G is served as the a style generator. The reconstruction content and style loss of D iteratively back-propagates and optimizes θ. In online generation, G transforms the input clothing image q into output clothing image ˆy via the mapping ˆy = fθ(q). Thus, we do not need to do back-propagation, which facilitates the real time generation. However, as discussed above, neither the existing global [Johnson et al., 2016] nor patch [Li and Wand, 2016b] based methods could well solve the challenges in fashion style generation. Therefore, we propose to jointly consider the global and patch reconstruction losses when optimizing G to overcome the shortcomings of global or patch based methods. The Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) main purpose of global based optimization is to preserve the global form and design of the basic clothing, while the main purpose of patch based optimization is to preserve the local details of the style pattern. 2.2 Architecture The flowchart of Figure 1 shows the training stage of our system. Different from existing works either only use full images or patches, the input X of our training stage consists of a set of clothing patches X(1) and full clothing images X(2). X(1) and X(2) are applied in patch and global based optimization stage respectively. The patch images are cropped from the online shopping clothing dataset [Hadi Kiapour et al., 2015; Jiang et al., 2016b]. They are usually with clean backgrounds and front poses, which makes it much easier to focus on the details of the local clothing structure. The whole clothing images are from the Fashion 144k dataset [Simo Serra and Ishikawa, 2016]. They are usually with complex backgrounds and different poses, which makes the model more robust to noise and could well preserve the global clothing structure. Our system is an end-to-end feed-forward neural network consists of an image transformation network G with parameter θ served as the fashion style generator and a discriminator network D. G consists of encoder and decoder parts. The encoder En encodes the input image as a vector and decoder De decodes the vector again as an image. D consists of the global loss network φ and the patch loss network ϕs and ϕc for style and content respectively. The reconstruction loss back-propagates and optimizes θ to make the synthesis image preserves both global structure and local details. As mentioned in [Johnson et al., 2016], the pretrained convolutional neural networks are able to extract perceptual information and encode semantics. Therefore, we utilize a pretrained image classification network (i.e., VGG-19) [Simonyan and Zisserman, 2014; Li et al., 2016] as the initialization of En. Also, the VGG network is utilized as the global loss network φ and the patch content loss network ϕc. For the patch style loss network ϕs, since existing network are mainly trained for whole images, instead of directly applying an existing pretrained discriminator network, we apply the generative adversarial training [Goodfellow et al., 2014] for learning the parameters of ϕs and initializing De simultaneously. After the initialization, an alternating patch-global training strategy is applied for optimizing the generator parameter θ. 2.3 Objective Function of Discriminator As discussed above, the loss function L of the discriminator D is defined as a weighted combination of the patch based loss L(1) and the global based loss L(2): L(ˆy, yc, ys) = L(1)(ˆy, yc, ys) + λL(2)(ˆy, yc, ys) = l(1) style + λ1l(1) content | {z } patch + λ2l(2) style + λ3l(2) content | {z } local where λ, λ1, λ2 and λ3 are tuning parameters to adjust the weights. Given an input training clothing image x X, ˆy is the output synthetic image of the generator through mapping ˆy = fθ(x). ys is the input style image. yc is the clothing content image. In the patch optimization stage, yc = x X(1), while in global optimization stage, yc is a higher resolution version of the image x X(2). Both L(1) and L(2) consist of two parts of losses: the content and the style reconstruction loss. The content losses l(1) content(ˆy, yc) and l(2) content(ˆy, yc) capture the distances in respect of perceptual features between yc and ˆy, for patch and global respectively. The style losses l(1) style(ˆy, ys) and l(2) style(ˆy, ys) capture the distances between mid-level features of ys and ˆy for patch and global respectively. In the following, we introduce l(2) content, l(2) style, l(1) content, and l(1) style one by one. As discussed above, we apply a pretrained convolutional neural networks (i.e., VGG-19) as the global loss network φ. The deeper layers of φ extract perceptual information and encode semantics of the content. Thus, measuring the perceptual similarity of yc and ˆy as the content loss is more informative than encouraging the pixel-based match. The middle layers of φ, instead, extract mid-level feature representation as the image style. Thus we measure the middle layer similarity of ys and ˆy as the style loss. Let φj and φk be the activations of the j-th (deeper) and k-th (middle) layer of the network φ. Cj Hj Wj is the shape of feature map of the j-th layer. In order to make the output image in the high resolution, we assign yc as the higher resolution version of the input image x X(2). lcontent(ˆy, yc) is the Euclidean distance between feature representation as: l(2) content(ˆy, yc) = 1 Cj Hj Wj φj(ˆy) φj(yc) 2 2, (2) and for global style loss, we use the Frobenius norm of differences of the Gram matrices [Gatys et al., 2015]: l(2) style(ˆy, ys) = 1 Ck Hk Wk Gramφ k(ˆy) Gramφ k(ys) 2 F . (3) Different from l(2) content and l(2) style computed on the same loss network φ, patch losses l(1) content and l(1) style are computed on patch content loss network ϕc and patch style loss network ϕs respectively. Assume we extract N patches from a full image and denote Ψ( ) as the patches extracted from the image. For content loss, we calculate the Euclidean distance between feature representation in the similar way as Eq. (2): l(1) content(ˆy, yc) = 1 N ϕc(Ψ(ˆy) ϕc(Ψ(yc) 2 2, (4) where Ψ(ˆy) and Ψ(yc) are patches extracted from ˆy and yc. For patch style loss network ϕs, since existing networks are mainly trained for full images, instead of directly applying the existing pretrained discriminator network, we apply Generative Adversarial Network (GAN) [Goodfellow et al., 2014; Radford et al., 2015] for learning ϕs and meanwhile initializing the parameters of decoder De of the generator. We will describe it in the next subsection. After obtaining the ϕs, we apply Hinge loss to measure the style loss as [Li and Wand, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) lstyle(ˆy, ys) = 1 i=1 max(0, 1 1 si), (5) where si denotes the classification score of i-th neural patch. More details could be referred in [Li and Wand, 2016b]. 2.4 Optimization of Generator In this section, we describe the strategy to optimize the parameter θ of the style generator G using the loss L calculated by the discriminator: θ arg min θ Ex,ys,yc[L fθ(x), ys, yc ], (6) where Ex,ys,yc is the estimation of the expectation via the training set {x, ys, yc}, x X. We firstly describe utilizing GAN [Goodfellow et al., 2014; Radford et al., 2015] for learning patch style network ϕs and meanwhile initializing the parameters of decoder De. The inputs of this stage are image patches X(2) and the style image ys. As described, the parameters of the En, the global loss net φ and the local content loss net ϕc are initialized by VGG. We keep En unchanged in this step. GAN estimates generative models via an adversarial process. The training procedure for G is to maximize the probability of D making a mistake. The objective function is as: min G max D V (D, G) = Ex pdata(x)[log D(x)] +Ez pz(z)[log(1 D(G(z)))]. (7) In traditional GAN, z is the random noise. In our work, we replace z using the encoded feature of the input image by En of VAE [Kingma and Welling, 2013]. The detailed theory proof could be referred in [Goodfellow et al., 2014]. Figure 3 shows three examples of the generated patches with the style Chinese knot after the initialization of ϕs and De. To this end, all the parts of networks are initialized. Figure 3: Example of generated style patches. The inputs are image patches and a style image Chinese knot . We could see that the generator blends the style of Chinese knot on the clothing patches detailedly. Next, we describe the alternating global-patch backpropagation algorithm for optimizing θ. The discriminator networks are unchanged during the optimization. The alternating global-patch back-propagation iterates the following two-steps for T iterations. (1)Global back-propagation: In the global back-propagation step, θt+1 can be obtained by using the least squares error of the global loss in iteration Algorithm 1 Alternating Patch-Global Back-propagation INPUT: X(1), X(2), ys, T, τ (1), τ (2). VGG network parameter. 1: Initialize weights of En, φ, ϕc by VGG. 2: Apply GAN to initialize De and ϕs. 3: for t=1,2,...,T do 4: %update θ by global loss back-propagation. 5: for m=1,2,...,τ (2) do 6: Calculate the global loss by Eq. (1),(2),(3). 7: Update θt by Eq. (8). 8: end for 9: %update θ by patch loss back-propagation. 10: for m=1,2,...,τ (1) do 11: Calculate the patch loss by Eq. (1),(4),(5). 12: Update θt by Eq. (9). 13: end for 14: Update θt+1 = θt. 15: end for ONPUT: Style generator parameter ˆθ = θt. t + 1 and t as L(2) t+1 L(2) t = e(2) m+1 to train the generator fθ(x). We employ a gradient descent (GD) algorithm to minimize em+1 . θt+1 is updated by repeating τ (2) times as: θt = θt η(2) e(2) t+1 2 2 θt , (8) where η(2) is the learning rate. (2)Patch back-propagation: In local back-propagation step, θt+1 can be obtained by using the least squares error of the patch loss in iteration t+1 and t as L(1) t+1 L(1) t = e(1) m+1 to train the generator fθ(x). θt+1 is updated by repeating τ (1) times as: θt = θt η(1) e(1) t+1 2 2 θt (9) where η(1) is the learning rate. The algorithm of optimization is described in Algorithm 1. 3 Experiments 3.1 Experimental Details Dataset and Data Processing: Our training dataset contains two parts: A Fashion 144k dataset as full image inputs [Simo Serra and Ishikawa, 2016] and 300 online shopping images as patch inputs, which are randomly selected from the Online Shopping dataset [Hadi Kiapour et al., 2015]. Existing patch based works point out that only a small number of training images (i.e., 100 images) could still produce good results [Li and Wand, 2016b]. The Fashion 144k dataset consists of 144,169 user posts with images, collected from the largest fashion website chictopia.com. The Online Shopping dataset consists of 404,683 shop photos from 25 different online clothing retailers. Our testing data are 100 images randomly collected from online shopping websites. In the exper- Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Content Neural ST Feed S MRFCNN MGAN Ours Style Figure 4: Synthetic fashion style images by 5 compared methods Neural ST, MRFCNN, Feed S, MGAN and Ours. The fist left column shows the input style images wave and bear . The second left column shows four input content images. For MGAN and Ours, we enlarge the regions in red frames to show more details. iments, we apply 6 style images as shown in the last second row in Figure 1. They are blue and white porcelain , bear , wave , Chinese knot , leopard print and starry night . The settings of the sizes of inputs and outputs images in training are following existing global and patch based works [Johnson et al., 2016; Li and Wand, 2016b]. The style images are color images of shape 3 256 256. For full images, the low-resolution inputs are of shape 3 72 72. The highresolution inputs are of shape 3 288 288. For patch images, the patches are of shape 3 128 128. They are cropped from full online shopping images with a fixed stride, which is 16 in our work. Since the image transformation networks are fullyconvolutional, at test stage they can be applied to images of any resolution. Network details: For the generator network G, it takes a V GG 19 layer relu4 1 encoding of an image and directly decodes it to pixels of the synthesis image. For the decoder De and the patch style loss network ϕs , like [Radford et al., 2015; Wu et al., 2016], we use batch normalization (BN) and LRe LU to improve the training. The style loss is computed at the V GG 19 network layer relu2 2, and the content loss is computed in V GG 19 layer relu5 1. Training details: For global stage back-propagation, maximum iteration is set to be 40000, and a batch size of 4 is applied. These settings give roughly 1.5 epochs over all the training data. For patch stage back-propagations, we test 1 to 10 epochs over all the patches. The optimization is based on Adam [Kingma and Ba, 2014] with a learning rate of 1 10 3. No weight decay or dropout is used. The training is implemented using Torch [Collobert et al., 2011] and cu DNN [Chetlur et al., 2014]. Each style training takes around 7 hours on a single GTX Titan X GPU. 3.2 Compared Methods Although there are very few publications fully focused on fashion style generation task, to evaluate the effectiveness of our proposed method, we take four most related global or patch based neural style transfer works as our baseline methods as following: Neural ST [Gatys et al., 2015]: Gatys et al. performed artistic neural style transfer by synthesizing a new image that matches both the content of the content image and the style of the style image. MRFCNN [Li and Wand, 2016a]: Li et al. combined generative Markov random field (MRF) patch based models and discriminatingly trained deep convolutional neural networks (d CNNs) for synthesizing 2D images. Feed S [Johnson et al., 2016]: Johnson et al. proposed feed-forward network to solve the optimization problem in [Gatys et al., 2015] in real time in test stage. MGAN [Li and Wand, 2016b]: Li et al. proposed a Markovian patch-based feed-forward network for artistic style transfer. This work is similar as the initialization of the patch loss network in our work. Ours: It includes the whole pipeline of our framework. In Neural ST and MRFCNN, both forward and backward propagations are applied when generating testing results. For Feed S and MGAN, we train the feed-forward networks with the same clothing datasets as our work. We have conducted different settings of parameters and post the best results we obtained of each method. For the comparison methods, we run the code released by the authors. 3.3 Experimental Results Figure 4 compares our results with compared methods Neural ST, MRFCNN, Feed S and MGAN. In Neural ST and MRFCNN, we set the iteration number as 200. In Feed S, we set Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Figure 5: Illustration of synthetic clothings at different iterations (0, 1000, 2000, 3000 from left to right) of the global back-propagation after the patch based initialization. The global iterations gradually add the style pattern on the destroyed parts of the images caused by the patch initialization. We enlarge the parts in the red frames to show more details. the iteration number as 40,000, which is almost 2.5 epochs. In MGAN, we set the iteration number as 3000, which is almost 10 epochs. In Ours, we set T = 1 and τ (1) = τ (2) = 3000. We remove the backgrounds of clothing images through image matting algorithms for better visualization. When comparing feed-forward based methods (Feed S, MGAN and Ours), we found that MGAN and Ours better preserve the detailed textures in the style images, compared with global based Feed S. For example, the claws of the waves and bear hair are very clear. Since our network is initialized by patch based network, the difference of the texture between MGAN and Ours are not large. However, as discussed above, patch based methods may not well preserve the global structure of the full image. For example, in the first row of MGAN, the areas in the red frames are not well synthesized. In our method, these areas are better blended with style patterns. It shows the effectiveness of considering both global and local characteristics in our method. Neural ST and MRFCNN are not feed-forward based networks. Generally, besides the speed, we have the similar observations. In MRFCNN, although the generated images preserve the textures, they may loss the original global structures. For example, on the two generated images with bear style in MRFCNN, even the head of bears are transferred. 3.4 Discussion of Speed and Complexity Neural ST and RMFCNN are computationally expensive since each step of the optimization requires forward and backward passes through the pretrained network. With the feed-forward network, since we do need to do the backpropagation in the test stage, the test speed is hundreds faster. For the training stage, the most time-consuming part is the patch discriminator network initialized by GAN. The time complexity of this step is the same as [Li and Wand, 2016b]. It is mainly effected by the training iterations and the batch size. In our work, it take about 5 hours for the initialization. After initialization, the speed is effected by the alternating iteration number T, and the iteration numbers τ (1) and τ (2) in the patch and global back-propagation. Since the generator is already initialized, we set T, τ (1) and τ (2) at small numbers. It takes about 2 hours for the following optimization. 3.5 Discussion of Our Method To evaluate the effectiveness of the alternating patch-global back-propagation, in Figure 5, we show the generated images of only utilizing the patch back-propagation (iteration 0) and after global back-propagation iterations at 1000, 2000 and 3000. The global back-propagation gradually blends the style on the destroyed parts caused by the patch initialization, which shows the effectiveness of the patch-global optimization strategy. We also discuss the weight λ in our objective function Eq. (1). We tune λ through different settings of learning rate η(1) and η(2) in Eq. (8) and (9). The initial learning rate η(1) in patch optimization is 0.02. We fix η(1) and tune η(2) of global optimization as e 5 to e 9. If we set the learning rate too large, the network could not be converged and the output image would be blur and without style patterns blended. We achieve good results at η(2) around e 7. Comparing η(1) and η(2), we observed that the patch loss plays an more important role than global loss. 3.6 Limitation Our work still has some limitations. First, similar as the patch based method MGAN [Li and Wand, 2016b], we may also fail to generate style texture on the clothing if a very large area of image is non-texture and pain. Second, sometimes the color would be less accurate, due to the network may preserve some original color of the content image. Third, the resolution of the generated clothings are still lower than the real clothing. 4 Conclusion In this paper, we focused on fashion style generation, which is a relatively new topic in artificial intelligence field. We pointed out the challenges in fashion style generation compared with existing artistic neural style transfer. The synthetic image should preserve the similar design as the basic clothing and meanwhile blend the detailed style. We analyzed the shortcomings of existing global and local methods in neural style transfer if directly applied in our task. To address the challenges, we proposed an end-to-end neural fashion style generator, together with an alternating patch-global back-propagation strategy. Experiments and analysis show that our model outperforms the state-of-the-arts. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) References [Chetlur et al., 2014] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. ar Xiv preprint ar Xiv:1410.0759, 2014. [Collobert et al., 2011] Ronan Collobert, Koray Kavukcuoglu, and Cl ement Farabet. Torch7: A matlablike environment for machine learning. In Big Learn, NIPS Workshop, number EPFL-CONF-192376, 2011. [Ding et al., 2016] Zhengming Ding, Ming Shao, and Yun Fu. Deep robust encoder through locality preserving lowrank dictionary. In Proceedings of ECCV, pages 567 582. Springer, 2016. [Efros and Freeman, 2001] Alexei A Efros and William T Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 341 346. ACM, 2001. [Fu12 et al., 2017] Jingtian Fu12, Jia Jia, Yihui Ma, Fanhang Meng, and Huan Huang. A virtual personal fashion consultant: Learning from the personal preference of fashion. In Proceedings of AAAI. AAAI Press, 2017. [FYihui Ma and Tong, 2017] Suping Zhou Jingtian Fu Yejun Liu FYihui Ma, Jia Jia and Zijian Tong. Towards better understanding the clothing fashion styles: A multimodal deep learning approach. In Proceedings of AAAI. AAAI Press, 2017. [Gatys et al., 2015] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. ar Xiv preprint ar Xiv:1508.06576, 2015. [Gatys et al., 2016] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of CVPR, pages 2414 2423, 2016. [Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of NIPS, pages 2672 2680, 2014. [Hadi Kiapour et al., 2015] M Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C Berg, and Tamara L Berg. Where to buy it: Matching street clothing photos in online shops. In Proceedings of ICCV, pages 3343 3351, 2015. [Jiang et al., 2016a] Shuhui Jiang, Ming Shao, Chengcheng Jia, and Yun Fu. Consensus style centralizing auto-encoder for weak style classification. In Proceedings of AAAI, pages 1223 1229. AAAI Press, 2016. [Jiang et al., 2016b] Shuhui Jiang, Yue Wu, and Yun Fu. Deep bi-directional cross-triplet embedding for crossdomain clothing retrieval. In Proceedings of MM, pages 52 56. ACM, 2016. [Johnson et al., 2016] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. ar Xiv preprint ar Xiv:1603.08155, 2016. [Kiapour et al., 2014] M Hadi Kiapour, Kota Yamaguchi, Alexander C Berg, and Tamara L Berg. Hipster wars: Discovering elements of fashion styles. In Proceedings of ECCV, pages 472 488. Springer, 2014. [Kingma and Ba, 2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [Kingma and Welling, 2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [Li and Wand, 2016a] Chuan Li and Michael Wand. Combining markov random fields and convolutional neural networks for image synthesis. ar Xiv preprint ar Xiv:1601.04589, 2016. [Li and Wand, 2016b] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. ar Xiv preprint ar Xiv:1604.04382, 2016. [Li et al., 2016] J. Li, , T. Zhang, W. Luo, J. Yang, X.T. Yuan, and J. Zhang. Sparseness analysis in the pertraining of deep neural networks. IEEE Transactions on Neural Networks and Learning Systems, DOI: 10.1109/TNNLS.2016.2541681, 2016. [Radford et al., 2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. [Simo-Serra and Ishikawa, 2016] Edgar Simo-Serra and Hiroshi Ishikawa. Fashion style in 128 floats: joint ranking and classification using weak data for feature extraction. In Proceedings of CVPR, pages 298 307, 2016. [Simonyan and Zisserman, 2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. [Ulyanov et al., 2016] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In Proceedings of ICML, 2016. [Wu et al., 2016] Yue Wu, Jun Li, Yu Kong, and Yun Fu. Deep convolutional neural network with independent softmax for large scale face recognition. In Proceedings of MM, pages 1063 1067. ACM, 2016. [Yamaguchi et al., 2012] Kota Yamaguchi, M Hadi Kiapour, Luis E Ortiz, and Tamara L Berg. Parsing clothing in fashion photographs. In Proceedings of CVPR, pages 3570 3577. IEEE, 2012. [Yamaguchi et al., 2013] Kota Yamaguchi, M Hadi Kiapour, and Tamara L Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. In Proceedings of ICCV, pages 3519 3526, 2013. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)