# dense_interspecies_face_embedding__e9acec79.pdf

Dense Interspecies Face Embedding

Sejong Yang Yonsei University sejong.yang@yonsei.ac.kr

Subin Jeon Yonsei University subinjeon@yonsei.ac.kr

Seonghyeon Nam York University snam0331@gmail.com

Seon Joo Kim Yonsei University seonjookim@yonsei.ac.kr

Dense Interspecies Face Embedding (DIFE) is a new direction for understanding faces of various animals by extracting common features among animal faces including human face. There are three main obstacles for interspecies face understanding: (1) lack of animal data compared to human, (2) ambiguous connection between faces of various animals, and (3) extreme shape and style variance. To cope with the lack of data, we utilize multi-teacher knowledge distillation of CSE and Style GAN2 requiring no additional data or label. Then we synthesize pseudo pair images through the latent space exploration of Style GAN2 to ﬁnd implicit associations between different animal faces. Finally, we introduce the semantic matching loss to overcome the problem of extreme shape differences between species. To quantitatively evaluate our method over possible previous methodologies like unsupervised keypoint detection, we perform interspecies facial keypoint transfer on MAFL and AP-10K. Furthermore, the results of other applications like interspecies face image manipulation and dense keypoint transfer are provided. The code is available at https://github.com/kingsj0405/dife.

1 Introduction

Face understanding is a longstanding research topic in computer vision, and there have been extensive studies and applications such as face recognition, facial expression estimation, face photo manipulation, and virtual humans [6, 47, 4, 51, 3]. The vast majority of successful works on faces have focused on human faces due to its practical implication for a variety of applications and the accessibility to a large amount of annotated data. There have been few attempts for understanding faces of other animal species to facilitate more applications such as ecological research, pet identiﬁcation, and farm automation [14, 34, 12]. However, the task of understanding animal faces is more challenging than that for the human face due to the lack of sufﬁcient data with annotations. In the literature on full-body understanding, joint learning of interspecies data has been studied as a remedy by transferring the knowledge of well-annotated domain such as human to other species [33, 26, 27]. However, the existing approaches are focused on full-body data, therefore their performance on facial data is limited.

In this work, we are interested in the problem of learning the representation of interspecies facial semantics, which is a common representation shared across interspecies facial data. Learning such representation allows us to extend face understanding beyond the human face, as it enables discovering facial semantics of other species without enough annotations through transferring knowledge from well-annotated human data. As many animal species share similar face geometry, the representation can also beneﬁt from learning multiple species jointly. Since sharing photos of animals through social networks has become more popular, learning the interspecies face representation can be used for

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

(a) Dense Interspecies Face Embedding (DIFE)

(b) Applications of DIFE Figure 1: (a) Dense Interspecies Face Embedding (DIFE) is a representation that contains common characteristics of faces of various species embedded to pixels in the feature space. (b) DIFE allows us to ﬁnd visual correspondence across multiple face domain, so we can transfer facial keypoints between animals or generate different animal faces with the same posture.

various applications. For instance, one can search animal photos using a photo of human face as a query to ﬁnd animal faces with similar pose. An animal face animation can be synthesized by using that of a human as a template, which can be useful for media production.

To achieve our goal, we propose to learn Dense Interspecies Face Embedding (DIFE) as shown in Fig. 1(a). Our approach is inspired by CSE [27], which was proposed to learn continuous surface embedding (CSE) of full-body shared across interspecies data. While the interspecies embedding in [27] is promising, it is difﬁcult to use it for faces as the embedding in the facial region is too coarse to discriminate facial landmarks accurately. To overcome this problem, we propose a multi-teacher knowledge distillation method for learning DIFE. Our key idea is to enforce a student network to learn a ﬁne-grained interspecies facial embedding from two teacher networks: CSE and Style GAN2 [19]. Since the CSE has a useful representation of interspecies data, we use it as a teacher to have the student network learn a common embedding space shared in multiple species. For a ﬁne-grained face embedding, we simultaneously exploit Style GAN2 as the secondary teacher. Style GAN2 is originally trained to generate high-quality face images [19]. Recent studies have demonstrated that the features of Style GAN2 have abundant information on facial attributes such as postures and expressions [13, 36, 37]. Thus, the goal of our knowledge distillation is to transfer such information to the student network for learning ﬁne-grained facial semantics for multiple species.

To further improve the performance of DIFE, we propose a data augmentation method to synthesize pseudo-paired images, and the semantic matching loss to enhance the semantic correspondence between images. For the data augmentation, we synthesize images, where the face poses of the synthesized images are similar to that of input images using Style GAN2. This can be achieved in our framework as DIFE is tied to the latent features of multiple Style GAN2 networks in different domains. Thus, DIFE of the input image can be used as a face geometry condition for the latent space exploration of Style GAN2 to generate images for both intraand inter-domain pairs. In addition, inspired by [39], a semantic matching loss is proposed to use the pseudo-paired data to enforce the DIFE to learn pixel-level ﬁne-grained matching.

We demonstrate the effectiveness of our method on the cross-domain keypoint transfer task using interspecies data such as people, dogs, cats, and wild animals in MAFL [50] and AP-10K [45]. We evaluate our method on both sparse and dense keypoint transfers. In addition, we apply our method for an image editing task of overlaying a virtual object on various animal faces.

Our contribution can be summarized as follows:

We present a cross-domain face understanding study that considers not only human faces but also the faces of animals. Our approach makes it possible to avoid tedious and expensive animal data collection through domain adaptation, setting the stage for a variety of crossdomain applications.

We present a novel multi-teacher knowledge distillation paradigm that extracts and combines the information from models with different architectures and data domains. Our framework is designed to learn continuous face embedding across interspecies data from CSE and ﬁne-grained face embedding from Style GAN2. Combining the knowledge from the two, our method learns dense interspecies face embedding (DIFE).

We further introduce a method for synthesizing paired data for learning the semantic matching using the latent space exploration of Style GAN2. With the novel approach, we can reﬁne the representation of our face embedding for semantic correspondence between various instances and species.

2 Related Works

2.1 Domain-speciﬁc Face Understanding

We ﬁrst review works that aimed to identify the visual correspondence between various instances and scenes within one data category, human or animal. For faces, detecting facial landmarks such as the eyes and the nose has been studied extensively to ﬁnd correspondences between subjects. Initial works [38, 24, 46, 49, 43, 5] ﬁnd facial landmarks using facial landmark annotations in a fully supervised way. Regarding animal faces, Yang et al. [44] detected sheep facial landmarks based on 6 annotated keypoints. Although most works show impressive results under extreme conditions, the lack of annotated animal data limits the application compared to human faces.

Recently, there have been active researches [40, 48, 15, 39, 16, 7, 17] on ﬁnding landmarks from a collection of images in an unsupervised way, which is applicable to various domains, including human body, face, etc. This stream of works relies on the fact that landmarks are spatially equivariant to the transformation. Some works learn landmarks by comparing warped images with the original ones [40, 48], and others cast the problem into a conditional image generation using video frames depicting the same instances in different poses [15, 16]. The most related work to our method is DVE [39] in that they consider not only equivariance between the same instances but also intracategory variations. All of aforementioned methods can detect landmarks within the same object categories, but the landmarks are not convertible between different object categories (e.g. humans and cats).

2.2 Interspecies Understanding

There are studies that aim to discover the relationship between animal species, whose structures are related but geometrically diverse. To the best of our knowledge, [32] is the only work that conducted semantic matching between animal faces. Animal facial landmark detector is trained by transferring the knowledge from the pre-trained human landmark detector. The network is ﬁnetuned using horses and sheep data with 5 keypoints annotations. In comparison, we propose an unsupervised approach for interspecies facial semantic matching with the higher granularity of the keypoints.

Closely relevant to our work is the stream of works in [33, 26, 27], as they ﬁnd the relationship of body poses between animals. Sanakoyeu et al. [33] transfers the knowledge of the human body to the chimpanzee using Dense Pose [10]. The CSE [26] utilizes the manually annotated 3D meshes of each category and continuous surface embedding to relate several animal categories ﬂexibly. In the following work [27], they utilize the cycle-consistency between meshes and images to further the performance. Although they show impressive results on the complete body, they have difﬁculty in achieving a ﬁne-level sharpness on faces, making it impossible to discern different parts of faces. While our work also takes the advantage of CSE [27] to detect interspecies correspondences, we improve the accuracy of facial correspondence with the help of the latent features of Style GAN2 [19] without additional annotations.

2.3 Style GAN2 Utilization

Our work is also related to Style GAN utilization works in that we use features of pretrained Style GAN2 [19] to obtain ﬁne-grained facial features. Style GANs [18, 19] show highly photo-realistic results in the unconditional image generation problem with the power of well disentangled latent spaces. There have been abundant works to exploit the pre-trained Style GAN for various tasks, including image manipulation [41, 13, 37, 35], 3D reconstruction [30], image segmentation [22, 2], and semantic matching [31].

Because Style GAN2 has plenty of face geometry many works try to extract the information with knowledge distillation [2, 29, 41]. Labels4Free [2] and Segmentation in style [29] utilizes the feature of Style GAN2 to train segmentation model in unsupervised manner. In Style GAN2 distillation [41],

Synthesized

Style GAN2 Encoder Domain Converter

Cross Domain O

Facial Embedding O

Cross Domain O

Facial Embedding X

Cross Domain X

Facial Embedding O

Domain-Specific

Feature of Style GAN2

Embedding of

(a) Multi-Teacher Knowledge Distillation

Latent Space

Exploration Pseudo-Pair

Semantic Matching Loss

Synthesized

Image DIFE DIFE of Pseudo-Pair

face geometry natural

Original Domain Pair Domain

Domain-Specific

(b) Pseudo-Pair Data Synthesis and Utilization Figure 2: (a) Our encoder learns the cross domain association and the geometry of face with the responses and the intermediate feature from CSE and multiple Style GAN2 s. The converters for each domain is necessary to generate the domain-speciﬁc embeddings which is compatible with the intermediate features of Style GAN2s. (b) The pseudo pair data of different domains can be synthesized with the domain-speciﬁc embedding and the latent exploration of Style GAN2. The semantic matching loss is utilized to overcome the shape difference between faces of different species.

they generate images and labels with Style GAN2 generator and face attribute classiﬁer, and image manipulation model is trained with synthesized data. However we can tell no studies have used Style GAN models learned in different data domains at the same time, while our work learns the relationship between separate Style GAN2 models, which have different data domains.

Let xd RH W 3 denote an image of a face in a domain d {0, 1, ..., D 1}, where H and W are the height and the width of the image and D denote the number of domains. The goal of our task is to obtain a dense interspecies face embedding E RH W n of the image xd through an encoder Φ, which is described as E = Φ(xd). Here, n is the embedding dimension for dense interspecies face embedding. To train the encoder Φ in an unsupervised manner, we propose two training approaches: the multi-teacher knowledge distillation, and the semantic matching through pseudo-paired data synthesis. Since all images and labels are synthesized by the teacher models whose training data are not suitable for our task, the training can be considered as an unsupervised learning. In the following, we describe our dense interspecies face embedding (DIFE) and its training in detail.

3.1 Multi-Teacher Knowledge Distillation

Fig. 2(a) illustrates the overview of the multi-teacher knowledge distillation on multiple domains consisting of humans, dogs, and cats. Desirable properties of the interspecies face embedding are two folds. First, the representation is shared by various face images of different domains. Faces in multiple domains should have similar embeddings if their geometry is similar. Second, the embedding contains ﬁne-grained details of facial semantics, so that different parts of a face are distinguishable.

To learn such properties, we ﬁrst train DIFE by distilling knowledge from CSE [27]. As the representation of CSE has high-quality co-embeddings of diverse animal species, we exploit it to

learn a common space for cross-domain faces. Speciﬁcally, we optimize a knowledge distillation loss that minimizes the distance between the DIFE and the CSE embedding of the same input. Formally, the loss is described as Ld K1 = X Cd Φ(xd) 2 , (1)

where Cd is the CSE embedding of xd. Note that xd is an image synthesized by Style GAN2 described as dotted arrow in Fig. 2(a).

Although the above loss enforces our embedding to learn a shared space for multiple domains of faces, the representation of CSE is coarse in the region of the face since it is originally learned to distinguish different parts of a full body. To further learn ﬁne-grained details of faces, we add another knowledge distillation loss using the features of Style GAN2. Speciﬁcally, we exploit the semantic information of faces extracted from a pre-trained Style GAN2 to enhance the representation of DIFE. Using synthetic images generated by the Style GAN2 as input, we minimize the distance between the embedding of the input and its corresponding Style GAN2 feature map, which is formulated as

Ld K2 = X Fd Dd 2 , (2)

where Fd is the Style GAN2 feature map, and Dd is the domain-speciﬁc embedding of xd for the domain d. Since the dimension of DIFE is different from that of Style GAN2 and each feature of Style GAN2s is in the different domain, we convert DIFE to a domain-speciﬁc embedding using a domain converter Ψd, as follows: Dd = Ψd(Ed). When learning from multiple domains, we optimize LK1 = PD d Ld K1 and LK2 = PD d Ld K2 for all of the domains.

3.2 Semantic Matching

Pseudo-paired data synthesis. Fig. 2(b) shows how to synthesize pseudo-paired data xd , which has the identical face geometry with the original image xd. The paired domain d can be different from the original domain d. From a given human face image xd generated by Style GAN2, we ﬁrst extract DIFE using the encoder Φ. Then, with the domain converter Ψd , we transform the interspecies embedding into a domain-speciﬁc embedding Dd . In Fig. 2(b), we convert DIFE into the dog-speciﬁc embedding. Note that the domain-speciﬁc embedding lies in the similar vector space with Style GAN2, since the domain converter is trained by the Style GAN2 knowledge distillation. Finally, we synthesize a pseudo-paired image xd by feeding the domain-speciﬁc embedding Dd to Style GAN2. Rather than directly feeding the embedding, we introduce a latent space exploration method, which generates more realistic images by adjusting the latent space of the domain-speciﬁc embeddings into the natural image manifold.

In the latent space exploration, given a domain-speciﬁc embedding Dd , the goal is to ﬁnd a latent code wd Wd , which satisﬁes the following two conditions. Here, Wd is a latent space of Style GAN2 trained on the data domain d . First, the pseudo-paired data xd generated from the found latent code wd should have the same face geometry with the original image xd. This is achieved by ﬁnding wd with small difference between the Style GAN2 feature Fd (wd ) that is generated from wd , and the domain-speciﬁc embedding Dd . Second, the pseudo-paired data needs to be realistic and natural. To this end, following [1], we take the mean of latent codes w, and enforce our latent code to be similar to w. w is calculated by sampling 4096 latent codes. With the regularization, the generated image looks natural by locating the latent code in the realistic image manifold. Finally, the latent code wd is found by optimizing the following equation:

wd = arg min w Wd

" Fd (w) Dd 2 + w w 2

Here, the feature map Fd (w) is extracted from Style GAN2 by feeding the latent code w to optimize. After the optimization, the pseudo-paired data xd is synthesized by Style GAN2 with wd .

Semantic matching loss. The purpose of the semantic matching loss is to boost the matching performance of DIFE at pixel level with the synthesized pseudo-paired data. The pixel location u Ω on 2D cartesian coordinate system Ω= {0, 1, ..., H 1} {0, 1, ..., W 1} is used to explain

the loss. Theoretically, the pixel embedding Ed u at location u in DIFE of the original image has the same semantic meaning with the corresponding pixel embedding Ed u of the pseudo-paired image. However, the shape difference between different domains makes it infeasible to exactly match the pixel locations. For example, the eyes of the original image xd and the eyes of the pseudo-paired data xd in Fig. 2(b) do not perfectly match. To alleviate this issue, we introduce a soft distance measure similar to [39]. Intuitively, we compare the embedding from a pixel on the original image to the weighted embedding of a small region on the pseudo-paired image, as described as the contour in Fig. 2(b). We ﬁnd the region to compare based on the similarity σ(Ed, Ed , ui, uj) between the pixel embeddings as:

σ(Ed, Ed , ui, uj) = Ed ui Ed uj Edui Ed uj , (4)

where ui and uj are the pixel coordinates of Ed and Ed , respectively. Finally, the semantic matching loss Ld,d

M is deﬁned as follows:

uj Ω σ(Ed, Ed , ui, uj)uj

When learning from multiple domains, we optimize intra semantic matching loss LM1 = PD d Ld,d M and inter semantic matching loss LM2 = PD d PD d =d Ld,d

M for all permutations of the domains.

3.3 Training Objective

We can formulate our ﬁnal training objective with hyper-parameters as:

L = λ1 LK1 + λ2 LK2 + λ3 LM1 + λ4 LM2. (6)

The matching losses LM1 and LM2 cannot be applied in the early stage of training, since the domainspeciﬁc embedding is different from the features of Style GAN2s. Therefore, we ﬁrst impose LK1 and LK2 on the encoder and domain converter, and after initialization, we start to generate pseudo-paired data and utilize semantic matching losses LM1 and LM2.

4 Experiments

4.1 Interspecies Keypoint Transfer

Given an image in the source domain with keypoints, the goal of interspecies keypoint transfer is to ﬁnd the corresponding locations of the keypoints on another image in the target domain. We mainly evaluate our method on this task to demonstrate that our embedding is not only a high quality representation of cross-domain faces, but also beneﬁcial for knowledge transfer to unlabeled domains. The task is basically achieved by ﬁnding the pixel-level correspondence between the embeddings of the source and the target images. The performance is quantitatively evaluated by calculating the error between the positions of the transferred keypoints and their corresponding ground truth. We further describe mathematical formulations of the evaluation in detail in the supplementary material.

Datasets. At the training time, DIFE is trained with synthetic images and their target labels generated by pre-trained CSE and Style GAN2 networks. We use four datasets for evaluation: MAFL [50], AP-10K [45], WFLW [42] and Animal Web [20]. The MAFL is a dataset of human faces that is composed of 20k images. The test set of MAFL excluding the shared images with the training set of Celeb A is used (400 images). The AP-10K dataset is composed of 10k images of various animal species which is split into three disjoint subsets, i.e., train, validation, and test sets, with the ratio of 7:1:2 per animal species. As the images contain the full-body of animals, we crop the region of their faces by computing the bounding box using facial landmarks in the dataset. For the evaluation, we make pairs of two different species by randomly selecting images from each domain. As a result, we collect 100 pairs of humans and dogs, 89 pairs of human and cats, 47 pairs of humans and wild animals, and 89 pairs of dogs and cats. The same preprocessing is applied to WFLW and Animal Web to evaluate our method on more dense keypoint annotations including the mouth.

Table 1: Quantitative results of interspecies keypoint transfer on MAFL and AP-10K. The pixel error is obtained by calculating the euclidean distance between the ground truth and the transferred keypoints. * indicates that the model is trained by the authors of this paper, as the pre-trained weights are not provided. indicates that the data domains for the learning and the evaluation do not match.

Method Pixel Error Human+Dog Human+Cat Dog+Cat Human+Dog+Cat Human+Wild CSE 19.00 18.15 8.00 15.20 17.69 DVE* 17.78 17.40 6.75 14.34 15.58 CATs 14.78 17.63 8.47 13.67 14.05 Ours(DIFE) 11.73 11.00 6.51 13.16 10.37

Figure 3: Qualitative results of interspecies keypoint transfer on MAFL and AP-10K.

Implementation details. We pre-train CSE with Dense Pose-COCO [23] and Dense Pose-LVIS [11], which are the datasets for full-body keypoints of human and animal respectively. Style GAN2 is pre-trained with FFHQ [19] for human and AFHQ [9] for animals. DIFE encoder uses the hourglass network [28] for a fair comparison with DVE [39], and the domain converter consists of three consecutive 7 7 conv layers. We use Adam optimizer [21] with the learning rate of 10 3, batch size of 12, and max training step of 1M with early stopping. The lambda values for each loss functions are ﬁxed to λ1 = 10 2, λ2 = 100, λ3 = 10 2, and λ4 = 10 2 in all experiments. All experiments are carried out on one NVIDIA Titan XP.

Baselines. We compare our method with three baselines: CSE [27], DVE [39] and CATs [8]. Even though CSE is originally learned for full-body embeddings of interspecies data, we compare to the method to demonstrate the effectiveness of our method on face images. DVE is an unsupervised learning approach for learning pixel-level facial embeddings, which is designed for human faces in the original work. For the comparison, we train the method with interspecies data. Both the training set of AP-10K and synthesized images of Style GAN2 were used, and the latter is selected for better performance. We implemented the baselines using the original source code available on their websites with the default hyperparameters. CATs is the state-of-the-art for ﬁnding visual semantic correspondence. We use the pre-trained model from the original paper trained on SPair-71k [25] that contains annotations including the landmarks of dogs, cats, and other animals.

Quantitative results. Table 1 presents the quantitative results of CSE, DVE, and DIFE on humans, dogs, cats, and wild animals showing that our method achieves better performance in all data domains.

Table 2: The interspecies keypoint transfer on WFLW [42] and Animal Web [20].

Human+Dog Human+Cat Dog+Cat Human+Wild

CSE 14.43 14.95 16.85 16.08 DVE 14.16 13.28 13.40 15.74 CATs 20.84 18.84 19.35 22.20 Ours 12.01 12.84 11.70 14.03

As can be seen in the table, transferring between human and other animals is more challenging due to the larger shape differences between faces of human and animals. Compared to other baselines, DIFE makes a signiﬁcant improvement with the help of the semantic loss. The errors in all methods are lower for Dog+Cat, with DIFE performing better than other baselines. We have also tested a challenging case of the keypoint transfer between three domains by training the models at once with human, dogs, and cats, in which DIFE also outperforms other baselines. For the Human+Wild case, we utilized the model that was trained with human+cat, because it was difﬁcult to train with human and wild animals due to the extreme shape and style variance. Since the wild animal domain includes animals that resemble cats, such as tigers, lions, and cheetahs, using our model trained with human and cats performed reasonably, showing that our method is applicable for out-of-domain cases. Our method also outperforms the CATs which indicates the applicability of our method to discovering landmarks in unlabeled animal domains.

Table 2 shows the quantitative result of the interspecies keypoint transfer on WFLW [42] and Animal Web [20] with 9 landmarks including the corners of the eye and mouth. Our method shows the best performance compared to previous methods on every domain pair.

Qualitative results. Fig. 3 shows some qualitative results of the keypoint transfer. As expected, CSE has difﬁculty in ﬁnely distinguishing different parts of faces as it has limited face prior for animals due to the sparse annotation of Dense Pose-LVIS. While DVE tends to recognize the eyes and the noses to some extent, it does not properly overcome the shape difference, showing some mismatches. DIFE, with its ability to understand cross-domain association and face embedding, can ﬁgure out exact visual correspondence on interspecies faces. More visualizations of the interspecies keypoint transfer including the results on WFLW and Animal Web can be found in the appendix.

Fig. 4 visualizes the embeddings of different methods. CSE shows similar values across domains, but less distinction between different parts of a face. On the other hand, Style GAN2 features exhibit distinctiveness between different parts of a face, but corresponding parts between different domains show a large difference in values, making it difﬁcult to align between different domains. DIFE makes the best of both worlds, showing distinctive values within a face, while having similar values between corresponding face parts between different domains.

Image Embedding of

Features of Style GAN2 DIFE

Figure 4: Visualization of embeddings

Coarse Fine

Features of Style GAN2

Figure 5: Visualization of features of Style GAN2

Table 3: Ablation study on losses

LK1 LK2 LM1 LM2 Pixel Error

19.33 13.59 11.73

17.45 - 13.03 12.22

Table 4: Quantitative results of choosing layers of Style GAN2 for the knowledge distillation.

Layer Number nd Pixel Error

Layer5 512 13.80 Layer7 512 11.73 Layer9 256 14.78

Figure 6: The result of image manipulation

Figure 7: The face parsing results based on DIFE

4.2 Ablation Study

The effect of loss functions. Table 3 shows an ablation study to verify the role of each loss. Pixel errors are reduced by adding each loss, and any absence of a loss leads to the performance drop. As we cannot synthesize pseudo-paired data without the knowledge distillation of Style GAN2, the result for the absence of LK2 is not provided. The absence of LK1 leads to a large performance drop of DIFE, and we ﬁnd that the role of knowledge distillation of CSE is essential in the early stages of learning. The semantic matching loss also helps to increase the performance by learning more ﬁne visual correspondence across the data domains.

Which layer in Style GAN2 is good for the knowledge distillation? Table 4 shows the result of selecting a different layer of Style GAN2 for acquiring the features. Fig. 5 shows that the features from the higher layer is more ﬁne-grained but domain-speciﬁc, which makes the domain converter to become confused. Also, because the dimension of the domain-speciﬁc embedding follows the dimension of Style GAN2 features nd, selecting the ninth layer limits the representation power of the domain embedding. On the other hand, the feature from the lower layer is easy for the knowledge distillation with the abundant representation power with large nd, but the information about the face is limited because of its coarseness. With this ablation study, we selected the seventh layer of Style GAN2 to learn face prior of each data domain.

4.3 Applications

Image manipulation. Fig. 6 presents image manipulation examples based on DIFE. Putting objects like glasses and masks requires the positions of face parts, and the result shows that DIFE ﬁnds exact locations to put the objects onto the images, even on animal faces.

Interspecies face parsing Fig. 7 shows the results of interspecies face parsing based on DIFE. Following the segmentation in style [29], we use k-means clustering by DIFE for unsupervised face parsing. The eye, nose, mouth, and hairy parts are discovered with a simple method which means DIFE has proper semantic information for the interspecies face.

Dense keypoint transfer. Although DIFE extracts embeddings from all pixels, we cannot accurately evaluate all visual correspondences due to the lack of annotations. However, we can ﬁgure out more dense visual correspondence qualitatively by transferring generated grid keypoints . In Fig. 8,

Figure 8: The result of dense keypoint transfer

Figure 9: The examples of pseudo pair data

the result of transferring dense keypoints from a human source to another human, dog, cat, and wild animal targets can be found.

Pseudo-paired data synthesis. Fig. 9 shows examples of pseudo-paired data with various data domains and postures. Given original images, we perform pseudo-paired data synthesis on the intra-domain and the inter-domain. We can see that the generated paired images are in similar poses with the original ones, while having minor texture variations.

5 Conclusion

In this paper, we introduced a study for extracting common features from faces of several species, including humans, as a dense embedding. In this process, we presented a mechanism for extracting and learning desired information from separate teacher models, even though the teacher models have different architectures and data domains. Furthermore, we proposed a new data augmentation technique with a latent space exploration of Style GAN2 and the semantic matching loss to overcome the extreme shape difference between species. Various experiments and applications were offered for the evaluation, and our approach has been shown to be superior to other baselines. For the future research directions, we need to deal with multiple animals more than three domains efﬁciently, which is not focused on this paper due to the absence of pre-trained Style GAN2. A research on the generalization method for the out-of-domain data is needed to address the situation where data cannot be obtained or the animal species is unknown. Our method is dependent on the performance of pre-trained CSE and Style GAN2. Lastly, there are typical face understanding failure cases like occlusion or dark illumination. As the ﬁrst work to tackle the difﬁcult problem of interspecies face embedding, more follow-up studies are necessary to overcome current limitations. Nevertheless, we believe that this work has set the stage for a new research topic of cross-domain study for interspecies face understanding.

6 Broader Impact.

This study aims to understand the interspecies face, which can be potentially extended to various applications. For example, our method can assist to ﬁnd lost pets, check the health status of farm animals through analyzing the facial expressions. Identifying wild animals for ecological experiments is another expected application of our method. Moreover, the proposed method can potentially be extended to the automatic motion capture, that can be used in the ﬁlm production or content creation for AR/VR. We note that the studies using DIFE could potentially violate the personal privacy or animal rights. The studies extending DIFE should carefully consider the above potential misuses, and bring positive impacts to the society.

7 Acknowledgments

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2022-0-00124, Development of Artiﬁcial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities), and No. 2020-0-01361, Artiﬁcial Intelligence Graduate School Program (Yonsei University).

[1] Abdal, R., Qin, Y., Wonka, P.: Image2stylegan++: How to edit the embedded images? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8296 8305 (2020)

[2] Abdal, R., Zhu, P., Mitra, N.J., Wonka, P.: Labels4free: Unsupervised segmentation using stylegan. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13970 13979 (2021)

[3] Abdal, R., Zhu, P., Mitra, N.J., Wonka, P.: Styleﬂow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing ﬂows. ACM Transactions on Graphics (TOG) 40(3), 1 21 (2021)

[4] Barsoum, E., Zhang, C., Canton Ferrer, C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: ACM International Conference on Multimodal Interaction (ICMI) (2016)

[5] Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2017)

[6] Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: A dataset for recognising faces across pose and age. In: IEEE International Conference on Automatic Face & Gesture Recognition (F&G). pp. 67 74. IEEE (2018)

[7] Cheng, Z., Su, J.C., Maji, S.: On equivariant and invariant learning of object landmark representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9897 9906 (2021)

[8] Cho, S., Hong, S., Jeon, S., Lee, Y., Sohn, K., Kim, S.: Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems (NIPS) 34 (2021)

[9] Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: Stargan v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

[10] Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: Dense human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7297 7306 (2018)

[11] Gupta, A., Dollar, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

[12] Hansen, M.F., Smith, M.L., Smith, L.N., Salter, M.G., Baxter, E.M., Farish, M., Grieve, B.: Towards on-farm pig face recognition using convolutional neural networks. Computers in Industry 98, 145 152 (2018)

[13] Härkönen, E., Hertzmann, A., Lehtinen, J., Paris, S.: Ganspace: Discovering interpretable gan controls. Advances in Neural Information Processing Systems (NIPS) 33, 9841 9850 (2020)

[14] Hummel, H.I., Pessanha, F., Salah, A.A., van Loon, T.J., Veltkamp, R.C.: Automatic pain detection on horse and donkey faces. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). pp. 793 800. IEEE (2020)

[15] Jakab, T., Gupta, A., Bilen, H., Vedaldi, A.: Unsupervised learning of object landmarks through conditional image generation. Advances in Neural Information Processing Systems (NIPS) 31 (2018)

[16] Jakab, T., Gupta, A., Bilen, H., Vedaldi, A.: Self-supervised learning of interpretable keypoints from unlabelled videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8787 8797 (2020)

[17] Jiang, J., Ji, Y., Wang, X., Liu, Y., Wang, J., Long, M.: Regressive domain adaptation for unsupervised keypoint detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6780 6789 (2021)

[18] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4401 4410 (2019)

[19] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8110 8119 (2020)

[20] Khan, M.H., Mc Donagh, J., Khan, S., Shahabuddin, M., Arora, A., Khan, F.S., Shao, L., Tzimiropoulos, G.: Animalweb: A large-scale hierarchical dataset of annotated animal faces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6939 6948 (2020)

[21] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. International Conference for Learning Representations (ICLR) (2015)

[22] Li, D., Yang, J., Kreis, K., Torralba, A., Fidler, S.: Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

[23] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of Proceedings of European Conference on Computer Vision (ECCV) (2014)

[24] Lv, J., Shao, X., Xing, J., Cheng, C., Zhou, X.: A deep regression architecture with two-stage reinitialization for high performance facial landmark detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3317 3326 (2017)

[25] Min, J., Lee, J., Ponce, J., Cho, M.: Spair-71k: A large-scale benchmark for semantic correspondence. ar Xiv prepreint ar Xiv:1908.10543 (2019)

[26] Neverova, N., Novotny, D., Khalidov, V., Szafraniec, M., Labatut, P., Vedaldi, A.: Continuous surface embeddings (2020)

[27] Neverova, N., Sanakoyeu, A., Novotny, D., Labatut, P., Vedaldi, A.: Discovering relationships between object categories via universal canonical maps (2021)

[28] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Proceedings of European Conference on Computer Vision (ECCV). pp. 483 499. Springer (2016)

[29] Pakhomov, D., Hira, S., Wagle, N., Green, K.E., Navab, N.: Segmentation in style: Unsupervised semantic image segmentation with stylegan and clip. ar Xiv preprint ar Xiv:2107.12518 (2021)

[30] Pan, X., Dai, B., Liu, Z., Loy, C.C., Luo, P.: Do 2d gans know 3d shape? unsupervised 3d shape reconstruction from 2d image gans. In: International Conference for Learning Representations (ICLR) (2021)

[31] Peebles, W., Zhu, J.Y., Zhang, R., Torralba, A., Efros, A., Shechtman, E.: Gan-supervised dense visual alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

[32] Rashid, M., Gu, X., Jae Lee, Y.: Interspecies knowledge transfer for facial keypoint detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6894 6903 (2017)

[33] Sanakoyeu, A., Khalidov, V., Mc Carthy, M.S., Vedaldi, A., Neverova, N.: Transferring dense pose to proximal animal classes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5233 5242 (2020)

[34] Schneider, S., Taylor, G.W., Kremer, S.C.: Similarity learning networks for animal individual re-identiﬁcation-beyond the capabilities of a human observer. In: IEEE International Conference on Automatic Face & Gesture Recognition (F&G). pp. 44 52 (2020)

[35] Shen, Y., Yang, C., Tang, X., Zhou, B.: Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020)

[36] Shen, Y., Zhou, B.: Closed-form factorization of latent semantics in gans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1532 1540 (2021)

[37] Shoshan, A., Bhonker, N., Kviatkovsky, I., Medioni, G.: Gan-control: Explicitly controllable gans. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14083 14093 (2021)

[38] Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3476 3483 (2013)

[39] Thewlis, J., Albanie, S., Bilen, H., Vedaldi, A.: Unsupervised learning of landmarks by descriptor vector exchange. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6361 6371 (2019)

[40] Thewlis, J., Bilen, H., Vedaldi, A.: Unsupervised learning of object landmarks by factorized spatial embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 5916 5925 (2017)

[41] Viazovetskyi, Y., Ivashkin, V., Kashin, E.: Stylegan2 distillation for feed-forward image manipulation. In: Proceedings of European Conference on Computer Vision (ECCV). pp. 170 186. Springer (2020)

[42] Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: A boundary-aware face alignment algorithm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2129 2138 (2018)

[43] Xiao, S., Feng, J., Xing, J., Lai, H., Yan, S., Kassim, A.: Robust facial landmark detection via recurrent attentive-reﬁnement networks. In: Proceedings of Proceedings of European Conference on Computer Vision (ECCV). pp. 57 72. Springer (2016)

[44] Yang, H., Zhang, R., Robinson, P.: Human and sheep facial landmarks localisation by triplet interpolated features. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1 8. IEEE (2016)

[45] Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., Tao, D.: Ap-10k: A benchmark for animal pose estimation in the wild. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NIPS Datasets and Benchmarks) (2021)

[46] Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-ﬁne auto-encoder networks (cfan) for realtime face alignment. In: Proceedings of European Conference on Computer Vision (ECCV). pp. 1 16. Springer (2014)

[47] Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23(10), 1499 1503 (2016)

[48] Zhang, Y., Guo, Y., Jin, Y., Luo, Y., He, Z., Lee, H.: Unsupervised discovery of object landmarks as structural representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2694 2703 (2018)

[49] Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: Proceedings of European Conference on Computer Vision (ECCV). pp. 94 108. Springer (2014)

[50] Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning deep representation for face alignment with auxiliary attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2015)

[51] Zhang, Z., Song, Y., Qi, H.: Age progression/regression by conditional adversarial autoencoder. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5810 5818 (2017)