# automated_synthetictoreal_generalization__9c3962b7.pdf

Automated Synthetic-to-Real Generalization

Wuyang Chen 1 Zhiding Yu 2 Zhangyang Wang 1 Anima Anandkumar 2 3

Abstract Models trained on synthetic images often face degraded generalization to real data. As a convention, these models are often initialized with Image Net pretrained representation. Yet the role of Image Net knowledge is seldom discussed despite common practices that leverage this knowledge to maintain the generalization ability. An example is the careful hand-tuning of early stopping and layer-wise learning rates, which is shown to improve synthetic-to-real generalization but is also laborious and heuristic. In this work, we explicitly encourage the synthetically trained model to maintain similar representations with the Image Net pretrained model, and propose a learningto-optimize (L2O) strategy to automate the selection of layer-wise learning rates. We demonstrate that the proposed framework can signiﬁcantly improve the synthetic-to-real generalization performance without seeing and training on real data, while also beneﬁting downstream tasks such as domain adaptation. Code is available at: https://github.com/NVlabs/ASG.

1. Introduction

Training a deep convolutional neural network (DCNN) can require large amounts of labeled data in computer vision tasks such as segmentation (Ros et al., 2016; Richter et al., 2016; 2017), depth/ﬂow estimation (Dosovitskiy et al., 2015; Mayer et al., 2016; Gaidon et al., 2016), object detection (Johnson-Roberson et al., 2016), visual navigation (Savva et al., 2019), and grasping (Coumans & Bai, 2016). When there is label scarcity, a popular approach is to resort to training with synthetic images, where full supervision can be obtained at a low cost. This ﬁnds applications in label-scarce domains such as robotics and autonomous driving where simulation can play an important role.

1Texas A&M University 2NVIDIA 3California Institute of Tech. Correspondence to: Wuyang Chen <wuyang.chen@tamu.edu>, Zhiding Yu <zhidingy@nvidia.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

0 5 10 15 20 25 Epoch (Synthetic Training)

Small LR for all layers Large LR for all layers Train FC only (backbone ﬁxed) Small LR for backbone Large LR for FC IBN-Net

Figure 1. Both heuristic solutions (early stopping, small learning rates, etc.) and recent works (e.g. IBN-Net (Pan et al., 2018)) fall in poor generalization in synthetic-to-real transfer learning, which suffers from the huge appearance gap between the source and the target domain. Here, we studied different learning rates ( LR ) or optimization strategies for the backbone and the last fully-connected classiﬁcation layer ( FC ). All settings start with an Image Net-pretrained backbone and a randomly initialized classiﬁcation layer. Please see section 3.2 for experiment details.

However, there are many challenges to train with synthetic images. Models trained on synthetic images often face problems from degraded generalization on the real domain. Such a domain gap is usually caused by limitations on rendering quality, including unrealistic texture, appearance, illumination and scene layout, etc. As a result, networks are prone to overﬁtting to the synthetic domain with learned representations that differ from those obtained on real images. To this end, domain generalization methods (Li et al., 2017; Pan et al., 2018; Yue et al.) have been proposed to overcome the above domain gaps and improve model generalization on real target domains.

Synthetic-to-real transfer learning involves training a model only on synthetic images (source domain) without seeing any real ones, and targets on the generalization performance on unseen real images (target domain). Recent synthetic-toreal generalization algorithms often start with an Image Netpretrained model. To achieve the best generalization performance, it is a common practice to ﬁne-tune the pretrained model on synthetic images for only a few epochs (i.e. earlystopping) with a small learning rate. Figure 1 illustrates the evaluation dynamics of several popular heuristic solutions on the Vis DA-17 dataset (Peng et al., 2017). One could

Automated Synthetic-to-Real Generalization

clearly see the high performance in early epochs, and the improvements of ﬁne-tuning with a small learning rate (or even a ﬁxed backbone) over training with a large one (red dashed line). Similar behavior exists in recent works (e.g. IBN-Net (Pan et al., 2018)). This observation implies an important clue: all these heuristics try to retain the Image Net domain knowledge during the synthetic-to-real transfer learning. It explains why the heuristic solutions in Figure 1 work: they allow the classiﬁer to quickly adjust from Image Net to the task deﬁned by the synthetic images, while preventing the Image Net-pretrained representations of natural images to be washed out due to catastrophic forgetting.

Unfortunately, existing solutions (e.g. IBN-Net) still face degraded generalization and are highly dependent on manual selections of training epochs and schedules (learning rates). Motivated by this open issue, we propose an Automated Synthetic-to-real Generalization (ASG) framework to improve synthetic-to-real transfer learning. This method is automated from two aspects: (1) It stably improves the generalization during transfer learning, avoiding the difﬁculty of choosing epochs to stop. (2) It automates the complicated tuning of layer-wise learning rates towards better generalization. The core of our work is the intuition that a good synthetically-trained model should share similar representations with Image Net-models, and we leverage this intuition as a proxy guidance to search layer-wise training schedules through learning-to-optimize (L2O).

Summary of Contributions:

We examine the behaviors of various training heuristics, in order to study the role of the Image Net domain knowledge in synthetic-to-real generalization, which is not thoroughly discussed by the literature to the best of our knowledge.

We provide a novel perspective to address synthetic-toreal generalization, by formulating it as a lifelong learning problem. We enforce the representation similarity between synthetically trained models and Image Netpretrained model, and treat their similarity as a proxy guidance of generalization performance. An overall design is illustrated in Figure 2.

We demonstrate that proxy guidance not only dramatically improves the generalization performance, but can also be easily integrated by existing transfer learning frameworks as a simple drop-in module, without requiring any additional training beyond synthetic images. Experiments also prove the cross-task generalizability of our proxy guidance, which magniﬁes the strength of synthetic-to-real transfer learning.

We design a reinforcement learning based learning-tooptimize (RL-L2O) approach to make the synthetic-to-

real generalization practically more convenient, by automating the complicated heuristic designs with layerwise learning rates. We demonstrate that our RL-L2O method out-performs hand-crafted decisions and learns explainable learning rate strategy.

Figure 2. We formulate the synthetic-to-real transfer learning as a lifelong learning problem: training on synthetic images (new task) while still memorizing Image Net classiﬁcation (old task), acting as our proxy guidance during the transfer learning.

2. Automated Syn-to-Real Generalization

In our work, we propose an automated framework to address the synthetic-to-real transfer learning, dubbed Automated Synthetic-to-real Generalization (ASG). We assume an Image Net-pretrained model as our starting point. Our target is to maximize the performance of the model on a target domain which consists of unseen real images, by utilizing only synthetic images from the source domain.

2.1. Syn-to-Real Generalization with Proxy Guidance

The accessibility to model pretrained on Image Net (Deng et al., 2009) implicitly provides the domain knowledge of real images. As we are transferring a model trained on synthetic data to unseen real images, retaining the Image Net domain knowledge is potentially beneﬁcial to the generalization. Motivated by this, we force the model to memorize how to capture the representation learned from Image Net while training on synthetic images, to maintain both the domain knowledge on real images and task-speciﬁc information provided by the synthetic data.

We start with an Image Net-pretrained model M, and formulate our transfer learning as a life-long learning problem: training on synthetic images as the new task while still memorizing the old Image Net classiﬁcation task. While updating the model M with synthetic images, we also keep a copy of the original Image Net-pretrained model Mo which is frozen during the training. In addition to the cross-entropy loss LXE calculated on the synthetic dataset, we also forward the synthetic images through Mo and minimize the KL divergence LKL between the output of Mo and M. Formally, we leverage the minimization of LKL as a proxy guidance during our transfer learning process:

Automated Synthetic-to-Real Generalization

θ s, θ n arg min θs,θn (L) (1)

L = LXE + λLKL (2)

i=1 yilog(M(xi, θs, θn)) (3)

i=1 Mo(xi, θs,o, θo)log(M(xi, θs, θo)) (4)

Here, λ is a balancing factor that controls how much Image Net domain knowledge the model should retain. θn denotes the parameters for the synthetic-to-real transfer learning LXE (i.e. the classiﬁer layers for the new task), θo denotes the parameters for Image Net classiﬁer which will output the predicted probabilities on the Image Net domain. θs denotes the parameters for the feature extractor (a.k.a. backbone) updated for the new tasks, and θs,o denotes the parameters for the feature extractor which is frozen with Image Net-pretrained weights. θs and θs,o share the same structure. NB is the current batch size, xi, and yi are sample and ground truth from the new task in the current batch. This synthetic-to-real transfer learning with proxy guidance is illustrated in Figure 2. The new task and the old Image Net task are jointly optimized during the training.

Cross-task proxy guidance: It is important to note that, the new task is not necessarily limited to be also for the image classiﬁcation purpose. For some models in semantic segmentation (e.g. Res Net based FCN (Long et al., 2015a)), a pixel-wise LXE provides a much denser supervision than the image-wise LKL in Eq. 4. To spatially balance LXE and LKL, we also make LKL denser by applying it on cropped feature map patches:

Ldense KL = 1

j Mo(xi,j, θs,o, θo)log(M(xi,j, θs, θo)).

(5) Here, xi,j (j = 1, , N) are cropped patches from xi. Later in section 3.4 we will demonstrate that this formulation also works well for cross-task training.

2.2. Automate LR Selection via Learning-to-Optimize

As observed in Figure 1, different convolution blocks contribute differently to the generalizability. This leads to a question: does different layers in a deep network require different training strategy towards optimal synthetic-to-real generalization performance during the transfer learning?

To avoid manually tuning the hyperparameters, we propose a reinforcement learning based learning-to-optimize (RLL2O) framework to automatically adjust the learning rates for layers. In the RL-L2O framework, we aim to learn a parameterized policy π to dynamically control the learning

rates given the training statistics of our model M during transfer learning.

Generally, the goal of the reinforcement learning algorithm is to learn a policy π that maximizes the total expected reward r over time. More precisely,

π = arg max π Es0,a0,s1,...,s T

where the expectation is taken over the sequence of states (or observations) and actions. In short, an action at produced by π will update the learning rates for M in the RL-L2O framework. A state st contains optimization related statistics of the model M during the transfer learning, and the reward rt measures how well the optimization performs.

Design of Optimization Coordinates: One challenge in applying reinforcement learning in our setting is that we want to be able to control the training schedules of a deep network of up to a hundred layers (Res Net-101), each of them requiring an action from our policy. As layers may have strong correlations during the optimization (Ghiasi et al., 2018), the policy may fall into sub-optimal solutions in this large scale action space. To avoid this difﬁculty and simplify our policy training, we leverage the underlying structures in current deep networks. Speciﬁcally, layers in M with similar input resolution will be grouped into a block, named as an optimization coordinate. Taking the Res Net family as an example, we group layers into a new coordinate whenever the feature map resolution is reduced. This grouping strategy keeps the action space of the policy small, and speeds-up the L2O training.

Design of Action Space: Intuitively, our policy could directly output learning rate for each coordinate. However, the model M could be very sensitive to the learning rate (as observed in Figure 1), and the learning rate usually resides in a small value range (e.g. 10 4 10 3). Directly predicting the value of the learning rate could be very unstable. Instead, we propose a learning rate scaling factor as the action. We ﬁrst provide the policy a base learning rate ηbase. In the following steps, π outputs discrete coordinate-wise learning rate scale factors as its actions at = [a1,t, ..., a C,t] where C is the number of optimization coordinates in M. We formulate at as categorical actions, where each learning rate scale factor ac,t [0, 0.1, 0.2, ..., 0.9, 1]. The learning rate for each coordinate is set to be ηc,t = ac,t ηbase, and we leverage the gradients and momentums calculated by stochastic gradient descent (SGD) (Rumelhart et al., 1986) to update the parameters in M.

Design of Observation Space and Reward: At each step, the state (observation) st for π includes: current LXE,t (Eq. 3) and LKL,t (Eq. 4, Eq. 5), the training progress of M (i.e. t

T , where T equals to total training steps (i.e., total epochs iterations per epoch )), the mean and standard

Automated Synthetic-to-Real Generalization

Block 1 Block 2

𝑎(,*$( 𝑎+,*$(

𝜂./01 𝑺𝑮𝑫 𝒔𝒕=

ℒ78,* ℒ9: ,*

𝜃 * std(𝜃*)

Actions: LR scale factors

Figure 3. Workﬂow of the proposed L2O framework. at = [a1,t, ..., a C,t]T is the learning rate scale factor for the coordinates, and η indicates the learning rate. is dot product.

𝑎",$%" 𝑎&,$%" 𝑎(,$%"

= 𝒂𝒕%𝟏 Linear𝟐

ℒ:;,$ ℒ<=,$

𝜃 $ std(𝜃$)

Figure 4. Architecture of the policy network.

deviation of the weights of the classiﬁer ( θn,t and std(θn,t)), and ﬁnally the scale factors from the last step at 1. The policy learning is guided by reward rt = Lt 1 Lt.

Policy Training: We update our LSTM policy π via the REINFORCE algorithm (Williams, 1992) to minimize:

t U rt log(pπ(at|st)), (7)

where U is the unroll length for LSTM. Algorithm 1 illustrate the procedure of our RL-L2O framework.

Once we obtained the learned policy π, we then freeze and apply it to the synthetic-to-real transfer learning of M together with SGD, as illustrated in Figure 3.

3. Experiments

3.1. Datasets

Vis DA-17 (Peng et al., 2017) We perform ablation study on the Vis DA-17 image classiﬁcation benchmark. The Vis DA17 dataset provides three subsets (domains), each with the same 12 object categories. Among them, the training set (source domain) is collected from synthetic renderings of 3D models under different angles and lighting conditions, whereas the validation set (target domain) contains real images cropped from the Microsoft COCO dataset (Lin et al., 2014).

GTA5 (Richter et al., 2016) is a vehicle-egocentric image dataset collected in a computer game with pixel-wise semantic labels. It contains 24,966 images with a resolution of 1052 1914. There are 19 classes that are compatible with the Cityscapes dataset.

Cityscapes (Cordts et al., 2016) contains urban street images taken on a vehicle from some European cities. There

Algorithm 1: RL-L2O: policy (π) learning to control group-wise learning rates.

1 Input: base learning rate ηbase, parameters θn,0, θs,0, hidden state h0 = 0, policy π, unroll length U, total training steps T.

2 Calculate L0, LXE,0, LKL,0 for θn,0, θs,0

3 Initialize storage

4 for t = 0, . . . , T 1 do

5 prob(at+1), at+1, ht+1 = π(LXE,t, LKL,t, t

T , θn,t, std(θn,t), at, ht)

6 (θn,t+1, θs,t+1) = SGD( Lt, at+1, ηbase, θn,t, θs,t)

7 Calculate Lt+1, LXE,t+1, LKL,t+1 for θn,t+1, θs,t+1

8 rt+1 = Lt Lt+1

9 storage.append(prob(at+1), rt+1)

10 if (t + 1)%U == 0 then

11 π = REINFORCE(π, storage)

12 Initialize storage

13 return ﬁnal learned policy π.

are 5,000 images with pixel-wise annotations. The images have a resolution of 1024 2048 and are labeled into 19 semantic categories.

3.2. Implementation

Image classiﬁcation: For Vis DA-17, we choose Res Net101 (He et al., 2016) as the backbone, and one fullyconnected layer as the classiﬁer. Backbone is pre-trained on Image Net (Deng et al., 2009), and then ﬁne-tuned on source domain, with learning rate = 1 10 4, weight decay = 5 10 4, momentum = 0.9, and batch size = 32. The model is trained for 30 epochs and λ for LKL is set as 0.1. In section 3.3, we will additionally study how to choose λ.

Semantic segmentation: We study both FCN with Res Net50 and FCN with VGG-16 (Long et al., 2015a). Backbones are pre-trained on Image Net. Our learning rate is 1 10 3, weight decay is 5 10 4, momentum is 0.9, and batch size is six. We crop the images into patches of 512 512 and train the model with multi-scale augmentation (0.75 1.25) and horizontal ﬂipping. The model is trained for 50 epochs, and λ for LKL is set as 75. Note that λ in segmentation is considerably larger since LXE is a pixel-wise dense loss.

RL-L2O policy: We set the learning rate for policy training as 0.5. The size of the hidden state vector h is set to 20, and the unroll length U = 5. We train π for 50 epochs. For the Res Net family, we follow the convention (He et al., 2016) to group the layers into C = 7 coordinates: conv1, bn1, conv2, conv3, conv4, conv5,

Automated Synthetic-to-Real Generalization

and the classiﬁer. For VGG-16 (Long et al., 2015a), we also group the layers into C = 7 coordinates: conv1, conv2, conv3, conv4, conv5, conv6&7, and the remaining projection upsampling layers.

Proxy guidance: For all backbones we studied (Res Net-50, Res Net-101, and VGG-16), we forward the feature maps extracted by group conv5 into the Image Net classiﬁer (parameterized by θo) to calculate LKL.

3.3. ASG for Image Classiﬁcation

We ﬁrst perform the ablation studies on the Vis DA-17 image classiﬁcation task1.

Generalization with Proxy Guidance. To evaluate the effect of our proxy guidance, we apply our LKL loss on different learning rate settings we studied in Figure 1. As demonstrated in Figure 5, once we force the model to memorize the Image Net domain knowledge, we achieve stably increasing and eventually better generalization performance for each setting we explored in Figure 1. The relative ranking still holds among the different learning rate settings, while the degraded generalizability is addressed. Early stopping is no longer needed, as models enjoy improved generalization given sufﬁcient training epochs. This ablation study validates the contribution of retaining the Image Net domain knowledge during the synthetic-to-real transfer learning. It is also worth noting that our proxy guidance can be also applied to different networks (e.g. the IBN-Net (Pan et al., 2018), green line in Figure 5), which demonstrate the easy integration of our approach as a simple drop-in module with existing synthetic-to-real generalization works, without requiring any additional training beyond synthetic images.

0 5 10 15 20 25 Epoch (Synthetic Training)

Proxy guidance + large LR for all layers Proxy guidance + small LR for all layers Proxy guidance + small LR for backbone Large LR for FC Proxy guidance + IBN-Net

Figure 5. The degraded generalization during the synthetic-to-real transfer learning (studied in Figure 1) can be solved by forcing the model to retain the Image Net domain knowledge via our proxy guidance2 Task: Res Net-101 Vis DA-17 Classiﬁcation. λ = 0.1.

1There is no previous synthetic-to-real transfer work on Vis DA17 classiﬁcation task, only domain adaptation works.

Moreover, a vital conclusion from Figure 5 is that, only reporting the (ﬁnal) performance as a number is far from sufﬁcient for analyzing and comparing synthetic-to-real transfer learning methods. Instead, the curve of the target performance during training can better demonstrate how well a model s generalizability is. Meanwhile, a stably increasing training curve implies that, the model is both better leveraging synthetic images and retaining Image Net domain knowledge, instead of overﬁtting on synthetic appearance and leaving the domain gap an open issue.

How to choose λ: We also study the effect of different strengths of the proxy guidance loss LKL by adjusting λ in Equation 2 for a Res Net-101 model trained with a small learning rate for the backbone and a large one for the classiﬁcation layer (blue line in Figure 5). In Table 1, we adjust λ in a wide range from 0.01 to 1. While we obtain the best generalization accuracy with λ = 0.1, we can see that our proxy guidance is very robust to different strength of LKL. Therefore, choosing λ is much easier than tuning hyperparameters in heuristic solutions like epochs.

Table 1. Ablation of λ for the proxy guidance loss LKL. Model: Res Net-101. Task: Vis DA-17 Classiﬁcation.

λ 0.01 0.05 0.1 0.5 1

Accuracy (%) 58.9 59.4 60.1 58.5 59.7

Automated Syn-to-Real Generalization. We next evaluate the performance of our RL-L2O framework. Speciﬁcally, we want to make sure the policy learned by our RL-L2O can perform better than both the random policy and the best hand-tuned learning rate policy we explored in Figure 5. A random policy means that the controller will always randomly pick an action as the learning rate scale factor. In all these three settings we start from the same base learning rate ηbase = 1 10 4. Figure 6 demonstrates that, while the hand-tuned learning rate strategy is better than a random policy, our RL-L2O framework can even out-perform the human-designed one (blue line).

Additional Ablation Study on Vis DA-17. We conduct additional ablation studies on Vis DA-17 to further analyze the learning behaviors of ASG. Speciﬁcally, as both the proxy guidance and the RL-L2O frameworks are motivated to carefully preserve the Image Net representations while targeting updates from the new tasks on synthetic data, it is interesting and important to connect the relation between the level of retained Image Net knowledge and the synthetic-to-real generalization. In our experiment, we compute Image Net validation accuracy as well as the generalization performance on Visda-17 target domain for the classiﬁcation task.

2We could not utilize the proxy guidance when the backbone is ﬁxed ( Train FC Only blue dashed curve in Figure 1). The LKL is always zero in this case as the group conv5 is not updated.

Automated Synthetic-to-Real Generalization

0 5 10 15 20 25 Epoch (Synthetic Training)

ASG Proxy guidance + small LR for backbone Large LR for FC Proxy guidance + random policy

Figure 6. Our RL-L2O framework can out-perform both the random policy and a carefully hand-tuned learning rate strategy. All three settings include LKL with the same λ = 0.1 during training. Model: Res Net-101. Task: Vis DA-17 Classiﬁcation.

Table 2 demonstrates two conclusions: 1) Heuristic solutions that retain more Image Net domain knowledge achieve higher synthetic-to-real generalization (#3 versus #1), i.e., using hand-crafted small learning rates to prevent the Image Net-pretrained representations of natural images from being washed out due to catastrophic forgetting; 2) By leveraging Proxy Guidance, the generalization performance on Vis DA-17 is dramatically improved, while the Image Net accuracy is also maintained with almost no drop. It is interesting that Proxy Guidance leads to learned model parameters that achieve high accuracy simultaneously on both Image Net and Vis DA-17. In contrast, naively freezing the backbone and only ﬁne-tuning the classiﬁer layer ( Oracle #5) results in inferior synthetic-to-real generalization despite high Image Net performance.

Table 2. Our Proxy Guidance improves the synthetic-to-real generalization (Visda-17) by retaining the Image Net domain knowledge. Learning rate (LR) settings were studied in Figure 1 and 5. FC: the last fully-connected classiﬁcation layer. Top1 accuracies are in percentage (%). Model: Res Net-101.

# Model Vis DA-17 Image Net

1. Large LR for all layers 28.2 0.8 2. + our Proxy Guidance 58.7 (+30.5) 76.2 (+75.4)

3. Small LR for backbone 49.3 33.1 and large LR for FC 4. + our Proxy Guidance 60.2 (+10.9) 76.5 (+43.4)

5. Oracle on Image Net3 53.3 (+4.0) 77.4 6. ROAD (Chen et al., 2018) 57.1 (+7.8) 77.4 7. Vanilla L2 distance 56.4 (+7.1) 49.1 8. SI (Zenke et al., 2017) 57.6 (+8.3) 53.9

9. ASG (ours) 61.1 76.7

3Oracle is obtained by freezing the Res Net-101 backbone while only training the last new fully-connected classiﬁcation layer on

In addition, we compare ASG with several other lifelong learning algorithms, including both feature-level ℓ2 regularization (Chen et al., 2018) and weight-level importancereweighted ℓ2 constraints (Zenke et al., 2017). Row #5 8 in Table 2 shows that although the three comparing methods indeed retain Image Net domain knowledge while improving over the baseline (49.3%), they are not performing as well as the proxy guidance (60.2%) under the same LR policy.

3.4. ASG for Semantic Segmentation

We also conduct experiments to evaluate the generalization performance of ASG on semantic segmentation. In particular, we treat GTA5 as the synthetic source domain and train segmentation models on it. We then treat the Cityscapes validation/test sets as target domains where we directly evaluate the synthetically trained models.

0 10 20 30 40 Epoch (Synthetic Training)

ASG Proxy guidance + small LR for backbone Large LR for FC Proxy guidance + random policy Small LR for backbone Large LR for FC

Figure 7. Dynamics of evaluation accuracy with training epochs. Models are trained on GTA5 and directly tested on the Cityscapes validation set. We use FCN-VGG16 as the backbone for segmentation models. In addition, LKL in all comparing methods share the same parameter λ = 75 during synthetic source training.

Figure 7 shows the dynamics of evaluation accuracy on the Cityscapes validation set. Again, ASG demonstrates significantly improved generalization performance on semantic segmentation over naive synthetic training. In addition, integrating proxy guidance with RL-L2O also consistently outperforms baselines where proxy guidance is integrated with other policy strategies. Note that in this case, both θo and LKL are oriented to the classiﬁcation task, while θn and LXE designed for segmentation. This showcases the ability of ASG to generalize across different tasks.

In Table 3, we compare our method with prior domain generalization methods for semantic segmentation. One can see that ASG achieves the best performance gain. Among the comparing methods, IBN-Net (Pan et al., 2018) im-

the Visda-17 source domain (the FC layer for Image Net remains unchanged). We use the Py Torch ofﬁcial model of Image Netpretrained Res Net-101.

Automated Synthetic-to-Real Generalization

proves domain generalization by ﬁne-tuning the mixed INBN residual building blocks, while (Yue et al.) transfers the styles from images in Image Net to synthetic images. It is worth noting that (Yue et al.) requires Image Net images during training and implicitly leverages Image Net label information (i.e. Auxiliary Domains ) which brings potential advantages. In contrast, our method requires minimum extra information without using any additional images or labels, therefore can be conveniently applied to existing frameworks as a drop-in training strategy.

Table 3. Comparison to prior methods on domain generalization for semantic segmentation (GTA5 Cityscapes).

Methods Model m Io U % m Io U %

No Adapt FCN-Res50 22.17 7.47 IBN-Net (2018) 29.64

No Adapt FCN-Res50 32.45 4.97 Yue et al. () 37.42

No Adapt FCN-Res50 23.29 8.60 Ours 31.89

No Adapt FCN-VGG16 29.81 6.3 Yue et al. () 36.11

No Adapt FCN-VGG16 19.89 11.58 Ours 31.47

Policy Behaviors. Figure 8 shows clear and explainable behavior patterns of our policy for FCN-VGG16 on the segmentation task. In FCN-VGG16, groups conv1 5 belong to the Image Net-pretrained backbone, while conv6&7 and the remaining projection upsampling layers act as the classiﬁer for the dense predictions. The feature map captured by conv5 is forward into θo to calculate LKL. As conv5 is close to the calculation of LKL, ﬁxing conv5 (i.e. selecting action = 0 which represents the learning rate scale factor = 0) can effectively minimize LKL and retain the Image Net domain knowledge. As parameters from group conv5 to conv1 are gradually far from the LKL supervision, the corresponding selected actions also increase.

On the other hand, to perform dense prediction in semantic segmentation, the extracted feature maps are ﬁrst forwarded to conv6&7 and then to projection upsampling. In addition, similar trend holds for the classiﬁer part: as projection upsampling is the closest group to LXE, it is assigned with the highest scale factor for learning rate.

3.5. ASG for Unsupervised Domain Adaptation

The proposed ASG framework not only can improve the synthetic-to-real generalization performance, but also can considerably beneﬁt downstream tasks such as unsupervised domain adaptation. Here we present synthetic-to-real domain adaptation results on Vis DA-17 (Peng et al., 2017) in Table 4, where the model trained by ASG (which did not use any real target images during training) is leveraged as

0 10 20 30 40 50 Epoch (Policy Training)

conv1 conv2 conv4 conv3

conv5 conv6&7 projection upsampling

Figure 8. Action behavior of our RL-L2O framework during the policy training for M = FCN-VGG16 for the GTA5 Cityscapes segmentation transfer learning. Categorical actions are smoothed for better visualization purpose. Actions of [0, 1, , 10] indicate learning rate scale factors [0, 0.1, , 1.0].

the source model (i.e., starting point for the unsupervised domain adaptation training), and the CBST/CRST frameworks are adopted exactly following (Zou et al., 2018; 2019) for fair comparison purposes.

Starting from a much better initialization (our 61.1% compared with 51.6% in (Zou et al., 2019)), we signiﬁcantly boost the adaptation performance over 6% compared with CBST/CRST, achieving 84.6% on Visda-17. It is important to emphasize that such improvement is obtained without any extra supervision and external knowledge. The only difference lies in smarter synthetic-to-real source training which ultimately leads to improved adaptation.

Table 4. Synthetic-to-real adaptation on Visda-17. We follow the same settings in (Zou et al., 2019) to set the weights as 0.1 and 0.25 for MRKLD and LRENT respectively, and report the averages and standard deviations (in brackets) of the evaluation results over ﬁve runs. Model: Res Net-101. Tgt Img : whether the method leveraged target real images during training. Top-1 accuracies are in percentage (%).

Method Tgt Img Accuracy

Source (Saito et al., 2017) 52.4 DANN (Ganin et al., 2016) 57.4 MCD (Saito et al., 2018) 71.9 ADR (Saito et al., 2017) 74.8 Sim Net-Res152 (Pinheiro, 2018) 72.9 GTA-Res152 (2018) 77.1

Source-Res101 (Zou et al., 2019) 51.6 CBST (Zou et al., 2018) 76.4 (0.9) CRST (MRKLD) (2019) 77.9 (0.5) CRST (MRKLD + LRENT) (2019) 78.1 (0.2)

Source-Res101 (ASG) 61.1 ASG + CBST 82.5 (0.7) ASG + CRST (MRKLD) 84.6 (0.4) ASG + CRST (MRKLD + LRENT) 84.5 (0.4)

Automated Synthetic-to-Real Generalization

road sidewalk building wall fence pole trafﬁc lgt trafﬁc sgn vegetation ignored terrain sky person rider car truck bus train motorcycle bike

Figure 9. Generalization results on GTA5 Cityscapes. Rows correspond to sample images in Cityscapes. From left to right, columns correspond to original images, ground truth, predication results of baseline (FCN-VGG16 (Long et al., 2015a)), and prediction by model trained with our ASG framework.

aero bike bus car horse knife motor person plant board train truck

Figure 10. t-SNE visualization of feature embeddings of different models on the target domain of Vis DA-17. From left to right: source model (Zou et al., 2019), CBST (Zou et al., 2018), CRST (MRKLD+LRENT) (Zou et al., 2019), and ASG + CRST (MRKLD+LRENT).

Feature visualization. We show the t-SNE visualization of the feature embeddings extracted by the backbone (Res Net101) of different models in Fig. 10. Compared with Both CBST (Zou et al., 2018) and CRST (MRKLD+LRENT) (Zou et al., 2019), feature embeddings obtained by ASG + CRST form purer clusters in terms of semantic labels.

4. Related Work

4.1. Domain Generalization and Adaptation

Domain generalization considers the problem of generalizing a model on the unseen target domain without leveraging any target domain images (Gan et al., 2016; Muandet et al., 2013; Yuan et al., 2020). Muandet et al. (2013) proposed to use the MMD (Maximum Mean Discrepancy) to align the distributions from different domains and train the network

with adversarial learning. Li et al. (2017) built separate networks for each source domain and used the shared parameters for the test. Li et al. (2018) improved the generalization performance by using a meta-learning approach on the split training sets. Pan et al. (2018) boosted a CNNs generalization by carefully integrating the Instance Normalization and Batch Normalization as building blocks.

Unsupervised domain adaptation (UDA) trains a model towards a speciﬁc target domain, where the (unlabeled) images from the target domain are available for training. One major idea is to learn domain invariant embeddings by minimizing the distribution divergence between the source and target domain (Long et al., 2015b; Sun & Saenko, 2016; Tzeng et al., 2014). Hoffman et al. (2017) reduced domain gap by ﬁrst translating the source images into target style with a cycle consistency loss, and then aligning the feature maps of the

Automated Synthetic-to-Real Generalization

network across different domains through the adversarial training. Other works that leverage image level translation to bridge the domain gap include domain stylization (?) and DLOW (?). Besides image-level translation, a number of works also perform adversarial learning at feature or output level for improved domain adaptation performance.

More recently, Zou et al. (2018) proposed an Expectationmaximization like UDA framework based on an iterative self-training process, where the loss of the latent variable is minimized. This is achieved by alternatively generating pseudo labels on target data and re-training the model with the mixed source and pseudo target labels.

In contrast to existing domain generalization and adaptation methods, we resort to leveraging the Image Net-pretrained model as a proxy guidance during the synthetic-to-real transfer learning, without any extra adversarial training or modiﬁcation to model architecture.

4.2. Lifelong Learning

Lifelong learning (Thrun, 1998) focuses on ﬂexibly appending new tasks to the model s training schedules, while maintaining the knowledge captured from previous old tasks. Li & Hoiem (2017) leverages only new task data to train the network while preserving the original capabilities by minimizing the outputs between the old network and the newly learned one. Lopez-Paz and Ranzato (2017) proposed a Gradient Episodic Memory (GEM) to alleviate the knowledge forgetting while transferring knowledge from previous tasks. Shin et al. (2017) developed a Deep Generative Replay framework, which is used to sample training data from previous tasks when training the new task. A number of other works on lifelong learning with related or similar applications include (Zenke et al., 2017; Kirkpatrick et al., 2017; Shafahi et al., 2019) where lifelong learning is shown to avoid catastrophic forgetting and beneﬁt tasks such as incremental tasks learning, domain adaptation and adversarial defense. One work that is particularly related to our synthetic-to-real generalization theme is (Chen et al., 2018) where the authors propose a spatial aware adaptation scheme and also leverage a distillation loss to avoid overﬁtting to synthetic data. Our work differs from the above prior works by carefully looking into the important role played by layer-wise learning rate policies in synthetic-to-real transfer learning problems and accordingly propose a principled solution to automate the policy search.

4.3. Learning to Optimize

Andrychowicz et al. (2016) proposed the ﬁrst learningto-optimize framework, where both the optimizee s gradients and loss function values were formulated as the input features for a Recurrent neural network (RNN) optimizer. Their RNN optimizer adopted coordinate-wise weight shar-

ing to alleviate the dimensionality challenge. Li and Malik (2016) used the gradient history and objective values as observations and step vectors as actions in their reinforcement learning framework. Chen et al. (2017) leveraged RNN to train a meta-optimizer to optimize black-box functions (e.g. Gaussian process bandits). Recently, Wichrowska et al. (2017) introduced an optimizer of multi-level hierarchical RNN architecture augmented with additional architectural features, in order to improve the generalizability of the optimization tasks. (Cao et al., 2019; You et al., 2020) further extended learned optimizers to handling Bayesian swarm optimization, and graph network training, respectively. In our work, we leverage the learning-to-optimize approach to control the layer-wise learning rates for the training of deep CNNs, where the deep CNN (i.e. optimizee) will be transferred from the synthetic source domain to the real target domain, extending the application range of the current learning-to-optimize methods.

5. Conclusion

In this paper, we present an Automated Synthetic Generalization (ASG) method for the synthetic-to-real transfer learning problem. We carefully analyzed the pitfall in existing generalization approaches where the Image Net domain knowledge is catastrophically forgotten. By leveraging the minimization of predictions between Image Net-pretrained model and the model for the new task as a proxy guidance, the generalization performance is dramatically improved during the whole training process. We further include a reinforcement learning based learning-to-optimize strategy to automate the layer-wise learning rates towards a better generalization performance. Our experiments demonstrate both the superior generalization performance and the automated learning schedules by our ASG framework.

6. Acknowledge

Work done during internship at NVIDIA. We appreciate the computing power supported by NVIDIA GPU infrastructure. We also thank for the discussion and suggestions from four anonymous reviewers and the help from Yang Zou for the domain adaptation experiments. The research of Z. Wang was partially supported by NSF Award RI-1755701.

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In Neur IPS, 2016.

Cao, Y., Chen, T., Wang, Z., and Shen, Y. Learning to optimize in swarms. In Neur IPS, 2019.

Automated Synthetic-to-Real Generalization

Chen, Y., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Lillicrap, T. P., Botvinick, M., and de Freitas, N. Learning to learn without gradient descent by gradient descent. In ICML, 2017.

Chen, Y., Li, W., and Van Gool, L. Road: Reality oriented adaptation for semantic segmentation of urban scenes. In CVPR, 2018.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.

Coumans, E. and Bai, Y. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., and Brox, T. Flownet: Learning optical ﬂow with convolutional networks. In ICCV, 2015.

Gaidon, A., Wang, Q., Cabon, Y., and Vig, E. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016.

Gan, C., Yang, T., and Gong, B. Learning attributes equals multi-source domain generalization. In CVPR, 2016.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. JMLR, 17(1):2096 2030, 2016.

Ghiasi, G., Lin, T.-Y., and Le, Q. V. Dropblock: A regularization method for convolutional networks. In Neur IPS, 2018.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.

Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., Efros, A. A., and Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. ar Xiv:1711.03213, 2017.

Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S. N., Rosaen, K., and Vasudevan, R. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? ar Xiv:1610.01983, 2016.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521 3526, 2017.

Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. Deeper, broader and artier domain generalization. In ICCV, 2017.

Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. Learning to generalize: Meta-learning for domain generalization. In AAAI, 2018.

Li, K. and Malik, J. Learning to optimize. ar Xiv:1606.01885, 2016.

Li, Z. and Hoiem, D. Learning without forgetting. IEEE Trans. PAMI, 40(12):2935 2947, 2017.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ar, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV, 2014.

Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. In CVPR, 2015a.

Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning transferable features with deep adaptation networks. ar Xiv:1502.02791, 2015b.

Lopez-Paz, D. and Ranzato, M. Gradient episodic memory for continual learning. In Neur IPS, 2017.

Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., and Brox, T. A large dataset to train convolutional networks for disparity, optical ﬂow, and scene ﬂow estimation. In CVPR, 2016.

Muandet, K., Balduzzi, D., and Sch olkopf, B. Domain generalization via invariant feature representation. In ICML, 2013.

Pan, X., Luo, P., Shi, J., and Tang, X. Two at once: Enhancing learning and generalization capacities via ibn-net. In ECCV, 2018.

Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., and Saenko, K. Vis DA: The visual domain adaptation challenge. ar Xiv:1710.06924, 2017.

Pinheiro, P. O. Unsupervised domain adaptation with similarity learning. In CVPR, 2018.

Richter, S. R., Vineet, V., Roth, S., and Koltun, V. Playing for data: Ground truth from computer games. In ECCV, 2016.

Richter, S. R., Hayder, Z., and Koltun, V. Playing for benchmarks. In ICCV, 2017.

Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and Lopez, A. M. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.

Automated Synthetic-to-Real Generalization

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. Nature, 323(6088):533 536, 1986.

Saito, K., Ushiku, Y., Harada, T., and Saenko, K. Adversarial dropout regularization. ar Xiv:1711.01575, 2017.

Saito, K., Watanabe, K., Ushiku, Y., and Harada, T. Maximum classiﬁer discrepancy for unsupervised domain adaptation. In CVPR, 2018.

Sankaranarayanan, S., Balaji, Y., Castillo, C. D., and Chellappa, R. Generate to adapt: Aligning domains using generative adversarial networks. In CVPR, 2018.

Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., et al. Habitat: A platform for embodied ai research. In ICCV, 2019.

Shafahi, A., Saadatpanah, P., Zhu, C., Ghiasi, A., Studer, C., Jacobs, D., and Goldstein, T. Adversarially robust transfer learning. ar Xiv:1905.08232, 2019.

Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learning with deep generative replay. In Neur IPS, 2017.

Sun, B. and Saenko, K. Deep coral: Correlation alignment for deep domain adaptation. In ECCV, 2016.

Thrun, S. Lifelong learning algorithms. In Learning to learn, pp. 181 209. Springer, 1998.

Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. Deep domain confusion: Maximizing for domain invariance. ar Xiv:1412.3474, 2014.

Wichrowska, O., Maheswaranathan, N., Hoffman, M. W., Colmenarejo, S. G., Denil, M., de Freitas, N., and Sohl Dickstein, J. Learned optimizers that scale and generalize. In ICML, 2017.

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992.

You, Y., Chen, T., Wang, Z., and Shen, Y. L2-gcn: Layerwise and learned efﬁcient training of graph convolutional networks. In CVPR, 2020.

Yuan, Y., Chen, W., Chen, T., Yang, Y., Ren, Z., Wang, Z., and Hua, G. Calibrated domain-invariant learning for highly generalizable large scale re-identiﬁcation. In WACV, 2020.

Yue, X., Zhang, Y., Zhao, S., Sangiovanni-Vincentelli, A., Keutzer, K., and Gong, B. Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data. In ICCV.

Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. In ICML, 2017.

Zou, Y., Yu, Z., Vijaya Kumar, B., and Wang, J. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV, 2018.

Zou, Y., Yu, Z., Liu, X., Kumar, B., and Wang, J. Conﬁdence regularized self-training. In ICCV, 2019.