# parallelmentoring_for_offline_modelbased_optimization__4feff3c4.pdf Parallel-mentoring for Offline Model-based Optimization Can (Sam) Chen1,2 , Christopher Beckham2,3, Zixuan Liu5, Xue Liu1,2, Christopher Pal2,3,4 1Mc Gill University, 2MILA - Quebec AI Institute, 3Polytechnique Montreal, 4Canada CIFAR AI Chair, 5University of Washington We study offline model-based optimization to maximize a black-box objective function with a static dataset of designs and scores. These designs encompass a variety of domains, including materials, robots and DNA sequences. A common approach trains a proxy on the static dataset to approximate the black-box objective function and performs gradient ascent to obtain new designs. However, this often results in poor designs due to the proxy inaccuracies for out-of-distribution designs. Recent studies indicate that: (a) gradient ascent with a mean ensemble of proxies generally outperforms simple gradient ascent, and (b) a trained proxy provides weak ranking supervision signals for design selection. Motivated by (a) and (b), we propose parallel-mentoring as an effective and novel method that facilitates mentoring among parallel proxies, creating a more robust ensemble to mitigate the out-of-distribution issue. We focus on the three-proxy case and our method consists of two modules. The first module, voting-based pairwise supervision, operates on three parallel proxies and captures their ranking supervision signals as pairwise comparison labels. These labels are combined through majority voting to generate consensus labels, which incorporate ranking supervision signals from all proxies and enable mutual mentoring. However, label noise arises due to possible incorrect consensus. To alleviate this, we introduce an adaptive soft-labeling module with soft-labels initialized as consensus labels. Based on bi-level optimization, this module fine-tunes proxies in the inner level and learns more accurate labels in the outer level to adaptively mentor proxies, resulting in a more robust ensemble. Experiments validate the effectiveness of our method. Our code is available here. 1 Introduction Designing new objects or entities to optimize specific properties is a widespread challenge, encompassing various domains such as materials, robots, and DNA sequences [1]. Traditional approaches often involve interacting with a black-box function to propose new designs, but this can be expensive or even dangerous in some cases [2 6]. In response, recent work [1] has focused on a more realistic setting known as offline model-based optimization (MBO). In this setting, the objective is to maximize a black-box function using only a static (offline) dataset of designs and scores. A prevalent approach to addressing the problem is to train a deep neural network (DNN) model parameterized as fθ( ), on the static dataset, with the trained DNN serving as a proxy. The proxy allows for gradient ascent on existing designs, generating improved designs by leveraging the gradient information provided by the DNN model. However, this method encounters an issue with the trained proxy being susceptible to out-of-distribution problems. Specifically, the proxy produces inaccurate predictions when applied to data points that deviate significantly from the training distribution. Correspondence to can.chen@mila.quebec. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Recent studies have observed that (a) employing a mean ensemble of trained proxies for gradient ascent in offline MBO generally leads to superior designs compared to using a single proxy [7]. Figure 1: Motivation illustration. This improvement stems from the ability of the ensemble to provide more robust predictions compared to a single proxy [8 11]. Recent work has also found that (b) a trained proxy offers weak (valuable, albeit potentially unreliable) ranking supervision signals for design selection in various offline MBO contexts, such as evolutionary algorithms [12], reinforcement learning [13], and generative modeling [14]. These signals, focusing on the relative order of designs over absolute scores, are more resilient to noise and inaccuracies. By exchanging these signals among proxies in the ensemble, we can potentially enhance its robustness. As shown in Figure 1, we have three parallel proxies f A θ ( ), f B θ ( ) and f C θ ( ). For two designs xn 1 and xn 2 within the neighborhood of the current optimization point, proxies f A θ ( ) and f B θ ( ) agree that the score of xn 1 is larger than that of xn 2, while proxy f C θ ( ) disagrees. Based on the majority voting principle, proxies f A θ ( ) and f B θ ( ) provide a more reliable ranking, and their voted ranking signal f V (xn 1) > f V (xn 2) could mentor the proxy f C θ ( ), thus enhancing its performance. To this end, we propose an effective and novel method called parallel-mentoring that facilitates mentoring among parallel proxies to train a more robust ensemble against the out-of-distribution issue. This paper primarily focuses on the three-proxy case, referred to as tri-mentoring, but we also examine the situation with more proxies in Appendix A.1. As depicted in Figure 2, tri-mentoring consists of two modules. Module 1, voting-based pairwise supervision (shown in Figure 2(a)), operates on three parallel proxies f A θ ( ), f B θ ( ), and f C θ ( ) and utilizes their mean for the final prediction. To ensure consistency with the ranking information employed in design selection, this module adopts a pairwise approach to represent the ranking signals of each proxy. Specifically, as illustrated in Figure 2(a), this module generates samples (e.g. xn 1, xn 2 and xn 3) in the neighborhood of the current point x and computes pairwise comparison labels ˆy A for all sample pairs, serving as ranking supervision signals for the proxy f A θ ( ). The label ˆy A ij is defined as 1 if f A θ (xi) > f A θ (xj) and 0 otherwise, and similar signals are derived for proxies f B θ ( ) and f C θ ( ). These labels ˆy A, ˆy B and ˆy C are combined via majority voting to create consensus labels ˆy V which are more reliable and thus can be used for mentoring proxies. The voted ranking signal f V (xn 1) > f V (xn 2) in Figure 1 corresponds to the pairwise consensus label ˆy V 12 = 1 in Figure 2(a), and both can mentor the proxy f C θ ( ). Figure 2: Illustration of tri-mentoring. Module 2, adaptive soft-labeling (shown in Figure 2(b)), mitigates the issue of label noise that may arise, since the voting consensus may not always be correct. To this end, this module initializes the consensus labels ˆy V from the first module as soft-labels ˆy S. It then aims to learn more accurate soft-labels to better represent the ranking supervision signals by leveraging the knowledge from the static dataset. Specifically, assuming accurate soft-labels, one of the proxies, either f A θ ( ), f B θ ( ) or f C θ ( ), fine-tuned using them, is expected to perform well on the static dataset, as both soft-labels (pairwise perspective) and the static dataset (pointwise perspective) describe the same ground-truth and share underlying similarities. This formulation leads to a bi-level optimization framework with an inner fine-tuning level and an outer soft-labeling level as shown in Figure 2(b). The inner level fine-tunes the proxy with soft-labels, which establishes the connection between them. The outer level optimizes soft-labels to be more accurate by minimizing the loss of the static dataset via the inner-level connection. The optimized labels are further fed back to the first module to adaptively mentor the proxy, ultimately yielding a more robust ensemble. Experiments on design-bench validate the effectiveness of our method. To summarize, our contributions are three-fold: We propose parallel-mentoring for offline MBO, effectively utilizing weak ranking supervision signals among proxies, with a particular focus on the three-proxy case as tri-mentoring. Our method consists of two modules: voting-based pairwise supervision and adaptive soft-labeling. The first module generates pairwise consensus labels via majority voting to mentor the proxies. To mitigate label noise in consensus labels, the second module proposes a bi-level optimization framework to adaptively fine-tune proxies and soft-labels, resulting in a more robust ensemble. 2 Preliminaries: Gradient Ascent on Offline Model-based Optimization Offline model-based optimization (MBO) aims to find the optimal design x that maximizes the black-box objective function f( ): x = arg max x f(x) , (1) To achieve this, a static dataset D = {(xi, yi)}N i=1 with N points is available, where xi represents a design and yi is its corresponding score. A common approach for solving this optimization problem is to fit a deep neural network (DNN) model fθ( ) with parameters θ to the static dataset in a supervised manner. The optimal parameters θ can be obtained by minimizing the mean squared error between the predictions and the true scores: θ = arg min θ 1 N i=1 (fθ(xi) yi)2 . (2) The trained DNN model fθ ( ) acts as a proxy to optimize the design using gradient ascent steps: xt+1 = xt + η xfθ(x)|x=xt , for t [0, T 1] , (3) where T is the number of steps and η represents the learning rate. x T serves as the final design candidate. However, this method faces a challenge with the proxy being vulnerable to out-ofdistribution designs. When handling designs that substantially differ from the training distribution, the proxy yields inaccurate predictions. In this section, we introduce parallel-mentoring, focusing on the three-proxy scenario, also known as tri-mentoring. The method can be easily extended to incorporate more proxies, as discussed in Appendix A.1. Tri-mentoring consists of two modules. The first module, voting-based pairwise supervision in Section 3.1, manages three proxies parallelly and generates consensus labels via majority voting to mentor proxies. To mitigate label noise, we introduce a second module, adaptive soft-labeling in Section 3.2. This module adaptively fine-tunes proxies and soft-labels using bi-level optimization, improving ensemble robustness. The overall algorithm is shown in Algorithm 1. 3.1 Voting-based Pairwise Supervision We train three parallel proxies f A θ ( ), f B θ ( ) and f C θ ( ) on the static dataset with different initializations and utilize their mean as the final prediction as suggested in [1, 15]: 3(f A θ ( ) + f B θ ( ) + f C θ ( )). (4) We then apply gradient ascent with fθ( ) on existing designs to generate improved designs as per Eq.(3). Although the mean ensemble generally results in superior designs compared to a single proxy Algorithm 1 Tri-mentoring for Offline Model-based Optimization Input: The static dataset D, the number of iterations T, the optimizer OPT( ). Output: The high-scoring design x h. 1 Initialize x0 as the design with the highest score in D. 2 Train proxies f A θ ( ), f B θ ( ) and f C θ ( ) on D with different initializations. 3 for t 0 to T 1 do Voting-based pairwise supervision. 4 Sample K neighborhood points at xt as S(xt). 5 Compute pairwise comparison labels ˆy A, ˆy B and ˆy C for three proxies on S(xt). 6 Derive consensus labels: ˆy V = majority_voting(ˆy A, ˆy B, ˆy C). Adaptive soft-labeling. 7 for proxy in [f A θ ( ), f B θ ( ), f C θ ( )] do 8 Initialize soft-labels as consensus labels: ˆy S = ˆy V . 9 Inner level: fine-tune the proxy with Eq.(8). 10 Outer level: learn more accurate soft-labels ˆy S with Eq.(9). 11 Mentor proxy using the optimized soft-labels ˆy S with Eq.(8). Gradient ascent with a mean ensemble. 12 Form a more robust ensemble as fθ(x) = 1 3(f A θ (x) + f B θ (x) + f C θ (x)) 13 Gradient ascent: xt+1 = xt + ηOPT( xfθ(xt)) 14 Return x h = x T [7] due to the ensemble robustness [8, 9], this approach does not fully exploit the potential of weak (valuable, albeit potentially unreliable) ranking supervision signals within every proxy. Emphasizing the relative order of designs rather than their absolute scores, these signals are more resilient to noise and inaccuracies. These ranking signals are commonly used in evolutionary algorithms [12], reinforcement learning [13], and generative modeling [14] to select designs and could further improve the ensemble robustness. We extract the ranking supervision signals from individual proxies in the form of pairwise comparison labels, and then combine these labels via majority voting to generate consensus labels to mentor proxies. We provide a detailed explanation of this module below, with its implementation shown in Algorithm 1 from Line 4 to Line 6. Pairwise comparison label. We adopt a pairwise approach to represent the ranking supervision signals for every proxy, focusing on relative order to align with the ranking information used in design selection. We sample K points at the neighborhood of the optimization point xt as S(xt) = {xn 1, . . . , xn K} N(xt, δ2) where N(xt, δ2) represents a Gaussian distribution centered at xt with variance δ2. For each sample pair (xn i , xn j ) and a proxy (e.g., f A θ ( )), we define the pairwise comparison label y A ij = 1(f A θ (xn i ) > f A θ (xn j )), where 1 is the indicator function. The labels ˆy A from all sample pairs serves as the ranking supervision signals for the proxy f A θ ( ). We repeat this process for all proxies, generating signals ˆy B and ˆy C for proxies f B θ ( ) and f C θ ( ) respectively. Majority voting. Given these pairwise comparison labels ˆy A, ˆy B and ˆy C, we derive the pairwise consensus labels ˆy V via an element-wise majority voting: ˆy V ij = majority_voting(ˆy A ij, ˆy B ij, ˆy C ij) , (5) where i and j are the indexes of the neighborhood samples. As consensus labels are generally more reliable, they can be employed for mentoring the proxies to promote the exchange of ranking supervision signals. Specifically, we can fine-tune the proxy f A θ ( ) using the binary cross-entropy loss, where σ(f A θ (xn i ) f A θ (xn j )) represents the predicted probability that f A θ (xn i ) > f A θ (xn j ), as also used in the Chat GPT reward model training [16 18]. The loss function can be computed as: 1 i