# robustness_reprogramming_for_representation_learning__328cc596.pdf

Published as a conference paper at ICLR 2025

ROBUSTNESS REPROGRAMMING FOR REPRESENTATION LEARNING

Zhichao Hou1, Mohamad Ali Torkamani2, Hamid Krim1, Xiaorui Liu1

1North Carolina State University 2Amazon Web Services zhou4@ncsu.edu, alitor@amazon.com, ahk@ncsu.edu, xliu96@ncsu.edu

This work tackles an intriguing and fundamental open challenge in representation learning: Given a well-trained deep learning model, can it be reprogrammed to enhance its robustness against adversarial or noisy input perturbations without altering its parameters? To explore this, we revisit the core feature transformation mechanism in representation learning and propose a novel non-linear robust pattern matching technique as a robust alternative. Furthermore, we introduce three model reprogramming paradigms to offer flexible control of robustness under different efficiency requirements. Comprehensive experiments and ablation studies across diverse learning models ranging from basic linear model and MLPs to shallow and modern deep Conv Nets demonstrate the effectiveness of our approaches. This work not only opens a promising and orthogonal direction for improving adversarial defenses in deep learning beyond existing methods but also provides new insights into designing more resilient AI systems with robust statistics. Our implementation is available at https://github.com/chris-hzc/Robustness Reprogramming.

1 INTRODUCTION

Deep neural networks (DNNs) have made significant impacts across various domains due to their powerful capability of learning representation from high-dimensional data (Le Cun et al., 2015; Goodfellow et al., 2016). However, it has been well-documented that DNNs are highly vulnerable to adversarial attacks (Szegedy, 2013; Biggio et al., 2013). These vulnerabilities are prevalent across various model architectures, attack capacities, attack knowledge, data modalities, and prediction tasks, which hinders their reliable deployment in real-world applications due to potential economic, ethical, and societal risks (AI, 2023).

In light of the burgeoning development of AI, the robustness and reliability of deep learning models have become increasingly crucial and of particular interest. A mount of endeavors attempting to safeguard DNNs have demonstrated promising robustness, including robust training (Madry, 2017; Zhang et al., 2019; Gowal et al., 2021; Li & Liu, 2023), regularization (Cisse et al., 2017; Zheng et al., 2016), and purification techniques (Ho & Vasconcelos, 2022; Nie et al., 2022; Shi et al., 2021; Yoon et al., 2021). However, these methods often suffer from catastrophic pitfalls like cumbersome training processes or domain-specific heuristics, failing to deliver the desired robustness gains in an efficient and adaptable manner.

Despite numerous advancements in adversarial defenses, an open challenge persists: Is it possible to reprogram a well-trained model to achieve the desired robustness without modifying its parameters? This question is of particular significance in the current era of large-scale models. Reprogramming without training is highly efficient, as the pretraining-and-finetuning paradigm grants access to substantial open source pre-trained parameters, eliminating the need for additional training. Moreover, reprogramming offers an innovative, complementary, and orthogonal approach to existing defenses, paving the way to reshape the landscape of robust deep learning.

To address this research gap, we firstly delve deeper into the basic neurons of DNNs to investigate the fundamental mechanism of representation learning. At its core, the linear feature transformations serve as an essential building block to capture particular feature patterns of interest in most deep learning models. For instance, the Multi-Layer Perceptron (MLP) fundamentally consists of

Published as a conference paper at ICLR 2025

multiple stacked linear mapping layers and activation functions; the convolution operations in Convolution Neural Networks (CNNs) (He et al., 2016) execute a linear feature mapping over local patches using the convolution kernels; the attention mechanism in Transformers (Vaswani, 2017) performs linear transformations over the contextualized token vectors. This linear feature mapping functions as Linear Pattern Matching by capturing the certain patterns that are highly correlated with the model parameters. However, this pattern matching manner is highly sensitive to data perturbations, which explains the breakdown of the deep learning models under the adversarial environments.

To this end, we propose a novel approach, Nonlinear Robust Pattern Matching, which significantly improves the robustness while maintaining the feature pattern matching behaviors. Furthermore, we also introduce a flexible and efficient strategy, Robustness Reprogramming, which can be deployed under three paradigms to improve the robustness, accommodating varying resource constraints and robustness requirements. This innovative framework promises to redefine the landscape of robust deep learning, paving the way for enhanced resilience against adversarial threats.

Our contributions can be summarized as follows: (1) We propose a new perspective on representation learning by formulating Linear Pattern Matching (the fundamental mechanism of feature extraction in deep learning) as ordinary least-square problems; (2) Built upon our novel viewpoint, we introduce Nonlinear Robust Pattern Matching as an alternative robust operation and provide theoretical convergence and robustness guarantees for its effectiveness; (3) We develop an innovative and adaptable strategy, Robustness Reprogramming, which includes three progressive paradigms to enhance the resilience of given pre-trained models; and (4) We conduct comprehensive experiments to demonstrate the effectiveness of our proposed approaches across various backbone architectures, using multiple evaluation methods and providing several insightful analyses.

2 RELATED WORKS

Adversarial Attacks. Adversarial attacks can generally be categorized into two types: white-box and black-box attacks. In white-box attacks, the attacker has complete access to the target neural network, including its architecture, parameters, and gradients. Examples of such attacks include gradient-based methods like FGSM (Goodfellow et al., 2014), Deep Fool (Moosavi-Dezfooli et al., 2016), PGD (Madry, 2017), and C&W attacks (Carlini & Wagner, 2017). On the other hand, blackbox attacks do not have full access to the model s internal information; the attacker can only use the model s input and output responses. Examples of black-box methods include surrogate model-based method (Papernot et al., 2017), zeroth-order optimization (Chen et al., 2017), and query-efficient methods(Andriushchenko et al., 2020; Alzantot et al., 2019). Additionally, Auto Attack (Croce & Hein, 2020b), an ensemble attack that includes two modified versions of the PGD attack, a fast adaptive boundary attack (Croce & Hein, 2020a), and a black-box query-efficient square attack (Andriushchenko et al., 2020), has demonstrated strong performance and is often considered as a reliable benchmark for evaluating adversarial robustness.

Adversarial Defenses. Numerous efforts have been made to enhance the robustness of deep learning models, which can broadly be categorized into empirical defenses and certifiable defenses. Empirical defenses focus on increasing robustness through various strategies: robust training methods (Madry, 2017; Zhang et al., 2019; Gowal et al., 2021; Li & Liu, 2023) introduce adversarial perturbations into the training data, while regularization-based approaches (Cisse et al., 2017; Zheng et al., 2016) stabilize models by constraining the Lipschitz constant or spectral norm of the weight matrix. Additionally, detection techniques Metzen et al. (2017); Feinman et al. (2017); Grosse et al. (2017) aim to defend against attacks by identifying adversarial inputs. Purification-based approaches seek to eliminate the adversarial signals before performing downstream tasks (Ho & Vasconcelos, 2022; Nie et al., 2022; Shi et al., 2021; Yoon et al., 2021). Recently, some novel approaches have emerged by improving robustness from the perspectives of ordinary differential equations (Kang et al., 2021; Li et al., 2022; Yan et al., 2019) and generative models (Wang et al., 2023; Nie et al., 2022; Rebuffi et al., 2021). Beyond empirical defenses, certifiable defenses (Cohen et al., 2019; Gowal et al., 2018; Fazlyab et al., 2019) offer theoretical guarantees of robustness within specific regions against any attack. However, many of these methods suffer from significant overfitting issues or depend on domain-specific heuristics, which limit their effectiveness and adaptability in achieving satisfying robustness. Additionally, techniques like robust training often entail high computational and training costs, especially when dealing with diverse noisy environments, thereby limiting their

Published as a conference paper at ICLR 2025

scalability and flexibility for broader applications. The contribution of robustness reprogramming in this work is fully orthogonal to existing efforts, and they can be integrated for further enhancement.

3 ROBUST NONLINEAR PATTERN MATCHING

In this section, we begin by exploring the vulnerability of representation learning from the perspective of pattern matching and subsequently introduce a novel robust feature matching in Section 3.1. Following this, we develop a Newton-IRLS algorithm, which is unrolled into robust layers in Section 3.2. Lastly, we present a theoretical robustness analysis of this architecture in Section 3.3.

Notation. Let the input features of one instance be represented as x = (x1, . . . , x D) RD, and the parameter vector as a = (a1, . . . , a D) RD, where D denotes the feature dimension. For simplicity, we describe our method in the case where the output is one-dimensional, i.e., z R.

3.1 A NEW PERSPECTIVE OF REPRESENTATION LEARNING

Linear Pattern Matching

Nonlinear Robust Pattern Matching

Optimization

Model (One Layer)

𝑧!"# = # 𝑎$𝑥$

(+)𝑎$𝑥$ % $&'

. # 𝑧/𝐷 𝑎$𝑥$ /

. # |𝑧/𝐷 𝑎$𝑥$|

Closed-form Optimal Solution Newton-IRLS Algorithm

Figure 1: Vanilla Linear Pattern Matching (LPM) vs. Nonlinear Robust Pattern Matching (NRPM).

Fundamentally, DNNs inherently function as representation learning modules by transforming raw data into progressively more compact embeddings (Le Cun et al., 2015; Goodfellow et al., 2016). The linear feature transformation, z = a x = PD d=1 ad xd, is the essentially building block of deep learning models to capture particular feature patterns of interest. Specifically, a certain pattern x can be captured and matched once it is highly correlated with the model parameter a.

Despite the impressive capability of linear operator in enhancing the representation learning of DNNs, the vanilla deep learning models have been validated highly vulnerable (Szegedy, 2013; Biggio et al., 2013). Existing approaches including robust training and regularization techniques (Madry, 2017; Cisse et al., 2017; Zheng et al., 2016) aim to improve the robustness of feature transformations by constraining the parameters a with particular properties. However, these methods inevitably alter the feature matching behaviors, often leading to clean performance degradation without necessarily achieving improved robustness.

Different from existing works, we aim to introduce a novel perspective by exploring how to design an alternative feature mapping that enhances robustness while maximally preserving feature matching behaviors. First, we formulate the linear feature pattern matching as the optimal closed-form solution of the following problem:

min z R L(z) =

D ad xd 2 ,

Published as a conference paper at ICLR 2025

where the first-order optimality condition L(z)

z = 0 yields the linear transformation z = PD d=1 ad xd. Since this estimation is highly sensitive to outlying values due to the quadratic penalty, we propose to derive a robust alternative inspired by the Least Absolute Deviation (LAD) estimation:

min z R L(z) =

D ad xd . (1)

By replacing the quadratic penalty with a linear alternative on the residual z/D ad xd, the impact of outliers can be significantly mitigated according to robust statistics (Huber & Ronchetti, 2011).

Methodology. The NRPM architecture draws significant inspirations from optimization-induced deep learning architectures (Ma et al., 2021; Fan et al., 2022; Hou et al., b; Liu et al., 2021b;a; Hou et al., a), where robust alternatives are derived from robust optimization formulations. Specifically, it reinterprets the linear feature pattern matching in the backbone models as solution to well-defined optimization objective, and corresponding optimization algorithm is developed to solve the problem induced by the nonlinear robust pattern matching. This approach highlights a promising direction for designing principled and inherently robust deep learning architectures.

3.2 ALGORITHM DEVELOPMENT AND ANALYSIS

Although the LAD estimator offers robustness implication, the non-smooth objective in Eq. (1) poses a challenge in designing an efficient algorithm to be integrated neural network layers. To this end, we leverage the Newton Iterative Reweighted Least Squares (Newton-IRLS) algorithm to address the non-smooth nature of the absolute value operator | | by optimizing an alternative smoothed objective function U with Newton method. In this section, we will first introduce the localized upper bound U for L in Lemma 3.1, and then derive the Newton-IRLS algorithm to optimize L.

Lemma 3.1. Let L(z) be defined in Eq. (1), and for any fixed point z0, U(z, z0) is defined as

d=1 wd (adxd z/D)2 + 1

2L(z0), (2)

where wd = 1 2|adxd z0/D|. Then, for any z, the following holds:

(1) U(z, z0) L(z), (2) U(z0, z0) = L(z0).

Proof. Please refer to Appendix B.1.

The statement (1) indicates that U(z, z0) serves as an upper bound for L(z), while statement (2) demonstrates that U(z, z0) equals L(z) at point z0. With fixed z0, the alternative objective U(z, z0) in Eq. (2) is quadratic and can be efficiently optimized. Therefore, instead of minimizing the nonsmooth L(z) directly, the Newton-IRLS algorithm will obtain z(k+1) by optimizing the quadratic upper bound U(z, z(k)) with second-order Newton method:

z(k+1) = D PD d=1 w(k) d adxd PD d=1 w(k) d (3)

where w(k) d = 1 |adxd z(k)/D|. Please refer to Appendix B.2 for detailed derivation. As a consequence

of Lemma 3.1, we can conclude the iteration {z(k)}K k=1 fulfill the loss descent of L(z):

L(z(k+1)) U(z(k+1), z(k)) U(z(k), z(k)) = L(z(k)).

This implies Eq. (3) can achieve convergence of L by optimizing the localized upper bound U.

Implementation. The proposed non-linear feature pattern matching is expected to improve the robustness against data perturbation in any deep learning models by replacing the vanilla linear feature transformation. In this paper, we illustrate its use cases through MLPs and convolution models. We provide detailed implementation techniques in Appendix A. Moreover, we will demonstrate how to leverage this technique for robustness reprogramming for representation learning in Section 4.

Published as a conference paper at ICLR 2025

3.3 THEORETICAL ROBUSTNESS ANALYSIS

In this section, we conduct a theoretical robustness analysis comparing the vanilla Linear Pattern Matching (LPM) architecture with our Nonlinear Robust Pattern Matching (NRPM) based on influence function (Law, 1986). For simplicity, we consider a single-step case for our Newton IRLS algorithm (K = 1). Denote the weighted feature random variable as X and corresponding empirical distribution as F(X) = 1 D PD d=1 I{X=adxd}. Then we can represent LPM as

z LP M := TLP M(F) = PD d=1 adxd and NRPM as z NRP M := TNRP M(F) = D

PD d=1 wdadxd PD d=1 wd ,

where wd = 1 |adxd z LP M/D|. We derive their influence functions in Theorem 3.2 to demonstrate their sensitivity against input perturbations, with a proof presented in Appendix B.3. Theorem 3.2 (Robustness Analysis via Influence Function). The influence function is defined as the sensitivity of the estimate to a small contamination at x:

IF( x; T, F) = lim ϵ 0 T(Fϵ) T(F)

ϵ where the contaminated distribution becomes Fϵ = (1 ϵ)F + ϵδ x, where δ x is the Dirac delta function centered at x and F is the distribution of x. Then, we have:

IF( x; TLP M, F) = D( x z LP M/D),

IF( x; TNRP M, F) = Dw x ( x z NRP M/D) PD d=1 wd where w x = 1 | x z LP M/D|.

Theorem 3.2 provide several insights into the robustness of LPM and NRLPM models:

For LPM, the influence function is given by D( x y) = D( x z LP M/D), indicating that the sensitivity of LPM depends on the difference between the perturbation x and the average clean estimation z LP M/D.

For NRPM, the influence function is Dw x( x z NRP M/D) PD d=1 wd , where w x = 1 | x z LP M/D|. Al-

though the robustness of NRPM is affected by the difference x z NRP M/D, the influence can be significantly mitigated by the weight w x, particularly when x deviates from the average estimation z LP M/D of the clean data.

These analyses provide insight and explanation for the robustness nature of the proposed technique.

4 ROBUSTNESS REPROGRAMMING

In this section, it is ready to introduce the robustness programming techniques based on the nonlinear robust pattern matching (NRPM) derived in Section 3.1. One naive approach is to simply replace the vanilla linear pattern matching (LPM) with NRPM. However, this naive approach does not work well in practice, and we propose three robustness reprogramming paradigms to improve the robustness, accommodating varying resource constraints and robustness requirements.

Algorithm 1 Hybrid Architecture Require: {xd}D d=1, {ad}D d=1, λ. Initialize z(0) NRP M = z LP M = PD d=1 ad xd for k = 0, 1, . . . , K 1 do

w(k) d = 1 |adxd z(k) NRP M/D| d [D]

z(k+1) NRP M = D

PD d=1 w(k) d adxd PD d=1 w(k) d end for return λ z LP M + (1 λ) z(K) NRP M

While NRPM enhances model robustness, it can lead to a decrease in clean accuracy, particularly in deeper models. This reduction in performance may be due to the increasing estimation error across layers. To balance the clean-robustness performance trade-off between LPM and NRPM, it is necessary to develop a hybrid architecture as shown in Algorithm 1, where their balance is controlled by hyperparameters {λl}L l=1 and L is the number of layers in the entire model. Based on this hybrid architecture, we propose three robustness reprogramming paradigms as shown in Figure 2:

Published as a conference paper at ICLR 2025

Figure 2: Three Robustness Reprogramming Paradigms : (1) Paradigm 1 freezes the model parameters and treats {λl}L l=1 as fixed hyperparameter; (2) Paradigm 2 freezes the model parameters but allows {λl}L l=1 to be learnable; (3) Paradigm 3 enables both the model parameters and {λl}L l=1 to be learnable.

Paradigm 1: without fine-tuning, good robustness with zero cost. As deep learning models become increasingly larger, it is critical to fully utilize pre-trained model parameters. Since the robust NRPM slightly refines the vanilla LPM through an adaptive instance-wise reweighting scheme, ensuring that the pre-trained LPM parameters still fit well within NRPM architecture. Additionally, by adjusting the hyperparameters {λl}L l=1 with pre-trained parameters, we can strike an ideal balance between natural and robust accuracy. It is worth noting that relying solely on pre-trained parameters with plug-and play paradigm significantly reduces computational costs, which is crucial in the era of large-scale deep learning.

Paradigm 2: only fine-tuning {λl}L l=1, strong robustness with slight cost. One drawback of Paradigm 1 is that we specify the same λ for all the layers and need to conduct brute force hyperparameters search to obtain the optimal one. However, hyper-parameters search is time-consuming and computation-intensive. Moreover, the entire model requires layer-wise {λl}L l=1 to balance the LPM and NRPM for different layers. To solve it, we propose Paradigm 2, which automatically learns optimal λ with light-weight fine-tuning. This paradigm only fine-tunes the hyperparameters {λl}L l=1 while keeping the model parameters frozen, which is very efficient.

Paradigm 3: overall fine-tuning, superior robustness with acceptable cost. To achieve best robustness, we can make both the model parameters {ad}D d=1 and the hyperparameters {λl}L l=1 learnable. By refining these parameters based on the pre-trained model, we can prevent clean performance degradation. Moreover, with learnable {λl}L l=1 during adversarial training, the model can automatically select the optimal combination to enhance both clean and robust performance. This paradigm is still efficient since it only needs to fine-tune the model lightly with a few training epochs.

5 EXPERIMENT

In this section, we comprehensively evaluate the effectiveness of our proposed Robustness Reprogramming techniques using a wide range of backbone architectures, starting from basic MLPs, progressing to shallow Le Net model, and extending to the widely-used Res Nets architecture.

5.1 EXPERIMENTAL SETTING

Datasets. We conduct the experiments on MNIST Le Cun & Cortes (2005), SVHN (Netzer et al., 2011), CIFAR10 (Krizhevsky et al., 2009), and Image Net10 (Russakovsky et al., 2015) datasets.

Backbone architectures. We select backbones ranging from very basic MLPs with 1, 2, or 3 layers, to mid-level architectures like Le Net, and deeper networks such as Res Net10, Res Net18, and Res Net34 (He et al., 2016). In some experiments with Res Nets, we chose the narrower version (with the model width reduced by a factor of 8) for the consideration for computation issue. Additionally, we also choose one popular architecture MLP-Mixer (Tolstikhin et al., 2021).

Published as a conference paper at ICLR 2025

Evaluation methods. We assess the performance of the models against various attacks under L norm, including FGSM (Goodfellow et al., 2014), PGD-20 (Madry, 2017), C&W (Carlini & Wagner, 2017), and Auto Attack (Croce & Hein, 2020b). Among them, Auto Attack is an ensemble attack consisting of three adaptive white-box attacks and one black-box attack, which is considered as a reliable evaluation method to avoid the false sense of security. In addition to empirical robustness evaluation, we also evaluate certified robustness to further demonstrate the robustness of our proposed architecture.

Baselines & Hyperparameter setting. For backbone Res Nets, we compare the baselines including PGD-AT (Madry, 2017), TRADES (Zhang et al., 2019), MART (Wang et al., 2019), SAT (Huang et al., 2020), and AWP (Wu et al., 2020). We train the baselines for 200 epochs with batch size 128, weight decay 2e-5, momentum 0.9, and an initial learning rate of 0.1 that is divided by 10 at the 100-th and 150-th epoch. For the backbone MLPs and Le Net, we train the vanilla models for 50 epochs. Our robustness reprogramming will fine-tune the pre-trained models for 5 epochs.

5.2 ROBUSTNESS REPROGRAMMING ON MLPS

Table 1: Robustness reprogramming under 3 paradigms on MNIST with 3-layer MLP as backbone.

Method / Budget Natural 0.05 0.1 0.15 0.2 0.25 0.3 [λ1, λ2, λ3] Normal-train 90.8 31.8 2.6 0.0 0.0 0.0 0.0 \ Adv-train 76.4 66.0 57.6 46.9 35.0 23.0 9.1 \ 90.8 31.8 2.6 0.0 0.0 0.0 0.0 [1.0, 1.0, 1.0] 90.8 56.6 17.9 8.5 4.6 3.0 2.3 [0.9, 0.9, 0.9] 90.4 67.1 30.8 17.4 10.6 6.5 4.5 [0.8, 0.8, 0.8] 89.7 73.7 43.5 25.5 16.9 11.7 9.2 [0.7, 0.7, 0.7] Paradigm 1 88.1 75.3 49.0 31.0 22.0 15.5 12.4 [0.6, 0.6, 0.6] (without tuning) 84.1 74.4 50.0 31.9 22.8 18.1 14.3 [0.5, 0.5, 0.5] 78.8 70.4 48.3 33.9 24.1 18.4 14.6 [0.4, 0.4, 0.4] 69.5 62.6 45.2 31.5 23.1 19.0 15.5 [0.3, 0.3, 0.3] 58.5 53.2 38.2 27.6 22.2 16.4 12.9 [0.2, 0.2, 0.2] 40.7 38.3 29.7 22.8 16.8 12.9 11.1 [0.1, 0.1, 0.1] 18.8 17.6 16.4 14.6 12.4 10.7 9.4 [0.0, 0.0, 0.0] Paradigm 2 (tuning λ) 81.5 75.3 61.2 44.7 33.7 26.0 20.1 [0.459, 0.033, 0.131] Paradigm 3 (tuning all) 86.1 81.7 75.8 66.7 58.7 50.1 39.8 [0.925, 0.119, 0.325]

Comparison of robustness reprogramming via various paradigms. To compare the robustness of our robustness reprogramming under 3 paradigms as well as the vanilla normal/adversarial training, we evaluate the model performance under FGSM attack across various budgets with 3-layer MLP as backbone model. From the results in Table 1, we can make the following observations:

In terms of robustness, our Robustness Reprogramming across three paradigms progressively enhances performance. In Paradigm 1, by adjusting {λl}L l=1, an optimal balance between clean and robust accuracy can be achieved without the need for parameter fine-tuning. Moreover, Paradigm 2 can automatically learn the layer-wise set {λl}L l=1, improving robustness compared to Paradigm 1 (with fixed {λl}L l=1). Furthermore, Paradigm 3, by fine-tuning all the parameters, demonstrates the best performance among all the methods compared.

Regarding natural accuracy, we can observe from Paradigm 1 that increasing the inclusion of NRLM (i.e., smaller {λl}L l=1) results in a decline in performance. But this sacrifice can be mitigated by fine-tuning {λl}L l=1 or the entire models as shown in Paradigm 2&3.

Behavior analysis on automated learning of {λl}L l=1. In Paradigm 2, we assert and expect that the learnable {λl}L l=1 across layers can achieve optimal performance under a specified noisy environment while maintaining the pre-trained parameters. To further validate our assertion and investigate the behavior of the learned {λl}L l=1, we simulate various noisy environments by introducing adversarial perturbations (FGSM) into the training data at different noise levels, ϵ. We initialize the {λl}L l=1 across all layers to 0.5. From the results shown in Table 2, we can make the following observations:

Published as a conference paper at ICLR 2025

Table 2: Automated learning of {λl}L l=1 under adversarial training with various noise levels.

Budget Natural 0.05 0.1 0.15 0.2 0.25 0.3 Learned {λl}L l=1 Adv-train (ϵ = 0.0) 91.3 75.7 45.0 24.9 14.9 10.5 8.0 [0.955, 0.706, 0.722] Adv-train (ϵ = 0.05) 91.2 76.8 50.6 27.8 16.6 10.5 8.1 [0.953, 0.624, 0.748] Adv-train (ϵ = 0.1) 91.0 82.6 62.5 41.7 26.4 19.8 14.9 [0.936, 0.342, 0.700] Adv-train (ϵ = 0.15) 90.6 82.3 66.8 49.4 35.0 25.5 19.9 [0.879, 0.148, 0.599] Adv-train (ϵ = 0.2) 90.3 82.5 67.5 49.3 35.8 27.0 21.3 [0.724, 0.076, 0.420] Adv-train (ϵ = 0.25) 87.7 81.0 66.0 48.5 36.4 26.6 21.6 [0.572, 0.049, 0.243] Adv-train (ϵ = 0.3) 81.5 75.3 61.2 44.7 33.7 26.0 20.1 [0.459, 0.033, 0.131]

The learned {λl}L l=1 values are layer-specific, indicating that setting the same {λl}L l=1 for each layer in Paradigm 1 is not an optimal strategy. Automated learning of {λl}L l=1 enables the discovery of the optimal combination across layers. As the noise level in the training data increases, the learned {λl}L l=1 values tend to decrease, causing the hybrid architecture to resemble a more robust NRPM architecture. This suggests that our proposed Paradigm 2 can adaptively adjust {λl}L l=1 to accommodate noisy environments.

Ablation studies. To further investigate the effectiveness of our proposed robust architecture, we provide several ablation studies on backbone size, attack budget measurement, additional backbone MLP-Mixer (Tolstikhin et al., 2021),the number of layers K in the Appendix C.2, Appendix C.3, and Appendix C.4, respectively. These experiments demonstrate the consistent advantages of our method across different backbone sizes and attack budget measurements. Additionally, increasing the number of layers K can further enhance robustness, though at the cost of a slight decrease in clean performance.

5.3 ROBUSTNESS REPROGRAMMING ON CONVOLUTION

Robustness reprogramming on convolution. We evaluate the performance of our robustness reprogramming with various weight {λl}L l=1 in Figure 3 , where we can observe that incorporating more NRPM-induced embedding will significantly improve the robustness while sacrifice a little bit of clean performance. Moreover, by fine-tuning NRPM-model, we can significantly improve robust performance while compensating for the sacrifice in clean accuracy.

Adversarial fine-tuning. Beyond natural training, we also validate the advantage of our NRPM architecture over the vanilla LPM under adversarial training. We utilize the pretrained parameter from LPM architecture, and track the clean/robust performance across 10 epochs under the adversarial fine-tuning for both architectures in Figure 4. The curves presented demonstrate the consistent improvement of NRPM over LPM across all the epochs.

Figure 3: Robustness reprogramming on Le Net. The depth of color represents the size of budget.

Figure 4: Adversarial fine-tuning on Le Net.

Hidden embedding visualization. We also conduct visualization analyses on the hidden embedding to obtain better insight into the effectiveness of our proposed NRPM. First, we quantify the difference between clean embeddings (x or zi) and attacked embeddings (x or z i) across all layers in Table 3, and visualize them in Figure 5 and Figure 11. The results in Table 3 show that NRPM-Le Net

Published as a conference paper at ICLR 2025

has smaller embedding difference across layers, indicating that our proposed NRPM architecture indeed mitigates the impact of the adversarial perturbation. Moreover, as demonstrated in the example in Figure 5, the presence of adversarial perturbations can disrupt the hidden embedding patterns, leading to incorrect predictions in the case of vanilla Le Net. In contrast, our NRPM-Le Net appears to lessen the effects of such perturbations and maintain predicting groundtruth label. From the figures, we can also clearly tell that the difference between clean attacked embeddings of LPM-Le Net is much more significant than in NRPM-Le Net.

Table 3: Embedding difference between clean and adversarial data (ϵ = 0.3) in Le Net. (MNIST)

LPM-Le Net || ||1 || ||2 || || |x x | 93.21 27.54 0.30 |z1 z 1| 271.20 116.94 1.56 |z2 z 2| 79.52 62.17 1.89 |z3 z 3| 18.77 22.56 2.32 |z4 z 4| 9.84 18.95 3.34

NRPM-Le Net || ||1 || ||2 || || |x x | 90.09 26.67 0.30 |z1 z 1| 167.52 58.55 1.19 |z2 z 2| 19.76 4.27 0.56 |z3 z 3| 3.63 0.98 0.50 |z4 z 4| 2.21 0.79 0.57

(a) Visualization on LPM-Le Net.

(b) Visualization on NRPM-Le Net.

Figure 5: Visualization of hidden embeddings. The LPM-Le Net is more sensitive to perturbation compared to the NRPM-Le Net: (a) When comparing zi (1st row) and z i (2nd row), LPM (left) shows a more significant difference than NRPM (right). (b) When comparing the likelihood of predictions, the perturbation misleads LPM from predicting 4 to 8, while NRPM consistently predicts 4 in both clean and noisy scenarios.

Additional experiments. To further demonstrate the effectiveness of our proposed method, we include experiments on two additional datasets, SVHN and Image Net10, which are provided in Appendix D.2. All these experiments demonstrate consistent advantages of our proposed method.

5.4 ROBUSTNESS REPROGRAMMING ON RESNETS

In this section, we will evaluate the robustness of robustness reprogramming on Res Nets across various attacks and further validate the effectiveness under diverse settings in the ablation studies.

Table 4: Robustness reprogramming on CIFAR10 with narrow Res Net18 as backbone.

Budget ϵ Natural 8/255 16/255 32/255 {λl}L l=1 67.89 41.46 16.99 3.17 λ = 1.0 59.23 40.03 23.05 8.71 λ = 0.9 40.79 28.32 20.27 13.70 λ = 0.8 Paradigm 1 24.69 18.34 15.31 13.09 λ = 0.7 (without tuning) 17.84 14.26 12.63 11.66 λ = 0.6 15.99 12.51 11.42 11.07 λ = 0.5 10.03 10.02 10.0 10.0 λ = 0.4 Paradigm 2 (tuning λ) 69.08 44.09 24.94 12.63 Learnable Paradigm 3 (tuning all) 71.79 50.89 39.58 30.03 Learnable

Robustness reprogramming on Res Nets. We evaluate the robustness under PGD of our robustness reprogramming via three paradigms in Table 4. From the results, we can observe that: (1) In Paradigm 1, adding more NRPM-based embeddings without fine-tuning leads to a notable drop in clean performance in Res Net18, which also limits the robustness improvement. (2) In Paradigm 2,

Published as a conference paper at ICLR 2025

by adjusting only the {λl}L l=1 values, we can improve the clean and robust performance, indicating the need of layer-wise balance across different layers. (3) In Paradigm 3, by tuning both {λl}L l=1 and parameters, we observe that both the accuracy and robustness can be further improved.

Adversarial robustness. To validate the effectiveness of our robustness reprogramming with existing method, we select several existing popular adversarial defenses and report the experimental results of backbone Res Net18 under various attacks in Table 5. From the results we can observe that our robustness reprogramming exhibits excellent robustness across various attacks.

Table 5: Adversarial robsustness on CIFAR10 with Res Net18 as backbone.

Method Natural PGD FGSM C&W AA Deep Fool SPSA AVG PGD-AT 80.90 44.35 58.41 46.72 42.14 14.81 62.92 44.89 TRADES-2.0 82.80 48.32 51.67 40.65 36.40 25.91 64.29 44.54 TRADES-0.2 85.74 32.63 44.26 26.70 19.00 12.98 57.79 32.23 MART 79.03 48.90 60.86 45.92 43.88 25.63 56.55 46.96 SAT 63.28 43.57 50.13 47.47 39.72 22.34 53.47 42.78 AWP 81.20 51.60 55.30 48.00 46.90 26.25 61.37 48.24 Consistency 84.37 45.19 53.84 43.75 40.88 21.27 65.91 45.14 DYNAT 82.34 52.25 65.96 52.19 45.10 28.72 67.97 52.03 Paradigm 3 (Ours) 80.43 57.23 70.23 64.07 52.60 36.50 67.56 58.03

Figure 6: Certified robustness via randomized smoothing with various σ levels.

Figure 7: Ablation study on backbone size. The depth of color represents budget size.

Certified Robustness. Additionally, we also evaluate the certified robustness using randomized smoothing with various σ levels in Figure 6. The curves presented in the figure demonstrate a significant advantage of our NRPM over vanilla LPM, further validating the effectiveness of our proposed architecture.

Different backbone sizes & budgets. Here, we conduct ablation studies on the backbone size & budget under Auto Attack in Figure 7 and leave the results under PGD in Appendix E. The results show the evident advantage of our NRPM architecture.

6 CONCLUSION

This paper addresses a fundamental challenge in representation learning: how to reprogram a welltrained model to enhance its robustness without altering its parameters. We begin by revisiting the essential linear pattern matching in representation learning and then introduce an alternative non-linear robust pattern matching mechanism. Additionally, we present a novel and efficient Robustness Reprogramming framework, which can be flexibly applied under three paradigms, making it suitable for practical scenarios. Our theoretical analysis and comprehensive empirical evaluation demonstrate significant and consistent performance improvements. This research offers a promising and complementary approach to strengthening adversarial defenses in deep learning, significantly contributing to the development of more resilient AI systems.

Published as a conference paper at ICLR 2025

Ethics statement. This paper proposes robustness reprogramming techniques to enhance the robustness and safety of machine learning models. We do not identify any potential negative concerns. Reproducibility statement. This paper provides all necessary technique details for reproducibility, including theoretical analysis, algorithm details, experimental settings, pseudo code, implementation, and source code of the proposed techniques.

ACKNOWLEDGMENT

Zhichao Hou and Dr. Xiaorui Liu are supported by the NSF National AI Research Resource Pilot Award, Amazon Research Award, NCSU Data Science Academy Seed Grant Award, and NCSU Faculty Research and Professional Development Award.

NIST AI. Artificial intelligence risk management framework (ai rmf 1.0). 2023.

Moustafa Alzantot, Yash Sharma, Supriyo Chakraborty, Huan Zhang, Cho-Jui Hsieh, and Mani B Srivastava. Genattack: Practical black-box attacks with gradient-free optimization. In Proceedings of the genetic and evolutionary computation conference, pp. 1111 1119, 2019.

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efficient black-box adversarial attack via random search. In European conference on computer vision, pp. 484 501. Springer, 2020.

Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim ˇSrndi c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pp. 387 402. Springer, 2013.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39 57. Ieee, 2017.

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pp. 15 26, 2017.

Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International conference on machine learning, pp. 854 863. PMLR, 2017.

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pp. 1310 1320. PMLR, 2019.

Francesco Croce and Matthias Hein. Minimally distorted adversarial examples with a fast adaptive boundary attack. In International Conference on Machine Learning, pp. 2196 2205. PMLR, 2020a.

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pp. 2206 2216. PMLR, 2020b.

Wenqi Fan, Xiaorui Liu, Wei Jin, Xiangyu Zhao, Jiliang Tang, and Qing Li. Graph trend filtering networks for recommendation. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pp. 112 121, 2022.

Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George Pappas. Efficient and accurate estimation of lipschitz constants for deep neural networks. Advances in neural information processing systems, 32, 2019.

Published as a conference paper at ICLR 2025

Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting adversarial samples from artifacts. ar Xiv preprint ar Xiv:1703.00410, 2017.

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT Press, 2016.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. ar Xiv preprint ar Xiv:1810.12715, 2018.

Sven Gowal, Sylvestre-Alvise Rebuffi, Olivia Wiles, Florian Stimberg, Dan Andrei Calian, and Timothy A Mann. Improving robustness using generated data. Advances in Neural Information Processing Systems, 34:4218 4233, 2021.

Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick Mc Daniel. On the (statistical) detection of adversarial examples. ar Xiv preprint ar Xiv:1702.06280, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Chih-Hui Ho and Nuno Vasconcelos. Disco: Adversarial defense with local implicit functions. Advances in Neural Information Processing Systems, 35:23818 23837, 2022.

Zhichao Hou, Ruiqi Feng, Tyler Derr, and Xiaorui Liu. Robust graph neural networks via unbiased aggregation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, a.

Zhichao Hou, Weizhi Gao, Yuchen Shen, and Xiaorui Liu. Protransformer: Robustify transformers via plug-and-play paradigm. In ICLR 2024 Workshop on Reliable and Responsible Foundation Models, b.

Lang Huang, Chao Zhang, and Hongyang Zhang. Self-adaptive training: beyond empirical risk minimization. Advances in neural information processing systems, 33:19365 19376, 2020.

Peter J Huber and Elvezio M Ronchetti. Robust statistics. John Wiley & Sons, 2011.

Qiyu Kang, Yang Song, Qinxu Ding, and Wee Peng Tay. Stable neural ode with lyapunov-stable equilibrium points for defending against adversarial attacks. Advances in Neural Information Processing Systems, 34:14925 14937, 2021.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

John Law. Robust statistics the approach based on influence functions, 1986.

Yann Le Cun and Corinna Cortes. The mnist database of handwritten digits. 2005. URL https: //api.semanticscholar.org/Corpus ID:60282629.

Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436 444, 2015.

Boqi Li and Weiwei Liu. Wat: improve the worst-class robustness in adversarial training. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp. 14982 14990, 2023.

Xiyuan Li, Zou Xin, and Weiwei Liu. Defending against adversarial attacks via neural dynamic system. Advances in Neural Information Processing Systems, 35:6372 6383, 2022.

Xiaorui Liu, Jiayuan Ding, Wei Jin, Han Xu, Yao Ma, Zitao Liu, and Jiliang Tang. Graph neural networks with adaptive residual. Advances in Neural Information Processing Systems, 34:9720 9733, 2021a.

Published as a conference paper at ICLR 2025

Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. Elastic graph neural networks. In International Conference on Machine Learning, pp. 6837 6849. PMLR, 2021b.

Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM international conference on information & knowledge management, pp. 1202 1211, 2021.

Aleksander Madry. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. ar Xiv preprint ar Xiv:1702.04267, 2017.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574 2582, 2016.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, pp. 4. Granada, 2011.

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. Diffusion models for adversarial purification. ar Xiv preprint ar Xiv:2205.07460, 2022.

Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506 519, 2017.

Sylvestre-Alvise Rebuffi, Sven Gowal, Dan A Calian, Florian Stimberg, Olivia Wiles, and Timothy Mann. Fixing data augmentation to improve adversarial robustness. ar Xiv preprint ar Xiv:2103.01946, 2021.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015.

Changhao Shi, Chester Holtz, and Gal Mishne. Online adversarial purification based on selfsupervision. ar Xiv preprint ar Xiv:2101.09387, 2021.

C Szegedy. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261 24272, 2021.

A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.

Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In International conference on learning representations, 2019.

Zekai Wang, Tianyu Pang, Chao Du, Min Lin, Weiwei Liu, and Shuicheng Yan. Better diffusion models further improve adversarial training. In International Conference on Machine Learning, pp. 36246 36263. PMLR, 2023.

Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. Advances in neural information processing systems, 33:2958 2969, 2020.

Hanshu Yan, Jiawei Du, Vincent YF Tan, and Jiashi Feng. On robustness of neural ordinary differential equations. ar Xiv preprint ar Xiv:1910.05513, 2019.

Published as a conference paper at ICLR 2025

Jongmin Yoon, Sung Ju Hwang, and Juho Lee. Adversarial purification with score-based generative models. In International Conference on Machine Learning, pp. 12062 12072. PMLR, 2021.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp. 7472 7482. PMLR, 2019.

Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 4480 4488, 2016.

Published as a conference paper at ICLR 2025

A EFFICIENT IMPLEMENTATION OF ROBUST CONVOLUTION/MLPS

In the main paper, we formulate the pixel-wise algorithm for notation simplicity. However, it is not trivial to implement such theory into practice in convolution and MLPs layer. Here, we will present the detailed implementation or techniques for each case.

A.1 ROBUST MLPS

In MLPs, we denote the input as X RD1, the output embedding as Z RD2, the model parameters as A RD1 D2. The pseudo code is presented in Algorithm 2.

Algorithm 2 Robust MLPs with NRPM

1 def Robust MLP(X, A, eps = 1e-3, K = 3):

2 AX = X.unsqueeze(-1) * A

3 D = X.shape[0]

4 Z = torch.matmul(X, A) # Initialization as LPM-estimation

5 For _ in range(K):

6 DIST = torch.abs(KX - Z/D) # Distance

7 W = 1/(DIST + eps)

8 W = normalize(W, p=1, dim=0)

9 Z = D*(W*AX).sum(dim=0) # Update

10 return Z

A.2 ROBUST CONVOLUTION

Unfolding. To apply robust estimation over each patch, we must first unfold the convolution into multiple patches and process them individually. While it s possible to use a for loop to extract patches across the channels, height, and width, this approach is not efficient. Instead, we can leverage Py Torch s torch.nn.functional.unfold function to streamline and accelerate both the unfolding and computation process.

B THEORETICAL PROOF

B.1 PROOF OF LEMMA 3.1

Proof. Since a a 2

b 2 and the equlity holds when a = b, by replacemnet as a = (adxd z/D)2 and b = (adxd z0/D)2, then

|adxd z/D| 1

2 1 |adxd z0/D| (adxd z/D)2 + 1

2|adxd z0/D|

= wd (adxd z/D)2 + 1

2|adxd z0/D|

Sum up the items on both sides, we obtain

d=1 |adxd z/D|

d=1 wd (adxd z/D)2 + 1

d=1 |adxd z0/D| = U(z, z0)

and the equality holds at a = b (z = z0): U(z0, z0) = L(z0). (4)

B.2 DERIVATION OF IRLS ALGORITHM

U (z(k), z(k)) = 1

d=1 w(k) d 2(z(k)/D adxd)

Published as a conference paper at ICLR 2025

U (z(k), z(k)) = 2

z(k+1) = z(k) U (z(k), z(k))

U (z(k), z(k)) (5)

1 D PD d=1 w(k) d 2(z(k)/D adxd)

2 D2 PD d=1 w(k) d (6)

= D PD d=1 w(k) d adxd PD d=1 w(k) d (7)

where w(k) d = 1

2 1 |adxd z(k)/D|. Since the constant in w(k) d can be canceled in the updated formulation, we have:

w(k) d = 1 |adxd z(k)/D|.

B.3 PROOF OF THEOREM 3.2

Proof. For notation simplicity, we denote y = 1 DTLP M(F), yϵ = 1 DTLP M(Fϵ), z = 1 DTNRP M(F), zϵ = 1 DTNRP M(Fϵ). Since ϵ is very small and D is large enough, we can assume |adxd y| |adxd yϵ| for simplicity.

Influence function for LPM. The new weighted average yϵ under contamination becomes:

yϵ = (1 ϵ)y + ϵ x

Taking the limit as ϵ 0, we get:

IF( x; TLP M, F) = lim ϵ 0 D(yϵ y)

ϵ = lim ϵ 0 Dϵ( x y)

ϵ = D( x y) = D( x z LP M/D)

This shows that the weighted average y is directly influenced by x, making it sensitive to outliers.

Influence function for NRPM. Next, we consider the reweighted estimate z, when we introduce a small contamination at x, which changes the distribution slightly. The contaminated reweighted estimate zϵ becomes:

zϵ = (1 ϵ) PD d=1 wdadxd + ϵw x x

(1 ϵ) PD d=1 wd + ϵw x

We can simplify the expression for zϵ as:

zϵ = PD d=1 wdadxd + ϵ(w x x PD d=1 wdadxd)

(1 ϵ) PD d=1 wd + ϵw x

Published as a conference paper at ICLR 2025

First, expand the difference zϵ z:

zϵ z = PD d=1 wdadxd + ϵ(w x x PD d=1 wdadxd)

(1 ϵ) PD d=1 wd + ϵw x PD d=1 wdadxd PD d=1 wd

= PD d=1 wdadxd PD d=1 wd + ϵ(w x x PD d=1 wdadxd) PD d=1 wd [(1 ϵ) PD d=1 wd + ϵw x] PD d=1 wd

+ [(1 ϵ) PD d=1 wd + ϵw x] PD d=1 wdadxd [(1 ϵ) PD d=1 wd + ϵw x] PD d=1 wd

=ϵ(w x x PD d=1 wdadxd) PD d=1 wd ϵ(w x PD d=1 wd) PD d=1 wdadxd [(1 ϵ) PD d=1 wd + ϵw x] PD d=1 wd

=ϵ(w x x PD d=1 wd w x PD d=1 wdadxd)

[(1 ϵ) PD d=1 wd + ϵw x] PD d=1 wd

Finally, divide the difference by ϵ and take the limit as ϵ 0:

IF( x; TNRP M, F) = lim ϵ 0 D(zϵ z)

= D w x x PD d=1 wd w x PD d=1 wdadxd PD d=1 wd PD d=1 wd

= Dw x ( x z NRP M/D) PD d=1 wd

Since w x is small for outliers (large | x z LP M/D|), the influence of x on NRPM is diminished compared to the influence on LPM. Therefore, NRPM is more robust than LPM because the influence of outliers is reduced.

Published as a conference paper at ICLR 2025

C ADDITIONAL EXPERIMENTAL RESULTS ON MLPS

C.1 MLP-MIXER

MLP-Mixer (Tolstikhin et al., 2021) serves as an alternative model to CNNs in computer vision. Building on the excellent performance of basic MLPs, we further explore the effectiveness of our NRPM with MLP-Mixer, presenting the results in Figure 8 and Table 6. The results validate the effectiveness of our proposed methdod in MLP-Mixer with different blocks across various budgets.

Figure 8: Robustness under PGD on MLP-Mixer. The depth of the color represents the size of the budget.

Table 6: Adversarial Robustness of MLP-Mixer - CIFAR10 Model Natural ϵ = 1/255 ϵ = 2/255 ϵ = 3/255 ϵ = 4/255 ϵ = 8/255 LPM-MLP-Mixer-2 77.45 38.55 10.39 2.24 0.34 0.0 NRPM-MLP-Mixer-2 76.86 64.27 49.62 37.22 27.99 15.73 LPM-MLP-Mixer-4 78.61 35.99 9.35 1.45 0.19 0.0 NRPM-MLP-Mixer-4 77.51 66.72 51.73 39.43 30.88 18.31 LPM-MLP-Mixer-8 78.98 40.55 12.81 3.27 0.73 0.01 NRPM-MLP-Mixer-8 78.32 66.82 52.73 40.37 32.66 20.61

C.2 THE EFFECT OF BACKBONE SIZE.

Table 7: Robustness of MLPs with different layers on MNIST.

Model Size Arch / Budget 0 0.05 0.1 0.15 0.2 0.25 0.3

1 Layer LPM 91.1 12.0 7.1 0.5 0.0 0.0 0.0 NRPM 87.6 14.0 13.6 12.9 12.1 11.5 10.3

2 Layers LPM 91.6 32.9 2.6 0.2 0.0 0.0 0.0 NRPM 89.1 37.0 34.2 30.7 25.8 22.7 21.2

3 Layers LPM 90.8 31.8 2.6 0.0 0.0 0.0 0.0 NRPM 90.7 47.2 45.1 37.8 30.8 24.9 21.1

We select the MLP backbones with different layers to investigate the performance under different model sizes. We report the performance with the linear models of 1 (784 - 10), 2 (784 - 64 -10), 3 (784 - 256 - 64 -10) layers. We present the fine-tuning results for both the LPM model and NRPM model in Table 7, from which we can make the following observations:

Our Ros Net show better robustness with different size of backbones Regarding clean performance, linear models with varying layers are equivalent and comparable. However, our Ros Net shows some improvement as the number of layers increases. The robustness of our Ros Net enhances with the addition of more layers. This suggests that our robust layers effectively mitigate the impact of perturbations as they propagate through the network.

Published as a conference paper at ICLR 2025

C.3 ATTACK BUDGET MEASUREMENT

(a) L0-attack

(b) L2-attack

(c) L -attack

Figure 9: Robustness under attacks with different norms.

In previous results of different backbone sizes, we notice the NRPM model is not effective enough under L -attack. To verify the effect the attack budget measurement, we evaluate the robustness under L ,L2,L0 attacks where the budget is measured under the L ,L2,L0-norms, that is, x x p budget, where p = 0, 2, . We can make the following observations from the results in Figure 9:

Our NRPM architecture consistently improve the robustness over the LPM-backbone across various attacks and budgets.

Our NRPM model experiences a slight decrease in clean accuracy, suggesting that the median retains less information than the mean.

Our NRPM model shows improved performance under the L0 and L2 attacks compared to the L attack. This behavior aligns with the properties of mean and median. Specifically, under the L attack, each pixel can be perturbed up to a certain limit, allowing both the mean and the median to be easily altered when all pixels are modified. Conversely, under the L2 and L0 attacks, the number of perturbed pixels is restricted, making the median more resilient to disruption than the mean.

C.4 EFFECT OF DIFFERENT LAYERS K

To investigate the effect of the number of iterations K in the unrolled architecture, we present the results with increasing K on combined models in the Table 8 and Figure 10 in the appendix. As K increases, the hybrid model can approach to the NRPM architecture, which makes the model more robust while slightly sacrificing a little bit natural accuracy.

Published as a conference paper at ICLR 2025

Table 8: Ablation study on number of layers K - MLP - MNIST

K / ϵ 0 0.05 0.1 0.15 0.2 0.25 0.3 λ = 0.9 K = 0 90.8 31.8 2.6 0.0 0.0 0.0 0.0 K = 1 90.8 56.6 17.9 8.5 4.6 3.0 2.3 K = 2 90.7 76.1 44.0 26.8 18.0 13.4 9.3 K = 3 90.7 82.1 56.7 37.4 26.3 19.9 16.3 K = 4 90.6 83.5 61.1 42.7 29.4 22.7 18.0 λ = 0.8 K = 0 90.8 31.8 2.6 0.0 0.0 0.0 0.0 K = 1 90.4 67.1 30.8 17.4 10.6 6.5 4.5 K = 2 90.4 81.1 56.2 36.2 26.0 19.3 15.5 K = 3 90.3 81.4 61.8 42.9 31.8 23.7 19.5 K = 4 90.3 82.5 64.3 45.6 33.1 25.3 22.0 λ = 0.7 K = 0 90.8 31.8 2.6 0.0 0.0 0.0 0.0 K = 1 89.7 73.7 43.5 25.5 16.9 11.7 9.2 K = 2 89.1 80.3 56.4 39.7 28.4 21.4 18.5 K = 3 88.9 80.3 61.8 44.0 33.2 24.4 19.2 K = 4 88.4 79.6 61.3 43.6 33.9 26.3 21.4 λ = 0.6 K = 0 90.8 31.8 2.6 0.0 0.0 0.0 0.0 K = 1 88.1 75.3 49.0 31.0 22.0 15.5 12.4 K = 2 86.2 77.1 56.2 39.3 29.2 22.6 17.1 K = 3 85.5 75.9 55.8 40.2 30.7 23.1 18.8 K = 4 84.0 74.7 56.2 39.6 29.4 22.1 18.7 λ = 0.5 K = 0 90.8 31.8 2.6 0.0 0.0 0.0 0.0 K = 1 84.1 74.4 50.0 31.9 22.8 18.1 14.3 K = 2 80.3 71.0 53.5 37.0 26.9 21.5 16.4 K = 3 79.0 69.6 53.7 38.1 29.2 23.2 19.5 K = 4 77.7 66.6 51.2 37.9 29.7 23.7 19.8

Published as a conference paper at ICLR 2025

(a) λ = 0.5

(b) λ = 0.6

(c) λ = 0.7

(d) λ = 0.8

(e) λ = 0.9

Figure 10: Effect of layers K

D ADDITIONAL EXPERIMENTAL RESULTS ON LENET

D.1 VISUALIZATION OF HIDDEN EMBEDDING

We put several examples of the visualizations of hidden embedding in Figure 11.

Published as a conference paper at ICLR 2025

Figure 11: Visualization of hidden embedding.

D.2 ADDITIONAL DATASETS

Besides MNIST, we also conduct the experiment on SVHN and Image Net10, and show the results under PGD in Table 9 and Table 10, respectively.

Published as a conference paper at ICLR 2025

Table 9: Robustness of Le Net under PGD on SVHN.

Model / Budget ϵ 0/255 1/255 2/255 4/255 8/255 16/255 32/255 64/255 LPM-Le Net 83.8 68.6 49.3 27.1 8.2 3.0 1.9 1.3 NRPM-Le Net 83.2 72.4 54.6 35.4 20.1 14.3 11.8 7.9

Table 10: Robustness of Le Net under PGD on Image Net10.

Model / Budget ϵ 0/255 2/255 4/255 8/255 LPM-Le Net 53.72 13.11 7.67 3.89 NRPM-Le Net 53.55 15.06 13.68 12.50

E ADDITIONAL EXPERIMENTAL RESULTS ON RESNETS

E.1 EFFECT OF BACKBONE SIZE

We evaluate the robustness under PGD with different backbone sizes including Res Net10, Res Net18, and Res Net34. The results presented in Figure 12 demonstrate the consistent improvement of our NRPM over vanilla LPM.

Figure 12: Ablation study on backbone size. (PGD)