# attacks_on_online_learners_a_teacherstudent_analysis__a64ebfb8.pdf Attacks on Online Learners: a Teacher-Student Analysis Riccardo G. Margiotta Sebastian Goldt Guido Sanguinetti International School for Advanced Studies, Trieste, Italy Machine learning models are famously vulnerable to adversarial attacks: small adhoc perturbations of the data that can catastrophically alter the model predictions. While a large literature has studied the case of test-time attacks on pre-trained models, the important case of attacks in an online learning setting has received little attention so far. In this work, we use a control-theoretical perspective to study the scenario where an attacker may perturb data labels to manipulate the learning dynamics of an online learner. We perform a theoretical analysis of the problem in a teacher-student setup, considering different attack strategies, and obtaining analytical results for the steady state of simple linear learners. These results enable us to prove that a discontinuous transition in the learner s accuracy occurs when the attack strength exceeds a critical threshold. We then study empirically attacks on learners with complex architectures using real data, confirming the insights of our theoretical analysis. Our findings show that greedy attacks can be extremely efficient, especially when data stream in small batches. 1 Introduction Adversarial attacks pose a remarkable threat to modern machine learning (ML), as models based on deep networks have proven highly vulnerable to such attacks. Understanding adversarial attacks is therefore of paramount importance, and much work has been done in this direction; see [1, 2] for reviews. The standard setup of adversarial attacks aims to change the predictions of a trained model by applying a minimal perturbation to its inputs. In the data poisoning scenario, instead, inputs and/or outputs of the training data are corrupted by an attacker whose goal is to force the learner to get as close as possible to a nefarious target model, for example to include a backdoor [3, 4]. Data poisoning has also received considerable attention that has focused on the offline setting, where models are trained on fixed datasets [5]. An increasing number of real-world applications instead require that machine learning models are continuously updated using streams of data, often relying on transfer-learning techniques. In many cases, the streaming data can be subject to adversarial manipulations. Examples include learning from user-generated data (GPT-based models) and collecting information from the environment, as in e-commerce applications. Federated learning constitutes yet another important example where machine learning models are updated over time, and where malicious nodes in the federation have private access to a subset of the data used for training. In the online setting, the attacker intercepts the data stream and modifies it to move the learner towards the nefarious target model, see Fig. 1-A for a schematic where attacks change the labels. In this streaming scenario, the attacker needs to predict both the future data stream and model states, accounting for the effect of its own perturbations, so as to decide how to poison the current data batch. Finding the best possible attack policy for a given data batch and learner state can be formalized as a stochastic optimal control problem, with opposing costs given by the magnitude of perturbations R. G. Margiotta (rimargi@sissa.it) is the corresponding author. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). and the distance between learner and attacker s target [6]. The ensuing dynamics have surprising features. Take for example Fig. 1-B, where we show the accuracy of an attacked model (a logistic regression classifier classifying MNIST images) as a function of the attack strength (we define this quantity in Sec. 2.2). Even in such a simple model, we observe a strong nonlinear behavior where the accuracy drops dramatically when crossing a critical attack strength. This observation raises several questions about the vulnerability of learning algorithms to online data poisoning. What is the effect of the perturbations on the learning dynamics and long-term behavior of the model under attack? Is there a minimum perturbation strength required for the attacker to achieve a predetermined goal? And how do different attack strategies compare in terms of efficacy? We investigate these questions by analyzing the attacker s control problem from a theoretical standpoint in the teacher-student setup, a popular framework for studying machine learning models in a controlled way [7 10]. We obtain analytical results characterizing the steady state of the attacked model in a linear regression setting, and we show that a simple greedy attack strategy can perform near optimally. This observation is empirically replicated across more complex learners in the teacher-student setup, and the fundamental aspects of our theoretical results are also reflected in real-data experiments. Main contributions We present a theoretical analysis of online data poisoning in the teacher-student setup, providing analytical results for the linear regression case [RESULT 1-5]. In particular, we demonstrate that a phase transition takes place when the batch size approaches infinity, signaled by a discontinuity in the accuracy of the student against the attacker s strength [RESULT 2]. We provide a quantitative comparison between different attack strategies. Surprisingly, we find that properly calibrated greedy attacks can be as effective as attacks with full knowledge of the incoming data stream. Greedy attacks are also computationally efficient, thus constituting a remarkable threat to online learners. We empirically study online data poisoning on real datasets (MNIST, CIFAR10), using architectures of varying complexities including Le Net, Res Net, and VGG. We observe qualitative features that are consistent with our theoretical predictions, providing validation for our findings. 1.1 Further related work Data poisoning attacks have been the subject of a wide range of studies. Most of this research focused on the offline (or batch) setting, where the attacker interacts with the data only once before training starts [3, 11 14]. Mei et al. notably proposed an optimization approach for offline data poisoning [15]. Wang and Chaudhuri addressed a streaming-data setting, but in the idealized scenario where the attacker has access to the full (future) data stream (the so-called clairvoyant setting), which effectively reduces the online problem to an offline optimization [16]. So far, the case of online data poisoning has received only a few contributions with a genuine online perspective. In [17], Pang and coauthors investigated attacks on online learners with poisoned batches that aim for an immediate disruptive impact rather than manipulating the long-term learning dynamics. Zhang et al. proposed an optimal control formulation of online data poisoning by following a parallel line of research, that of online teaching [18, 19], focussing then on practical attack strategies that perturb the input [6]. Our optimal control perspective is similar to that of Zhang and co-authors, though our contribution differs from theirs in various aspects. First, we consider a supervised learning setting where attacks involve data labels, and not input data. We note that attacks on the labels have a reduced computational cost compared to high-dimensional adversarial perturbations of the inputs, which makes label poisoning more convenient in a streaming scenario. Second, we provide a theoretical analysis and explicit results for the dynamics of the attacked models, and we consider non-linear predictors in our experiments. We observe that our setup is reminiscent of continual learning, which has also recently been analyzed in a teacher-student setup [20]. However, in online data poisoning the labels are dynamically changed by the attacker to drive the learner towards a desired target. In continual learning there is no entity that actively modifies the labels, and the learner simply switches from one task to another. Finally, we note that standard (static, test-time) adversarial attacks have been studied in a teacher-student framework [21]. Reproducibility. The code and details for implementing our experiments are available here. label: yt input: x Teacher label: yt input: x Teacher Figure 1: Problem setup and key experimental observations. A: The environment (teacher) generates a sequence of clean data with µth input xµ and label yt µ. The attacker perturbs the labels towards a nefarious target y µ = ϕ (xµ). At each training step, the predictions ys µ of the learner (student) and the perturbed labels y µ are used to compute the loss and update the learner parameters. B: accuracy vs attack strength of a logistic regression classifying MNIST digits 1 and 7 under online label-flipping attacks. C: relative distance dµ of the logistic learner vs SGD step µ during the clean training phase, and for training under attacks of increasing attack strength. 2 Problem setup 2.1 The teacher, the student, and the attacker The problem setup is depicted in Fig. 1-A. There are three entities: the teacher, the student, and the attacker. The teacher is a model of the environment, which generates a stream of clean inputoutput data batches. The inputs x RD have elements sampled from a distribution Px, while clean labels are provided by the scalar teacher function: yt = ϕt(x). The µth batch of the stream is BP µ = {(xµ, yt µ)p}P p=1, with P the batch size. The attacker intercepts the clean batch BP µ and perturbs the labels so as to make the student learn the target function ϕ . The student ϕs µ represents the learning model, and it updates its weights θs µ on the stream of perturbed batches via SGD, using each (perturbed) batch only once. We will consider teacher, attacker s target, and student with the same architecture ϕ(x; θ); they only differ by the value of their parameters, respectively θt, θ , θs µ. We will refer to this Teacher-Student-Attacker setup as the TSA problem. If the student trained directly on the input-label pairs provided by the teacher, we would recover the standard teacher-student setup that has been analyzed extensively in the literature [7 10]. The key novelty of this paper is that we consider a setup where each new batch of data is instead poisoned by the attacker with the attempt of deviating the student towards a different target function. 2.2 The attack The goal of the attacker is for the student to predict a target label y = ϕ (x) for each given input x. Target label and clean label can coincide for specific inputs, however they won t generally do so. The attacker performs an action aµ that produces perturbed labels as y µ = yt(1 aµ) + y aµ, aµ [amin, amax]. (1) Note that aµ = 0 implies no perturbation, while for aµ = 1 perturbed and target labels coincide. We allow amin < 0 and amax > 1 to model attackers that under/overshoot. The attacker s action aµ thus constitutes the control variable of the TSA problem. We primarily consider the case where the same value of aµ is used for all labels in each batch, but can be different for different batches. We also explore settings where this assumption is relaxed, including the case of sample-specific perturbations, and the scenario where only a fraction of the labels in each batch is perturbed. The aim of the attacker is to sway the student towards its target, and so minimize the nefarious cost p=1 (ϕs µ(xµp) ϕ (xµp))2, (2) using small perturbations. For this reason, it also tries to minimize the perturbation cost 2 Ca2 µ, C = E(ϕ )C, (3) with C a parameter expressing the cost of actions, and E(ϕ ) = Ex (ϕ (x) ϕt(x))2 the expected squared error of the attacker s target function; Ex indicates the average over the input distribution. We use the pre-factor E(ϕ ) in gper µ to set aside the effect of the teacher-target distance on the balance between the two costs. Note that C is inversely proportional to the intuitive notion of attack strength: low C implies a low cost for the action and therefore results in larger perturbations; the x-axis in Fig. 1-B shows the inverse of C. The attacker s goal is to find the sequence of actions aopt µ that minimize the total expected cost: {aopt µ } = argmin {aµ} Efut h lim T G({aµ}, T) i , G = µ=1 γµ(gper µ + gnef µ ), (4) where Efut represents the average over all possible realizations of the future data stream, and γ (0, 1) is the future discounting factor. In relation to γ, we make the crucial assumption that G is dominated by the steady-state running costs; we use γ = 0.995 in our experiments. 2.3 SGD dynamics under attack The perturbed batch BP µ = {(xµ, y µ)p}P p=1 is used to update the model parameters θs µ following the standard (stochastic) gradient descent algorithm using the MSE loss function: θs µ+1 = θs µ η θsµLµ BP µ , Lµ = 1 p=1 (ϕs µ(xµp) y µp)2, (5) where η is the learning rate. We characterize the system s dynamics in terms of the relative teacherstudent distance, defined as dµ(C, P) = E(ϕs µ)/E(ϕ ) 1/2 . (6) Note that d = 0 (resp. d = 1) for a student ϕs that perfectly predicts the clean (resp. target) labels, on average. We will split the training process into two phases: first, the student trains on clean data, while the attacker calibrates its policy. Then, after convergence of the student, attacks start and training continues on perturbed data. A typical realization of the dynamics is depicted in Fig. 1-C, which shows dµ for a logistic regression classifying MNIST digits during the clean-labels phase, and under attacks of increasing strength (decreasing values of the cost C). The sharp transition observed in Fig. 1-B occurs for dµ 0.5, when the student is equally distant from the teacher and target. 2.4 Attack strategies Finding the optimal actions sequence that solves the control problem (4) involves performing the average Efut, and so it is only possible if the attacker knows the data generating process. Even then, however, optimal solutions can be found only for simple cases, and realistic settings require approximate solutions. We consider the following approaches: Constant attacks. The action is observation-independent and remains constant for all SGD steps: aµ = ac µ. This simple baseline can be optimized by sampling data streams using past observations, simulating the dynamics via Eq. (5), and obtaining numerically the value of ac that minimizes the simulated total expected cost. Greedy attacks. A greedy attacker maximizes the immediate reward. Since the attacker actions have no instantaneous effect on the learner, the greedy objective has to include the next training step. The greedy action is then given by a G µ (xµ, θs µ; γ) = argmin aµ Exµ+1 gper µ + γgnef µ+1 . (7) The average Exµ+1 can be estimated with input data sampled from past observations, and using (5) to obtain the next student state. Similarly to constant attacks, the greedy future weight γ can be calibrated by minimizing the simulated total expected cost using past observations. Reinforcement learning attacks. The attacker can use a parametrized policy function a(x, θs) and optimize it via reinforcement learning (RL). This is a model free approach, and it is suitable for grey box settings. Tuning deep policies is computationally costly, however, and it requires careful calibrations. We utilize TD3 [22], a powerful RL agent that improves the deep deterministic policy gradient algorithm [23]. We employ TD3 as implemented in Stable Baselines3 [24]. Clairvoyant attacks. A clairvoyant attacker has full knowledge of the future data stream, so it simply looks for the sequence {aµ} that minimizes G({aµ}, T 1). Although the clairvoyant setting is not realistically plausible, it provides an upper bound for the attacker s performance in our experiments. We use Opty [25] to cast the clairvoyant problem into a standard nonlinear program, which is then solved by an interior-point solver (IPOPT) [26]. The above attack strategies are summarized as pseudocode in Appendix A. 3 Theoretical analysis We first analyze online label attacks with synthetic data to establish our theoretical results. We will confirm our findings with real data experiments in Sec. 4.2. Solving the stochastic optimal control problem (4) is very challenging, even for simple architectures ϕ(x, θ) and assuming that Px is known. We discuss an analytical treatment in two scenarios for the linear TSA problem with unbounded action space, where ϕ(x; w) = w Tx/ 3.1 Large batch size: deterministic limit In the limit of very large batches, batch averages of the nefarious cost gnef µ and loss Lµ in Eqs. (2, 5) can be approximated with averages over the input distribution. For standardized and i.i.d. input elements, the resulting equations are gnef µ = 1 2D|ws µ w |2, ws Lµ = 1 D (ws µ wt) + (wt µ w )aµ . (9) In this setting, the attacker faces a deterministic optimal control problem, which can be solved using a Lagrangian-based method. Following this procedure, we obtain explicit solutions for the steady-state student weights ws and optimal action aopt . The relative distance dopt between student and teacher then follows from Eq. (6). We find [RESULT 1] ws = wt + dopt (w wt), dopt = aopt = (C + 1) 1. (10) The above result is intuitive: the student learns a function that approaches the attacker s target as the cost of actions C tends to zero, and dopt (C = 0) = 1. Vice versa, for increasing values of C, the attacker s optimal action decreases, and the student approaches the teacher function. In the context of a classification task, we can characterize the system dynamics in terms of the accuracy of the student Aµ(C, P) = Ex[Sµ(x)], with Sµ(x) = sign(ϕs µ(x)) + sign(ϕt(x)) /2. For label-flipping attacks, where ϕ = ϕt, a direct consequence of Eq. (10) is that the steady-state accuracy of the student exhibits a discontinuous transition in C = 1. More precisely, we find [RESULT 2] A (C) = 1 H(C 1), (11) with H( ) the Heaviside step function. This simple result explains the behaviour observed in Fig. 1-B, where the collapse in accuracy occurs for attack strength C 1 1. We refer to Appendix B for a detailed derivation of results (10, 11). 3.2 Finite batch size: greedy attacks When dealing with batches of finite size, the attacker s optimal control problem is stochastic and has no straightforward solution, even for the simple linear TSA problem. To make progress, we consider greedy attacks, which are computationally convenient and analytically tractable. While an explicit solution to (7) is generally hard to find, it is possible for the linear case of Eq. (8). Figure 2: Greedy and clairvoyant attacks in the linear TSA problem. A: relative distance vs SGD step µ for greedy attacks on linear models for two different batch sizes P. Solid lines display the average dµ, shaded areas cover 1std of dµ. Dashed lines show the theoretical estimate of Eqs. (13). B: steady-state average accuracy of the student vs cost of action. The dashed line shows the theoretical prediction of Eq. (11). C: scatterplot of greedy (G) vs clairvoyant (CV) actions used on the same data stream with P = 1. D: student-teacher (s-t, green) and student-target (s- , coral) overlap vs SGD step, for the clairvoyant (dotted line) and greedy (continuous line) attacks of panel C. Parameters: C = 1, a [ 2, 3], D = 10, η = 0.02 D. Input elements sampled i.i.d. from Px = N(0, 1). We find the greedy action policy a G = 1 CDP p=1 (ws w )Txp(wt w ) Txp + O(η). (12) Note that a G decreases with the cost of actions, and a G 0 for C . With a policy function at hand, we can easily investigate the steady state of the system by averaging over multiple realizations of the data stream. For input elements sampled i.i.d. from Px = N(0, 1), we find the average steady-state weights ws as [RESULT 3] ws = wt + d G (w wt), d G = P P + 2 C + 1 1 , (13) where d G is the distance (6) of the average steady-state student; we refer to Appendix C for the derivation of the above result. The expression of d G includes a surprising batch-size effect: for a fixed value of the cost C, greedy attacks become more effective for batches of decreasing sizes, as the weights get closer to w while perturbations become smaller (see Eq. (45)). This effect is displayed in Fig. 2-A, which shows dµ vs SGD step for greedy attacks, and for P = 1, 10. Note that only the perturbed training phase is shown, so d0 = 0 as the student converges to the teacher function during the preceding clean training phase. The black dashed lines correspond to the steady-state values given by (13), which are in very good agreement with the empirical averages (up to a finite-η effect)2. We also observe that dµ is characterized by fluctuations with amplitude that decreases as the batch size increases. A consequence of such fluctuations is that the average steady-state accuracy of the student undergoes a smooth transition. The transition becomes sharper as the batch size increases, and in the limit P it converges to the prediction (11); see Fig. 2-B, which shows numerical evaluations of A (C, P) averaged over multiple data streams. In our experiments, we draw the teacher weights elements as i.i.d. samples from N(0, 1) and normalize them so that |wt|2 = D, while w = wt. Remark. In order to derive the result of Eq. (13), we have set γ = D/η so that d G dopt for P (see Appendix B.1). This guarantees the steady-state optimality of the greedy action policy in the large batch limit. Note that γ coincides with the timescale of the linear TSA problem, and it lends itself to a nice interpretation: in the greedy setting, where the horizon ends after the second training step, the future is optimally weighted by a factor proportional to the decorrelation time of the weights. Optimality for P finite is not guaranteed, though we observe numerically that γ approximately matches our choice even for P = 1, 10. We use again γ = D/η to obtain the theoretical results of the next two sections. 2We shall remark that the quantities d G and d = limµ dµ , the average over multiple data streams shown in Fig. 2-A, can, in general, differ, and they only coincide necessarily in the deterministic limit P . 3.2.1 Sample-specific perturbations The batch-size effect observed for greedy attacks can be explained as a signature of the batch average appearing in the policy function (12). When dealing with a single data point at a time, the attacker has precise control over the data labels, designing sample-specific perturbations. For batches with multiple samples, instead, perturbations are designed to perform well on average and do not exploit the fluctuations within the data stream. As a result, attacks become less effective, increasing in magnitude and leading to a smaller average steady-state distance between the student and the teacher function. A straightforward solution to this problem consists of applying sample-specific perturbations, thus using a multi-dimensional control a RP , which, however, results in a higher computational cost. This can be achieved by using the policy function (12) independently for each sample, so that the p-th element of a is given by a G p = 1 CD (ws w )Txp(wt w ) Txp. (14) In Appendix C.1, we show that this strategy coincides with the optimal greedy solution for multidimensional, sample-specific control, up to corrections of order η. We also show that the resulting average steady state reached by the student, for normal and i.i.d. input data, is [RESULT 4] ws = wt d G wt , d G = (C/3 + 1) 1 . (15) Remarkably, there is no batch-size dependence in the above solution, and the average steady-state distance coincides with (13) for P = 1. This result demonstrates that precise, sample-specific greedy attacks remain effective independently of the batch size of the streaming data. 3.2.2 Mixing clean and perturbed data In the previous section, we addressed a case of enhanced control, where batch perturbations are replaced by sample-specific ones. Here, instead, we consider a case of reduced control, where the attacker can apply the same perturbation to a fraction of the samples in each batch. This scenario can arise when the attacker faces computational limitations or environmental constraints. For example, the streaming frequency may be too high for the attacker to contaminate all samples at each time step, or, in a poisoned federated learning setting, the central server could have access to a source of clean data, beyond the reach of the attackers. Concretely, we assume that the attacker has access to a fraction ρ of the batch samples only. Therefore, training involves ρP perturbed and (1 ρ)P clean samples. Perturbations follow the policy function a G µ = 1 CDρP p=1 ws µ Txµp wt Txµp + O(η). (16) In Appendix C.2, we find the following average steady-state solution for normal and i.i.d. input data: [RESULT 5] ws = wt d G wt , d G = P ρP + 2 C + 1 1 ; (17) see Fig. 7 for a comparison with empirical evaluations. Note that for large batches d G (C/ρ+1) 1, indicating that the cost of actions effectively increases by a factor ρ 1. We remark that the above result is derived under the assumption that the attacker ignores the number of clean samples used during training, so it cannot compensate for the reduced effectiveness of the attacks. 4 Empirical results 4.1 Experiments on synthetic data Having derived analytical results for the infinite batch and finite batch linear TSA problem, we now present empirical evidence as to how different strategies compare on different architectures. As a first question, we investigate the efficacy of greedy attacks for data streaming in batches of small size. By choosing an appropriate future weight γ, we know the greedy solution is optimal as the batch size tends to infinity. To investigate optimality in the finite batch setting, we compare greedy actions with clairvoyant actions on the same data stream, optimizing γ over a grid of possible values. By C: constant, RL: reinforcement learning, G: greedy, CV: clairvoyant A. Linear regression B. Sigmoidal perceptron C. 2-layer NN C RL* G CV 0.0 C RL G CV 0.0 C RL G CV 0.0 2 0 2 log10(C) 2 0 2 log10(C) 2 0 2 log10(C) Figure 3: Empirical results for the TSA problem with synthetic data. A, top panel: average steady-state distance vs cost of action C for the linear TSA problem and greedy attacks. The orange area represents the range of solutions of d G (13) for P [1, 10]. Bottom panel: average steady-state running cost of different attack strategies relative to the largest one (constant attacks), for P = 1, and C = 1. B, C: same quantities as shown in A for the perceptron and NN architectures; RL indicates an agent that sees the final layer only. Averages were performed over 10 data streams of 105 batches and over the last 103 steps for each stream. Parameters: D = 10, η = 0.02 D, a [ 2, 3] (A, B), M = 100, η = 0.02 D M, a [0, 1] (C). Input elements sampled i.i.d. from Px = N(0, 1). definition, clairvoyant attacks have access to all the future information and therefore produce the best actions sequence. The scatterplot in Fig. 2-C shows greedy vs clairvoyant actions, respectively a G and a CV, on the same stream. Note that greedy actions are typically larger than their clairvoyant counterparts, except when a CV is very large. This is a distinctive feature of greedy strategies as they look for local optima. The clairvoyant vision allows the attacker to allocate its resources more wisely, for example by investing and paying extra perturbation costs for higher future rewards. This is not possible for the greedy attacker, as its vision is limited to the following training step. Substantially, though, the scattered points in Fig. 2-C lie along the diagonal, indicating that greedy attacks are nearly clairvoyant efficient, even for P = 1. The match between a G and a CV corresponds to almost identical greedy and clairvoyant trajectories of the student-teacher and student-target weights overlap, defined respectively as Ost µ = ws µ Twt/D and Os µ = ws µ Tw /D (see Fig. 2-D). Next, we investigate the steady state reached by the student under greedy attacks and the performance of the various attack strategies for three different architectures: the linear regression of Eq. (8), the sigmoidal perceptron ϕ(x; w) = Erf(z/ 2), with Erf( ) the error function and z = w Tx/ D, and the 2-layer neural network (NN) ϕ(x; v, w) = 1 m=1 vm Erf(zm/ 2), zm = w T mx/ with parameters wm RD and v RM. We continue to draw all teacher weights from the standard normal distribution, normalizing them to have unitary self-overlap. For the perceptron we set the target weights as w = wt, while for the neural network w m = wt m and v = vt. The top row in Fig. 3 shows d = limµ dµ , i.e. the average steady-state distance reached by the student versus the cost of actions C for greedy attacks. The solid lines show the results obtained from simulations3. The yellow shaded area shows d G form Eq. (13) comprised between values of batch size P = 1 (upper contour) and P = 10 (lower contour). Note that the three cases show very similar behavior of d (C, P) and are in excellent agreement with the linear TSA prediction. 3Practically, this involves solving numerically the minimization problem in Eq. (7) on a discrete action grid in the range [amin, amax]. This is done at each SGD step µ, and it requires simulating the transition µ µ + 1 to estimate the average nefarious cost Ex[gnef µ+1] sampling input data from a buffer of past observations. In order to compare the efficacy of all the attack strategies presented in Sec. 2.4, we consider the running cost, defined as gµ = gper µ + gnef µ . More precisely, we compute the steady-state average g over multiple data streams. The bottom row in Fig. 3 shows this quantity for the corresponding architecture of the top row. The evaluation of the clairvoyant attack is missing for the NN architecture, as this approach becomes unfeasible for complex models due to the high non-convexity of the associated minimization problem. Similarly, reinforcement learning attacks are impractical for large architectures. This is because the observation space of the RL agent, given by the input x and the parameters of the model under attack, becomes too large to handle for neural networks. Therefore, for the NN model we used a TD3 agent that only observes the read-out weights v of the student network; hence the asterisk in RL . We observe that the constant and clairvoyant strategies have respectively the highest and lowest running costs, as expected. Greedy attacks represent the second-best strategy, with an average running cost just above that of clairvoyant attacks. While we do not explicitly show it, it should also be emphasized that the computational costs of the different strategies vary widely; in particular, while greedy attacks are relatively cheap, RL attacks require longer data streams and larger computational resources, since they ignore the learning dynamics. We refer to Fig. 8, 9 in Appendix D for a comparison between greedy and RL action policies, and for a visual representation of the convergence in performance of the RL algorithm vs number of training episodes. 4.2 Experiments on real data In this section, we explore the behaviour of online data poisoning in an uncontrolled scenario, where input data elements are not i.i.d. and are correlated with the teacher weights. Specifically, we consider two binary image classification tasks, using classes 1 and 7 from MNIST, and cats and dogs from CIFAR10, while attacks are performed using the greedy algorithm. We explore two different setups. First, we experiment on the MNIST task with simple architectures where all parameters are updated during the perturbed training phase. For this purpose, we consider a logistic regression (Log Reg) and a simple convolutional neural network, the Le Net of [27]. Then, we address the case of transfer learning for more complex image-recognition architectures, where only the last layer is trained during the attacks, which constitutes a typical application of large pre-trained models. Specifically, we consider the visual geometry group (VGG11) [28] and residual networks (Res Net18) [29], with weights trained on the Image Net-1k classification task. For all architectures, we fix the last linear layer to have 10 input and 1 output features, followed by the Erf activation function. Preprocessing. For Log Reg, we reduce the dimensionality of input data to 10 via PCA projection. For Le Net, images are resized from 28x28 to 32x32 pixels, while for VGG11 and Res Net18 images are resized from 32x32 to 224x224 pixels. Features are normalized to have zero mean and unit variance, and small random rotations of maximum 10 degrees are applied to each image, thus providing new samples at every step of training. Learning. The training process follows the TSA paradigm: we train a teacher network on the given classification task, and we then use it to provide (soft) labels to a student network with the same architecture. As before, we consider two learning phases: first, the attacker is silent, and it calibrates γ. Then attacks start and the student receives perturbed data. For consistency with our teacher-student analysis, we consider the SGD optimizer with MSE loss; see Fig. 10 in Appendix D for a comparison between SGD and Adam-based learning dynamics. To keep the VGG11 and Res Net18 models in a regime where online learning is possible, we use the optimizer with positive weight decay. The findings are presented in Fig. 4, where panels A and B represent the outcomes of the full training and transfer learning experiments, respectively. Each panel consists of two rows: the top row displays the dependence of the relative student-teacher distance on the cost of action C and batch size P. The orange area depicts d G from Eq. (13) for P [1, 10], which we included for a visual comparison. The bottom row shows the corresponding average steady-state accuracy of the student. In general, all learners exhibit comparable behavior to the synthetic teacher-student setup in terms of the relationship between the average steady-state distance d and C. Additionally, all experiments demonstrate a catastrophic transition in classification accuracy when the value of C surpasses a critical threshold. The impact of batch size P on the d is also evident across all experiments. The effect of the weight decay used in the transfer learning setting is also visible: the value of d remains larger than zero even when C is very large (and attacks are effectively absent). Similarly, for very low values of C, the relative distance remains below 1. We note that calibrating γ takes a long time for complex models updating all parameters, so we provide single-run evaluations for Le Net. 2 0 2 log10(C) 2 0 2 log10(C) 2 0 2 log10(C) 2 0 2 log10(C) 2 0 2 log10(C) 2 0 2 log10(C) A. Full training - MNIST 2 0 2 log10(C) 2 0 2 log10(C) B. Transfer learning - CIFAR10 Figure 4: Empirical results for the TSA problem with real data. A, top row: average steady-state distance vs cost of action C, for greedy attacks on fully trained logistic regression (Log Reg) and Le Net architectures. The orange area represents the range of solutions of d G (13) for P [1, 10]. Bottom row: average steady-state accuracy vs C for the architectures of the top row. B: same quantities as shown in A for transfer learning experiments on VGG11 and Res Net18. Averages were performed over 10 data streams of 105 batches and over the last 103 steps for each data stream. Parameters: D = 10, η = 0.02 D (Log Reg, VGG11, Res Net18), η = 0.01 (Le Net), a [0, 1]. 5 Conclusions Understanding the robustness of learning algorithms is key to their safe deployment, and the online learning case, which we analyzed here, is of fundamental relevance for any interactive application. In this paper, we explored the case of online attacks that target data labels. Using the teacher-student framework, we derived analytical insights that shed light on the behavior of learning models exposed to such attacks. In particular, we proved that a discontinuous transition in the model s accuracy occurs when the strength of attacks exceeds a critical value, as observed consistently in all of our experiments. Our analysis also revealed that greedy attack strategies can be highly efficient in this context, especially when applying sample-specific perturbations, which are straightforward to implement when data stream in small batches. A key assumption of our investigation is that the attacker tries to manipulate the long-run behavior of the learner. We note that, for complex architectures, this assumption may represent a computational challenge, as the attacker s action induces a slowdown in the dynamics. Moreover, depending on the context, attacks may be subject to time constraints and therefore involve transient dynamics. While we have not covered such a scenario in our analysis, the optimal control approach that we proposed could be recast in a finite-time horizon setting. We also note that our treatment of online data poisoning did not involve defense mechanisms implemented by the learning model. Designing optimal control attacks that can evade detection of defense strategies represents an important avenue for future research. Further questions are suggested by our results. While the qualitative features of our theoretical predictions are replicated across all real-data experiments that we performed, we observed differences in robustness between different algorithms. For standard, test-time adversarial attacks, model complexity can aggravate vulnerability [30, 31], and whether this happens in the context of online attacks represents an important open question. Another topic of interest is the interaction of the poisoning dynamics with the structure of the data, and with the complexity of the task. Finally, we remark that our analysis involved sequences of i.i.d. data samples: taking into account temporal correlations and covariate shifts within the data stream constitutes yet another intriguing challenge. Compute. We used a single NVIDIA Quadro RTX 4000 graphics card for all our experiments. Acknowledgements SG and GS acknowledge co-funding from Next Generation EU, in the context of the National Recovery and Resilience Plan, Investment PE1 Project FAIR Future Artificial Intelligence Research". This resource was co-financed by the Next Generation EU [DM 1555 del 11.10.22]. 1. Huang, L., Joseph, A. D., Nelson, B., Rubinstein, B. I. & Tygar, J. D. Adversarial Machine Learning. Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, 43 58 (2011). 2. Biggio, B. & Roli, F. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 317 331 (2018). 3. Chen, X., Liu, C., Li, B., Lu, K. & Song, D. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. Preprint, Ar Xiv:1712.05526 (2017). 4. Gu, T., Liu, K., Dolan-Gavitt, B. & Garg, S. Bad Nets: Evaluating Backdooring Attacks on Deep Neural Networks. IEEE Access, 47230 47244 (2019). 5. Goldblum, M. et al. Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1563 1580 (2023). 6. Zhang, X., Zhu, X. & Lessard, L. Online Data Poisoning Attacks. Proceedings of the 2nd Conference on Learning for Dynamics and Control, 201 210 (2020). 7. Gardner, E. & Derrida, B. Three unfinished works on the optimal storage capacity of networks. Journal of Physics A: Mathematical and General, 1983 (1989). 8. Seung, H. S., Sompolinsky, H. & Tishby, N. Statistical mechanics of learning from examples. Physical review A (1992). 9. Engel, A. & Van den Broeck, C. Statistical mechanics of learning (2001). 10. Carleo, G. et al. Machine learning and the physical sciences. Reviews of Modern Physics (2019). 11. Biggio, B., Nelson, B. & Laskov, P. Poisoning Attacks against Support Vector Machines. Proceedings of the 29th International Coference on Machine Learning, 1467 1474 (2012). 12. Xiao, H. et al. Support vector machines under adversarial label contamination. Neurocomputing, 53 62 (2015). 13. Muñoz-González, L. et al. Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization. 10th ACM Workshop on Artificial Intelligence and Security, 27 38 (2017). 14. Ma, Y., Zhang, X., Sun, W. & Zhu, X. Policy Poisoning in Batch Reinforcement Learning and Control. Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019). 15. Mei, S. & Zhu, X. Using Machine Teaching to Identify Optimal Training-Set Attacks on Machine Learners. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2871 2877 (2015). 16. Wang, Y. & Chaudhuri, K. Data Poisoning Attacks against Online Learning. Preprint, Ar Xiv:1808.08994 (2018). 17. Pang, T., Yang, X., Dong, Y., Su, H. & Zhu, J. Accumulative Poisoning Attacks on Real-time Data. Advances in Neural Information Processing Systems, 2899 2912 (2021). 18. Liu, W. et al. Iterative Machine Teaching. Proceedings of the 34th International Conference on Machine Learning, 2149 2158 (2017). 19. Lessard, L., Zhang, X. & Zhu, X. An Optimal Control Approach to Sequential Machine Teaching. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 2495 2503 (2019). 20. Lee, S., Goldt, S. & Saxe, A. Continual Learning in the Teacher-Student Setup: Impact of Task Similarity. Proceedings of the 38th International Conference on Machine Learning, 6109 6119 (2021). 21. Yang, Z. et al. Understanding Robustness in Teacher-Student Setting: A New Perspective. Preprint, Ar Xiv:2102.13170 (2021). 22. Fujimoto, S., van Hoof, H. & Meger, D. Addressing Function Approximation Error in Actor Critic Methods. Preprint, Ar Xiv:1802.09477 (2018). 23. Lillicrap, T. P. et al. Continuous control with deep reinforcement learning. 4th International Conference on Learning Representations, ICLR (2016). 24. Raffin, A. et al. Stable-Baselines3: Reliable Reinforcement Learning Implementations. Journal of Machine Learning Research, 1 8 (2021). 25. Moore, J. K. & van Bogert, A. D. Opty: Software for trajectory optimization and parameter identification using direct collocation. Journal of Open Source Software (2018). 26. Wächter, A. & Biegler, L. T. On the Implementation of an Interior-Point Filter Line-Search Algorithm for Large-Scale Nonlinear Programming. Mathematical Programming, 25 57 (2006). 27. Le Cun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 2278 2324 (1998). 28. Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. 3rd International Conference on Learning Representations, ICLR (eds Bengio, Y. & Le Cun, Y.) (2015). 29. He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770 778 (2016). 30. Tsipras, D., Santurkar, S., Engstrom, L., Turner, A. & Madry, A. Robustness may be at odds with accuracy. Preprint, Ar Xiv:1805.12152 (2018). 31. Carbone, G. et al. Robustness of bayesian neural networks to gradient-based attacks. Advances in Neural Information Processing Systems, 15602 15613 (2020). 32. Bressan, A. & Piccoli, B. Introduction to the mathematical theory of control (2007). A Attack strategies: algorithms Table 1 summarizes the attack strategies introduced in Sec. 2.4. For clarity, we present the algorithms for data streaming in batches of size P = 1. We recall that the action value ac of constant attacks and the greedy future weight γ can be optimized by sampling future data streams from past observations, and using the SGD update rule (5) to simulate the future dynamics. Similarly, the attacker can sample past observations to calibrate the the RL policy a RL(x, θ). Algorithm 1 Online label attacks (batch size P = 1) Constant attacks 1: optimize ac 2: repeat 3: aµ ac 4: y µ yt µ + (y µ yt µ)aµ 5: ys µ ϕs(xµ; θs µ) 6: θs µ+1 θs µ η θsµL(ys µ, y µ) 7: until end of stream Greedy attacks 1: optimize γ 2: repeat 3: observe clean batch (x, yt)µ, state θs µ 4: aµ argmin gper µ + γgnef µ+1 5: y µ yt µ + (y µ yt µ)aµ 6: ys µ ϕs(xµ; θs µ) 7: θs µ+1 θs µ η θsµL(ys µ, y µ) 8: until end of stream Reinforcement learning attacks 1: calibrate policy a RL(x, θ) 2: repeat 3: observe clean batch (x, yt)µ, state θs µ 4: aµ a RL(xµ, θs µ) 5: y µ yt µ + (y µ yt µ)aµ 6: ys µ ϕs(xµ; θs µ) 7: θs µ+1 θs µ η θsµL(ys µ, y µ) 8: until end of stream Clairvoyant attacks 1: observe state θs 1, stream S = {(x, yt)µ}T µ=1 2: find {aµ}T µ=1 = argmin G(S, θs 1) 3: repeat 4: y µ yt µ + (y µ yt µ)aµ 5: ys µ ϕs(xµ; θs µ) 6: θs µ+1 θs µ η θsµL(ys µ, y µ) 7: until end of stream B Linear TSA problem: optimal solution In the limit of very large batches, batch averages of the nefarious cost gnef µ and loss Lµ in Eqs. (2, 5) can be approximated with averages over the input distribution. For a general architecture ϕ, and indicating with t, s, and the teacher, student, and attacker s target, respectively, we can write 2 (Iµ(s, s) 2Iµ(s, ) + Iµ( , )) , (19) 2I µ(s, s) I µ(s, t)(1 aµ) + I µ(s, )aµ, (20) where I(a, b) = Ex ϕa(x)ϕb(x) for two models, a and b, with the same architecture but different parameters θa, θb, and I = θs I. In this case, the optimal control problem (4) is deterministic, and we can write it in continuous time (η 0) as aopt µ = argmin aµ 0 dµγµ(gper µ + gnef µ ), (21) where gper µ = Ca2 µ/2. Note that µ now is a continuous time index, and the above minimization problem considers all functions aµ [amin, amax] for µ [0, ). In the following, we will assume that the action space is unbounded, i.e. aµ ( , ), and that the elements of x are i.i.d. samples with zero mean and variance σ2. In the linear TSA problem, where ϕ(x; w) = w Tx/ D, the above Eqs. (19, 20) simplify to gnef µ = σ2 2D| ws µ |2, ws Lµ = σ2 D wst µ + wt aµ , (22) where wab = wa wb, and | |2 the squared 2-norm. We also find D | wt |2C. (23) We shall recall that in the TSA problem each batch is used only once during training. This assumption guarantees that inputs xµ and weights ws µ are uncorrelated, a necessary condition to derive the above expressions. We solve this problem with a Lagrangian-based approach: the Lagrangian L is obtained by imposing the constraint wsµ = ws Lµ, at all times, with the co-state variables λµ RD: 0 dµγµ(gper µ + gnef µ ) λT µ( ws µ + σ2 wst µ + wt aµ /D). (24) The Pontryagin s necessary conditions for optimality aµL = 0, [ wsµ µ wsµ]L = 0, and λµL = 0 give the following set of equations coupling aµ, ws µ, and λµ: γµ Caµ λT µσ2 ωt = 0, γµσ2 ωs µ /D + λµ σ2λµ = 0, ws µ + σ2( wst µ + wt aµ)/D = 0. (25) The above system of ODEs can be solved by specifying the initial condition ws 0 and using the transversality condition limµ λµ = 0. At steady state, we obtain the following equation for ws : wst + σ2 ws T wt wt / CD = 0, (26) which admits only one (optimal) solution. We find it as a linear combination of wt and w : ws = wt aopt wt , aopt = (C + 1) 1. (27) From the definition of dµ (6), it follows that dopt = aopt . (28) As a final remark, we observe that the same procedure can in principle be applied to TSA problems with non-linear architectures, as long as an explicit expression for I(a, b) is available. However, one may find multiple steady-state solutions, and obtaining guarantees for optimality is often hard [32]. B.1 Optimal greedy solution For the linear TSA problem, and in the limit of large batches, the discrete-time equation (7) for greedy actions simplifies to a G µ = argmin aµ 2 Ca2 µ + γ σ2 2D|ws µ+1(aµ) w |2 . (29) The expression of ws µ+1(aµ) is given by the update rule, which can be conveniently written as ws µ+1 = v1 + v2aµ, (30) where v1 = ws µ ησ2 wst µ /D, and v2 = ησ2 wt /D. The solution of (29) gives the greedy policy function a G µ = w Tv2 v1Tv2 γ + |v2|2 . (31) The right-hand side of the above equation depends on the future weighting factor γ. Requiring that the (greedy) update rule ws µ+1 = v1 + v2a G µ gives the optimal steady-state solution of Eq. (27), and using the explicit expression of C (23), one finds the optimal weight γ = D/σ2η. (32) B.2 Accuracy vs cost of action In Sec. 3.1, we addressed the case of attacks in a classification context, with class labels given by the sign of ϕt(x), and where the attacker aims to swap the labels, so ϕ = ϕt. A pointwise estimate of the accuracy is given by 2 sign(ϕs µ(x)) + sign(ϕt(x)) , (33) and the system dynamics can be characterized by the accuracy of the student, defined as Aµ(C, P) = Ex[Sµ(x)], with Ex the average over the input distribution. For the linear TSA problem, we have w = wt, and, consistently with the definition of dµ (6), ws µ = wt + dµ(w wt) = (1 2dµ)wt. (34) It follows that Sµ(x) = 1 for dµ 0.5, and Sµ(x) = 0 otherwise. In the deterministic limit of large batches, the optimal steady-state solution (27) gives sign (C 1) wt Tx + sign wt Tx . (35) We then have S = 0 for C < 1, and S = 0 otherwise, regardless of the value of x, giving A (C) = 1 H(C 1), (36) with H( ) the Heaviside step function. C Stochastic control with greedy attacks When the batch size P is finite, the attacker s optimal control problem is stochastic and has no straightforward solution, even for the simplest case of the linear TSA problem. It is however possible to find the average steady-state solution for greedy attacks, as we show in this Appendix. We derived the greedy policy in B.1 for P , and for i.i.d. input elements with zero mean and variance σ2. For a general P, the same expression (31) holds, with v1 and v2 now given by v1 = ws µ η 1 wst µ Txµp xµp, wt Txµp xµp, (37) where wab = wa wb. The first-order expansion of a G µ in powers of η reads a G µ = 1 CDP p=1 ws µ Txµp wt Txµp + O(η), (38) where we have used γ = D/σ2η, which guarantees that in the limit P we recover the optimal solution (27). We can use the above action policy in the SGD update rule (30) and take the average over data streams to get the steady-state equation wst l = 1 CDP 2 PT1l + P(P 1)T l , (39) where we have imposed ws µ = ws µ+1 = ws for µ , and where T1l = Ex h ws Tx wt Tx wt Txxl i , (40) T l = Ex =x h ws Tx wt Tx wt Tx x l i . (41) Note that for P = 1 and P the right-hand side of (39) only involves T1l and T l, respectively (hence the labels). So far, we have assumed that the input elements are i.i.d. samples from Px with first two moments m1 = 0, and m2 = σ2. We also recall that xµ and ws µ are uncorrelated, as each 2 0 2 log10(C) P=10 P=1 Figure 5: Average steady-state greedy action. a G vs cost of actions C for the linear TSA problem with x N(0, 1). The orange area represents the range of solutions (45) for P [1, 10]. The red and blue lines show empirical evaluations for P = 1 and P = 10, respectively. Parameters: D = 10, η = 10 4 D, a [ 2, 3]. Averages performed over 10 data streams of 24 batches, and over the last 103 steps in each stream. batch is used only once in the TSA problem. We now further assume that input elements have third and fourth moments m3 = 0, and m4 < . It is then immediate to verify that T1l = T1l(m2, m4) and T l = T l(m2), showing that finite batch-size effects depend on the kurtosis of the input distribution. Specifically, we find T1l = (m4 3m2 2) wt l 2 w s l + m2 2| wt |2 w s l + 2m2 2 wt T w s wt l , T l = m2 2 wt T w s wt l . (42) For Px = N(0, 1), the first term in T1l disappears, and the above expressions simplify to T1l = | wt |2 w s l + 2 wt T w s wt l , T l = wt T w s wt l . (43) Substituting in Eq. (39) the expressions of T1l, T l, and C from (23), and looking for a solution of ws as a linear combination of wt and w , we find ws = wt d G wt , d G = P P + 2 C + 1 1 . (44) Note that d G is the relative distance (6) of the student with parameters ws . From this result, we can easily find the mean steady-state greedy action by averaging (38) over data streams, giving a G = f(P) f(P)C + 1, f(P) = P/(P + 2). (45) Fig. 5 shows a comparison of the above result with empirical evaluations, which are in excellent agreement. Together, (44) and (45) demonstrate that greedy attacks become more efficient as the batch size decreases, reducing the distance between the student and the target while using smaller perturbations. We remark that this result is derived using the expression of γ from the P analysis, as in Eq. (32). In our experiments, this value is near-optimal also for P finite for linear learners, while for non-linear architectures it provides a good first guess during the calibration phase. C.1 Sample-specific perturbations In this section, we address the case where the attacker has finer control over the data labels and can apply sample-specific perturbations, rather than batch-specific ones. The control variable a is P-dimensional, with P the size of the streaming batches, and each label yt p is perturbed as y p = yt p(1 ap) + y pap, (46) while the perturbation cost is gper µ = C|aµ|2/2P. We can write the SGD update rule as p=1 v1p + v2paµp, (47) v1p = ws µ η wst µ Txµp xµp/D, v2p = η wt µ Txµp xµp/D. (48) 0 2000 4000 µ P=1 P=10 * P=10 Figure 6: Relative distance with sample-specific attacks. Student-teacher distance vs SGD step µ for the linear TSA problem with x N(0, 1). The light-red and blue lines show a single realization of the dynamics for batch-specific control. The dark-red line shows the same quantity for sample-specific, multi-dimensional control. Parameters: C = 1, D = 10, η = 10 2 D, a [ 2, 3]. We use again the notation wab = wa wb. Note that (47) reduces to (30) when the control variable is one-dimensional and batch-specific, i.e. aµp = aµ p. The minimization problem (7) for greedy actions now consists of P coupled equations, one for each control variable aµp. For i.i.d. input elements with zero mean and variance σ2, we find a G µp = argmin aµp 2 C|aµ|2 + γ σ2 2D|ws µ+1(aµ) w |2 , (49) and the first-order conditions lead to Caµp + γ σ2 j=1 v T 1j + aµjv T 2j w T v2p = 0. (50) A straightforward and explicit solution can be found by considering only the first-order terms in η, effectively decoupling the above system of equations. With γ = D/σ2, we obtain a G µp 1 CD ws µ Txµp wt Txµp, (51) which coincides with the P = 1 solution for batch-specific greedy attacks (38). We can now use this policy in the update rule (47) to investigate the steady state of the system. Once more, we average across data streams with i.i.d. data sampled from Px = N(0, 1) and impose that the average student weights ws µ reach a steady state for µ . This leads to the following P-independent solution ws = wt d G wt , d G = (C/3 + 1) 1 . (52) Fig. 6 shows empirical evaluations confirming this result. Finally, averaging the policy function (51) across data streams and using the above result for ws we find a G p 1/3 C/3 + 1. (53) C.2 Mixing clean and perturbed data The assumption that the attacker can perturb every data sample in each batch may not always be satisfied. For example, in a federated learning scenario, the central server coordinating the training may have access to a trusted source of clean data that is inaccessible to the attacker. Moreover, the attacker may face a limited computational budget and only be able to process a fraction of samples in the data stream at each timestep. Here, we consider the case where only a fraction ρ of the P samples in each batch is perturbed by the attacker. Moreover, we assume that the attacker ignores the amount of clean samples used for training. Precisely, the attacker considers the update rule ωs µ+1 = ωs µ η ωs µLµ BρP µ = ωs µ η DρP wst µ Txµp xµp + aµ wt µ Txµp xµp with ρP perturbed samples, while training follows ωs µ+1 = ωs µ ηρ ωsµLµ BρP µ η(1 ρ) ωsµLµ B(1 ρ)P µ = ωs µ η DP wst µ Txµp xµp + aµ wt µ Txµp xµp 0.2 0.4 0.6 0.8 1.0 Ω Theory Lin Reg syn. Log Reg MNIST Figure 7: Mixing clean and poisoned samples. Average steadystate distance vs fraction ρ of poisoned samples, under greedy attacks. The dashed line shows the estimate of Eq. (58), and the red dots display the associated empirical evaluations. Blue crosses show empirical results obtained for a logistic regression training on MNIST digits. Parameters: C = 1, P = 100, D = 10, η = 0.1, a [ 2, 3]. Averages performed over 10 data streams of 104 batches, and over the last 103 steps in each stream. thus involving ρP perturbed samples and (1 ρ)P clean samples. We recall that wab = wa wb, and that the stream has input data drawn i.i.d. from Px. Following (54), the greedy policy (38) in this case reads a G µ = 1 CDρP p=1 ws µ Txµp wt Txµp + O(η). (56) We can now substitute the above expression into (55) to investigate the steady state of the system. As before, we can obtain a steady-state condition for ws by averaging across data streams and imposing ws µ = ws µ+1 = ws for µ . We find wst l = 1 CDρP 2 ρPT1l + ρP(ρP 1)T l , (57) with T1l and T l as in (43) for Px = N(0, 1). It is then immediately verified that ws = wt d G wt , d G = P ρP + 2 C + 1 1 . (58) Note that for large batches (ρP 2) the above result reduces to d G (C/ρ + 1) 1, and it has a simple and intuitive interpretation: when perturbations contaminate a fraction ρ of the batch samples only, the cost of actions effectively increases by a factor ρ 1. Fig. 7 shows the average steady-state distance reached by the student model as a function of ρ, demonstrating that empirical evaluations using synthetic data perfectly match the theoretical estimate of Eq. (58). We also performed real data experiments using MNIST images (as in Sec. 4.2), which again exhibit a good qualitative agreement with our theoretical result. D Supplementary figures A. Greedy policy B. RL policy Figure 8: Greedy and RL policies. A: greedy action policy given by Eq. (31) with v1 and v2 as in (37) with P = 1, for the linear TSA problem with one-dimensional input. The x-axis shows values of the input, while ρ represents the student state expressed as ωs = w (1 ρ) + wtρ. B: best TD3 policy function found for this problem. Parameters: C = 1, η = 0.02, a [0, 1], wt = 1, w = 1. 0 2 4 6 8 n. episodes 300 340 380 µ greedy RL (best) A. Linear regression 0 2 4 6 8 n. episodes 300 340 380 µ greedy RL (best) B. Sigmoidal perceptron 300 340 380 µ greedy RL (best) C. 2-layer NN 0 2 4 6 8 n. episodes Figure 9: Convergence of RL agents. A, top panel: average steady-state running cost of the TD3 agent vs training episodes, for attacks on the linear regression model. Training episodes have length of 4 104 SGD steps, and policy updates are performed every 100 steps. The dotted line shows the average steady-state running cost of the greedy algorithm. The blue dot indicates the best performance achieved by the TD3 agent. Bottom panel: example of actions performed by the greedy attacker (dotted line) and by the best TD3 agent (blue line) on the same data stream. B, C: same quantities as shown in A, for attacks on the sigmoidal perceptron and 2-layer NN. For the latter case, the TD3 agent observed the read-out weight layer only. The best TD3 agents were employed for the comparison shown in Fig. 3. Parameters: P = 1, C = 1, D = 10, η = 0.02 D, a [ 2, 3]. 0 2000 4000 µ 2 0 2 log10(C) Figure 10: Comparison of Adam and SGD dynamics. A: average relative distance vs training step µ for a logistic regression model classifying MNIST digits 1 and 7, under label-inversion greedy attacks, for C = 1. Solid and dashed lines correspond to training via Adam and SGD, respectively. Average performed over 100 data streams. B: average steady-state relative distance as a function of the cost of actions C and batch size P, obtained for the experiment described in panel A, and using the Adam optimizer. The orange shaded area shows the SGD-based theoretical estimate (Eq. (13)) for P in the range [1, 10]. Averages performed over 10 data streams of 104 batches, and over the last 1000 steps in each stream. Parameters: D = 10, η = 0.1 (SGD), η = 0.01 (Adam), a [0, 1].