# reciprocal_learning__00ce5f24.pdf

Reciprocal Learning

Julian Rodemann Department of Statistics LMU Munich j.rodemann@lmu.de

Christoph Jansen Computing & Communications Lancaster University Leipzig c.jansen@lancaster.ac.uk

Georg Schollmeyer Department of Statistics LMU Munich g.schollmeyer@lmu.de

We demonstrate that numerous machine learning algorithms are specific instances of one single paradigm: reciprocal learning. These instances range from active learning over multi-armed bandits to self-training. We show that all these algorithms not only learn parameters from data but also vice versa: They iteratively alter training data in a way that depends on the current model fit. We introduce reciprocal learning as a generalization of these algorithms using the language of decision theory. This allows us to study under what conditions they converge. The key is to guarantee that reciprocal learning contracts such that the Banach fixed-point theorem applies. In this way, we find that reciprocal learning converges at linear rates to an approximately optimal model under some assumptions on the loss function, if their predictions are probabilistic and the sample adaption is both non-greedy and either randomized or regularized. We interpret these findings and provide corollaries that relate them to active learning, self-training, and bandits.

1 Introduction

The era of data abundance is drawing to a close. While GPT-3 [9] still had to make do with 300 billion tokens, Llama 3 [102] was trained on 15 trillion. With the stock of high-quality data growing at a much smaller rate [67], adequate training data might run out within this decade [58, 107]. Generally and beyond language models, machine learning is threatened by degrading data quality and quantity [60]. Apparently, learning ever more parameters from ever more data is not the exclusive route to success. Models also have to learn from which data to learn. This has sparked a lot of interest in sample efficiency [70, 92, 111, 7, 46, 27, 105], subsampling [47, 101, 71], coresets [62, 76, 86], data subset selection [55, 113, 13, 82], and data pruning [32, 118, 57, 5] in recent years.

Instead of proposing yet another method along these lines, we demonstrate that a broad spectrum of well-established machine learning algorithms already exhibits a reciprocal relationship between data and parameters. That is, parameters are not only learned from data, but data is also iteratively chosen based on currently optimal parameters with the aim of increasing sample efficiency. For instance, consider self-training algorithms in semi-supervised learning [106, 12, 103, 87], see section 2.1. They iteratively add pseudo-labeled variants of unlabeled data to the labeled training data. The pseudo-labels are predicted by the current model, and thus depend on the parameters learned by the model from the labeled data in the first place. Other examples comprise active learning [93], Bayesian optimization [63, 64, 98], superset learning [35, 34, 85, 36], and multi-armed bandits [3, 81], see Appendix A for details.

In this paper, we develop a unifying framework, called reciprocal learning, that allows for a principled analysis of all these methods. After an initial model fit to the training data, reciprocal learning algorithms alter the latter in a way that depends on the fit. This dependence can have various facets, ranging from predicting labels (self-training) over taking actions (bandits) to querying an oracle (active learning), all based on the current model fit. It can entail both adding and removing data. Figure 7 illustrates this oscillating procedure and compares it to a well-known illustration of classical

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

(a) Classical machine learning

(b) Reciprocal learning

Figure 1: (A) Classical machine learning fits a model from the model space (restricted by red curve) to a realized sample from the sample space (blue-grey); figure replicated from "The Elements of Statistical Learning" [31, Figure 7.2]. (B) In reciprocal learning, the realized sample is no longer static, but changes in response to the model fit. Grey ellipse indicates restriction of sample space in 𝑡= 2 through realization in 𝑡= 1. Sample in 𝑡thus depends on model in 𝑡 1 and sample in 𝑡 1.

machine learning. A pressing question naturally arises: Given the additional degrees of freedom these algorithms enjoy through data selection, can they at all reach a stable point; that is, can they converge? Convergence is well-understood for classical empirical risk minimization (ERM), where it refers to the sequence of model parameters. In reciprocal learning, however, we need to extend the notion of convergence to the sequence of both parameters and data. It is self-evident that convergence is a desirable property of any learning algorithm. Only if an algorithm converges to a unique solution, we can identify a unique model and use it for deployment or assessment on test data. Practically speaking, convergence of a training procedure means that subsequent iterations will no longer change the model, giving rise to stopping criteria. Generally speaking, convergence is a prerequisite for any further theoretical or empirical assessment of such methods. In the literature on reciprocal learning algorithms like active learning, self-training, or multi-armed bandits, there is no consensus on when to stop them. And while a myriad of stopping criteria exist [109, 121, 48, 110, 28, 22, 11, 79, 96], only some come with generalization error bounds [41, 40, 88], but none to the best of our knowledge comes with rigorous guarantees of convergence. We address this research gap by proving convergence of all these methods under a set of sufficient conditions.

Our strategy will be to take a decision-theoretic perspective on both the parameter and the data selection problem. While the former is well studied, in particular its ubiquitous solution strategy ERM, little attention is commonly paid to a formal treatment of the other side of the coin: data selection. We particularly study the hidden interaction between parameter and data selection. We identify a sample adaption function 𝑓which maps from the sample and the empirical risk minimizer in iteration 𝑡to the sample in 𝑡+ 1. Bounding the change of the sample in 𝑡+ 1 by the change in the sample and the change in the model in 𝑡(i.e., 𝑓being Lipschitz-continuous with a sufficiently small constant) will turn out to guarantee convergence of reciprocal learning algorithms. In response to this key finding, we study which algorithms fulfill this restriction on 𝑓. We prove that the sample adaption is sufficiently Lipschitz for reciprocal learning to converge, if (1) it is non-greedy, i.e., it adds and removes data, (2) predictions are probabilistic and (3) the selection of data is either randomized or regularized. Conclusively, we transfer these results to common practices in active learning, self-training, and bandits, showing which algorithms converge and which do not.

2 Reciprocal learning

Machine learning deals with two pivotal objects: data and parameters. Typically, parameters are learned from data through ERM. In various branches of machine learning, however, the relationship between data and parameters is in fact reciprocal, as argued above. In what follows, we show that this reciprocity corresponds to two interdependent decision problems and explicitly study how learned parameters affect the subsequent training data. We emphasize that our analysis focuses on reciprocity

between parameters and training data only. The population and test data thereof are assumed to be fixed, i.e., our inference goal is static. Specifically, we call a machine learning algorithm reciprocal if it performs iterative ERM on training data that depends on the previous ERM, see definition 1. This dependence can be induced by any kind of data collection, removal, or generation that is affected by the model fit. In particular, it can be stochastic (think of Thompson-sampling in multi-armed bandits) as well as deterministic in nature (think of maximizing a confidence score in self-training). Definition 1 (Reciprocal Learning, informal). An algorithm that iteratively outputs 𝜃𝑡 = arg min𝜃E(𝑌,𝑋) P𝑡ℓ(𝑌, 𝑋, 𝜃) shall be called reciprocal learning algorithm if P𝑡= 𝑓(𝜃𝑡 1, P𝑡 1, 𝑛𝑡 1), where P𝑡 P are empirical distributions from a space of probability distributions P of 𝑌, 𝑋 of size 𝑛𝑡in iteration 𝑡 {1, . . . ,𝑇}. Let ℓ(𝑌, 𝑋, 𝜃) = ℓ(𝑌, 𝑝(𝑋, 𝜃)) denote a loss function with 𝑝(𝑋, 𝜃) a prediction function that maps to the image of 𝑌. Further denote by 𝑌, 𝑋random variables describing the training data, and 𝜃𝑡 Θ a parameter vector of the model in 𝑡.

In principle, the above definition needs no restriction on the nestedness between data in 𝑡and 𝑡 1. In practice, however, most algorithms iteratively either only add training data or both add and remove instances, see extensive list of examples in appendix A. That is, data in 𝑡is either a superset of data in 𝑡 1 or a distinct set. We will address these two cases in the remainder of the paper, referring to the former as greedy (only adding data) and to the latter as non-greedy (adding and removing data). For classification problems, i.e., discrete image of 𝑌, we typically have 𝑝(𝑋, 𝜃) = 𝜎(𝑔(𝑋, 𝜃)) with 𝜎: R [0, 1] a sigmoid function and 𝑔: X Θ R. For regression problems, we simply have 𝑝: X Θ R. The notation P𝑡= 𝑓(𝜃𝑡 1, P𝑡 1, 𝑛𝑡 1) shall be understood as a mere indication of the distribution s dependence on ERM in the previous iteration. We will be more specific soon.

2.1 An illustrating running example: self-training

In appendix A, we demonstrate at length that several well-established machine learning procedures turn out to be special cases of reciprocal learning as specified in definition 1 and more formally in definitions 6 and 7 below. Here, we seek to illustrate the principles of reciprocal learning by the simple running example of self-training in a semi-supervised learning (SSL) setup. The aim of SSL is to learn a predictive classification function ˆ𝑦(𝑥, 𝜃) parameterized by 𝜃utilizing both labeled and unlabeled data. Self-training is a popular algorithm class within SSL. Algorithms of that class start by fitting a model on labeled data by ERM and then exploit this model to predict labels for the unlabeled data. In a second step, some instances of the unlabeled data are selected (according to a confidence score , a measure of predictive uncertainty, see [2, 49, 80, 83, 53, 84, 18] for examples) to be added to the training data together with the predicted labels. In other words, self-training algorithms label unlabeled data themselves and ultimately learn from these pseudo-labels by iteratively adding pseudo-labeled variants of unlabeled data to the labeled training data. The pseudo-labels are predicted by the current model, and thus depend on the parameters learned by the model from the labeled data in the first place. This latter dependence constitutes the sample adaption function in definition 1. The sample of labeled and pseudo-labeled data in 𝑡depends on the sample and the model (through its predicted pseudo-labels) in 𝑡 1. For a more comprehensive and formal introduction of self-training, we refer the curious reader to appendix A.1.

2.2 A decision-theoretic perspective

On a high level, reciprocal learning can be viewed as sequential decision-making. First, a parameter 𝜃𝑡 is fitted through ERM, which corresponds to solving a decision problem characterized by the triple (Θ, A, L) with Θ the unknown set of states of nature, the action space A = Θ of potential parameter fits (estimates), and a loss function L : A Θ R, analogous to classical statistical decision theory [6]. Secondly, features 𝑥𝑡 X are chosen and data points (𝑥𝑡, 𝑦𝑡) are added to or removed from the training data inducing a new empirical distribution P𝑡+1, where 𝑦𝑡is predicted (self-training), queried (active learning) or observed (bandits). These features 𝑥𝑡are found by solving another decision problem (Θ, A, L𝜃𝑡), where crucially the loss function L𝜃𝑡depends on the previous decision problem s solution 𝜃𝑡. This time, the action space corresponds to the feature space A = X. Illustration 1. Think of reciprocal learning as a sequential decision-making problem:

𝑡= 1: 𝜃1 solves decision problem (Θ, Θ, L)

𝑎1 solves decision problem (Θ, A, L𝜃1)

𝑡= 2: 𝜃2 solves decision problem (Θ, Θ, L𝑎1(𝜃1))

𝑎2 solves decision problem (Θ, A, L𝜃2)

Loosely speaking, the data is judged in light of the parameters here. Excitingly, such an approach is symmetrical to any type of classical machine learning, where parameters are judged in light of the data. This twist in perspective will later pave the way for another type of regularization that of data, not of parameters. As the decision problem (Θ, Θ, L) is well-known through its solution strategy ERM, see definition 1, we want to be more specific about (Θ, A, L𝜃𝑡) with A = X. In particular, we need a solution strategy for all of the loss functions in the family L𝜃: X Θ R; (𝑥, 𝜃) L𝜃𝑡(𝜃, 𝑥) in iteration 𝑡. Definition 2 does the job. The family L𝜃describes all potential loss functions in the data selection problem arising from respective solutions of the parameter selection problem.1 Redefining this family of functions as a single one L : X Θ Θ R makes it clear that we can retrieve the decision criterion 𝑐: X Θ R from it, which is a generalization of classical decision criteria 𝑐: X R retrieved from classical losses L : X Θ R, see [6] for instance.

Definition 2 (Data Selection). Let 𝑐: X Θ R be a criterion for the decision problem (Θ, A, L𝜃𝑡) with bounded A = X of selecting features to be added to the sample in iteration 𝑡. Define 𝑐: X Θ [0, 1]; (𝑥, 𝜃𝑡) exp(𝑐(𝑥,𝜃𝑡))

𝑥 exp(𝑐(𝑥 ,𝜃𝑡))𝑑𝜇(𝑥) as standardized version thereof with 𝜇the Lebesgue measure

on X. For a model 𝜃𝑡in iteration 𝑡, it assigns to each feature vector 𝑥a value between 0 and 1 that can be used as drawing probabilities. Drawing 𝑥 X according to 𝑐(𝑥, 𝜃𝑡) shall be called stochastic data selection 𝑥𝑠(𝜃𝑡).2 The function 𝑥𝑑: Θ X; 𝜃𝑡 arg max𝑥 X 𝑐(𝑥, 𝜃𝑡) shall be called deterministic data selection function.

Figure 2: Data regularization is symmetrical to classical regularization, see illustration in "The Elements of Statistical Learning" [31, Figure 7.2].

The data selection function can be understood as the workhorse of reciprocal learning: It describes the non-trivial part of the sample adaption function 𝑓, see definition 1. For any model 𝜃𝑡in 𝑡, a data selection function chooses a feature vector to be added to the training data in 𝑡+ 1, based on a criterion 𝑐. This happens either stochastically through 𝑥𝑠by drawing from X according to 𝑐or deterministically through 𝑥𝑑. Examples for 𝑐comprise confidence measures in selftraining, acquisition functions in active learning, or policies in multi-armed bandits. For an example of stochastic data selection, consider 𝑐to be the classical Bayes criterion [6] in (Θ, A, L𝜃𝑡). In this case, drawing from X as prescribed by 𝑥𝑠corresponds to well-known Thompson sampling [12, 90]. As already hinted at, we will need some regularization (definition 3) of the data selection. Intuitively, the regularization term smoothes out the criterion 𝑐( , 𝜃). In other words, the higher the constant 1 𝐿𝑠, the less the selection of data is affected by small changes of 𝜃for given R( ). This is completely symmetrical to classical statistical regularization in ERM, where the regularization terms smoothes out the effect of the data on parameter selection, see also figure 2.

Definition 3 (Data Regularization). Consider 𝑐: X Θ R a criterion for the decision problem (Θ, A, L𝜃𝑡) with 𝑐as in definition 2. Define the following regularized (deterministic) data selection function:

𝑥𝑑,R : Θ X; 𝜃 argmax 𝒙 X

𝑐(𝒙, 𝜃) + 1

where R( ) is a 𝜅-strongly convex regularizer. In complete analogy to definition 2, we can define a stochastic regularized data selection function as 𝑥𝑠,R(𝜃) by drawing 𝑥 X according to a normalized version of 𝑐(𝒙, 𝜃) + 1 𝐿𝑠R(𝒙).

1Like most other sequential decision-making problems, solving these decision problems by extensive search computationally explodes, both in the normal and extensive form [37, 38]. Reciprocal learning thus corresponds to a one-step look-ahead approximation. As such, it is a method that aims at subtree solutions in the extensive form. Reciprocal learning can be understood as a compromise between the sub-optimal greedy strategy and infeasible extensive search. 2An alternative to stochastic data selection via direct randomization of actions is discussed in appendix D.

We will denote a generic data selection function as{ ש𝑥𝑑, 𝑥𝑠, 𝑥𝑑,R, 𝑥𝑠,R} in what follows. For the non-greedy variant of reciprocal learning, where data is both added and removed, we need to define data removal as well. A straightforward strategy is to randomly remove data points with uniform removal probabilities. The following functionש describes the effect of this procedure in expectation. Definition 4 (Data Removal Function). Given an empirical distribution P(𝑌, 𝑋) of a sample, the functionש : P X; P(𝑌, 𝑋) 𝑋𝑑P(𝑋) shall be called data removal function.

2.3 Formal definition and desirable properties

In order to study reciprocal learning in a meaningful way, we need to be a bit more specific about how P𝑡depends on empirical risk minimization in 𝑡 1, and specifically on 𝜃𝑡 1. The following definition 5 of the sample adaption function allows for this. It will be the pivotal object in this work. The function describes in a general way and for any 𝑡how empirical distributions of training data in 𝑡 are affected by the model, the empirical distribution of training data, and its size in 𝑡 1, respectively. Definition 5 (Sample Adaption). Denote by Θ a parameter space, by P a space of probability distributions of 𝑋and 𝑌, and N the natural numbers. The function 𝑓: Θ P N P shall be called the greedy and the function 𝑓𝑛: Θ P P the non-greedy sample adaption function.

A greedy sample adaption function outputs a distribution P (𝑌, 𝑋) P in the iteration after 𝜃 Θ solved ERM on a sample of size 𝑛 N described by P(𝑌, 𝑋) P, which led to an enhancement of the training data that changed P(𝑌, 𝑋) to P (𝑌, 𝑋). It will come in different flavors for different types of algorithms, see examples in section A. Generally, we have 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) = P (𝑌, 𝑋), with P (𝑌, 𝑋) being induced by

P (𝑌= 1, 𝑋= 𝑥) = 1(𝑥=(ש𝜃)) (ש(צ𝜃), 𝜃) + 𝑛P(𝑌= 1, 𝑋= 𝑥)

𝑛+ 1 𝑃צ𝑑𝑦 𝑃ש𝑑𝑥, (1)

in case of Y = {0, 1}, where: צX Θ {0, 1} is any function that assigns a label 𝑦, potentially based on the model 𝜃, to selected 𝑥, andשany function that selects features 𝑥given a model 𝜃, for example, 𝑥𝑑, 𝑥𝑠, 𝑥𝑑,R, or 𝑥𝑠,R as defined above. They give rise to 𝑃צand 𝑃,שrespectively. We can be so specific about the sample adaption function due to 𝑃(𝑌= 1, 𝑋= 𝑥) = 𝑃(𝑋= 𝑥) 𝑃(𝑌= 0, 𝑋= 𝑥) in binary classification problems. We can analogously define the non-greedy variant 𝑓𝑛(𝜃, P(𝑌, 𝑋)), where one instance is removed byש and one instance is added byשper iteration. To this end, define P (𝑌= 1, 𝑋= 𝑥) by replacing the integrand in equation (1) by

1(𝑥=(ש𝜃)) (ש(צ𝜃), 𝜃) + 𝑛0 P(𝑌= 1, 𝑋= 𝑥) 1(𝑥=ש (P(𝑌, 𝑋))) ש(צ (P(𝑌, 𝑋)), 𝜃)

where 𝑛0 is the size of the initial training data set. Notably, we observe that both sample adaption functions entail a reflexive effect of the model on subsequent data akin to performative prediction [29], see section 5 for a discussion.

We can now define reciprocal learning (definition 1) more formally given the sample adaption function as follows, both in greedy and non-greedy flavors. Definition 6 (Greedy Reciprocal Learning). With Θ, P, 𝑋, 𝑌, and N as above, we define

𝑅: Θ P N Θ P N; (𝜃, P(𝑌, 𝑋), 𝑛) (𝜃 , P (𝑌, 𝑋), 𝑛 )

as reciprocal learning, where 𝜃 = arg min𝜃E(𝑌,𝑋) P (𝑌,𝑋)ℓ(𝑌, 𝑋, 𝜃) and P (𝑌, 𝑋) = 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) as well as 𝑛 = 𝑛+ 1, with 𝑓a sample adaption function, see definition 5. Note the equivalence to the informal recursive definition 1 with 𝑓(𝜃𝑡 1, P(𝑌, 𝑋)𝑡 1, 𝑛𝑡 1) = P(𝑌, 𝑋)𝑡. Definition 7 (Non-Greedy Reciprocal Learning). With Θ, P, 𝑋, and 𝑌as above, we define

𝑅𝑛: Θ P Θ P; (𝜃, P(𝑌, 𝑋)) (𝜃 , P (𝑌, 𝑋))

as reciprocal learning, where P (𝑌, 𝑋) = 𝑓𝑛(𝜃, P(𝑌, 𝑋)) and 𝜃 = arg min𝜃E(𝑌,𝑋) P (𝑌,𝑋) ℓ(𝑌, 𝑋, 𝜃) with 𝑓𝑛a non-greedy sample adaption function, see definition 5.

We introduce two desirable properties of reciprocal learning. First, we define convergence as a state in which the model stops changing in response to newly added data. This kind of stability allows to stop the process in good faith: Hypothetical subsequent iterations would not have changed the model. Definition 8 offers a straightforward way of formalizing this, implying standard Cauchy convergence. Definition 8 (Convergence of Reciprocal Learning). Let 𝑔: N R be a strictly monotone decreasing function and 𝑅(𝑅𝑛) any (non-greedy) reciprocal learning algorithm (definitions 6 and 7) outputting 𝑅𝑡(𝑅𝑛,𝑡) in iteration 𝑡. Then 𝜚 {𝑅, 𝑅𝑛} is said to converge if ||𝜚𝑘, 𝜚𝑗|| 𝑔(𝑡) for all 𝑘, 𝑗 𝑡, and lim𝑡 𝑔(𝑡) = 0, where || || is a norm on the codomains of 𝑅and 𝑅𝑛, respectively. In this case, define 𝜚𝑐 {𝑅𝑐, 𝑅𝑛,𝑐} as the limit of this convergent sequence 𝜚.

Contrary to classical ERM, convergence of reciprocal learning implies stability of both data and parameters. Technically, it refers to all components of the functions 𝑅and 𝑅𝑛, respectively, see definition 8. It guarantees that 𝜃𝑡 1 solves ERM on the sample induced by it in 𝑡. However, this does not say much about its optimality in general. What if the algorithm had outputted a different 𝜃𝑡 1 in the first place? The empirical risk could have been lower on the sample in 𝑡induced by it. The following definition describes such a look-ahead optimality. It can be interpreted as the optimal data-parameter combination. Definition 9 (Optimal Data-Parameter Combination). Consider (non-greedy) reciprocal learning 𝑅(𝑅𝑛), see definitions 6 and 7. Define 𝑅 and 𝑅 𝑛as optimal data-parameter combination in reciprocal learning if 𝑅 𝑛= (𝜃 𝑛, P 𝑛) = arg min𝜃,P E(𝑌,𝑋) 𝑓𝑛(𝜃,P) ℓ(𝑌, 𝑋, 𝜃), and 𝑅 = (𝜃 , P , 𝑛 ) = arg min𝜃,P,𝑛E(𝑌,𝑋) 𝑓(𝜃,P,𝑛) ℓ(𝑌, 𝑋, 𝜃), respectively.

An optimal 𝜃 (or 𝜃 𝑛, analogously) not only solves ERM on the sample it induces, but is also the best ERM-solution among all possible 𝜃(𝜃𝑛) that could have led to optimality on the respectively induced sample. In other words, 𝜃 (𝜃 𝑛) is found by minimizing the empirical risk with respect to whole 𝑅(𝑅𝑛). That is, it is found by minimizing the empirical risk with respect to 𝜃given a sample (characterized by P and 𝑛) and steering this very sample through 𝜃simultaneously given only the initial sample. Technically, optimality (definition 9) is a bivariate arg min-condition on E(𝑌,𝑋) 𝑓𝑛(𝜃,P) ℓ(𝑌, 𝑋, 𝜃) and E(𝑌,𝑋) 𝑓(𝜃,P,𝑛) ℓ(𝑌, 𝑋, 𝜃), respectively. In contrast, convergence (definition 8) translates to a fixed-point condition on the arg min viewed as a function Θ P N Θ P N in case of 𝑅and as Θ P Θ P in case of 𝑅𝑛, see section 3.

2.4 Self-training is an instance of reciprocal learning

Let us get back to our running example of self-training. It is easy to see that self-training is a special case of reciprocal learning with the sample adaption function 𝑓𝑆𝑆𝐿: Θ P N P; (𝜃, P(𝑌, 𝑋), 𝑛) P (𝑌, 𝑋) defined through P (𝑌, 𝑋) being induced by

P (𝑌= 1, 𝑋= 𝑥) = 1(𝑥= 𝑥(𝜃)) ˆ𝑦(𝑥𝑑(𝜃), 𝜃) + 𝑛P(𝑌= 1, 𝑋= 𝑥)

𝑛+ 1 𝑃𝑌|𝑋𝑑𝑦 𝑃𝑋𝑑𝑥 (3)

where 𝑥𝑑(𝜃) (definition 2) selects data with highest confidence score [2, 49, 80, 83, 53, 84, 18], see section 2.1, according to the model 𝜃, and gives rise to 𝑃𝑋. The prediction function ˆ𝑦: X Θ {0, 1} returns the predicted pseudo-label of the selected 𝑥𝑑(𝜃) based on the learned model 𝜃and gives rise to 𝑃𝑌|𝑋. Moreover, we still assume binary target variables, i.e., the image of 𝑌is {0, 1}, real-valued features 𝑋, and only consider cases where the sample changes through the addition of one instance per iteration.3 The averaging with respect to 𝑃𝑋and 𝑃𝑌|𝑋accounts for the fact that we allow stochastic inclusion of 𝑋in the sample through randomized actions and for probabilistic predictions of 𝑌| 𝑋, respectively. For now, however, it suffices to think of the special case of degenerate distributions 𝑃𝑋and 𝑃𝑌|𝑋putting point mass 1 on data with hard labels in the sample and 0 elsewhere.4 Through averaging with respect to 𝑃𝑌|𝑋we can describe the joint distribution of hard labels (𝑦1, 𝑥1), . . . , (𝑦𝑛, 𝑥𝑛) and predicted soft labels 𝑦= ˆ𝑝(𝑌= 1 | 𝑥, 𝜃) [0, 1] of ( 𝑦𝑛+1, 𝑥𝑛+1), . . . , ( 𝑦𝑛+𝑡, 𝑥𝑛+𝑡). Summing up, both deterministic data selection and non-probabilistic (i.e., hard labels) predictions are well-defined special cases of the above with 𝑃𝑌|𝑋and 𝑃𝑋collapsing to trivial Dirac measures, respectively.

3In case more than one instance is added per iteration, the sample adaption function can be defined as a composite function of the used sample adaption functions. 4In this case, P(𝑌= 1, 𝑋= 𝑥) = 1(𝑥=𝑥(𝜃)) ˆ𝑦(𝑥𝑑(𝜃),𝜃) + 𝑛P(𝑌=1, 𝑋=𝑥)

3 Convergence of reciprocal learning: Lipschitz is all you need

After having generalized several widely adopted machine learning algorithms to reciprocal learning, we will study their convergence (definition 8) and optimality (definition 9). Our general aim is to identify sufficient conditions for any reciprocal learning algorithm to converge and then show that such a convergent solution is sufficiently close to the optimal one. This will not only allow to assess convergence and optimality of examples 1 through 3 (self-training, active learning, multi-armed bandits, see appendix A) but of any other reciprocal learning algorithm. Besides further existing examples not detailed in this paper like superset learning [35] or Bayesian optimization [63], we are especially aiming at potential future yet to be proposed algorithms. On this background, our conditions for convergence and optimality can be understood as design principles. Before turning to these concrete conditions on reciprocal learning algorithms, we need some general assumptions on the loss function for the remainder of the paper. Assumptions 1 and 2 can be considered quite mild and are fulfilled by a broad class of loss functions, see [95, Chapter 12] or [14]. For instance, the L2-regularized (ridge) logistic loss has Lipschitz-continuous gradients both with respect to features and parameters. For a discussion of assumption 3, we refer to appendix E.2. Assumption 1 (Continuous Differentiability in Features). A loss function ℓ(𝑌, 𝑋, 𝜃) is said to be continuously differentiable with respect to features if the gradient 𝑋ℓ(𝑌, 𝑋, 𝜃) exists and is 𝛼-Lipschitz continuous in 𝜃, 𝑥, and 𝑦with respect to the L2-norm on domain and codomain. Assumption 2 (Continuous Differentiability in Parameters). A loss function ℓ(𝑌, 𝑋, 𝜃) is continuously differentiable with respect to parameters if the gradient 𝜃ℓ(𝑌, 𝑋, 𝜃) exists and is 𝛽-Lipschitz continuous in 𝜃, 𝑥, and 𝑦with respect to the L2-norm on domain and codomain. Assumption 3 (Strong Convexity). Loss ℓ(𝑌, 𝑋, 𝜃) is said to be 𝛾-strongly convex if ℓ(𝑦, 𝑥, 𝜃) ℓ(𝑦, 𝑥, 𝜃 ) + 𝜃ℓ(𝑦, 𝑥, 𝜃 ) (𝜃 𝜃 ) + 𝛾

2 𝜃 𝜃 2 2 , for all 𝜃, 𝜃 , 𝑦, 𝑥. Observe convexity for 𝛾= 0.

Let us now turn to specific and more constructive conditions on reciprocal learning s workhorse, the data selection problem (Θ, X, L𝜃). At the heart of these conditions lies a common goal: We want to establish some continuity in how the data changes from 𝑡 1 to 𝑡in response to 𝜃𝑡 1 and P𝑡 1. It is self-evident that without any such continuity, convergence seems out of reach. As it will turn out, bounding the change of the data in 𝑡by the change of what happens in 𝑡 1 will be sufficient for convergence, see figure 3. We thus need the sample adaption function (definition 5) to be Lipschitz-continuous. Theorem 1 will deliver this for subsets of conditions 1 through 5 in case of binary classification problems. The reason for the latter restriction is that we need an explicit definition of 𝑓to constructively prove its Lipschitz-continuity. Condition 1 (Data Regularization). Data selection is regularized as per definition 3. Condition 2 (Soft Labels Prediction). The prediction function ˆ𝑦: X Θ {0, 1} on bounded X gives rise to a non-degenerate distribution of 𝑌| 𝑋for any 𝜃such that we can consider soft label predictions 𝑝: X Θ [0, 1] with 𝑝(𝑥, 𝜃) = 𝜎(𝑔(𝑋, 𝜃)) with 𝜎: R [0, 1] a sigmoid function. Further, assume that the loss is jointly smooth in these predictions. That is, 𝑝ℓ(𝑦, 𝑝(𝑥, 𝜃)) exists and is Lipschitz-continuous in 𝑥and 𝜃. Condition 3 (Stochastic Data Selection). Data is selected stochastically according to 𝑥𝑠by drawing from a normalized criterion exp(𝑐(𝑥,𝜃𝑡))

𝑥 exp(𝑐(𝑥 ,𝜃𝑡))𝑑𝜇(𝑥) , see definition 2.

Condition 4 (Continuous Selection Criterion). It holds for the decision criterion 𝑐: X Θ R in the decision problem (Θ, A, L𝜃𝑡) of selecting features to be added to the sample that 𝑥𝑐(𝑥, 𝜃) and 𝜃𝑐(𝑥, 𝜃) are bounded from above. Condition 5 (Linear Selection Criterion). The decision criterion 𝑐: X Θ R in (Θ, A, L𝜃𝑡) is linear in 𝑥and Lipschitz-continuous in 𝜃with a Lipschitz constant 𝐿𝑐that is independent from 𝑥.

We can interpret 𝑝as 𝑃𝜃(𝑌| 𝑋= 𝑥) in condition 2, see also definition 1. In other words, soft labels in the form of probability distributions are available. Adding observations with soft labels to the data can be implemented either through randomization, i.e., by adding 𝑥with label 1 with probability 𝑝 and vice versa, or through weighted retraining. Note that condition 4 implies condition 5 through characterization of Lipschitz-continuity by bounded gradients. We need two implications of these conditions to establish Lipschitz-continuity of the sample adaption in reciprocal learning. First, it can be shown that regularized data selection (condition 1) is Lipschitz-continuous in the model, see lemma 1. Second, the soft label prediction function (condition 2) is Lipschitz in both data and model, if the data selection, in turn, is Lipschitz-continuous in the model, see lemma 2.

Lemma 1 (Regularized Data Selection is Lipschitz). Regularized Data Selection

𝑥𝑑,R : Θ X; 𝜃 argmax 𝒙 X

𝑐(𝒙, 𝜃) + 1

with 𝜅-strongly convex regularizer, see definition 3 and condition 1, is 𝐿𝑠 𝐿𝑐

𝜅 -Lipschitz continuous, if 𝑐is linear in 𝑥(condition 5) and Lipschitz-continuous in 𝜃with a Lipschitz constant 𝐿𝑐that is independent of 𝑥. Lemma 2 (Soft Label Prediction is Lipschitz). The soft label prediction function (condition 2)

𝑝: X Θ [0, 1]; 𝑝((ש𝜃), 𝜃) = ˆ𝑦((ש𝜃), 𝜃) 𝑝𝑑𝑦 𝑃ש𝑑ש

is Lipschitz-continuous in both 𝑥 X and 𝜃 Θ and (𝑥, 𝜃) X Θ if (ש𝜃) 𝑃ש𝑑שis Lipschitzcontinuous.

Proofs of all results in this paper can be found in appendix F. With the help of lemma 1 and 2, we are now able to state two key results. They tell us under which conditions the sample adaption functions in both greedy and non-greedy reciprocal learning are Lipschitz-continuous, which will turn out to be sufficient for convergence. Theorem 1 (Regularization Makes Sample Adaption Lipschitz-Continuous). If predictions are soft (condition 2) and the data selection is regularized (conditions 1 and 5), both greedy and non-greedy sample adaption functions 𝑓and 𝑓𝑛(see definition 5) in reciprocal learning with Y = {0, 1} are Lipschitz-continuous with respect to the L2-norm on Θ and N, and the Wasserstein-1-distance on P. Theorem 2 (Randomization Makes Sample Adaption Lipschitz-Continuous). If predictions are soft (condition 2) and the data selection is randomized (conditions 3 and 4), greedy and non-greedy sample adaption functions are Lipschitz-continuous in the sense of theorem 1.

The general idea for both proofs is to show Lipschitz-continuity component-wise and then infer that 𝑓and 𝑓𝑛are Lipschitz with the supremum of all component-wise Lipschitz-constants. We can now leverage these theorems to state our main result. It tells us (via theorems 1 and 2 and conditions 1 - 5) which types of reciprocal learning algorithms converge. Recall that convergence (definition 8) in reciprocal learning implies a convergent model and a convergent data set. Theorem 3 (Convergence of Non-Greedy Reciprocal Learning). If the non-greedy sample adaption 𝑓𝑛is Lipschitz-continuous with 𝐿 (1 + 𝛽

𝛾) 1, the iterates 𝑅𝑛,𝑡= (𝜃𝑡, P𝑡) of non-greedy reciprocal learning 𝑅𝑛(definition 7) converge to 𝑅𝑛,𝑐= (𝜃𝑐, P𝑐) at a linear rate.

Figure 3: Reciprocal learning converges if the change in sample (purple) is bounded by the change in model (yellow) and previous sample.

The proof idea is as follows. We relate the Lipschitz-continuity of 𝑓𝑛to the Lipschitzcontinuity of 𝑅𝑛via the dual characterization of the Wasserstein metric [43]. If 𝑓𝑛is Lipschitz with 𝐿 (1 + 𝛽

𝛾) 1, we further show that 𝑅𝑛is a bivariate contraction. The Banach fixed-point theorem [4, 72, 17] then directly delivers uniqueness and existence of (𝜃𝑐, P𝑐) as convergent fixed point, which means that it holds 𝜃𝑐= arg min𝜃E(𝑌,𝑋) 𝑓(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃). A complete proof can be found in appendix F.5. Building on earlier work on performatively optimal predictions [73], we can further relate this convergent training solution to the global solution of reciprocal learning, i.e., the optimal data-parameter fit, see definition 9. The following theorem 4 states that our convergent solution is close to the optimal one. It tells us that we did not enforce a trivial or even degenerate form of convergence (e.g., constant 𝜃𝑡) by regularization and randomization. Theorem 4 only refers to the convergent parameter solution, not to the data. Note that the parameter solution is the crucial part of reciprocal learning for later deployment and assessment on test data.

Theorem 4 (Optimality of Convergent Solution). If non-greedy reciprocal learning converges in the sense of theorem 3, it holds ||𝜃𝑐 𝜃 ||2 2𝐿ℓ𝐿

𝛾 for 𝜃𝑐and 𝜃 from the convergent data-parameter tuple 𝑅𝑛,𝑐= (𝜃𝑐, P𝑐) and the optimal one 𝑅 𝑛= (𝜃 , P ) if the loss is 𝐿ℓ-Lipschitz in 𝑋and 𝑌.

While theorem 1 and 2 guarantee that both greedy 𝑓and non-greedy 𝑓𝑛are Lipschitz, theorem 3 and thus also theorem 4 only hold for non-greedy reciprocal learning. The question immediately comes to mind whether we can say anything about the asymptotic behavior of the greedy variant, too. The following theorem 5 gives an affirmative answer. Intuitively, there is no fixed point in Θ P N if data is constantly being added and not removed such that 𝑛 . Theorem 5. Greedy Reciprocal Learning does not converge in the sense of definition 8.

We conclude this section with another negative result. It states that theorem 3 is tight in theorem 1 and 2. Summing things up, Lipschitz-continuity is all you need for non-greedy reciprocal learning to converge. Theorem 6. If the sample adaption 𝑓𝑛is not Lipschitz, non-greedy reciprocal learning can diverge.

4 Which reciprocal learning algorithms converge?

We briefly relate the above results to specific algorithms in active learning, bandits, and our running example of self-training. Assume a binary target variable and 𝐿 (1 + 𝛽

𝛾) 1 with 𝛾and 𝛽in the sense of assumption 1-3 throughout. First observe that any greedy (definition 6) algorithm that only adds data without removal does not converge in the sense of defintion 8 with respect to 𝜃, P, and 𝑛, see theorem 5. This provides a strong case for non-greedy self-training algorithms, often referred to as amending strategies [103] or self-training with editing [52] and noise filters [104], that add and remove data, as opposed to greedy ones like incremental or batch-wise self-training [103], see example 1, that only add pseudo-labeled data without removing any data. For detailed explanation and comparison of the two, please refer to Appendix A.1.1. Corollary 1 (Self-Training). Amending self-training algorithms converge in the sense of definition 8, if predicted pseudo-labels are soft (condition 2) and data selection is regularized (condition 1) or randomized (condition 3).

Furthermore, we shed some light on the debate [97, 120] in the literature on multi-armed bandits about whether to use deterministic strategies like upper confidence bound [10, 100, 42] or stochastic ones like epsilon-greedy [45, 54] search or Thompson sampling [89, 90], see example 3. Note, however, that this insight relates to in-sample convergence only, see definition 8. Corollary 2 (Bandits). Non-greedy multi-armed bandits with Thompson sampling and epsilon-greedy strategies converge in the sense of definition 8 under additional condition 2, while bandits with upper confidence bound (UCB) are not guaranteed to converge.

What is more, condition 2 allows us to distinguish between active learning from weak and strong oracles, see [59, 93, 78] for literature surveys. The former posits the availability of probabilistic or noisy oracle feedback through soft labels [59, 114, 117]; the latter assumes the oracle to have access to an undeniable ground truth via hard labels [26, 19, 20]. Corollary 3 (Active Learning). Active Learning from a strong oracle (i.e., providing hard labels) is not guaranteed to converge, while active learning from a weak oracle (soft labels) converges in the sense of definition 8 under additional condition 1 or 3.

5 Related work

Convergence of active learning, self-training, and other special cases of reciprocal learning has been touched upon in the context of stopping criteria [109, 121, 48, 110, 28, 22, 83, 11, 79, 96]. We refer to section 1 for a discussion and relate reciprocal learning to other fields in what follows.

Continual Learning: While reciprocity through, e.g., gradient-based data selection is a known phenomenon in continual learning [1, 112, 15], the inference goal is not static as in reciprocal learning. Continual learning rather aims at excelling at new tasks (that is, new populations), while reciprocal learning can simply be seen as a greedy approximation of extended ERM, see section 2.

Online Learning: In online learning and online convex optimization, the task is to predict 𝑦by ˆ𝑦from iteratively receiving 𝑥. After each prediction, the true 𝑦and corresponding loss ℓ(𝑦, ˆ𝑦) is observed, see [94] for an introduction and appendix B.2 for an illustration. Reciprocal learning can thus be considered a special online learning framework. Typically, online learning assumes incoming data to be randomly drawn or even selected by an adversarial player, while being selected by the algorithm itself in reciprocal learning. The majority of the online learning literature is concerned with how to update a model in light of new data, while we focus on how data is selected based on the current model fit. Loosely speaking, online learning deals with only one side of the coin explicitly, while we take a reciprocal point of view: We study both how to learn parameters from data and how to select data in light of fitted parameters.

Coresets: The aim of coreset construction is to find subsamples that lead to parameter fits close to the originally learned parameters [62, 76, 69, 24, 68, 33]. It can be seen as a post hoc approach, while reciprocal learning algorithms directly learn a parameter-efficient sample on the go.

Performative Prediction: The sample adaption functions in reciprocal learning are reminiscent of performative prediction, where today s predictions change tomorrow s population [74, 29, 61], and, more generally, of the reflexivity problem in social sciences [99, 66]. We identify analogous reflexive effects on the sample level in reciprocal learning via the sample adaption function 𝑓(or 𝑓𝑛), see section 2. Contrary to performative prediction, however, 𝑓( 𝑓𝑛) describes an in-sample (performative prediction: population) reflexive effect of both data and parameters (performative prediction: only parameters). Moreover, reciprocal learning describes specific and implementable algorithms, which allows for an explicit study of these reflexive effects. While we rely on similar techniques as in [74, 61], namely Lipschitz-continuity and Wasserstein-distance, our work is thus conceptually different. For an illustration of these differences, see appendix B.1.

Safe Active Learning: Safe active learning explores the feature space by optimizing an acquisition criterion under a safety constraint [122, 51, 91]. While this can be viewed as regularization akin to the one we propose in definition 3, both aim and structure are different: We want to enforce Lipschitz-continuity explicitly via a penalty term in data selection; safe active learning optimizes a selection criterion without penalty terms under constraints that are motivated by domain knowledge.

6 Discussion

Summary: We have embedded a wide range of established machine learning algorithms into a unifying framework, called reciprocal learning. This gave rise to a rigorous analysis of (1) under which conditions and (2) how fast these algorithms converge to an approximately optimal model. We further applied these results to common practices in self-training, active learning, and bandits.

Limitations: While our results guarantee the convergence of reciprocal learning algorithms, the opposite does generally not hold. That is, if our conditions are violated, we cannot rule out the possibility of (potentially weaker notions) of convergence. Furthermore, our analysis requires assumptions on the loss functions, as detailed in section 3 and appendix E. In particular, it needs to be 𝛾-strongly convex and have 𝛽-Lipschitz gradients, such that 𝐿 (1 + 𝛽

𝛾) 1 with 𝐿the Lipschitzconstant of the sample adaption. This limits our results applicability. From another perspective, however, this is a feature rather than a bug, since the described restrictions can serve as design principles for self-training, active learning, or bandit algorithms that shall converge, see below.

Future Work: This article identifies sufficient conditions for convergence of reciprocal learning. These restrictions pave the way for a theory-informed design of novel algorithms. In particular, our results emphasize the importance of regularization of both parameters and data for convergence. While the former is needed to control 𝛾and 𝛽, see appendix E.2 for the example of Tikhonovregularization, the latter guarantees Lipschitz-continuity of the sample adaption through theorem 1. Parameter regularization is well-studied and has been heavily applied. We conjecture that the concept of data regularization might bear similar practical potential. Another line of future research would be to address the question whether reciprocal learning algorithms are stable with respect to slight changes in the initial training data. In this sense, [8, 30] might serve as a bridge to future research.

Acknowledgements

We sincerely thank Thomas Augustin, James Bailie, and Lea Höhler for helpful comments on earlier versions of this manuscript. We also thank all four anonymous reviewers for their assessment of our paper. Moreover, we are indebted to several participants of the 2024 Workshop on Machine Learning under Weakly Structured Information in Munich for critically assessing preliminary ideas and conjectures regarding reciprocal learning presented at the workshop.

Julian Rodemann acknowledges support by the Federal Statistical Office of Germany within the co-operation project Machine Learning in Official Statistics , the Bavarian Academy of Sciences (BAS) through the Bavarian Institute for Digital Transformation (bidt), and the LMU mentoring program of the Faculty of Mathematics, Informatics, and Statistics.

[1] Rahaf Aljundi et al. Gradient based sample selection for online continual learning . In: Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. Vol. 32. Curran Associates, Inc., 2019. [2] Eric Arazo et al. Pseudo-labeling and confirmation bias in deep semi-supervised learning . In: 2020 International Joint Conference on Neural Networks. IEEE. 2020, pp. 1 8. [3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem . In: Machine learning 47 (2002), pp. 235 256. [4] Stefan Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales . In: Fundamenta mathematicae 3.1 (1922), pp. 133 181. [5] Emanuel Ben-Baruch et al. Distilling the Knowledge in Data Pruning . In: ar Xiv preprint ar Xiv:2403.07854 (2024). [6] James O. Berger. Statistical decision theory and Bayesian analysis. 2nd. Springer, 1985. [7] Surojit Biswas et al. Low-N protein engineering with data-efficient deep learning . In: Nature methods 18.4 (2021), pp. 389 396. [8] Olivier Bousquet and André Elisseeff. Stability and generalization . In: The Journal of Machine Learning Research 2 (2002), pp. 499 526. [9] Tom Brown et al. Language models are few-shot learners . In: Advances in neural information processing systems 33 (2020), pp. 1877 1901. [10] Alexandra Carpentier et al. Upper-confidence-bound algorithms for active learning in multiarmed bandits . In: International Conference on Algorithmic Learning Theory. Springer. 2011, pp. 189 203. [11] Deepayan Chakrabarti et al. Mortal multi-armed bandits . In: Advances in neural information processing systems 21 (2008). [12] Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-supervised learning. Adaptive computation and machine learning series. MIT Press, 2006. [13] Anshuman Chhabra et al. What data benefits my classifier? Enhancing model performance and interpretability through influence-based data selection . In: International Conference on Learning Representations. 2024. [14] Geoffrey Chinot, Guillaume Lecué, and Matthieu Lerasle. Robust statistical learning with Lipschitz and convex loss functions . In: Probability Theory and related fields 176.3 (2020), pp. 897 940. [15] Aristotelis Chrysakis and Marie-Francine Moens. Online Continual Learning from Imbalanced Data . In: Proceedings of the 37th International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR, 13 18 Jul 2020, pp. 1952 1961. [16] David A Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning . In: Machine Learning 15.2 (1994), pp. 201 221. [17] Patrick L Combettes and Jean-Christophe Pesquet. Fixed point strategies in data science . In: IEEE Transactions on Signal Processing 69 (2021), pp. 3878 3905. [18] Stefan Dietrich, Julian Rodemann, and Christoph Jansen. Semi-Supervised Learning guided by the Generalized Bayes Rule under Soft Revision . In: ar Xiv preprint ar Xiv:2405.15294 (2024).

[19] Shi Dong. Multi class SVM algorithm with active learning for network traffic classification . In: Expert Systems with Applications 176 (2021), p. 114885. [20] Liat Ein Dor et al. Active learning for BERT: an empirical study . In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). 2020, pp. 7949 7962. [21] Dheeru Dua and Casey Graff. UCI Machine Learning Repository. http://archive.ics. uci.edu/ml. 2017. [22] Eyal Even-Dar et al. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. In: Journal of machine learning research 7.6 (2006). [23] Gabriele Farina et al. Stable-predictive optimistic counterfactual regret minimization . In: International conference on machine learning. PMLR. 2019, pp. 1853 1862. [24] Susanne Frick, Amer Krivosija, and Alexander Munteanu. Scalable Learning of Item Response Theory Models . In: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics. Ed. by Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li. Vol. 238. Proceedings of Machine Learning Research. PMLR, Feb. 2024, pp. 1234 1242. [25] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data . In: International conference on machine learning. PMLR. 2017, pp. 1183 1192. [26] Mingfei Gao et al. Consistency-based semi-supervised active learning: Towards minimizing labeling cost . In: Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part X 16. Springer. 2020, pp. 510 526. [27] Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amnesiac machine learning . In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 13. 2021, pp. 11516 11524. [28] Edita Grolman et al. How and when to stop the co-training process . In: Expert Systems with Applications 187 (2022), p. 115841. [29] Moritz Hardt and Celestine Mendler-Dünner. Performative Prediction: Past and Future . In: ar Xiv preprint ar Xiv:2310.16608 (2023). [30] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent . In: Proceedings of The 33rd International Conference on Machine Learning. Ed. by Maria Florina Balcan and Kilian Q. Weinberger. Vol. 48. Proceedings of Machine Learning Research. New York, New York, USA: PMLR, 20 22 Jun 2016, pp. 1225 1234. [31] Trevor Hastie et al. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. Springer, 2009. [32] Shuo He et al. Candidate Label Set Pruning: A Data-centric Perspective for Deep Partiallabel Learning . In: The Twelfth International Conference on Learning Representations. 2023. [33] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable Bayesian logistic regression . In: Advances in neural information processing systems 29 (2016). [34] Eyke Hüllermeier. Learning from imprecise and fuzzy observations: Data disambiguation through generalized loss minimization . In: International Journal of Approximate Reasoning 55 (2014), pp. 1519 1534. [35] Eyke Hüllermeier and Weiwei Cheng. Superset learning based on generalized loss minimization . In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer. 2015, pp. 260 275. [36] Eyke Hüllermeier, Sébastien Destercke, and Ines Couso. Learning from imprecise data: adjustments of optimistic and pessimistic variants . In: International Conference on Scalable Uncertainty Management. Springer, 2019, pp. 266 279. [37] Nathan Huntley and Matthias Troffaes. Subtree perfectness, backward induction, and normalextensive form equivalence for single agent sequential decision making under arbitrary choice functions . In: ar Xiv preprint ar Xiv:1109.3607 (2011). [38] Nathan Huntley and Matthias CM Troffaes. Normal form backward induction for decision trees with coherent lower previsions . In: Annals of Operations Research 195.1 (2012), pp. 111 134. [39] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge university press, 2015.

[40] Hideaki Ishibashi and Hideitsu Hino. Stopping criterion for active learning based on deterministic generalization bounds . In: International Conference on Artificial Intelligence and Statistics. PMLR. 2020, pp. 386 397. [41] Hideaki Ishibashi and Hideitsu Hino. Stopping criterion for active learning based on error stability . In: ar Xiv preprint ar Xiv:2104.01836 (2021). [42] Anand Kalvit and Assaf Zeevi. A closer look at the worst-case behavior of multi-armed bandit algorithms . In: Advances in Neural Information Processing Systems 34 (2021), pp. 8807 8819. [43] Leonid Vasilevich Kantorovich and SG Rubinshtein. On a space of totally additive functions . In: Vestnik of the St. Petersburg University: Mathematics 13.7 (1958), pp. 52 59. [44] William Karush. Minima of functions of several variables with inequalities as side constraints . In: M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago (1939). [45] Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems . In: ar Xiv preprint ar Xiv:1402.6028 (2014). [46] Yongchan Kwon et al. Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models . In: ar Xiv preprint ar Xiv:2310.00902 (2023). [47] Hunter Lang, Aravindan Vijayaraghavan, and David Sontag. Training subset selection for weak supervision . In: Advances in Neural Information Processing Systems 35 (2022), pp. 16023 16036. [48] Florian Laws and Hinrich Schütze. Stopping criteria for active learning of named entity recognition . In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). 2008, pp. 465 472. [49] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks . In: Workshop on challenges in representation learning, International Conference on Machine Learning. Vol. 3. 2013, p. 896. [50] David D Lewis and William A Gale. A sequential algorithm for training text classifiers . In: SIGIR 94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer. 1994, pp. 3 12. [51] Cen-You Li, Barbara Rakitsch, and Christoph Zimmer. Safe active learning for multi-output gaussian processes . In: International Conference on Artificial Intelligence and Statistics. PMLR. 2022, pp. 4512 4551. [52] Ming Li and Zhi-Hua Zhou. SETRED: Self-training with editing . In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer. 2005, pp. 611 621. [53] Shuangshuang Li et al. Pseudo-label selection for deep semi-supervised learning . In: 2020 IEEE International Conference on Progress in Informatics and Computing (PIC). IEEE. 2020, pp. 1 5. [54] Tian Lin, Jian Li, and Wei Chen. Stochastic online greedy learning with semi-bandit feedbacks . In: Advances in Neural Information Processing Systems 28 (2015). [55] Wei Liu et al. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning . In: ar Xiv preprint ar Xiv:2312.15685 (2023). [56] Debmalya Mandal, Stelios Triantafyllou, and Goran Radanovic. Performative reinforcement learning . In: International Conference on Machine Learning. PMLR. 2023, pp. 23642 23680. [57] Max Marion et al. When less is more: Investigating data pruning for pretraining llms at scale . In: ar Xiv preprint ar Xiv:2309.04564 (2023). [58] Nestor Maslej et al. Artificial intelligence index report 2023 . In: ar Xiv preprint ar Xiv:2310.03715 (2023). [59] Björn Mattsson. Active learning of neural network from weak and strong oracles . In: (2017). [60] Lara Mauri and Ernesto Damiani. Estimating degradation of machine learning data assets . In: ACM Journal of Data and Information Quality (JDIQ) 14.2 (2021), pp. 1 15. [61] John P Miller, Juan C Perdomo, and Tijana Zrnic. Outside the echo chamber: Optimizing the performative risk . In: International Conference on Machine Learning. PMLR. 2021, pp. 7710 7720.

[62] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models . In: International Conference on Machine Learning. PMLR. 2020, pp. 6950 6960. [63] Jonas Moˇckus. On Bayesian methods for seeking the extremum . In: Optimization Techniques IFIP Technical Conference: Novosibirsk, July 1 7, 1974. Springer. 1975, pp. 400 404. [64] Jonas Moˇckus, Vytautas Tiesis, and Antanas Zilinskas. The application of Bayesian methods for seeking the extremum . In: Towards global optimization 2.117-129 (1978), p. 2. [65] Robert Munro Monarch. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Simon and Schuster, 2021. [66] O Morgenstern. Wirtschaftsprognose: Eine Untersuchung ihrer Voraussetzungen und Möglichkeiten, Wien 1928, cited after: G. Betz (2004), Empirische und aprioristische Grenzen von Wirtschaftsprognosen: Oskar Morgenstern nach 70 Jahren . In: Wissenschaftstheorie in Ökonomie und Wirtschaftsinformatik, Deutscher Universitäts-Verlag, Wiesbaden (1928), pp. 171 190. [67] Niklas Muennighoff et al. Scaling data-constrained language models . In: Advances in Neural Information Processing Systems 36 (2024). [68] Alexander Munteanu and Chris Schwiegelshohn. Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms . In: KI - Künstliche Intelligenz 32.1 (Feb. 2018), pp. 37 53. ISSN: 1610-1987. [69] Alexander Munteanu et al. On Coresets for Logistic Regression . In: Advances in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates, Inc., 2018. [70] Ofir Nachum et al. Data-efficient hierarchical reinforcement learning . In: Advances in neural information processing systems 31 (2018). [71] Malte Nalenz, Julian Rodemann, and Thomas Augustin. Learning de-biased regression trees and forests from complex samples . In: Machine Learning (2024), pp. 1 20. [72] Vittorino Pata et al. Fixed point theorems and applications. Vol. 116. Springer, 2019. [73] Juan Perdomo. Performative Prediction: Theory and Practice . Ph D thesis. UC Berkeley, 2023. [74] Juan Perdomo et al. Performative prediction . In: International Conference on Machine Learning. PMLR. 2020, pp. 7599 7609. [75] Nitin Namdeo Pise and Parag Kulkarni. A survey of semi-supervised learning methods . In: 2008 International conference on computational intelligence and security. Vol. 2. IEEE. 2008, pp. 30 34. [76] Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman. Adaptive second order coresets for data-efficient machine learning . In: International Conference on Machine Learning. PMLR. 2022, pp. 17848 17869. [77] Daniel Reker. Practical considerations for active machine learning in drug discovery . In: Drug Discovery Today: Technologies 32 (2019), pp. 73 79. [78] Pengzhen Ren et al. A survey of deep active learning . In: ACM computing surveys (CSUR) 54.9 (2021), pp. 1 40. [79] Paul Reverdy, Vaibhav Srivastava, and Naomi Ehrich Leonard. Satisficing in multi-armed bandit problems . In: IEEE Transactions on Automatic Control 62.8 (2016), pp. 3788 3803. [80] Mamshad Nayeem Rizve et al. In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning . In: International Conference on Learning Representations, 2020. 2020. [81] Herbert Robbins. Some aspects of the sequential design of experiments . In: Bulletin of the American Mathematical Society 58.5 (1952), pp. 527 535. [82] Julian Rodemann. Bayesian Data Selection . In: ar Xiv preprint ar Xiv:2406.12560 (2024). 5th Workshop on Data-Centric Machine Learning Research (DMLR) at ICML 2024. [83] Julian Rodemann et al. Approximately Bayes-optimal pseudo-label selection . In: Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI). Vol. 216. Proceedings of Machine Learning Research. PMLR, 2023, pp. 1762 1773.

[84] Julian Rodemann et al. In all likelihoods: Robust selection of pseudo-labeled data . In: International Symposium on Imprecise Probability: Theories and Applications. PMLR. 2023, pp. 412 425. [85] Julian Rodemann et al. Levelwise Data Disambiguation by Cautious Superset Classification . In: International Conference on Scalable Uncertainty Management. Springer. 2022, pp. 263 276. [86] Julian Rodemann et al. Not All Data Are Created Equal: Lessons From Sampling Theory For Adaptive Machine Learning . In: International Conference on Statistics and Data Science (ICSDS) by the Institute of Mathematical Statistics (IMS). 2022. [87] Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-Supervised Self-Training of Object Detection Models . In: 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION 05)-Volume 1. Vol. 1. IEEE. 2005, pp. 29 36. [88] Jean-Francis Roy, Mario Marchand, and François Laviolette. A column generation bound minimization approach with PAC-Bayesian generalization guarantees . In: Artificial Intelligence and Statistics. PMLR. 2016, pp. 1241 1249. [89] Daniel J Russo and Benjamin Van Roy. An information-theoretic analysis of thompson sampling . In: Journal of Machine Learning Research 17.68 (2016), pp. 1 30. [90] Daniel J Russo et al. A tutorial on thompson sampling . In: Foundations and Trends in Machine Learning 11.1 (2018), pp. 1 96. [91] Jens Schreiter et al. Safe exploration for active learning with Gaussian processes . In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part III 15. Springer. 2015, pp. 133 149. [92] Max Schwarzer et al. Pretraining representations for data-efficient reinforcement learning . In: Advances in Neural Information Processing Systems 34 (2021), pp. 12686 12699. [93] Burr Settles. Active Learning Literature Survey. Tech. rep. Computer Sciences Technical Report 1648. University of Wisconsin Madison, 2010. [94] Shai Shalev-Shwartz et al. Online learning and online convex optimization . In: Foundations and Trends in Machine Learning 4.2 (2012), pp. 107 194. [95] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [96] Jaehyeok Shin, Aaditya Ramdas, and Alessandro Rinaldo. On Conditional Versus Marginal Bias in Multi-Armed Bandits . In: Proceedings of the 37th International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR, 13 18 Jul 2020, pp. 8852 8861. [97] Aleksandrs Slivkins et al. Introduction to multi-armed bandits . In: Foundations and Trends in Machine Learning 12.1-2 (2019), pp. 1 286. [98] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms . In: Advances in neural information processing systems 25 (2012). [99] George Soros. The alchemy of finance. John Wiley & Sons, 2015. [100] Niranjan Srinivas et al. Information-theoretic regret bounds for gaussian process optimization in the bandit setting . In: IEEE transactions on information theory 58.5 (2012), pp. 3250 3265. [101] Bernadette J Stolz. Outlier-robust subsampling techniques for persistent homology . In: Journal of Machine Learning Research 24.90 (2023), pp. 1 35. [102] Hugo Touvron et al. Llama: Open and efficient foundation language models . In: ar Xiv preprint ar Xiv:2302.13971 (2023). [103] Isaac Triguero, Salvador García, and Francisco Herrera. Self-labeled techniques for semisupervised learning: taxonomy, software and empirical study . In: Knowledge and Information systems 42.2 (2015), pp. 245 284. [104] Isaac Triguero et al. On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification . In: Neurocomputing 132 (2014), pp. 30 41. [105] Vishaal Udandarao et al. No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance . In: ar Xiv preprint ar Xiv:2404.04125 (2024).

[106] Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning . In: Machine Learning 109.2 (2020), pp. 373 440. [107] Pablo Villalobos et al. Will we run out of data? an analysis of the limits of scaling datasets in machine learning . In: ar Xiv preprint ar Xiv:2211.04325 (2022). [108] Cédric Villani et al. Optimal transport: old and new. Vol. 338. Springer, 2009. [109] Andreas Vlachos. A stopping criterion for active learning . In: Computer Speech & Language 22.3 (2008), pp. 295 312. [110] Wenquan Wang, Wenbin Cai, and Ya Zhang. Stability-based stopping criterion for active learning . In: 2014 IEEE International Conference on Data Mining. IEEE. 2014, pp. 1019 1024. [111] Ximei Wang et al. Self-tuning for data-efficient deep learning . In: International Conference on Machine Learning. PMLR. 2021, pp. 10738 10748. [112] Felix Wiewel and Bin Yang. Entropy-based Sample Selection for Online Continual Learning . In: 2020 28th European Signal Processing Conference (EUSIPCO). 2021, pp. 1477 1481. [113] Zimo Yin et al. Embrace sustainable AI: Dynamic data subset selection for image classification . In: Pattern Recognition (2024), p. 110392. [114] Taraneh Younesian et al. Qactor: Active learning on noisy labels . In: Asian Conference on Machine Learning. PMLR. 2021, pp. 548 563. [115] Kelly Zhang, Lucas Janson, and Susan Murphy. Statistical inference with m-estimators on adaptively collected data . In: Advances in neural information processing systems 34 (2021), pp. 7460 7471. [116] Lijun Zhang, Tie-Yan Liu, and Zhi-Hua Zhou. Adaptive regret of convex and smooth functions . In: International Conference on Machine Learning. PMLR. 2019, pp. 7414 7423. [117] Wentao Zhang et al. Information gain propagation: a new way to graph active learning with soft labels . In: ar Xiv preprint ar Xiv:2203.01093 (2022). [118] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset Condensation with Gradient Matching . In: International Conference on Learning Representations. 2020. [119] Peng Zhao et al. Dynamic regret of convex and smooth functions . In: Advances in Neural Information Processing Systems 33 (2020), pp. 12510 12520. [120] Li Zhou. A survey on contextual multi-armed bandits . In: ar Xiv preprint ar Xiv:1508.03326 (2015). [121] Jingbo Zhu et al. Confidence-based stopping criteria for active learning for data annotation . In: ACM Transactions on Speech and Language Processing (TSLP) 6.3 (2010), pp. 1 24. [122] Christoph Zimmer, Mona Meister, and Duy Nguyen-Tuong. Safe active learning for timeseries modeling with Gaussian processes . In: Advances in Neural Information Processing Systems 31 (2018).

A Familiar examples of reciprocal learning

We will demonstrate that well-established machine learning procedures are special cases of reciprocal learning. We start by illustrating reciprocal learning by self-training in semi-supervised learning (SSL), see section 2.1, and then turn to active learning and multi-armed bandits.

A.1 Self-Training

For ease of exposition, we will start by focusing on binary target variables, i.e., the image of 𝑌is {0, 1}, with real-valued features 𝑋. Moreover, we will only consider cases where the sample changes through the addition of one instance per iteration.5 Leaning on [106, 12, 103], we describe SSL as follows. Consider labeled data

D = {(𝑥𝑖, 𝑦𝑖)}𝑛 𝑖=1 (X Y)𝑛 (4)

and unlabeled data 𝑥 X. The aim of SSL is to learn a predictive classification function ˆ𝑦(𝑥, 𝜃) parameterized by 𝜃utilizing both labeled and unlabeled data. According to [75] and [106], SSL can be broadly categorized into self-training and co-training. We will focus on the former. Self-training involves fitting a model on D by ERM and then exploiting this model to predict labels for X. In a second step, some instances {𝑥𝑖}𝑚 𝑖=𝑛+1 X are selected to be added to the training data together with the predicted label, typically the ones with the highest confidence according to some criterion, see [2, 49, 80, 83, 53, 84, 18] for examples.

Example 1 (Self-Training). Self-training is an instance of reciprocal learning with the sample adaption function (see definition 5) 𝑓𝑆𝑆𝐿: Θ P N P; (𝜃, P(𝑌, 𝑋), 𝑛) P (𝑌, 𝑋) with P (𝑌, 𝑋) induced by

P (𝑌= 1, 𝑋= 𝑥) = 1(𝑥= 𝑥(𝜃)) ˆ𝑦(𝑥𝑑(𝜃), 𝜃) + 𝑛P(𝑌= 1, 𝑋= 𝑥)

𝑛+ 1 𝑃𝑌|𝑋𝑑𝑦 𝑃𝑋𝑑𝑥

where 𝑥𝑑(𝜃) (definition 2) selects data with highest confidence score, see [2, 80, 53], according to the model 𝜃, and gives rise to 𝑃𝑋. The prediction function ˆ𝑦: X Θ {0, 1} returns the predicted label of the selected 𝑥𝑑(𝜃) based on the learned model 𝜃and gives rise to 𝑃𝑌|𝑋.

The averaging with respect to 𝑃𝑋and 𝑃𝑌|𝑋accounts for the fact that we allow stochastic inclusion of 𝑋in the sample through randomized actions and for probabilistic predictions of 𝑌| 𝑋, respectively. For now, however, it suffices to think of the special case of degenerate distributions 𝑃𝑋and 𝑃𝑌|𝑋 putting point mass 1 on data with hard labels in the sample and 0 elsewhere.6 Through averaging with respect to 𝑃𝑌|𝑋we can describe the joint distribution of hard labels (𝑦1, 𝑥1), . . . , (𝑦𝑛, 𝑥𝑛) and predicted soft labels 𝑦= ˆ𝑝(𝑌= 1 | 𝑥, 𝜃) [0, 1] of ( 𝑦𝑛+1, 𝑥𝑛+1), . . . , ( 𝑦𝑛+𝑡, 𝑥𝑛+𝑡). Summing up, both deterministic data selection and non-probabilistic (i.e., hard labels) predictions are well-defined special cases of the above with 𝑃𝑌|𝑋and 𝑃𝑋collapsing to trivial Dirac measures, respectively.

A.1.1 Implications of convergence results

Our corollaries in section 4 shed some light on the convergence of self-training methods in a semisupervised learning regime. Recall from above that the aim of these methods is to learn a predictive classification function ˆ𝑦(𝑥, 𝜃) parameterized by 𝜃utilizing both labeled data D = {(𝑥𝑖, 𝑦𝑖)}𝑛 𝑖=1 (X Y)𝑛and unlabeled data U = {(𝑥𝑖, Y)}𝑚 𝑖=𝑛+1 X 2Y 𝑚 𝑛from the same data generation process. Self-training involves fitting a model identified with parameters 𝜃on D by ERM and then exploiting this model to predict labels for U. In incremental self-training, some instances from U are selected to be added to the training data (together with the predicted label) according to some regularized data selection criterion 𝑐𝑟(𝑥, 𝜃) = 𝑐(𝑥, 𝜃) + 1 𝐿𝑠R(𝑥), see example 1 and pseudo code below. Amending self-training does the same, but additionally removes instances from U, see below.

5In case more than one instance is added per iteration, the sample adaption function can be defined as a composite function of the used sample adaption functions. 6In this case, P(𝑌= 1, 𝑋= 𝑥) = 1(𝑥=𝑥(𝜃)) ˆ𝑦(𝑥𝑑(𝜃),𝜃) + 𝑛P(𝑌=1, 𝑋=𝑥)

The key insight from our analysis is that the sequence of 𝜃converges at a linear rate in case of amending self-training and regularized data selection.

Algorithm 1: Incremental Self-Training in Semi-Supervised learning Data: Labeled data D, unlabeled data U Result: Updated D, fitted model 𝜃 while stopping criterion not met do

fit model 𝜃on labeled data D for 𝑖 {1, . . . , |U|} do

compute 𝑐𝑟(𝑥𝑖,𝜃) end obtain 𝑖 = arg max𝑖𝑐𝑟(𝑥𝑖,𝜃) predict Y ˆ𝑦𝑖 = ˆ𝑦(𝑥𝑖 , 𝜃) update D D {(𝑥𝑖 , ˆ𝑦𝑖 )} update U U \ {(𝑥𝑖 , Y)𝑖 } end

Algorithm 2: Amending Self-Training in Semi-Supervised learning Data: Labeled data D, unlabeled data U Result: Updated D, fitted model 𝜃 while stopping criterion not met do

fit model 𝜃on labeled data for 𝑖 {1, . . . , |U|} do

compute 𝑐𝑟(𝑥𝑖,𝜃) end obtain 𝑖 = arg max𝑖𝑐𝑟(𝑥𝑖,𝜃) predict Y ˆ𝑦𝑖 = ˆ𝑦(𝑥𝑖 , 𝜃) for 𝑗 {1, . . . , |U|} do

compute 𝑐(𝑥𝑗,𝜃) end obtain 𝑗 = arg min𝑗𝑐𝑟(𝑥𝑗,𝜃) update D D {(𝑥𝑖 , ˆ𝑦𝑖 )} \ {(𝑥𝑗 , 𝑦𝑗 )} update U U \ {(𝑥𝑖 , Y)𝑖 } end

A.2 Active learning

Active learning is a machine learning paradigm where the learning algorithm iteratively asks an oracle to provide true labels for training data [50, 16, 93]. The goal is to improve the sample efficiency of the learning process by asking queries that are expected to provide the most information. Let X be the input space and Y the set of possible labels. Consider training data D = {(𝑥𝑖, 𝑦𝑖)}𝑛 𝑖=1 (X Y)𝑛 (5) as above. The active learning cycle is as follows. First, train a model on the currently labeled dataset D. Next, select the most informative sample 𝑥 X based on an acquisition function (criterion) such as uncertainty, representativeness, or expected model change and obtain the label 𝑦 for the selected instance 𝑥 from an oracle (e.g., human expert). Finally, update the training data D D {(𝑥 , 𝑦 )} and refit the model. This cycle is repeated until a stopping criterion is met (e.g., performance threshold). Example 2 (Active Learning). Active Learning is an instance of reciprocal learning with the following sample adaption function (see definition 5) 𝑓𝐴𝐿: Θ P N P; (𝜃, P(𝑌, 𝑋), 𝑛) P (𝑌, 𝑋) with P (𝑌, 𝑋) induced by

P (𝑌= 1, 𝑋= 𝑥) = 1(𝑥= 𝑥(𝜃)) 𝑞𝑦(𝑥𝑠(𝜃)) + 𝑛P(𝑌= 1, 𝑋= 𝑥)

𝑛+ 1 𝑃𝑌|𝑋𝑑𝑦 𝑃𝑋𝑑𝑥

where 𝑥𝑠(𝜃) is a data selection function, see definition 2. Its induced distribution on X is 𝑃𝑋. The query function 𝑞𝑦: X [0, 1] returns the true class (probability) for the selected 𝑥𝑠(𝜃) and gives

rise to 𝑃𝑌|𝑋. In contrast to self-training (example 1), the queried labels 𝑞𝑦(𝑥𝑠(𝜃)) do not directly depend on the model 𝜃, only indirectly through 𝑥𝑠(𝜃).

Again, both deterministic data selection through 𝑥𝑑(𝑠) and non-probabilistic (i.e., hard labels) queries through 𝑞𝑦: X Θ {0, 1} are well defined special cases of the above with 𝑃𝑌|𝑋and 𝑃𝑋collapsing to trivial Dirac measures, respectively. As far as we can oversee the active learning literature, hard label queries [77, 25, 65] are more common than probabilistic or soft queries [114].

A.3 Multi-armed bandits

The multi-armed bandit problem is one of the most general setups for evaluating decision-making strategies when facing uncertain outcomes. It is named after the analogy of a gambler at a row of slot machines, where each machine provides a different, unknown reward distribution. The gambler must develop a strategy to maximize their rewards over a series of spins, balancing the exploration of machines to learn more about their rewards versus exploiting known information to maximize returns. Typically, a contextual bandit algorithm is comprised of contexts {𝑋𝑡}𝑇 𝑡=1, actions {𝐴𝑡}𝑇 𝑡=1, and primary outcomes {𝑌𝑡}𝑇 𝑡=1, again with binary image of 𝑌𝑡for simplicity, denoted by Y = {0, 1}. We assume that rewards are a deterministic function of the primary outcomes, i.e., 𝑅𝑡= 𝑓(𝑌𝑡) for some known function 𝑓. Following [115], we use potential outcome notation [39] and let {𝑌(𝑎) : 𝑎 A} denote the potential outcomes of the primary outcome and let 𝑌𝑡:= 𝑌(𝐴𝑡) be the observed outcome. Define these quantities analogously for 𝑋(𝑎) and call

H𝑡:= {𝑋𝑡 , 𝐴𝑡 ,𝑌𝑡 }𝑡 𝑡 =1 (6)

the history for 𝑡 1 and H0 := as in [115]. The fixed and (time-independent) potential joint distributions of 𝑋𝑡and 𝑌𝑡shall be denoted by

{𝑋𝑡,𝑌𝑡(𝑎) : 𝑎 A} P(𝑋,𝑌) P for 𝑡 {1, . . . ,𝑇}. (7)

Further assume that we learn a model of P(𝑋,𝑌, 𝐴) or P(𝑌| 𝑋, 𝐴), which can be parameterized by 𝜃𝑡 Θ. Note that for a specific reward function and A = X, active learning could be formulated as a multi-armed bandit problem. The following embedding into reciprocal learning, however, is much more general. It only requires that the probability of playing an action P (𝐴𝑡| H𝑡 1) is informed by our model 𝜃, i.e.,

P (𝐴𝑡| H𝑡 1) = P (𝐴𝑡| 𝜃(H𝑡 1)) , (8)

which is a very mild assumption given that the latter is the whole point of 𝜃in multi-armed contextual bandit problems. Example 3 (Multi-Armed Bandits). Multi-Armed Bandits are instances of reciprocal learning with the following sample adaption function (see definition 5) 𝑓𝑀𝐴𝐵: Θ P N P; (𝜃, P(𝑌, 𝑋), 𝑛) P (𝑌, 𝑋) with P (𝑌, 𝑋) induced by

P (𝑌= 1, 𝑋= 𝑥) = 1(𝑥= 𝑥(𝑎(𝜃))) 𝑌(𝑎(𝜃)) + 𝑛P(𝑌= 1, 𝑋= 𝑥)

𝑛+ 1 P (𝐴𝑡| 𝜃(H𝑡 1)) 𝑑𝑎

where 𝑎(𝜃) : Θ A is an action selection function7, also referred to as policy function in the bandit literature that induces the well-known action selection probabilities P (𝐴| 𝜃(H𝑡 1)), often called policies and denoted by 𝜋:= {𝜋𝑡}𝑡 1. Further note that the indicator function takes an argument that depends on 𝜃only through 𝑎, contrary to active and semi-supervised learning.

Several strategies exist to solve multi-armed bandit problems, including upper confidence bound (deterministic), epsilon-greedy (stochastic) and already mentioned Thompson sampling (stochastic). Deterministic strategies like upper confidence bound can be embedded into the above general stochastic formulation through degenerate policies P (𝐴| 𝜃(H𝑡 1)) putting point mass 1 on the deterministically optimal action.

7It is usually directly defined in terms of action selection probabilities P (𝐴𝑡| H𝑡 1), see [115] for instance.

B Additional Illustrations

B.1 Difference between reciprocal learning and performative prediction

(a) Reciprocal Learning

(b) Performative Prediction

Figure 4: (A) Reciprocal learning fits a model from the model space (restricted by red curve) to a realized sample from the sample space (blue-grey) that depends on the previous model fit, see Figure 1b. (B) In performative prediction, the population, not the sample, changes in response to the model fit. In other words, reciprocal learning algorithms have a static inference goal, while performative prediction is concerned with moving targets.

B.2 Reciprocal learning compared to general online learning

(a) One-Shot Learning

(b) Reciprocal Learning

(c) Online Learning

Figure 5: (A) Classical one-shot machine learning fits a model from the model space (restricted by red curve) to a realized sample from the sample space (blue-grey), see [31, Figure 7.2]. (B) Reciprocal learning fits a model from the model space (restricted by red curve) to a realized sample from the sample space (blue-grey) that depends on the previous model fit, see Figure 1b. (C) In the general online learning setup, there is no interaction between sample in 𝑡and model in 𝑡 1.

C Illustrative experiments on data regularization

We run two simple experiments to illustrate the effect of data regularization (definition 3) on stability of parameters 𝜃𝑡in reciprocal learning by the example of self-training in semi-supervised learning, see sections 2.1, A.1 and example 1. Code to reproduce findings can be found in

https://github.com/rodemann/simulations-self-training-reciprocal-learning.

Specifically, we deploy incremental self-training with soft labels on a real world datasets (banknote data) with 90%, 80%, and 70% unlabeled data, see figures 7a, 7b and 7c. The task is to predict the authenticity of a banknote based on labeled and unlabeled data. We use a generalized additive model and multiple selection criteria from the literature ([83, 35, 80, 103]), one of which is regularized according to section 3. We want to compare the stability of the parameter vector 𝜃𝑡of self-training with regularized data selection to self-training with unregularized data selection criteria. Specifically, we are interested in comparing the regularized Bayesian selection criterion (gold) to its unregularized counterpart (red). The goal is not to study convergence under all conditions 1 to 5, but merely to illustrate the stabilizing effect ot the novel concept of data regularization (condition 1) on the sequence of learned parameters. To do so, we compute the L2-norm of the parameter vector 𝜃𝑡at each iteration 𝑡, see figures 7a, 7b and 7c. It becomes evident that self-training with regularized data selection is more stable than with unregularized data selection. Note that this setup analyzes variation (or rather, the absence thereof) within an experiment. We also assess the variations of 𝜃𝑡between experiments. In order to do so, we restart the experiment 40 times and average the L2-norm of 𝜃𝑡over these 40 restarts of the experiment and compute 95%-confidence intervals to assess the variation between experiments. We observe that the regularized selection criterion has much higher variation than its unregularized counterpart. Interestingly, the lower in-experiment variation due to data regularization seems to come at the cost of higher between-experiment variation.

D Alternative stochastic data selection

Definition 10 (Data Selection alternative). Let 𝑐: X Θ R be a criterion for the decision problem (Θ, A, L𝜃𝑡) of selecting features to be added to the sample in iteration 𝑡. Let D(X) denote a suitable set of probability measures on the measurable space (X, 𝜎(X)). Define

𝑐: D(X) Θ R (𝜆, 𝜃𝑡) E𝜆(𝑐( , 𝜃𝑡))

Each 𝜆 D(X) is interpreted as a randomized feature selection strategy. The function 𝑐evaluates randomized feature selection strategies based on the expectation of the criterion 𝑐under the randomization weights.

E Discussion of assumptions on loss

E.1 General discussion of assumption 1, 2, and 3

Assumptions 2 and 3 address the loss function ℓin standard ERM, i.e., the first decision problem, as detailed in section 2. These are general assumptions often needed in a wide array of repeated (empirical) risk minimization setups, see [56, 116, 119] and particularly [74] as well as [95] for an overview. Assumption 1 will be needed in both decision problems of data selection and parameter selection, but is still fairly general and mild.

E.2 Discussion of assumption 3: Strong convexity of loss function

Assumption 3 is typically required for fast convergence of repeated ERM solution strategies. Here, it is needed for reciprocal learning to converge at all. Thus, it can be considered a stronger assumption than assumptions 1 and 2. It is easy to see that common loss functions like the linear loss ℓ(𝑦, 𝑥, 𝜃) = 𝜃𝑥𝑦 are convex, but not strongly convex. The same even holds for the logistic loss ℓ(𝑦, 𝑥, 𝜃) = log(1 + exp(𝜃𝑥𝑦)). To see this, consider its second partial derivative 2 𝜃ℓ(𝑦, 𝑥, 𝜃). It is

2 𝜃ℓ(𝑦, 𝑥, 𝜃) = 𝑦2𝑥2 exp(𝜃𝑦𝑥)

(1 + exp(𝜃𝑦𝑥))2 .

It becomes evident that lim𝑥 lim𝑦 2 𝜃ℓ(𝑦, 𝑥, 𝜃 ) = 0 Hence, there is no 𝐾> 0 that can bound 2 𝜃ℓ(𝑦, 𝑥, 𝜃) from below. However, a Tikhonov-regularized version thereof ℓ𝑟(𝑦, 𝑥, 𝜃) = log(1 + exp(𝜃𝑥𝑦)) + 𝛾

2 ||𝜃||2 is 𝛾-strongly convex, which follows from analogous reasoning. This sheds some light on the nature of our sufficient conditions for convergence, see also section 6. In

(a) Self-training on banknote data with 90% unlabeled data.

(b) Self-training on banknote data with 80% unlabeled data.

(c) Self-training on banknote data with 70% unlabeled data.

Figure 6: Self-training with soft labels and varying selection criteria 𝑐(𝑥, 𝜃), one of which (Bayes-crit-reg) is regularized, on banknote data [21] with 70% (a) and 80% (b) unlabeled data; y-axis shows L2-Norm of 𝜃𝑡at iteration 𝑡. Iterations vary between (a), (b), and (c) due to varying size of unlabeled data. Model: Generalized additive regression. Data source: Public UCI Machine Learning Repository [21]. References for other selection criteria: Bayes-crit: Rodemann, J., et al. "Approximately Bayes-optimal pseudo-label selection." [83]. Likelihood: Hüllermeier, E., Cheng, W. "Superset learning based on generalized loss minimization." [35] Predictive Var: Rizve, M, N., et al. "In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning." [80]. Probability Score: Triguero, I., García, S., Herrera, F. (2015). "Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study." [103]. For details, see https://github.com/rodemann/simulations-self-training-reciprocal-learning.

(a) Self-training on banknote data with 90% unlabeled data.; distribution of L2-norms over 40 restarts.

(b) Self-training on banknote data with 80% unlabeled data; distribution of L2-norms over 40 restarts.

(c) Self-training on banknote data with 70% unlabeled data; distribution of L2-norms over 40 restarts.

Figure 7: Self-training with soft labels and varying selection criteria 𝑐(𝑥, 𝜃), one of which (Bayes-crit-reg) is regularized, on banknote data [21] with 70% (a) and 80% (b) unlabeled data; y-axis shows L2-Norm of 𝜃𝑡 averaged over 40 restarts. Shaded area indicates 95%-confidence region.

fact, we require regularization of parameters to obtain strong convexity of the loss function and of data to obtain Lipschitz-continuity of the sample adaptions, see theorem 1. This paves the way for theory-informed design of regularized reciprocal learning algorithms that are guaranteed to converge.

F.1 Proof of Lemma 1

Proof. Recall definition 2 of 𝑐: X Θ R as a criterion for the decision problem (Θ, A, L𝜃𝑡). We will prove the Lipschitz-continuity of regularized data selection:

𝑥𝑑,R : Θ X; 𝜃 argmax 𝒙 X

𝑐(𝒙, 𝜃) + 1

𝐿𝑠 R(𝒙) , (9)

where R( ) is a 𝜅-strongly convex regularizer. The variational inequality for the optimality of argmax 𝒙 X

n 𝑐(𝒙, 𝜃) + 1 𝐿𝑠R(𝒙) o =: 𝑠(𝜃) implies

𝑥𝑐(𝑠(𝜃), 𝜃) + 1

𝐿𝑠 𝑥R(𝑠(𝜃)) (𝑠(𝜃 ) 𝑠(𝜃)) 0. (10)

Symmetrically for 𝑠(𝜃 ) 𝑥𝑐(𝑠(𝜃 ), 𝜃 ) + 1

𝐿𝑠 𝑥R(𝑠(𝜃 )) (𝑠(𝜃) 𝑠(𝜃 )) 0. (11)

Summing the above two inequalities yields

𝑥𝑐(𝑠(𝜃), 𝜃) + 1

𝐿𝑠 𝑥R(𝑠(𝜃)) (𝑠(𝜃 ) 𝑠(𝜃))+ 𝑥𝑐(𝑠(𝜃 ), 𝜃 ) + 1

𝐿𝑠 𝑥R(𝑠(𝜃 )) (𝑠(𝜃) 𝑠(𝜃 )) 0.

Rearranging terms,

1 𝐿𝑠 ( 𝑥R(𝑠(𝜃)) 𝑥𝑅(𝑠(𝜃 ))) (𝑠(𝜃) 𝑠(𝜃 )) ( 𝑥𝑐(𝑠(𝜃 ), 𝜃 ) 𝑥𝑐(𝑠(𝜃), 𝜃)) (𝑠(𝜃) 𝑠(𝜃 )) .

It is a known fact that for any 𝜅-strongly convex function R it holds that ( R(𝑥) R(𝑦))𝑇(𝑥 𝑦) 𝜅 𝑥 𝑦 2, 𝑥, 𝑦. This allows lower-bounding the left-hand side by 𝜅 𝐿𝑠 𝑠(𝜃) 𝑠(𝜃 ) 2.

If 𝑐is linear in 𝑥we have 𝑐(𝑥, 𝜃) = 𝑥 𝑔(𝜃) for some appropriate function 𝑔. Thus, 𝑥𝑐(𝑥, 𝜃) = 𝑔(𝜃) and || 𝑥𝑐(𝑠(𝜃 ), 𝜃 ) 𝑥𝑐(𝑠(𝜃), 𝜃)|| 𝐿𝑐||𝜃 𝜃|| and we can also upper bound the right-hand side using the generalized Cauchy-Schwarz inequality, see also [23, Appendix A.1].

𝜅 𝐿𝑠 𝑠(𝜃) 𝑠(𝜃 ) 2 𝐿𝑐 𝑠(𝜃) 𝑠(𝜃 ) 𝜃 𝜃 . (14)

Equivalently,

𝑠(𝜃) 𝑠(𝜃 ) 𝐿𝑠 𝐿𝑐

𝜅 𝜃 𝜃 , (15)

which was to be shown.

F.2 Proof of Lemma 2

Proof. By Fubini, ˆ𝑦((ש𝜃), 𝜃) 𝑝𝑑𝑦𝑑 𝑃= ש ˆ𝑦((ש𝜃), 𝜃) 𝑝𝑑 𝑃ש𝑑𝑦. For brevity, set (ש𝜃) = (ש𝜃) 𝑃ש𝑑.שTo prove that 𝑝( (ש𝜃), 𝜃) is Lipschitz-continuos, we will proceed as follows. First, we will show that 𝑝( (ש𝜃), ) is Lipschitz-continuous. Second, we will show that 𝑝( , 𝜃) is Lipschitzcontinuous. Third, we will show that the Lipschitz-continuity of 𝑝( (ש𝜃), 𝜃) follows from the first and second result.

1. To show that 𝑝( (ש𝜃), ) is Lipschitz-continuous with Lipschitz-constant 𝐿 ש𝐿𝑝, first observe that this holds if (ש𝜃) and 𝑝( ) ,שare both Lipschitz-continuous with Lipschitz-constants 𝐿 ש and 𝐿𝑝, respectively, since

||𝑝( (ש𝜃), ) 𝑝( (ש𝜃 ), )||2 𝐿𝑝|| (ש𝜃) (ש𝜃 )||2 𝐿𝑝𝐿 ||ש𝜃 𝜃 ||2. (16)

The first premise of the above statement holds per assumption. Let us now show that the second premise for the above statement holds.

To show that 𝑝( ) ,שis Lipschitz-continuous, first recall condition 2, by which we have that 𝑝( = ) ,ש𝑝(𝑥, 𝜃) = 𝜎(𝑔(𝑥, 𝜃)) with 𝜎: R [0, 1] a sigmoid function. Further recall that the prediction function of a classifier 𝑝(𝑥, 𝜃) is implicitly given by a loss function ℓ(𝑦, 𝑝(𝑥, 𝜃)) as per definition 1. By assumption 1 we inter alia have that 𝑥ℓ(𝑦, 𝑝(𝑥, 𝜃)) is Lipschitz-continuous in 𝑥for all 𝑦and 𝜃. By chain rule,

𝑥ℓ(𝑦, 𝑝(𝑥, 𝜃)) = 𝑝ℓ(𝑦, 𝑝(𝑥, 𝜃)) 𝑥𝑝(𝑥, 𝜃). (17)

Now note that we also have by condition 2 that 𝑝ℓ(𝑦, 𝑝(𝑥, 𝜃)) is Lipschitz-continuous in 𝑥. The Lipschitz-continuity of 𝑥ℓ(𝑦, 𝑝(𝑥, 𝜃)) in 𝑥per assumption 1 thus implies the Lipschitz-continuity of 𝑥𝑝(𝑥, 𝜃) in 𝑥, because the first is a product of the second and another function that is Lipschitz-continuous in x. Recall that X is bounded (Definition 2 and Condition 2). It is a known fact that any Lipschitz-continuous function is bounded on a bounded domain. We thus concluded that 𝑥𝑝(𝑥, 𝜃) is bounded on the whole domain X. Any differentiable function is Lipschitz-continuous if and only if its gradient is bounded. See [95], for instance. We can thus conclude that 𝑝(𝑥, 𝜃) is Lipschitz-continuous in 𝑥for all 𝑦 Y and 𝜃 Θ.

2. Let us now show that 𝑝( , 𝜃) is Lipschitz-continuous, too. By assumption 2, we have that 𝜃ℓ(𝑌, 𝑝(𝑥, 𝜃)) is Lipschitz-continuous in 𝜃. With reasoning analogous to 1 (b), it follows that 𝑝(𝑥, 𝜃) is Lipschitz-continuous in 𝜃for all 𝑦 Y and all 𝑥 X.

3. It remains to be proven that the Lipschitz-continuity of 𝑝( (ש𝜃), 𝜃) : X Θ [0, 1] follows from those of 𝑝(𝑥, ) : X [0, 1] and 𝑝( , 𝜃) : Θ [0, 1]. To do so, denote by 𝐿𝑥the Lipschitz-constant of 𝑝(𝑥, ) and by 𝐿𝜃the Lipschitz-constant of 𝑝( , 𝜃).

First note 𝜃, 𝜃 Θ; 𝑥, 𝑥 X:

||𝑝(𝜃, 𝑥) 𝑝( 𝜃, 𝑥)||2 ||𝑝(𝜃, 𝑥) 𝑝(𝜃, 𝑥) + 𝑝(𝜃, 𝑥) 𝑝( 𝜃, 𝑥)||2 (18)

By triangle inequality, we get

||𝑝(𝜃, 𝑥) 𝑝( 𝜃, 𝑥)||2 ||𝑝(𝜃, 𝑥) 𝑝(𝜃, 𝑥)||2 + ||𝑝(𝜃, 𝑥) 𝑝( 𝜃, 𝑥)||2. (19)

Exploiting the Lipschitz-continuity of 𝑝(𝑥, ) and 𝑝( , 𝜃) allows us to upper bound this expression by

𝐿𝜃||𝑥 𝑥||2 + 𝐿𝑥||𝜃 𝜃||2, (20)

which eventually delivers

||𝑝(𝜃, 𝑥) 𝑝( 𝜃, 𝑥)||2 sup{𝐿𝜃, 𝐿𝑥}(||𝑥 𝑥||2 + ||𝜃 𝜃||2). (21)

We conclude that 𝑝(𝑥, 𝜃) is Lipschitz-continuous with Lipschitz-constant sup{𝐿𝜃, 𝐿𝑥} if 𝑝(𝑥, ) and 𝑝( , 𝜃) are Lipschitz-continuous with Lipschitz constants 𝐿𝜃and 𝐿𝑥, respectively.

The assertion follows from 1., 2., and 3.

F.3 Proof of Theorem 1

Proof. We will first prove the Lipschitz-continuity of the greedy sample adaption 𝑓: Θ P N P. The strategy of the proof is as follows. We first show that 1. 𝑓(𝜃, , ), 2. 𝑓( , P(𝑌, 𝑋), ), and 3. 𝑓( , , 𝑛) are Lipschitz-continuos with Lipschitz constants 𝐿𝜃, 𝐿P, and 𝐿𝑛, respectively. We then show in 4. that the Lipschitz-continuity of 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) follows with Lipschitz constant 𝐿= max{𝐿𝜃, 𝐿P, 𝐿𝑛}.

1. We prove the Lipschitz-continuity of 𝑓(𝜃, , ) given conditions 1, 2, and 5.

To show that 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) is Lipschitz-continuous in 𝜃, it is sufficient to show that

𝑓((ש𝜃), 𝜃) = 1(𝑥=(ש𝜃)) (ש(צ𝜃), 𝜃) 𝑝𝑑𝑦 𝑃ש𝑑𝑥 (22)

is Lipschitz-continuous in 𝜃. First note that by Conditions 1 and 4 we can directly infer that (ש𝜃) is Lipschitz-continuous through lemma 1. In the remainder of the proof, the strategy is as follows. We first show that 𝑓() ,שis Lipschitz-continuos and then demonstrate the Lipschitz-continuity of 𝑓((ש𝜃), 𝜃) in 𝜃. With reasoning analogous to argument (3.) in the proof of lemma 2, it then follows that 𝑓((ש𝜃), 𝜃) is Lipschitz-conitnuous on Θ Θ.

(a) We start by showing that the function

𝑓(= ) ,ש 1(𝑥=,ש(צ )ש𝜃) 𝑝𝑑𝑦 𝑃ש𝑑𝑥 (23)

is Lipschitz-continuous in.שApply Fubini to get

𝑓(= ) ,ש 1(𝑥=,ש(צ )ש𝜃) 𝑃ש𝑑𝑥 𝑝𝑑𝑦. (24)

Condition 3 implies that features are drawn according to

( X Θ [0, 1] (𝑥, 𝜃) exp(𝑐(𝑥,𝜃))

𝑥 exp(𝑐(𝑥 ,𝜃))𝑑𝜇(𝑥) ,

see definition 2. That is, 1(𝑥=,ש(צ )ש𝜃) 𝑃ש𝑑𝑥= 𝑐(,ש𝜃) (צ𝑐(,ש𝜃), 𝜃). Per condition 5, 𝑐(𝑥, 𝜃) is linear in 𝑥and thus Lipschitz-continuous in 𝑥. Further note that the mapping 𝑐 𝑐is a softmax function, which is continuosly differentiable and thus Lipschitz-continuous. We thus conclude with the argument (1.) in the proof of lemma 2, see equation 16, that 𝑐(𝑥, 𝜃) is Lipschitz-continuous in 𝑥, since it is a composition of two Lipschitz-continuous functions. Now recall that by Conditions 1 and 5 we can infer that(ש𝜃) is Lipschitz-continuous through lemma 1. This and Condition 2 imply that we can apply lemma 2, which delivers that

𝑓(= )ש 𝑐(,ש𝜃) (צ𝑐(,ש𝜃), 𝜃) 𝑝𝑑𝑦= 𝑐(,ש𝜃) ˆ𝑦((ש𝜃), 𝜃) 𝑝𝑑𝑦= 𝑐(,ש𝜃)𝑝(,ש𝜃)

(25) and that 𝑝(,ש𝜃) is Lipschitz-continuous in both arguments. Now note that both 𝑐: X Θ [0, 1] and 𝑝: X Θ [0, 1] are both bounded from above by 1. We thus conclude by triangle inquality that 𝑓()שis Lipschitz-continuous in.שExplicitly,

|| 𝑓()ש 𝑓(ש )||2 = || 𝑐(,ש𝜃)𝑝(,ש𝜃) 𝑐(ש , 𝜃)𝑝(ש , 𝜃)||2 || 𝑐(,ש𝜃)𝑝(,ש𝜃) 𝑐(,ש𝜃)𝑝(ש , 𝜃)||2 + || 𝑐(,ש𝜃)𝑝(ש , 𝜃) 𝑐(ש , 𝜃)𝑝(ש , 𝜃)||2 = || 𝑐(,ש𝜃) 𝑝(,ש𝜃) 𝑝(ש , 𝜃) ||2 + ||𝑝(ש , 𝜃) 𝑐(,ש𝜃) 𝑝(ש , 𝜃) ||2 ||𝑝(,ש𝜃) 𝑝(ש , 𝜃)||2 + || 𝑐(,ש𝜃) 𝑝(ש , 𝜃)||2 𝐿𝑝||ש ש ||2 + 𝐿 𝑐||ש ש ||2 (𝐿𝑝+ 𝐿 𝑐)||ש ש ||2, (26) where 𝐿𝑝and 𝐿 𝑐denote the Lipschitz constants of 𝑐and 𝑝, respectively.

(b) To show the Lipschitz-continuity of 𝑓((ש𝜃), 𝜃) in 𝜃, first note the Lipschitz-continuity 𝑓((ש𝜃), ) in 𝜃directly follows from the facts that 1)(ש𝜃) is Lipschitz-continuous, 2) 𝑓() ,שis Lipschitz-continuous in,שand 3) any composition of Lipschitz-continuous functions is Lipschitz-continuous. We have proven 1) and 2) right above. For a proof of 3), see equation 16 in the proof of lemma 2. What remains to be shown is the Lipschitz-continuity of 𝑓( , 𝜃), which translates to the Lipschitz-continuity of 1(𝑥=)ש ,ש(צ𝜃) 𝑝𝑑𝑦 𝑃ש𝑑𝑥. (27)

in 𝜃, which in turn translates to the Lipschitz-continuity of the inner integral. Condition 2 and lemma 2 deliver ,ש(צ𝜃) 𝑝𝑑𝑦= ˆ𝑦((ש𝜃), 𝜃) 𝑝𝑑𝑦= 𝑝(,ש𝜃) (28)

with 𝑝(,ש𝜃) being Lipschitz-continuous in 𝜃. (c) It remains to be shown that 𝑓((ש𝜃), 𝜃) is Lipschitz-continuous on Θ Θ. This follows from reasoning analogous to the proof of lemma 2, resulting in

|| 𝑓((ש𝜃), 𝜃) 𝑓( (ש𝜃), 𝜃)||2 sup{𝐿𝜃, 𝐿(ש𝜃)}(||𝜃 𝜃||2 + ||(ש𝜃) (ש𝜃)||2), (29)

with 𝐿𝜃and 𝐿(ש𝜃) being the Lipschitz-constants of 𝜃and(ש𝜃), respectively. This concludes the proof that 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) is Lipschitz-continuous in 𝜃.

2. To see that 𝑓( , P(𝑌, 𝑋), ) is Lipschitz-continuous with respect to Wasserstein-1-distance on both domain P and codomain P, recall the definition of Wasserstein-p-distancees [108]. Let (X, 𝑑) be a metric space, and let 𝑝 [1, ). For any two probability measures 𝜇, 𝜈on X, the Wasserstein distance of order 𝑝between 𝜇and 𝜈is defined by

𝑊𝑝(𝜇, 𝜈) = inf 𝜋 Π(𝜇,𝜈)

X 𝑑(𝑥, 𝑦) 𝑝𝑑𝜋(𝑥, 𝑦) 1/𝑝 . (30)

Π(𝜇, 𝜈) denotes the set of all joint measures on X X with marginals 𝜇and 𝜈. For 𝑝= 1 and empirical distributions P(𝑍) and P (𝑍 ) this translates to

𝑊1 P(𝑍), P (𝑍 ) = min

𝑖, 𝑗 𝜋𝑖, 𝑗𝑑(𝑧𝑖, 𝑧 𝑗) : 𝜋𝑖, 𝑗 0,

𝑖 𝜋𝑖, 𝑗= 𝛽𝑗,

𝑗 𝜋𝑖, 𝑗= 𝛼𝑖

with 𝜋𝑖, 𝑗a joint measure on 𝑍and 𝑍 and 𝛼𝑖and 𝛽𝑗corresponding to marginal measures of 𝑍and 𝑍 , respectively.

It becomes evident that any marginal change in 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) caused by a change P(𝑌, 𝑋) is essentially 𝑛 𝑛+1.

𝛿𝑓(𝜃, P(𝑌, 𝑋), 𝑛)

𝛿P(𝑌, 𝑋) = 𝑛 𝑛+ 1, (32)

which analytically follows from

𝑓(𝜃, P(𝑌, 𝑋), 𝑛) = 1(𝑥=(ש𝜃)) (ש(צ𝜃),𝜃) + 𝑛P(𝑌=1, 𝑋=𝑥)

𝑛+1 𝑝𝑑𝑦 𝑃ש𝑑𝑥 = 1(𝑥=(ש𝜃)) (ש(צ𝜃),𝜃)

𝑛+1 𝑝𝑑𝑦 𝑃ש𝑑𝑥+ 𝑛P(𝑌=1, 𝑋=𝑥)

The partial derivative in equation 32 is trivially upper-bounded by 1. It is a known fact that differentiable functions are Lipschitz-continuous if and only if the gradient is upper bounded, see [95, page 161], for instance.

3. Choose 𝑛, 𝑛 N such that 𝑛 𝑛 arbitrarily. It is self-evident that for fixed 𝑥

1(𝑥=(ש𝜃)) (ש(צ𝜃), 𝜃) + 𝑛P(𝑌= 1, 𝑋= 𝑥)

𝑛+ 1 1(𝑥=(ש𝜃)) (ש(צ𝜃), 𝜃) + 𝑛P(𝑌= 1, 𝑋= 𝑥)

𝑛+ 1 1. (34) And thus also for any 𝑥

𝑓(𝜃, P(𝑌, 𝑋), 𝑛) 𝑓(𝜃, P(𝑌, 𝑋), 𝑛 ) 1. (35)

Since supp(𝑍) = supp(𝑍 ) with 𝑍 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) and 𝑍 𝑓(𝜃, P(𝑌, 𝑋), 𝑛 ), we have

𝑊1( 𝑓(𝜃, P(𝑌, 𝑋), 𝑛), 𝑓(𝜃, P(𝑌, 𝑋), 𝑛 )) 1 (36)

as well as |𝑛 𝑛 | 1, from which the assertion that 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) is Lipschitz-continuous in 𝑛directly follows.

4. With reasoning analogous to (3.) in the proof of lemma 2, we have that the Lipschitzcontinuity of 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) follows from the Lipschitz-continuity of 1. 𝑓(𝜃, , ), 2. 𝑓( , P(𝑌, 𝑋), ), and 3. 𝑓( , , 𝑛). That is,

𝑊1( 𝑓(𝜃, P, 𝑛), 𝑓(𝜃 , P , 𝑛 )) max{𝐿𝜃, 𝐿P, 𝐿𝑛} (||𝜃 𝜃 ||2+𝑊1(P, P )+||𝑛 𝑛 ||2), (37)

from which the assertion follows with 𝑝= 1 for the 𝑝-norm.

What remains to be proven is the Lipschitz-continuity of the non-greedy sample adaption function 𝑓𝑛: Θ P P with 𝑓𝑛(𝜃, P(𝑌, 𝑋)) = P (𝑌, 𝑋) induced by

P (𝑌= 1, 𝑋= 𝑥)

= 1(𝑥=(ש𝜃)) (ש(צ𝜃), 𝜃) + 𝑛0 P(𝑌= 1, 𝑋= 𝑥) 1(𝑥=ש (P(𝑌, 𝑋))) ש(צ (P(𝑌, 𝑋)), 𝜃)

𝑛0 𝑃צ𝑑𝑦 𝑃ש𝑑𝑥.

The reasoning is completely analogous to the greedy sample adaption function 𝑓, see 4. above. In particular, we can directly transfer the proof of 1. 𝑓(𝜃, , ) being Lipschitz-continuous. What remains to be shown is that the Lipschitz-continuity in P P also holds for 𝑓𝑛. This translates to showing thatש is Lipschitz-continuous in P, since we have the Lipschitz-continuity of 𝑛0 P(𝑋,𝑌) with analogous reasoning as in 2. above. However, note that the Lipschitz-constant is not necessarily the same, since the partial derivative of 𝑓𝑛also includes the indirect effect throughש and.צ

To see thatש is Lipschitz-continuous, note that for two arbitrary P, P P we have that

||ש (P) ש (P )||2 = || 𝑋𝑑P 𝑋 𝑑P ||2 = (𝑥 𝑥 )𝑑𝜌(𝑥, 𝑥 ) |𝑥 𝑥 |𝑑𝜌(𝑥, 𝑥 ), (38)

where 𝜌(𝑥, 𝑥 ) is any joint probability measure on X X. Now recall that the Wasserstein-1-distance is defined as the infimum of |𝑥 𝑥 |𝑑𝜌(𝑥, 𝑥 ) with respect to 𝜌(𝑥, 𝑥 ). We conclude that

||ש (P) ש (P )||2 𝐿ש 𝑊1(P, P ) (39) with 𝐿ש a constant.

The Lipschitz-continuity of 𝑓𝑛then follows from the Lipschitz-continuity of 𝑓𝑛in 𝜃and the Lipschitzcontinuity in P with the argument in 4.

F.4 Proof of Theorem 2

Proof. The structure of the proof is analogous to the proof of theorem 1. We show that 1. 𝑓(𝜃, , ), 2. 𝑓( , P(𝑌, 𝑋), ), and 3. 𝑓( , , 𝑛) are Lipschitz-continuos with Lipschitz constants 𝐿𝜃, 𝐿P, and

𝐿𝑛, respectively. We then show in 4. that the Lipschitz-continuity of 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) follows with Lipschitz constant 𝐿= max{𝐿𝜃, 𝐿P, 𝐿𝑛}. Since none of conditions 1 through 5 were required to show 1., 2., and 4., in the proof of theorem 1, we only need to show that 1. also holds under conditions 2, 3, and 4.

To show that 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) is Lipschitz-continuous in 𝜃, it is sufficient to show that

𝑓((ש𝜃), 𝜃) = 1(𝑥=(ש𝜃)) (ש(צ𝜃), 𝜃) 𝑝𝑑𝑦 𝑃ש𝑑𝑥 (40)

is Lipschitz-continuous in 𝜃.

In the remainder of the proof, the strategy is as follows. We first show that 𝑓() ,שis Lipschitzcontinuos and then demonstrate the Lipschitz-continuity of 𝑓((ש𝜃), 𝜃) in 𝜃. With reasoning analogous to argument (3.) in the proof of lemma 2, it then follows that 𝑓((ש𝜃), 𝜃) is Lipschitz-conitnuous on Θ Θ.

We start by showing that the function

𝑓(= ) ,ש 1(𝑥=,ש(צ )ש𝜃) 𝑝𝑑𝑦 𝑃ש𝑑𝑥 (41)

is Lipschitz-continuous in.שApply Fubini to get

𝑓(= ) ,ש 1(𝑥=,ש(צ )ש𝜃) 𝑃ש𝑑𝑥 𝑝𝑑𝑦. (42)

Condition 3 implies that features are drawn according to

( X Θ [0, 1] (𝑥, 𝜃) exp(𝑐(𝑥,𝜃))

𝑥 exp(𝑐(𝑥 ,𝜃))𝑑𝜇(𝑥) , (43)

see definition 2. That is, 1(𝑥=,ש(צ )ש𝜃) 𝑃ש𝑑𝑥= 𝑐(,ש𝜃) (צ𝑐(,ש𝜃), 𝜃). Per condition 4, 𝑐(𝑥, 𝜃) has bounded gradients with respect to 𝑥, which implies that 𝑐(𝑥, 𝜃) is Lipschitz-continuous in 𝑥. Further note that the mapping 𝑐 𝑐is a softmax function, which is continuosly differentiable and thus Lipschitz-continuous. We thus conclude with the argument (1.) in the proof of lemma 2, see equation 16, that 𝑐(𝑥, 𝜃) is Lipschitz-continuous in 𝑥, since it is a composition of two Lipschitzcontinuous functions.

We now need to verify that (ש𝜃) 𝑃ש𝑑שis Lipschitz such that we can apply lemma 2. By condition 3 we have that (ש𝜃) 𝑃ש𝑑 = ש𝑐(𝑥, 𝜃), which is Lipschitz-continuous in 𝜃per condition 4.

Condition 2 and lemma 2 then directly deliver that

𝑓(= )ש 𝑐(,ש𝜃) (צ𝑐(,ש𝜃), 𝜃) 𝑝𝑑𝑦= 𝑐(,ש𝜃) ˆ𝑦((ש𝜃), 𝜃) 𝑝𝑑𝑦= 𝑐(,ש𝜃)𝑝(,ש𝜃) (44)

with 𝑝(,ש𝜃) Lipschitz continuous in both arguments. Now note that both 𝑐: X Θ [0, 1] and 𝑝: X Θ [0, 1] are both bounded from above by 1. We thus conclude by triangle inquality that 𝑓()שis Lipschitz-continuous in,שanalogous to the proof of theorem 1. Explicitly,

|| 𝑓()ש 𝑓(ש )||2 = || 𝑐(,ש𝜃)𝑝(,ש𝜃) 𝑐(ש , 𝜃)𝑝(ש , 𝜃)||2 || 𝑐(,ש𝜃)𝑝(,ש𝜃) 𝑐(,ש𝜃)𝑝(ש , 𝜃)||2 + || 𝑐(,ש𝜃)𝑝(ש , 𝜃) 𝑐(ש , 𝜃)𝑝(ש , 𝜃)||2 = || 𝑐(,ש𝜃) 𝑝(,ש𝜃) 𝑝(ש , 𝜃) ||2 + ||𝑝(ש , 𝜃) 𝑐(,ש𝜃) 𝑝(ש , 𝜃) ||2 ||𝑝(,ש𝜃) 𝑝(ש , 𝜃)||2 + || 𝑐(,ש𝜃) 𝑝(ש , 𝜃)||2 𝐿𝑝||ש ש ||2 + 𝐿 𝑐||ש ש ||2 (𝐿𝑝+ 𝐿 𝑐)||ש ש ||2,

where 𝐿𝑝and 𝐿 𝑐denote the Lipschitz constants of 𝑐and 𝑝, respectively.

To show the Lipschitz-continuity of 𝑓((ש𝜃), 𝜃) in 𝜃, first note the Lipschitz-continuity 𝑓((ש𝜃), ) in 𝜃 directly follows from the facts that 1)(ש𝜃) is Lipschitz-continuous, 2) 𝑓() ,שis Lipschitz-continuous

in,שand 3) any composition of Lipschitz-continuous functions is Lipschitz-continuous. We have proven 1) and 2) right above. For a proof of 3), see equation 16 in the proof of lemma 2.

What remains to be shown is the Lipschitz-continuity of 𝑓( , 𝜃), which translates to the Lipschitzcontinuity of 1(𝑥=)ש ,ש(צ𝜃) 𝑝𝑑𝑦 𝑃ש𝑑𝑥. (46)

in 𝜃, which in turn translates to the Lipschitz-continuity of the inner integral. Condition 2 and lemma 2 (which we can apply, since (ש𝜃) 𝑃ש𝑑שis Lipschitz, see above) deliver

,ש(צ𝜃) 𝑝𝑑𝑦= ˆ𝑦((ש𝜃), 𝜃) 𝑝𝑑𝑦= 𝑝(,ש𝜃) (47)

with 𝑝(,ש𝜃) being Lipschitz-continuous in 𝜃.

It remains to be shown that 𝑓((ש𝜃), 𝜃) is Lipschitz-continuous on Θ Θ. This follows from reasoning analogous to the proof of lemma 2, resulting in

|| 𝑓((ש𝜃), 𝜃) 𝑓( (ש𝜃), 𝜃)||2 sup{𝐿𝜃, 𝐿(ש𝜃)}(||𝜃 𝜃||2 + ||(ש𝜃) (ש𝜃)||2), (48) with 𝐿𝜃and 𝐿(ש𝜃) being the Lipschitz-constants of 𝜃and(ש𝜃), respectively.

This concludes the proof that 𝑓(𝜃, P(𝑌, 𝑋), 𝑛) is Lipschitz-continuous in 𝜃. The Lipschitz-continuity of 𝑓𝑛(𝜃, P(𝑌, 𝑋)) directly follows, since the Lipschitz-continuity ofש (P) has been shown in the proof of theorem 1.

F.5 Proof of Theorem 3

Proof. Choose (𝜃, P), (𝜃 , P ) Θ P arbitrarily. Set 𝐹(𝜂) := E(𝑌,𝑋) 𝑓𝑛(𝜃,P) ℓ(𝑌, 𝑋, 𝜂) and 𝐹 (𝜂) := E(𝑌,𝑋) 𝑓𝑛(𝜃 ,P ) ℓ(𝑌, 𝑋, 𝜂). As integrals over 𝛾-strongly convex functions, both 𝐹and 𝐹 are 𝛾-strongly convex themselves.

Let 𝑅1 := 𝑅1(𝜃, P) and 𝑅 1 := 𝑅1(𝜃 , P ) be first components of 𝑅𝑛(𝜃, P) and 𝑅𝑛(𝜃 , P ), respectively. Since, by construction, we know that 𝑅1 is the unique minimizer of 𝐹and that 𝑅 1 is the unique minimizer of 𝐹 , we can conclude that:

𝐹(𝑅1) 𝐹(𝑅 1) (𝑅1 𝑅 1)𝑇 𝐹(𝑅 1) + 𝛾

𝑅1 𝑅 1 2 2 (49)

𝐹(𝑅 1) 𝐹(𝑅1) 𝛾

2 𝑅1 𝑅 1 2 2 (50) Adding the above inequalities yields

𝛾 𝑅1 𝑅 1 2 2 (𝑅1 𝑅 1)𝑇 𝐹(𝑅 1) (51)

Now, consider the function 𝑇(𝑥, 𝑦) := (𝑅1 𝑅 1)𝑇 ℓ(𝑦, 𝑥, 𝑅 1). Due to Cauchy-Schwarz inequality, we have that

𝑇(𝑥, 𝑦) 𝑇(𝑥 , 𝑦 ) 2 𝑅1 𝑅 1 2 ℓ(𝑦, 𝑥, 𝑅 1) ℓ(𝑦 , 𝑥 , 𝑅 1) 2 (52)

As ℓis 𝛽-jointly smooth, we have that

ℓ(𝑦, 𝑥, 𝑅 1) ℓ(𝑦 , 𝑥 , 𝑅 1) 2 𝛽 (𝑥, 𝑦) (𝑥 , 𝑦 ) 2 (53)

Thus, together, the latter two inequalities imply

𝑇(𝑥, 𝑦) 𝑇(𝑥 , 𝑦 ) 2 𝑅1 𝑅 1 2 𝛽 (𝑥, 𝑦) (𝑥 , 𝑦 ) 2 (54)

showing that 𝑇is 𝑅1 𝑅 1 2 𝛽-Lipschitz. This implies that

𝑇:= ( 𝑅1 𝑅 1 2 𝛽) 1𝑇 (55)

is 1-Lipschitz. As, due to theorem 1, the non-greedy sample adaption function 𝑓𝑛is Lipschitz with respect to 𝑊1 and 𝑝for some constant 𝐿, we can use the dual characterization of the Wasserstein metric, i.e. the Kantorovich-Rubinstein lemma [43], to obtain

|E(𝑌,𝑋) 𝑓𝑛(𝜃,P) ( 𝑇(𝑌, 𝑋)) E(𝑌,𝑋) 𝑓𝑛(𝜃 ,P ) ( 𝑇(𝑌, 𝑋))| 𝐿(||𝜃 𝜃 ||2 + 𝑊1(P, P )) (56)

We compute:

E(𝑌,𝑋) 𝑓𝑛(𝜃,P) ( 𝑇(𝑌, 𝑋)) = E(𝑌,𝑋) 𝑓𝑛(𝜃,P) (𝑇(𝑌, 𝑋)) 𝑅1 𝑅 1 2 𝛽 = (𝑅1 𝑅 1)𝑇 𝑅1 𝑅 1 2 𝛽 𝐹(𝑅 1) (57)

Here, we used that (𝑅1 𝑅 1)𝑇is a constant with respect to the measure 𝑓𝑛(𝜃, P) and that the order of integration and differentiation can be exchanged according to Lebesgue s dominated convergence theorem. Analogously, we obtain

E(𝑌,𝑋) 𝑓𝑛(𝜃 ,P ) ( 𝑇(𝑌, 𝑋)) = E(𝑌,𝑋) 𝑓𝑛(𝜃 ,P ) (𝑇(𝑌, 𝑋)) 𝑅1 𝑅 1 2 𝛽 = (𝑅1 𝑅 1)𝑇 𝑅1 𝑅 1 2 𝛽 𝐹 (𝑅 1) (58)

Together, this yields

(𝑅1 𝑅 1)𝑇 𝐹(𝑅 1) (𝑅1 𝑅 1)𝑇 𝐹 (𝑅 1) 𝐿𝛽 𝑅1 𝑅 1 2 (||𝜃 𝜃 ||2 + 𝑊1(P, P )) (59)

As 𝑅 1 is the unique minimizer of 𝐹 , we conclude that the second product on the left-hand side of this inequality is larger or equal to 0. Thus, the inequality reduces to

(𝑅1 𝑅 1)𝑇 𝐹(𝑅 1) 𝐿𝛽 𝑅1 𝑅 1 2 (||𝜃 𝜃 ||2 + 𝑊1(P, P )) (60)

which, together with Equation (51), yields

𝛾 𝑅1 𝑅 1 2 2 𝐿𝛽 𝑅1 𝑅 1 2 (||𝜃 𝜃 ||2 + 𝑊1(P, P )) (61)

which proves after some rearranging that 𝑅1 : Θ P Θ is Lipschitz-continuous with Lipschitz-constant 𝐿𝛽

𝛾. Precisely, we get

𝑅1 𝑅 1 2 𝐿𝛽

𝛾(||𝜃 𝜃 ||2 + 𝑊1(P, P ). (62)

We further have per theorems 1 and 2 that 𝑓𝑛: Θ P P is Lipschitz-continuous with constant 𝐿, as used above. It can easily be verified (see definitions 7 and 5) that the second component 𝑅2 := 𝑅2(𝜃, P) of 𝑅(𝜃, P) equates 𝑓𝑛. Thus,

𝑅2 𝑅 2 𝐿(||𝜃 𝜃 ||2 + 𝑊1(P, P ) (63)

with 𝑅 2 := 𝑅2(𝜃 , P ) and arbitrary 𝜃, 𝜃 Θ and arbitrary P, P P. We can conclude that 𝑅𝑛is Lipschitz-continuous with Lipschitz-constant 𝐿(1 + 𝛽

𝛾) and the sum-metric on Θ P by adding the two Lipschitz-inequalities, yielding

𝑅1 𝑅 1 2 + 𝑅2 𝑅 2 2 𝐿𝛽

𝛾(||𝜃 𝜃 ||2 + 𝑊1(P, P ) + 𝐿(||𝜃 𝜃 ||2 + 𝑊1(P, P ). (64)

𝑅𝑛 𝑅 𝑛 𝐿(1 + 𝛽

𝛾)(||𝜃 𝜃 ||2 + 𝑊1(P, P )). (65)

Remains to be shown that, assuming 𝐿< (1 + 𝛽

𝛾) 1, the sequence (𝑅𝑛(𝜃𝑡, P𝑡))𝑡 N converges to a fix point at a linear rate. The existence and uniqueness of a fix point follows from Banach s fix point theorem [4], since equation 65 guarantees that the map 𝑅𝑛is a contraction for 𝐿< (1 + 𝛽

𝛾) 1 on a complete metric space. So, let (𝜃𝑐, P𝑐) denote such a fix point. Observe that it holds per equation 65 for all 𝑡 N ||(𝜃𝑡, P𝑡) (𝜃𝑐, P𝑐)|| 𝐿(1 + 𝛽

𝛾)||(𝜃𝑡, P𝑡) (𝜃𝑐, P𝑐)|| (66)

Repeatedly applying this yields

||(𝜃𝑡, P𝑡) (𝜃𝑐, P𝑐)|| 𝐿𝑡(1 + 𝛽

𝛾)𝑡||(𝜃0, P0) (𝜃𝑐, P𝑐)|| (67)

Setting the expression on the right-hand side to be at most Δ gives

||(𝜃𝑡, P𝑡) (𝜃𝑐, P𝑐)|| 𝐿𝑡(1 + 𝛽

𝛾)𝑡||(𝜃0, P0) (𝜃𝑐, P𝑐)|| Δ (68)

Rearranging and setting the expression on the right-hand side to be at most Δ gives

log ||(𝜃𝑡, P𝑡) (𝜃𝑐, P𝑐)|| 𝑡log 𝐿(1 + 𝛽

𝛾) log Δ ||(𝜃𝑡, P𝑡) (𝜃𝑐, P𝑐) || (69)

Rearranging for 𝑡and exploiting that 𝐿(1 + 𝛽

𝛾) < 1 yields

𝑡 log || (𝜃0,P0) (𝜃𝑐,P𝑐) ||

Δ log 𝐿(1 + 𝛽

since ||(𝜃0, P0) (𝜃𝑐, P𝑐)|| and 𝐿(1 + 𝛽

𝛾) are fixed quantities, we have that

||(𝜃𝑡, P𝑡) (𝜃𝑐, P𝑐)|| Δ (71)

if 𝑡 log || (𝜃0,P0) (𝜃𝑐,P𝑐) ||

Δ (log 𝐿(1 + 𝛽

𝛾)) 1, which proves the linear rate of convergence.

Note that the proof was mainly an application of Banach fixed-point theorem [4] and as such similar to the proof of [74, theorem 3.5]. The key differences to [74, theorem 3.5] are: (1) We need to prove the Lipschitz-continuity of the sample adaption function 𝑓first, see theorem 1, which is non-trivial for several instances of reciprocal learning. (2) [74, theorem 3.5] considers simple repeated risk minimization, i.e., a mapping Θ Θ, while reciprocal learning is 𝑅: Θ P N Θ P N or 𝑅𝑛: Θ P Θ P (definitions 6 and 7).

F.6 Proof of Theorem 4

Proof. Recall that 𝑅 𝑛= (𝜃 , P ) = arg min𝜃,P E(𝑌,𝑋) 𝑓𝑛(𝜃,P) ℓ(𝑌, 𝑋, 𝜃) per definition 9 and 𝜃𝑐= arg min𝜃E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃) per definition 8. First assume that 𝜃𝑐 𝜃 , since otherwise the the statement would be trivial due to 𝐿> 0, 𝐿ℓ> 0, 𝛾> 0 per assumptions.

We will now prove the statement ||𝜃𝑐 𝜃 || 2𝐿ℓ𝐿

𝛾 by contradiction. Thus, assume that ||𝜃𝑐 𝜃 || > 2𝐿ℓ𝐿

First observe that E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃) is (𝐿𝐿ℓ)-Lipschitz in 𝜃for fixed P𝑐and 𝜃𝑐, since the loss is 𝐿ℓ-Lipschitz and 𝑓𝑛is 𝐿-Lipschitz in 𝜃, since it is 𝐿𝜃-Lipschitz in 𝜃(see 1. in proof of theorem 1) and 𝐿= max{𝐿𝜃, 𝐿P, 𝐿𝑛}. That is,

E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃𝑐) E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃 ) 𝐿ℓ𝐿||𝜃 𝜃𝑐||. (72)

Further note that

E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃 ) E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃𝑐) 𝛾

2 ||𝜃 𝜃𝑐||2, (73)

which holds because we can state due to assumption 3 (strong convexity) that

E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃 ) ℓ(𝑌, 𝑋, 𝜃𝑐) E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) 𝜃ℓ(𝑌, 𝑋, 𝜃𝑐)𝑇(𝜃 𝜃𝑐) + 𝛾

2 ||𝜃 𝜃𝑐||2

by taking expectations on the inequality stated in assumption 3 (strong convexity). By classical first-order optimality conditions [44] we know that the first term on the right-hand side is larger or equal to 0, from which equation 73 directly follows, see also [74, Theorem 4.3].

If now ||𝜃𝑐 𝜃 || > 2𝐿ℓ𝐿

𝛾, or equivalently 𝛾

2 ||𝜃𝑐 𝜃 ||2 > 𝐿ℓ𝐿||𝜃𝑐 𝜃 || as we assumed, we have per equations 72 and 73 that

E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃𝑐) E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃 ) < E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃 ) E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃𝑐), (75)

which would imply

E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃 ) > E(𝑌,𝑋) 𝑓𝑛(𝜃𝑐,P𝑐) ℓ(𝑌, 𝑋, 𝜃𝑐), (76)

which contradicts definition 9 and the non-negativity of the loss function.

F.7 Proof of Theorem 5

Proof. By Cauchy criterion for series 𝑅𝑡, 𝑡 N with respect to sum or product norm on Θ P N. According to the Cauchy criterion the series 𝑅𝑡diverges, if there is an 𝜖such that 𝑡 N : 𝑚, 𝑛 𝑡: 𝑑𝑝(𝑅𝑚 𝑅𝑛) = 𝑑𝑝((𝜃𝑚, P𝑚, 𝑛𝑚) (𝜃𝑛, P𝑛, 𝑛𝑛))) > 𝜖with 𝑑𝑝the sum norm and 𝑚 𝑛; 𝑚, 𝑛 N. This holds for 𝜖 (0, 1), since ||𝑛𝑚 𝑛𝑛|| 1.

F.8 Proof of Theorem 6

Proof. By counterexample. Assume 𝑓𝑛: Θ P P is not Lipschitz-continuous, i.e., no 𝐿< exists such that

𝑊1( 𝑓(𝜃, P), 𝑓(𝜃 , P )) 𝐿 ||(||𝜃 𝜃 ||2, 𝑊1(P, P ))|| 𝑝.

Let again 𝑅𝑛,1 := 𝑅𝑛,1(𝜃, P) and 𝑅 𝑛,1 := 𝑅𝑛,1(𝜃 , P ) be first components of 𝑅(𝜃, P) and 𝑅(𝜃 , P ), respectively. Further assume that 𝑅𝑛,1 = 𝐶+ 𝜃 𝐿, 𝐶 R. It becomes evident that for any fixed point 𝜃𝑐it must hold: 𝜃𝑐= 𝐶 1 𝐿. If 𝐿 , this fixed point does not exist: lim𝐿 (𝜃𝑐) = .

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Abstract and section 1 (introduction) of the paper accurately reflect the paper s contribution and scope, covering all results presented in the paper. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitations of the work are explicitely discussed in sections 1 and 6 (Subsection "Limitations) of the main paper as well as mentioned when stating the main results in sections 3 and 4. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: The theoretical results of the paper are stated in Theorems 1, 2, 3, 4, 5, and 6 as well as in Lemmata 1 and 2. For each of these theorems and Lemmata, full sets of assumptions and conditions are provided. Moreover, complete proofs for all these statements are provided in the paper s appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: The paper does include two experiments in section C.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: See section C.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: See section C.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: See section C.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes] Justification: The research conducted in the paper is, in every aspect, conform with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes] Justification: We do not foresee direct negative societal impact from the current work. Positive societal impact in form of making decisions based on reciprocal learning algorithms such as active learning or self-training more reliable and trustworthy are discussed in section 1, 2, and 6 of the paper.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.