# learning_from_streaming_data_when_users_choose__e200f0a7.pdf

Learning from Streaming Data when Users Choose

Jinyan Su 1 Sarah Dean 1

In digital markets comprised of many competing services, each user chooses between multiple service providers according to their preferences, and the chosen service makes use of the user data to incrementally improve its model. The service providers models influence which service the user will choose at the next time step, and the user s choice, in return, influences the model update, leading to a feedback loop. In this paper, we formalize the above dynamics and develop a simple and efficient decentralized algorithm to locally minimize the overall user loss. Theoretically, we show that our algorithm asymptotically converges to stationary points of of the overall loss almost surely. We also experimentally demonstrate the utility of our algorithm with real world data.

1. Introduction

Online services, ranging from social media and music streaming to chatbots and search engines, collect user data in real time to make small adjustments to the models they use to serve and personalize content. Such digital platforms must contend with the demand for instant action, handle continuous data streams, and update their models in an incremental manner. For example, music streaming services continuously process new feedback from user interaction data to refine and enhance their personalized playlist and recommendation models (Prey, 2016; Eriksson et al., 2019; Anderson, 2013; Morris, 2015; Webster, 2023). Search engines analyze user queries and click through rates to personalize future search suggestions (Yoganarasimhan, 2020; Bi et al., 2021; Ustinovskiy & Serdyukov, 2013). Chatbots learn from user interactions to provide more accurate and context-aware responses over time (Clarizia et al., 2019; Shumanov & Johnson, 2021; Ma et al., 2021).

1Department of Computer Science, Cornell University. Correspondence to: Jinyan Su <js3673@cornell.edu>, Sarah Dean <sdean@cornell.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Moreover, due to the data-driven nature of digital platforms, interesting dynamics emerge among users and service providers: on the one hand, users choose amongst providers based on the quality of their services; on the other hand, providers use the user data to improve and update their services, affecting future user choices (Ginart et al., 2021; Kwon et al., 2022; Dean et al., 2024; Jagadeesan et al., 2023a). For example, in personalized music streaming platform, a user chooses amongst different music streaming platforms based on how well they meet the user s needs. Data from the user s interaction with the platform (such as the music user searches for, saves, or skips) can be used to update its recommendation model in order to better predict users listening habits and create personalized playlists. The newly updated model affects how well the platform will meet a new user s needs, impacting the future user choice.

In this paper, we study the dynamics of such interactions between users who choose and services which update their models. Our focus is on streaming data, meaning that data points arrive sequentially, uncoordinated services, meaning that data is not shared, and imperfectly rational users, who do not always select the best performing service. In particular, we study a range of imperfect user behaviors to account for the fact that while users prefer better performing services, they might make mistakes or have limited information. The degree of imperfection is characterized by the parameter ζ: with probability ζ, users choose amongst service providers uniformly at random, while with probability 1 ζ, they choose the best performing model (i.e. the one with the lowest loss). This setting is challenging due to the fact that services have only limited information: they observe only the data points of users who choose them. Indeed, each service must contend with sampled data from a high non-stationary distribution. Despite these challenges, we propose a simple decentralized algorithm, Multi-learner Streaming Gradient Descent (MSGD), and show that it converges to fixed points with desirable properties.

Our analysis rests on the observation that user subpopulations are naturally induced by the selection between models. Of course, these sub-populations are highly nonstationary and evolve in feedback with model updates. Prior work shows that the coupled evolution of users and models gives rise to nonlinear dynamics with multiple equilibria. Ginart et al. (2021) and Dean et al. (2024) empirically and

Learning from Streaming Data when Users Choose

theoretically demonstrate that services will specialize when they repeatedly retrain their models on distributions induced by user selection dynamics. However, this prior work does not adequately handle the realistic setting in which user data is sampled from the population and arrives in a streaming manner. When services only observe data from a single user at each time step, they have only partial information about the loss, so model updates cannot guarantee monotonic improvements in performance. Our key insight is to connect streaming data and user choice to induced sub-populations. This allows us to analyze our algorithm with tools originally developed in the context of stochastic gradient descent.

In summary, our contributions are: (1) We formalize the dynamics of streaming users choosing amongst multiple service providers using the notion of induced sub-populations. By considering imperfectly rational users and streaming data, our setting is general and practical. (2) We provide Multi-learner Streaming Gradient Descent (MSGD), an intuitive and efficient algorithm in which service providers simply perform a step of gradient descent with the single user loss when they are chosen. (3) We theoretically prove that the proposed algorithm converges to the local optima of the overall loss function, which quantifies the social welfare of users. We empirically support our result with experiments1 on real data.

2. Related Work

We discuss three strands of related work.

First, learning from non-stationary distributions has roots in concept drift, which studies the problem of learning when the target distribution drifts over time (Bartlett, 1992; Bartlett et al., 2000; Kuh et al., 1990; Gama et al., 2014). For arbitrary sources of shifts, it is difficult to creating unified objective (Gama et al., 2014; Webb et al., 2016). Performative prediction (Perdomo et al., 2020) simplifies the problem by assuming the distribution is induced by the deployed model. Most closely related to our setting is multi-player performative prediction (Li et al., 2022; Narang et al., 2023; Piliouras & Yu, 2023). However, the shifts considered in this literature do not adequately model the partition distributions induced by (bounded) rational user choice.

Second, learning when users choose has been studied from several perspectives, including opting-out (Hashimoto et al., 2018; Zhang et al., 2019), data consent (Kwon et al., 2022; James et al., 2023), competition between strategic learners (Ben-Porat & Tennenholtz, 2017; 2019; Jagadeesan et al., 2023a;b), and strategic users (Shekhtman & Dean, 2024). Most closely related our work are papers by Ginart et al. (2021); Dean et al. (2024); Bose et al. (2023) who charac-

1Code can be found at https://github.com/ sdean-group/MSGD

terize the specialization that results from the combination of user choice and model retraining. These works form a conceptual foundation for our paper, but they do not provide insight into the streaming data setting that we study.

Third and lastly, learning from streaming samples has been studied extensively, with many algorithms based on gradient descent (Yang et al., 2021; Wood et al., 2021). Of particular relevance to our analysis is work on stochastic gradient descent for nonconvex functions (Arous et al., 2021; Li & Orabona, 2019; Cutkosky & Orabona, 2019), distributed stochastic gradient descent (Swenson et al., 2022; Cutkosky & Busa-Fekete, 2018) and in particular works connecting this perspective to clustering algorithms (So et al., 2022; Tang & Monteleoni, 2017; Cohen-Addad et al., 2021; Liberty et al., 2016; Tang & Monteleoni, 2016), which share similar structure to the specialization that results from MSGD.

3. Problem setting

In this section, we formalize the interaction dynamics between users and service providers.

Notation Let P be the data distribution on user data space X Rd. Let x be a data point of d dimensions, i.e., x Rd. Denote x P if data is drawn from distribution P. Without loss of generality, we assume that P is a density. Given a set S X with positive probability mass P(S) > 0, we denote the distribution obtained by restricting P onto S as P|S.

Denote the tuple of k services providers models by Θ = (θ1, , θk). For the sake of presentation, we slightly abuse notation and refer the model parameter θi as service provider i s model. Given a loss function ℓ, the loss of model θ for a user with data x is ℓ(x, θ). The loss measures the performance of model θ for user x, with low loss corresponding to good performance. For example, user data x = (z, y) may contain both features x and a label y, and θ parameterized a model which predicts the label from the features, e.g. θ z. For a regression problem with squared loss, we would have ℓ((z, y), θ) = (θ z y)2. In a classification setting, we could similarly define ℓas logistic loss.

3.1. User-Service Interaction Dynamics

Streaming Data from Users Data comes from users sequentially: at each time step t, a user xt P selects among service providers according to their preferences, and commits their data to the selected provider. Our notion of data and user is rather general: at one end of the spectrum, all data could come from a single individual, who distributes specific tasks among different service providers. At the other end, each data point could come from a different individual within a larger population.

Learning from Streaming Data when Users Choose

User preferences for different models can be evaluated by the loss functions (ℓ(xt, θt 1), , ℓ(xt, θt k)). If users were perfectly rational and had perfect information, they would always choose the model with the lowest loss function. However, considering humans may not have full information when making decisions and the fact that humans sometimes make mistakes, we study a more generalized setting where users may have only bounded rationality. Instead of always choosing the model with the lowest loss, we allow users to make mistakes: with probability ζ, they choose randomly among all the models, while with probability 1 ζ, they choose the model that suits them best. This randomness captures the fact that, unlike algorithms and computers, humans are not always stable when making decisions. They may choose randomly because they don t know how to make a choice (due to limited information) or because they don t bother to do the optimization (they just don t care much about which service provider to choose). We refer to this mode of user behavior as a no preference user to describe user who has trouble thinking straight or taking care for the future but who at the same time is actuated by a concern with being fair to other people (Posner, 1997). When users select models according to their preference, we call them a perfect rational user, who chooses the best means to the chooser s ends (Posner, 1997).

We remark that there are other user behavior models in the literature such as the Boltzmann-rational model (Ziebart et al., 2010; Luce, 2005; 1977) where users choose proportionally to e αℓ(x,θi), and α controls the user rationality. Though not in scope of our present results, we provide further discussion on this behavior model in Section 4.3, as this could be of interest to consider in future work.

Model Updates by Services At each time step t, once the user makes a choice, only the chosen service provider i receives the data xt. In general, services have no information about the user population P other than the data points of users who selected them. This differs from the usual streaming setting where models see every sampled data point and user choice plays no role. It is also distinct from the setting in which models receive more than a single sample and can estimate the full distribution of users choosing them. Unlike the usual streaming setting, services cannot repeatedly sample from the same distribution because users choices change in feedback. Indeed, services only observe a single sample from time-varying distributions, which is not enough to estimate the distribution. As a result of these challenges, it is natural to consider a streaming algorithm that immediately and incrementally updates models based on each observed data point. We introduce such an algorithm in Section 3.

Once the selected service updates its model based on data it receives, the same user-service dynamics repeats at the next time step t + 1. The new data point xt+1 arrives and is

assigned to a service provider based on user choice over the new models Θt+1 = (θt+1 1 , . . . , θt+1 k ).

3.2. Learning Objective

Given the interaction between model quality and user choice, what is the right learning objective? In the traditional setting, machine learning aims to minimize the expected loss over the entire population: Ex P[ℓ(x, θ)]. However, this fails to account for the fact that users tend to choose the best performing model. Motivated by the goal of providing users with the highest quality services we aim to minimize the average loss experienced by users, which we refer to as the overall loss function. This objective can be understood as minimizing the loss of the distribution induced by the parameters, similar to the notion of performative optimality introduced by Perdomo et al. (2020).

For ease of presenting the overall loss function, we now introduce the following notation related to the subpopulations induced by Θ: X(Θ) = (X1(Θ), X2(Θ), , Xk(Θ)) is the data partitioning on X induced by Θ, where Xi(Θ) is the set {x : i arg minj [k] ℓ(x, θj) | Θ}. Let a(Θ) = (a1(Θ), , ak(Θ)) be the proportion of the population P contained in Xi(Θ). Naturally, we have Pk i=1 ai(Θ) = 1. Finally, D(Θ) = (D1(Θ), , Dk(Θ)) is the distribution within each partition, where each Di(Θ) is obtained by restricting P onto partition Xi(Θ), i.e., Di(Θ) = P|Xi(Θ). Note for fixed Θ, X(Θ), a(Θ) and D(Θ) are all fixed.

To make the models identifiable, and to ensure the above partitions/distribution notations being well-defined, we additional assume the service providers don t have exact the same model.

Assumption 1. i, j [k], θi = θj, i = j.

This assumption says that the services providers have different parameters, which is realistic considering that the service providers don t share parameters.

Using this notation of induced distributions, we can formally define an objective function that prioritizes user experience We will start by defining the learning objective under rational user behavior and then extend it to bounded rationality setting.

Perfect Rationality For perfectly rational users and a fixed Θ, model i is faced with data sampled x Di(Θ). Namely, x is sampled from a distribution supported on Xi(Θ), where all the x Xi(Θ) prefer model θi over all the other models θj for j [k] with j = i. The set Xi(Θ) can be understood as a subpopulation naturally induced by the models Θ and user preference; that is, users choosing the same service provider make up a single subpopulation. The expected loss of service provider i over the induced subpopulation can be written as Ex Di(Θ)[ℓ(x, θi)]. In order

Learning from Streaming Data when Users Choose

to write the overall loss function, we need to consider all models. Since ai(Θ) is the portion of subpopulation Xi(Θ) within the total population P, the overall learning objective under perfect rationality can be written as

i=1 ai(Θ) E x Di(Θ)[ℓ(x, θi)]. (1)

This expression can also be understood as the expected loss over users x sampled from the population P and models θ sampled according to rational user choice. From this perspective, the expression in (1) is the result of applying the tower property of expectation and conditioning on the model choice being i. Now we extend this loss function to bounded rationality.

Bounded Rationality A perfectly rational user would deterministically choose service θi to minimize their loss ℓ(x, θ). However, due to the complexity and uncertainty in real world decision making, such as limited knowledge, resource, and time, users don t always act to maximize their utility (Selten, 1990; Jones, 1999; Gigerenzer & Selten, 2002). As a result, they might just randomly choose a service provider due to limited time or imperfect information. We refer to this as a user with bounded rationality. We introduce the parameter ζ to characterize this user behavior: with probability 1 ζ, users rationally choose the model θi which minimizes their loss, and with probability ζ, they choose uniformly at random amongst all the models Θ = (θ1, , θk). Conditioned on this latter event, the probability that the user selects model i is 1/k for all i [k]. Conditioned on a perfectly rational choice, the probability that the user selects model i is zero unless x Xi(Θ). Combining these two events, the expected loss for users sampled from population P and models sampled according to the user choice is

f(Θ) = E x P

(1 ζ)1i(x, Θ) + ζ

ℓ(x, θi) i , (2)

where we define 1i(x, Θ) = 1{x Xi(Θ)}. This expected loss defines the learning objective in the bounded rationality setting. Compared to perfect rationality, the learning objective under bounded rationality accounts for the fact that users may not always select the best service for them.

4. Multi-learner Streaming Gradient Descent

In this section, we investigate important properties of the learning objective and then present an algorithm for the multi-learner setting which works for both perfect and bounded rational users.

4.1. Properties of Learning Objective

We begin by deriving some important properties of the learning objective f(Θ), laying the groundwork for the subsequent sections. Formal proofs of these properties are provided in Appendix B.

Property 1: Decomposability. With some simple algebraic manipulation of f(Θ), we can decompose f(Θ) as f(Θ) = (1 ζ) f PR(Θ) + ζ f NP(Θ), where

f NP(Θ) = 1

k Pk i=1 E x P[ℓ(x, θi)]

represents the loss function when users have no preference over the services providers. When ζ = 0, the learning objective reduces to perfect rationality f PR(Θ); when ζ = 1, the it reduces to no preference f NP(Θ).

Property 2: Boundedness. f(Θ) is lower bounded by f PR(Θ) and upper bounded by f NP(Θ), i.e., f PR(Θ) f(Θ) f NP(Θ).

We now make a set of common assumptions on the loss as a function of parameters: ℓ(x, ).

Assumption 2. Assume for all x X, the loss function ℓ(x, ) is non-negative, convex, differentiable, L-Lipschitz, and β-smooth.

Property 3: Non-convexity. When ℓ(x, θ) is convex in θ, f NP(Θ) is convex in Θ, but since f PR(Θ) is generally non-convex, f(Θ) is also non-convex when ζ < 1.

In Figure 1, we give a simple example in which the loss f PR(Θ) is non-convex. In this example, we let P be a uniform distribution over the interval [0, 1], the number of services k = 2, and the loss function to be ℓ(x, θ) = (x θ)2. Full details are provided in Appendix B.2.

Since f(Θ) is generally non-convex, we aim to design algorithms whose outputs converge to local optima, i.e. the

Figure 1. Example of f PR(Θ) being non-convex.

Learning from Streaming Data when Users Choose

Algorithm 1 Multi-learner Streaming Gradient Descent (MSGD)

Input: Rationality parameter ζ; loss function ℓ( , ) 0; Initial models Θ0 = (θ0 1, , θ0 k); Learning rate {ηt}T +1 t=1 . for t = 0, 1, 2, . . . , T do

Sample data point x P, User Side: User Selects a Service Provider: User selects best model i = arg minj [k] ℓ(x, θt j) w.p. 1 ζ, otherwise the user selects i from {1, ..., k} uniformly at random w.p. ζ. Learner Side: Selected Learner Updates its Model: The selected model i receives data and performs a gradient step: θt+1 i = θt i ηt+1 ℓ(x, θt i), while all other models remain the same θt+1 j = θt j for all j = i. end for Return ΘT

stationary points of learning objective f(Θ).

Definition 4.1. (Stationary Points) The stationary points of a differentiable function f(Θ) are {Θ : f(Θ) = 0}.

Remark 4.2. While stationary points are local minima of the f(Θ) objective, they are global minima for a decoupled objective where the effect of Θ on the partition is not accounted for. As we will show in the proof of Lemma 4.3, stationary points contains the points where each model is at the global optimum of the induced distribution of users that they see.

4.2. Multi-Learner Streaming Gradient Descent

We propose an intuitive and simple algorithm, Multi-Learner Streaming Gradient Descent (MSGD), presented in Algorithm 1. We write the algorithm to reflect the setting (Section 3.1) from both the user and the learner side.

On user side, a user comes into the digital platform at each time step. The user selects a model amongst service providers according to their preference and rationality. The user shares their data only with the selected service. We emphasize that these steps reflect the user behavior, which is not under the control of service providers or algorithm designers. On the learner side, one service will receive data; this service computes the gradient of the loss for this single user and then performs a gradient descent step. Because the services are not coordinated, the parameters of the models that were not selected remain the same. We denote model parameter tuple at time t as Θt = (θt 1, , θt k). At each time step, the selected model i updates with a gradient step θt+1 i θt i ηt+1 ℓ(x, θt i). In the next time step t + 1, a new user will arrive and decide which model to select based on the updated model parameters Θt+1.

MSGD is practical due to three main advantages. First, it is

computationally affordable, memory efficient and privacyconscious. At each time step, when a data point arrives, the service only has to perform a lightweight gradient descent step, which enables services to adapt quickly to incoming information without using extensive computational resources. Moreover, no extra storage is needed to retain the past user data, which may also address privacy concerns. Second, MSGD is amenable to the partial information setting: services do not need to know anything other than the data they receive. Third, MSGD handles non-stationary user distribution: user preferences update along with model parameters. Despite the overall user population being constant, the subpopulation that will choose a specific model evolves over time. To enhance user experience, service providers optimize over the population that chooses them, i.e., Di(Θ), which not a static distribution but a function of Θ.

Although the gradient update from learner side in Algorithm 1 is intuitively simple, it is not straightforward to see whether Algorithm 1 will perform well with respect to the overall objective f(Θ). One challenge arises due to the fact that f(Θ) is non-convex. Even focusing on convergence to local optima (Definition 4.1) leaves several additional challenges. First, notice that instead of updating all the learners at the same time with batched data, in MSGD, learners are updated asynchronously, one at a time depending on user choice, fostering the potential for competition amongst learners. Indeed, an update to model i can affect the distribution of users selecting any model j = i, meaning that each model must content with a highly non-stationary distribution. Moreover, since the update uses the gradient of a single data point, rather than the gradient of the objective f(Θ), we can t guarantee that the update will decreases f(Θ) (which would imply convergence to a local optimum). It might be natural to consider something like the learner expected loss Ex Di(Θ)[ℓ(x, θi)] which would measure the local performance of each model. However, due to the streaming nature of the update, it is still not possible to show that this learner loss will decrease. In other words, the streaming update step in Algorithm 1 is not risk reducing, a property that Dean et al. (2024) leveraged to prove convergence in the full information setting.

In Section 5, we theoretically prove the convergence of Algorithm 1 to stationary points of the learning objective. First, in the following subsection, we connect the gradient of a single user ℓ(x, θ) with the gradient of the learning objective f(Θ).

4.3. Gradient of learning objective f(Θ)

The following lemma computes the gradient of the objective function f(Θ). This lemma facilitates the analysis of MSGD.

Lemma 4.3. For the learning objective f(Θ) defined in

Learning from Streaming Data when Users Choose

Eq. (2), the gradient with respect to θi is:

θif(Θ) =(1 ζ) ai(Θ) E x Di(Θ)[ θiℓ(x, θi)]

k E x P[ θiℓ(x, θi)] .

Proof Sketch. Recall the decomposition of f(Θ) = (1 ζ) f PR(Θ)+ζ f NP(Θ). The gradient of f NP(Θ) is easy to compute, since by linearity of the gradient and Lipschitzness of the loss we may write θif NP(Θ) = 1

k E x P[ θiℓ(x, θi)].

It is more difficult to compute the gradient of f PR(Θ) because the distribution Di(Θ), and thus the domain of integration, is also a function of Θ. In our proof, we calculate θif PR(Θ) through its directional derivative Dvf PR(Θ) = limγ 0 1

γ (f PR(Θ + γv) f PR(Θ)). Similar to So et al. (2022), we decouple the difference f PR(Θ + γv) f PR(Θ) into two parts; one has fixed integral domain that is independent of Θ, and the other term is an integration on a zero measure set, which is 0 when we take the limit. In the end, it turns out that the derivative of f PR(Θ) can be computed by treating its domain of integral to be fixed. Thus, we can move the derivative inside the integral and get

θif PR(Θ) = ai(Θ) E x Di(Θ)[ θiℓ(x, θi)] (3)

The complete proof can be found in Appendix C.2.

This lemma shows that the gradient of the objective with respect to model i depends only on the gradient of the loss with respect to model i s parameters. The decomposability allows for decentralized algorithms, like the one proposed in Algorithm 1. We now turn to formalizing the guarantees of this algorithm in terms of convergence. Remark 4.4. Though it is interesting to consider the Boltzmann-rational model (Ziebart et al., 2010; Luce, 2005; 1977), this form of user choice poses additional difficulties for the development of decentralized algorithms. Under this behavior model, the gradient θif(Θ) w.r.t. model i unavoidably depends on the loss of other service providers. This poses a challenge to our setting: though service providers are all interested in providing an accurate service, they are not coordinated and can t share information with each other. This challenge may be of interest for future work.

5. Asymptotic Convergence Analysis

Define filtration (Ft) t=0 associated with MSGD. In particular, let F0 = σ(Θ0) be the σ-algebra generated by Θ0. Let σ-algebra Ft = σ(xt, ηt, it, Ft 1) contain all the information up to iteration t, where it indicates the random user choice. We now show that MSGD will converge to stationary points of the learning objective. Formally, convergence is defined as follows.

Definition 5.1. (Convergence) We say a sequence {Θt} t=0 converges to a set T if T N, s.t., t > T, Θt T .

We will prove that MSGD converges under the following additional assumptions on step size, the user population, and the loss function. These assumptions are standard in the literature on stochastic (non)convex optimization (Ge et al., 2015; Bertsekas & Tsitsiklis, 2000; Wang & Srebro, 2019). Assumption 3. The stepsize {ηt+1} t=0 satisfies the condition: P t=0 (ηt+1)2 < . Assumption 4. We assume { f(Θ) = 0} to be compact.

The above compactness assumption is rather common (Leluc & Portier, 2020; So et al., 2022), which allows us to prove that {Θt} t=0 converges to { f(Θ) = 0} by showing that f(Θt) converges to 0. Assumption 5. Assume the underlying data distribution P has a continuous density function p with a bounded support, namely, Prx P( x > R) = 0 for some R > 0. Assumption 6. For any θ = θ , there exists a d0 such that for all small d < d0, the Lebesgue measure of set Sd = {x : |ℓ(x, θ) ℓ(x, θ )| < d} is bounded by d, i.e., Vol(Sd) d.

This assumption states that the loss function are good enough for most users to distinguish different services (so that they can make a choice) and a sufficiently small perturbation on one of the models won t dramatically change user preference. We provide further intuition and examples in Appendix F.

Under these assumptions, we have the following theorem showing that the learning objective converges. Moreover, under an additional condition that ηt decreases with a rate of 1

t , the objective converges to a local optimum and the iterates {Θt}T t=1 converge to stationary points. The following theorem makes this formal. Theorem 5.2. (Convergence of Algorithm 1) Denote the iterates from Algorithm 1 and their overall loss to be {Θt}T t=0 and {f(Θt)}T t=0 respectively. Under Assumptions 1, 2, 3 there is an R-valued random variable f such that f(Θt) converges to f almost surely. Additionally, under Assumptions 4, 5, 6 and setting ηt = ηc

t , where ηc is a constant, the iterate {Θt}T t=1 converges to the set of stationary points of f(Θ), i.e, {Θ : f(Θ) = 0} almost surely.

To prove Theorem 5.2, we first argue that the objective converges. Then, we use this fact to show that the parameters converge to a stationary point. The argument proceeds in the following subsections respectively.

5.1. Convergence of f(Θ)

We first provide an outline of the proof of convergence of the learning objective. To start, we show an analytic upper

Learning from Streaming Data when Users Choose

bound on the value of f(Θ) at time t + 1 compared to time t. This bound relies on the smoothness of ℓ(x, ). It writes this difference explicitly in terms of the gradient of the objective function f(Θ), as computed in Lemma 4.3, and the gradient of a single loss ℓ(x, θi), as used for the gradient update in MSGD algorithm. Formally, we have the following lemma.

Lemma 5.3. Let f(Θ) be our learning objective proposed in Eq. (2) and let ζ = 1 ζ denote the probability that the best model is selected (while w.p. ζ a random model is selected). Let i be the model selected at time t. Then the following inequality holds under Assumption 2:

f(Θt+1) f(Θt) At+1 + Bt+1 Ct+1

At+1 = ζηt+1 θif PR(Θt) 2

ai(Θt) + ζηt+1 θif NP(Θt) 2

i=1 (ηt+1 i )2 ℓ(xt+1, θt i) 2

Ct+1 = ζηt+1 θif PR(Θt), ℓ(xt+1, θt i) θif PR(Θt)

+ ζηt+1 θif NP(Θt), ℓ(xt+1, θt i) θif NP(Θt)

This lemma quantifies the decrease (or lack thereof) in the objective function value from one time step to the next. We therefore turn our focus to the sequences (At) t=0, (Bt) t=0, (Ct) t=0. The sequence (Ct) t=0 is Ft martingale difference sequence because, as we show in Eq. (3) in the proof of Lemma 4.3, the single loss in the MSGD update is parallel to the gradient of the objective f(Θ) in expectation. Then also notice that (Bt) t=1 converges when P t=0 (ηt+1)2 < (Assumption 3).

Let (M t) t=1 be defined by:

M t+1 = f(Θt+1)

Then we have that M t+1 M t At+1 Ct+1. We show that M t is an Ft super-martingale, which converges almost surely by the martingale convergence theorem. From this, the convergence of f(Θ) follows by the convergence of (Bt) t=0. The complete proof can be found in Appendix C.1

5.2. Convergence of iterates Θ

Next, we show that the model parameters Θ also converge, and in particular that they converge to the stationary points of the overall loss. The convergence of Θ follows from the convergence of f(Θ), under some additional conditions on the step size. To simplify the notation, we denote model update at each round as θt+1 i = θt i ηt+1 i ℓ(xt+1, θt i),

where ηt+1 i = 0 only when model i was selected by the user. We define the following event Ei(τ, T, r, s):

Ei(τ, T, r, s) =

t=τ ηt+1 < r and

t=τ ηt+1 i > s

This event occurs when, for a particular time interval, the accumulated step size ηt is bounded above by a constant, while the accumulation of steps of model i is bounded below. With this definition in hand, we have the following lemma.

Lemma 5.4. Under Assumption 1-6, Θt converges to the stationary point of f(Θ) almost surely if the following condition on ηt holds: For any ϵ, there is an r0 > 0 such that if r (0, r0), then there exists a mapping T : N N, a time step t0 N, and constants s, c > 0 such that, for any τ > t0:

Pr Ei(τ, T, r, s) Fτ, θif(Θτ) > ϵ > c .

The first part of the event Ei(τ, T, r, s), which bounds the accumulated step size ηt by r, indicates that the gradient θif(Θt) > ϵ 2, i.e., the gradient remains large in t [τ, T(τ)). (Notice that the total displacement between Θτ and ΘT (τ) can be controlled by bounding accumulated learning rate). The second part P

τ t<T (τ) ηt+1 i > s ensures that we accumulate enough learning for the center i [k] with large gradients, so that f(Θ) decreases by at least a constant amount on this interval.

To finish proving Theorem 5.2, it remains to set stepsize in Algorithm 1 to be ηt = ηc

t , and show that this satisfies the condition proposed in Lemma 5.4. The complete proof of Theorem 5.2 is given in Appendix C.1.

6. Experiments

We experimentally illustrate the performance of MSGD using real world data. These experiments confirm our theoretical results and illustrate interesting phenomena that occur in the multi-learner setting. Code for reproducing these results can be found at https://github.com/ sdean-group/MSGD.

6.1. Experimental Settings

We illustrate the performance of MSGD in two distinct settings with different datasets and loss functions.

Movie Recommendation with Squared Loss Our first experimental setting is based on a widely used movie recommendation dataset Movielens10M (Harper & Konstan, 2015), which contains 10 million ratings across 10k movies by 70k viewers. We preprocess the dataset with a similar procedure from Bose et al. (2023): We filter the total

Learning from Streaming Data when Users Choose

0.0 2.5 5.0 7.5 log(t)

Full Information

0.0 2.5 5.0 7.5 log(t)

Figure 2. Convergence of objective function f(Θ) under MSGD or Full Information with k = 3 services in the movie recommendation (left) and census data (right) tasks.

ratings and only keep the top 200 most rated movies and use the inbuilt matrix factorization function from Python toolkit Surprise (Hug, 2020) to get d = 5 dimensional user embeddings. This preprocessing procedure results in a population of 69474 users each with a five dimensional feature vector which we denote by z. Then we consider a regression problem, where each model aims to predict the the user ratings r of the dr = 200 movies. Denote by x = (z, r) each user s data and by Ωx the set of movies which have been rated. Let m = |Ωx| denote the number of movies user x has rated. For each user, we define the following squared loss, where θ Rd dr.

ℓ(x, θ) = 1

i Ωx (θ i z ri)2 .

Census Data with Logistic Loss Our second setting is based on census data made available by folktables (Ding et al., 2021). We consider the ACSEmployment task, where to goal is to predict whether individuals are employed. The population is defined by the 2018 census data from Alabama, filtered to individuals between ages 16 and 90, resulting 47777 user in total. After splitting 0.2 of them for testing, we have 38221 data to sample from. The data contains d = 16 features describing demographic characteristics like age, educated, marital status, etc. Denote each user data x = (z, y), where x Rd and y {0, 1}. We scale features x so that they have zero mean and unit variance. We use logistic regression loss for this task,

ℓ(x, θ) = y log(θ z) (1 y) log(1 θ z) .

The model predicts 1 if θ z > 0 and 0 otherwise.

6.2. Results

We investigate the behavior of MSGD (Algorithm 1) in the movie recommendation and unemployment prediction

0 1000 2000 Time t

t = 0 t t 1

Full Information

0 1000 2000 Time t

Figure 3. Convergence of iterates Θ under MSGD or Full Information with k = 3 services in the movie recommendation (left) and census data (right) tasks. For MSGD, we show results for ζ = 0, 0.2, 0.5, 0.8, 1 respectively.

settings. At each time step, we sample a user x at random from the data described above. We assign this user to one of k services according to bounded rational with parameters ζ. Then the selected service updates their parameter with the gradient of the loss on the user s data with step size ηt = 1

t . We evaluate them on a held-out test set.

We compare MSGD with a Full Information algorithm in which user data is shared among services, and at each time step, all the models performs a gradient descent step with this user data, regardless of the user selection.

Convergence of Loss Function In Figure 2, we show the convergence of the learning objectivef(Θ) for settings with users with different levels of rationality. We use a log-log scale, and thus the linear trend in Figure 2 signifies convergence decreasing polynomially in t. Compared with the Full Information setting, we find that MSGD performs better when users select services with high rationality, i.e., ζ is small. This is because the full information updates do not allow models to specialize to their own user sup-populations. However, when ζ becomes large, MSGD attains higher overall loss than the full information setting. This is due to the fact that data sharing allows models to learn faster, since they receive comparably more data. We conclude that, even with limited data, as long as users select services with sufficient rationality, MSGD can still achieve higher social welfare than when data is shared between all the services (i.e, full information).

Convergence of Iterates In Figure 3, we plot the accumulated parameter update distance PT t=0 Θt Θt 1 to show the convergence of iterates. Since Θ updates an average of k = 3 times more often in the full information setting than in MSGD, the accumulated error is k times larger.For fair comparison, we divide the accumulated error of the full information line by the number of services. We find that,

Learning from Streaming Data when Users Choose

0 1000 2000 Time t

Subppopulation

Full Information k=2 k=4 k=6

0 1000 2000 Time t

Whole Population

Figure 4. Accuracy of MSGD or Full Information on the modelspecific subpopulation Di(Θ) (left) and whole population P (right) for the ACSEmployment task on census data with perfectly rational users (ζ = 0). For MSGD, we illustrate results of different total number of services k = 2, 4, 6.

even though the full information setting receives data from a static distribution, it still converges slower compared to MSGD when users select with high rationality.

Accuracy over Subpopulation vs. Whole Population So far, we have seen that MSGD is advantageous particularly when users are highly rational, despite the fact that models have access to less data and act in an uncoordinated manner. We explore this fact by comparing the induced subpopulation performance of model i, i.e. the loss on Di(Θ), to the whole population performance, i.e. the loss on P with the ACSEmployment task on census data. In Figure 4, we plot the averaged accuracy for k = 2, 4, 6 using census data. Because all services update with the same data in Full Information, the only difference is their initialization, and since they converge to the minimizer of the loss on P, changing k has negligible effect. Compared to Full Information, MSGD achieves higher accuracy on induced subpopulations, and this accuracy increases as k increases, as illustrated in the left graph of Figure 4. However, when evaluated on the whole population P, MSGD performs worse than Full Information, and we even observe accuracy decreasing with more training steps.

In Figure 5, we illustrate the accuracy over the whole population and the induced subpopulations when the number of services increases with ζ = 0 and ζ = 0.1 respectively. Notice that when k increases, services updates slower, to ensure services have already converged when calculating the accuracy, for each k, we use compute the average accuracy after T = 2000 k total timesteps and plot the average over 3 trials. We observe that when the number of total services k increases, the accuracy over the induced subpopulation increases at the cost of decreasing the accuracy over the whole population. In practice, the number of service

1 2 3 4 5 Number of services k

Whole Population Subpopulation

1 2 3 4 5 Number of services k

Figure 5. Accuracy of MSGD or Full Information in the census data (right) with fairly rational users and different total number of services k. The plot displays mean and standard deviation over three trials.

providers depends on an uncoordinated market of services, and choosing k would be like a social planner or regulator intervening on the market to balance the trade-off between accuracy and specialization.

In the appendix, we investigate the performance of MSGD in additional settings: MSGD under Boltzmann-rational users and minibatch MSGD. These additional experiments show that despite a lack of theoretical understanding, MSGD also converges for Botzmann-rational users. They also show that MSGD is able to perform better when minibatching is possible and it is not necessary to operate in a purely streaming setting. More details can be found in Appendix E

7. Discussion

In this paper, we consider a setting in which streaming users chose between multiple services, and commit their data to the model of their choosing. We design a simple, efficient, and intuitive gradient descent algorithm that does not require any coordination between services. We prove that it guarantees convergence to the local optima of the overall objective function, and empirically explore its performance on two different real data settings.

There remain several interesting directions for future work. One thread comes from considering alternative user behavior models. Though we experimentally show that MSGD also converges under the Boltzmann-rational model, we leave as future work the theoretical analysis. Another direction is to consider alternative learning objectives, such as overall population accuracy or market share of each service. This perspective would motivate greater coordination or explicit competition between the learning-based services, rather than the simple decentralized updates that we study.

Learning from Streaming Data when Users Choose

Acknowledgements

This work was partly funded by NSF CCF 2312774 and NSF OAC-2311521, and a Linked In Research Award.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, most of which we do not specifically highlight here. We do remark that the MSGD results in specialized models: services optimize for accuracy on the sub-population of users who choose them. There is a tension between such specialization, which enables more accurate models, and global accuracy, which ensures universal performance over an entire population. It is important for real world deployments to consider this trade-off carefully in light of fairness, bias, accessibility, etc.

Anderson, T. Popular music in a digital music economy: Problems and practices for an emerging service industry. Routledge, 2013.

Arous, G. B., Gheissari, R., and Jagannath, A. Online stochastic gradient descent on non-convex losses from high-dimensional inference. The Journal of Machine Learning Research, 22(1):4788 4838, 2021.

Bartlett, P. L. Learning with a slowly changing distribution. In Proceedings of the fifth annual workshop on Computational learning theory, pp. 243 252, 1992.

Bartlett, P. L., Ben-David, S., and Kulkarni, S. R. Learning changing concepts by exploiting the structure of change. Machine Learning, 41:153 174, 2000.

Ben-Porat, O. and Tennenholtz, M. Best response regression. Advances in Neural Information Processing Systems, 30, 2017.

Ben-Porat, O. and Tennenholtz, M. Regression equilibrium. In Proceedings of the 2019 ACM Conference on Economics and Computation, pp. 173 191, 2019.

Bertsekas, D. P. and Tsitsiklis, J. N. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627 642, 2000.

Bi, K., Metrikov, P., Li, C., and Byun, B. Leveraging user behavior history for personalized email search. In Proceedings of the Web Conference 2021, pp. 2858 2868, 2021.

Bose, A., Curmei, M., Jiang, D. L., Morgenstern, J., Dean, S., Ratliff, L. J., and Fazel, M. Initializing services in

interactive ml systems for diverse users. ar Xiv preprint ar Xiv:2312.11846, 2023.

Clarizia, F., Colace, F., De Santo, M., Lombardi, M., Pascale, F., and Santaniello, D. A context-aware chatbot for tourist destinations. In 2019 15th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 348 354. IEEE, 2019.

Cohen-Addad, V., Guedj, B., Kanade, V., and Rom, G. Online k-means clustering. In International Conference on Artificial Intelligence and Statistics, pp. 1126 1134. PMLR, 2021.

Cutkosky, A. and Busa-Fekete, R. Distributed stochastic optimization via adaptive sgd. Advances in Neural Information Processing Systems, 31, 2018.

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.

Dean, S., Curmei, M., Ratliff, L., Morgenstern, J., and Fazel, M. Emergent specialization from participation dynamics and multi-learner retraining. In International Conference on Artificial Intelligence and Statistics, pp. 343 351. PMLR, 2024.

Ding, F., Hardt, M., Miller, J., and Schmidt, L. Retiring adult: New datasets for fair machine learning. Advances in neural information processing systems, 34:6478 6490, 2021.

Durrett, R. Probability: theory and examples, volume 49. Cambridge university press, 2019.

Eriksson, M., Fleischer, R., Johansson, A., Snickars, P., and Vonderau, P. Spotify teardown: Inside the black box of streaming music. Mit Press, 2019.

Gama, J., ˇZliobait e, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1 37, 2014.

Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points online stochastic gradient for tensor decomposition. In Conference on learning theory, pp. 797 842. PMLR, 2015.

Gigerenzer, G. and Selten, R. Bounded rationality: The adaptive toolbox. MIT press, 2002.

Ginart, T., Zhang, E., Kwon, Y., and Zou, J. Competing ai: How does competition feedback affect machine learning? In International Conference on Artificial Intelligence and Statistics, pp. 1693 1701. PMLR, 2021.

Learning from Streaming Data when Users Choose

Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015.

Hashimoto, T., Srivastava, M., Namkoong, H., and Liang, P. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pp. 1929 1938. PMLR, 2018.

Hug, N. Surprise: A python library for recommender systems. Journal of Open Source Software, 5(52):2174, 2020.

Jagadeesan, M., Jordan, M. I., and Haghtalab, N. Competition, alignment, and equilibria in digital marketplaces. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 5689 5696, 2023a.

Jagadeesan, M., Jordan, M. I., Steinhardt, J., and Haghtalab, N. Improved bayes risk can yield reduced social welfare under competition. ar Xiv preprint ar Xiv:2306.14670, 2023b.

James, H., Nagpal, C., Heller, K. A., and Ustun, B. Participatory personalization in classification. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Jones, B. D. Bounded rationality. Annual review of political science, 2(1):297 321, 1999.

Kuh, A., Petsche, T., and Rivest, R. Learning time-varying concepts. Advances in neural information processing systems, 3, 1990.

Kwon, Y., Ginart, A., and Zou, J. Competition over data: how does data purchase affect users? ar Xiv preprint ar Xiv:2201.10774, 2022.

Leluc, R. and Portier, F. Asymptotic analysis of conditioned stochastic gradient descent. ar Xiv preprint ar Xiv:2006.02745, 2020.

Li, Q., Yau, C.-Y., and Wai, H.-T. Multi-agent performative prediction with greedy deployment and consensus seeking agents. Advances in Neural Information Processing Systems, 35:38449 38460, 2022.

Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd international conference on artificial intelligence and statistics, pp. 983 992. PMLR, 2019.

Liberty, E., Sriharsha, R., and Sviridenko, M. An algorithm for online k-means clustering. In 2016 Proceedings of the eighteenth workshop on algorithm engineering and experiments (ALENEX), pp. 81 89. SIAM, 2016.

Luce, R. D. The choice axiom after twenty years. Journal of mathematical psychology, 15(3):215 233, 1977.

Luce, R. D. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005.

Ma, Z., Dou, Z., Zhu, Y., Zhong, H., and Wen, J.-R. One chatbot per person: Creating personalized chatbots based on implicit user profiles. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pp. 555 564, 2021.

Morris, J. W. Selling digital music, formatting culture. University of California Press, 2015.

Narang, A., Faulkner, E., Drusvyatskiy, D., Fazel, M., and Ratliff, L. J. Multiplayer performative prediction: Learning in decision-dependent games. Journal of Machine Learning Research, 24(202):1 56, 2023.

Perdomo, J., Zrnic, T., Mendler-D unner, C., and Hardt, M. Performative prediction. In International Conference on Machine Learning, pp. 7599 7609. PMLR, 2020.

Piliouras, G. and Yu, F.-Y. Multi-agent performative prediction: From global stability and optimality to chaos. In Proceedings of the 24th ACM Conference on Economics and Computation, pp. 1047 1074, 2023.

Posner, R. A. Rational choice, behavioral economics, and the law. St An. l. re V., 50:1551, 1997.

Prey, R. Musica analytica: The datafication of listening. Networked music cultures: Contemporary approaches, emerging issues, pp. 31 48, 2016.

Selten, R. Bounded rationality. Journal of Institutional and Theoretical Economics (JITE)/Zeitschrift f ur die gesamte Staatswissenschaft, 146(4):649 658, 1990.

Shekhtman, E. and Dean, S. Strategic usage in a multilearner setting. In International Conference on Artificial Intelligence and Statistics. PMLR, 2024.

Shumanov, M. and Johnson, L. Making conversations with chatbots more personalized. Computers in Human Behavior, 117:106627, 2021.

So, G., Mahajan, G., and Dasgupta, S. Convergence of online k-means. In International Conference on Artificial Intelligence and Statistics, pp. 8534 8569. PMLR, 2022.

Swenson, B., Murray, R., Poor, H. V., and Kar, S. Distributed stochastic gradient descent: Nonconvexity, nonsmoothness, and convergence to local minima. The Journal of Machine Learning Research, 23(1):14751 14812, 2022.

Learning from Streaming Data when Users Choose

Tang, C. and Monteleoni, C. On lloyd s algorithm: new theoretical insights for clustering in practice. In Artificial Intelligence and Statistics, pp. 1280 1289. PMLR, 2016.

Tang, C. and Monteleoni, C. Convergence rate of stochastic k-means. In Artificial Intelligence and Statistics, pp. 1495 1503. PMLR, 2017.

Ustinovskiy, Y. and Serdyukov, P. Personalization of websearch using short-term browsing context. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pp. 1979 1988, 2013.

Wang, W. and Srebro, N. Stochastic nonconvex optimization with large minibatches. In Algorithmic Learning Theory, pp. 857 882. PMLR, 2019.

Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L., and Petitjean, F. Characterizing concept drift. Data Mining and Knowledge Discovery, 30(4):964 994, 2016.

Webster, J. The promise of personalisation: Exploring how music streaming platforms are shaping the performance of class identities and distinction. New Media & Society, 25(8):2140 2162, 2023.

Wood, K., Bianchin, G., and Dall Anese, E. Online projected gradient descent for stochastic optimization with decision-dependent distributions. IEEE Control Systems Letters, 6:1646 1651, 2021.

Yang, Z., Lei, Y., Wang, P., Yang, T., and Ying, Y. Simple stochastic and online gradient descent algorithms for pairwise learning. Advances in Neural Information Processing Systems, 34:20160 20171, 2021.

Yoganarasimhan, H. Search personalization using machine learning. Management Science, 66(3):1045 1070, 2020.

Zhang, X., Khaliligarekani, M., Tekin, C., et al. Group retention when using machine learning in sequential decision making: the interplay between user dynamics and fairness. Advances in Neural Information Processing Systems, 32, 2019.

Ziebart, B. D., Bagnell, J. A., and Dey, A. K. Modeling interaction via the principle of maximum causal entropy. 2010.

Learning from Streaming Data when Users Choose

A. Notations, Assumptions and Definitions

We summarize the notation used in the paper and the proof in Table 1. When it is clear from the context, we shall omit the Θ in the parenthesis and use a simpler notation X = (X1, X2, , Xk), a = (a1, , ak) and D = (D1, , Dk).

A.1. Notations

Notation Explanation Value

ℓ( , ) Personalized loss function ℓ( , ) 0

Θt Collection of model parameters of all the services at time step t Θt = (θt 1, , θt k) Rk d

θt i Model parameter of service provider i at time step t θt i Rd

P Underlying data distribution P (X)

xt+1 Data arrives time step t xt+1 P

ζ The probability that users pick service providers randomly ζ [0, 1]

X(Θ) = (X1(Θ), , Xk(Θ)) Data subpopulation partitioning induced by Θ i [k]Xi = X

a(Θ) = (a1(Θ), , ak(Θ)) Data subpopulation portion induced by Θ Pk i=1 ai = 1

D(Θ) = (D1(Θ), , Dk(Θ)) Subpopulation distribution induced by Θ Di(Θ) = P|Xi(Θ)

Table 1. Notation summary.

In the proof, our notations are consistent with Algorithm 2, which is the more verbose version of MSGD.

Algorithm 2 Detailed version of MSGD

Input: Rationality Parameter ζ; loss function ℓ( , ) 0; Initial model parameters Θ = (θ1, , θk); Learning rate parameter {ηt} t=1 for model that has actual update at each round. for t = 0, 1, 2, , T do

Receive data point xt+1 P, User Side: User Select a Service Provider: User rationally picks the best model i [k] where i = arg mini [k] ℓ(x, θi) w.p. 1 ζ; Otherwise user randomly picks some i [k] w.p. ζ. Learner Side: Selected Learner Updates its Model: Let ηt+1 i = ηt+1 if model i is selected; else ηt+1 i = 0 Models update through θt+1 i = θt i ηt+1 i ℓ(xt+1, θt i); end for Return ΘT +1 = (θT +1 1 , , θT +1 k )

We also use the notation of filtration, which is given below:

Definition A.1. Let (Ft) t=0 be a filtration associated with Algorithm 2. In particular, let F0 = σ(Θ0) be the σalgebra generated by Θ0, xt and ηt be the tuple where xt = (xt 1, , xt k) and ηt = (ηt 1, , ηt k). Let σ-algebra Ft = σ(xt, ηt, Ft 1) contains all random events up to iteration t. Note that this does not explicitly contain it, the model chosen at time t, as this information is included in the definition of ηt.

In addition, assume

(1) If ai(Θt) = 0, then ηt+1 i = 0 almost surely.

(2) ηt+1 and xt+1 are conditionally independent given Ft

(3) 0 ηt+1 i 1.

Learning from Streaming Data when Users Choose

A.2. Assumptions and Definitions

In this paper, we use make some rather general assumptions on loss function ℓ(x, θ) (in Assumption 2), such as L-Lipschitz (Definition A.2) and β-smooth (Definition A.3), whose definitions are given below. Definition A.2. (L-Lipschitz) Given the loss function ℓ( , ) : X C R, it is L-Lipschitz if for all x X and θ, θ C, we have |ℓ(x, θ) ℓ(x, θ )| L θ θ .

Definition A.3. (β-smooth) Given the loss function ℓ( , ) : X C R, we say it is β-smooth if the gradient is β-Lipschitz, namely, x X and θ, θ C, ℓ(x, θ) ℓ(x, θ ) β θ θ

Note that, the above definition is equivalent to

ℓ(x, θ ) ℓ(x, θ) + θℓ(x, θ)T (θ θ) + β

2 θ θ 2. (4)

In the proof, we also use martingales. Definition A.4. (Martingale Sequences) Define a filtration as an increasing sequence of σ-fields = F0 F1 Fn on some probability space. Let (Xi) be a sequence of random variables such that Xi is measurable w.r.t. Fi, then (Xi) is a martingale w.r.t. (F)i if E[Xi|Fi 1] = Xi 1 for all i. Definition A.5. (Martingale Difference Sequences) We call (Xi) is a martingale difference sequence w.r.t. (F)i if E[Xi|Fi 1] = 0, i.

A.3. Background Results

We also need the following theorem and lemmas for our proof. Theorem A.6. (Martingale convergence theorem (Durrett, 2019)) Let (M t) t=0 be a (sub)martingale with

sup t N E[M t +] <

where M t + = max{0, M t}. Then as t , M t converges a.s. to a limit M with E[|M|] < . Lemma A.7. (Second Borel-Cantelli Lemma (Durrett, 2019))

Let (Ω, F, P) be a probability space, let (Ft)t 0 be a filtration with F0 = {0, Ω} and (Bt)t 0 a sequence of events with Bt Ft, then, the event {Bt occurs infinitely often.} is the same as {P t=0 P(Bt|Ft 1) = }. Lemma A.8. (Azuma-Hoeffding Inequality) For a sequence of Martingale difference sequence random variable {Yt} t=1 w.r.t. filtration {Ft} t=1, if we have at Yt bt for some constant at, bt, t = 1, , n, then

t=1 Yt > s) exp 2s2 Pn t=1(bt at)2

B. Properties of Learning Objective

B.1. Decomposition of f(Θ)

The learning objective f(Θ) can be decomposed as follows:

f(Θ) = E x P

(1 ζ)1i(x, Θ) + ζ

i=1 (1 ζ) E x P[1i(x, Θ)ℓ(x, θi)] +

ζ k E x P[ℓ(x, θi)]

i=1 ai(Θ) E x Di(Θ)[ℓ(x, θi)] + ζ

i=1 E x P[ℓ(x, θi)]

=(1 ζ) f PR(Θ) + ζ f NP(Θ)

Learning from Streaming Data when Users Choose

i=1 ai E x Di[ℓ(x, θi)]

is the learning objective when all users have perfect rationality, while

f NP(Θ) = 1

i=1 E x P[ℓ(x, θi)]

is the learning objective when users have no preference and choose randomly over all the models. Thus, f(Θ) can be decomposed as a linear combination of f PR and f NP, controlled by parameter ζ, which captures users rationality.

B.2. Example of f PR(Θ) being Non-Convex.

In Figure 1, we give a simple illustration of f PR(Θ) to be non-convex. Here, we show it mathematically. Let x U(0, 1), where U(0, 1) is a uniform distribution over [0, 1], and let the loss function to be ℓ(x, θ) = (x θ)2. Let Θ = (θ1, θ2) with θ1, θ2 R and θ1 < θ2. Then

f PR(Θ) = Ex U(0,1)[min{(x θ1)2, (x θ2)2}]

0 (x θ1)2dx + Z 1

2 (x θ2)2dx

= (θ1 + θ2)2(θ1 θ2)

4 + θ2 2 θ2 + 1

The Hessian of f(Θ):

2f PR(Θ) = 3θ1+θ2

Without too much constraints on Θ, the Hessian is generally not positive semi-definite, and thus the function is non-convex.

B.3. Boundedness of f(Θ).

Here we prove that f(Θ) is upper bounded by f NP(Θ) and lower bounded by f PR(Θ) using Proposition D.1. Specifically, let

i=1 ai(Θ ) E x Di(Θ )[ℓ(x, θi)],

which is a family of functions parameterized by Θ . Then we have f PR(Θ) F(Θ; Θ ) for all Θ (Proposition D.1). Let F(Θ; Θ i) = E x P[ℓ(x, θi)], namely,

( 1, if j = i 0, if j = i, , Dj(Θ i) =

( P, if j = i 0, if j = i

Then f PR(Θ) 1

k Pk i=1 F(Θ; Θ i) = f NP(Θ). Thus, f PR(Θ) (1 ζ) f PR(Θ) + ζ f NP(Θ) f NP(Θ), i.e., f PR(Θ) f(Θ) f NP(Θ).

C. Omitted Proofs

C.1. Proof of Theorem 5.2

Proof. (1) Convergence of Objective Function f(Θ).

Recall Lemma 5.3 gives the following inequality for the updates of f(Θ):

f(Θt+1) f(Θt) At+1 + Bt+1 Ct+1 (6)

Learning from Streaming Data when Users Choose

Let (Ft) t=0 be the filtration given in Definition A.1.

Let (M t) t=1 be defined by:

M t+1 = f(Θt+1)

Then Eq. (6) becomes

M t+1 M t At+1 Ct+1

Take the expectation conditioned on Ft:

E[M t+1|Ft] E[M t|Ft] E[At+1|Ft] E[Ct+1|Ft]

E[M t|Ft] E[Ct+1|Ft] (7)

= E[M t|Ft] (8)

where Eq. (7) is due to (At) t=1 being non-negative while Eq. (8) is because (Ct) t=0 being a martingale difference sequence (Lemma D.3). Thus, M t is an Ft-supermartingale. Thus, we have showed that M t is an Ft-submartingale.

Moreover, since (Bt) t=1 is non-negative, we have

M t = f(Θt) +

where we also used the fact that f(Θt) 0 and P τ=1 Bτ < (Lemma D.3). By the martingale convergence theorem (Theorem A.6), ( M t) t=1 converges almost surely. Hence, f(Θt) converges almost surely.

(2) Convergence of Iterates Θ. Based on Lemma 5.4, what we have left to do is to verify our learning rate in Algorithm 1 satisfies the two conditions in Lemma 5.4. In order to satisfy Lemma 5.4, we take t0 and s such that s < ϵηcr

4L and t0 2 ln 6

s η2 c (er 1).

To begin with, we show that when θif(Θτ) [ϵ, ϵ0), the first condition P

τ t<T (τ) ηt+1 < r can be satisfied with probability 1. Define map Tr : N N so that Tr(τ) is a unique number s.t.,

1 t + 1 < r

1 t + 1 (9)

Then, we have P τ t<Tr(τ) ηt+1 = P τ t<Tr(τ) ηc t+1 < r. Now we prove that, conditioned on Fτ and θif(Θτ) [ϵ, ϵ0), with some probability, P

τ t<T (τ) ηt+1 i > s occurs.

Define µ = P

τ t T (τ) E[ηt+1 i |Ft] and note that ηt+1 i E[ηt+1 i |Ft] is a martingale difference sequence with increments

|(1 ζ) at i + ζ 1

k 1| ηt+1 ηt+1 = ηc t + 1,

where we have used that E[ηt+1 i |Ft] = [(1 ζ) at i + ζ 1

k] ηt+1. Now, we apply Azuma-Hoeffding s inequality (Lemma

Learning from Streaming Data when Users Choose

A.8), which implies that

τ t<T (τ) ηt+1 i > µ s Ft, θif(Θτ) [ϵ, ϵ0)

τ t<T (τ) η2c (t+1)2

where v = P

τ t<T (τ) η2 c (t+1)2 1 and we use the fact that

η2 c (t + 1)2

1 (i) 1 er 1 (τ + 1)2

(ii) > 2 ln 6

(i) is because T(τ) τ (eτ 1) (τ + 1) from Corollary D.6 while (ii) is due to τ > t0 ( 2 ln 6

s η2 c) (er 1).

In the following, we claim and prove that: conditioned on Ft and θif(Θτ) [ϵ, ϵ0), we have µ > 2s. Because

θif(Θτ) = (1 ζ) aτ i (Θ) E x Di(Θ)[ θiℓ(x, θi)] + ζ

k E x P[ θiℓ(x, θi)]

|(1 ζ) aτ i L + ζ

Then, we must have (1 ζ) aτ i + ζ

Since ai(Θ) is locally La sensitive (Lemma D.7), for τ t T(τ), we have |ai(Θt) ai(Θτ)| La Θτ Θt La L P

τ t <t ηt La Lr. By setting r < ϵ 2L2La(1 ζ), we have (1 ζ) at i + ζ

k ϵ 2L for all τ t < T(τ).

Then we have

τ t<T (τ) E[ηt+1 i |Fτ, θif(Θτ) [ϵ, ϵ0)]

τ t<T (τ) ((1 ζ) at i + ζ

k ) ηc t + 1

2L ln T(τ) + 1

2L ln τ(α + 1) + 1

(iii) = ϵηc

2L ln τer + 1

where (i) is from Lemma D.5, (ii) is from Corollary D.6 and (iii) is because s < ϵηcr

4L . Therefore, we have proved that

τ t<T (τ) ηt+1 i > µ s > s Ft, θif(Θτ) [ϵ, ϵ0)

Learning from Streaming Data when Users Choose

C.2. Proof of Lemma 4.3

Proof. As shown in Appendix B.1, the objective function f(Θ) can be decomposed as f(Θ) = (1 ζ)f PR(Θ) + ζf NP(Θ), the derivative of f NP(Θ) can be easily computed as θif NP(Θ) = 1

k E x P[ ℓ(x, θi)]. Thus, we will mainly focus on the

derivative of f PR(Θ), i.e., the learning objective when ζ = 0.

In the following, we will get a closer look at f PR and then use similar technique from (So et al., 2022) by taking the directional derivatives. First, note that f PR(Θ), although expressed in terms of a(Θ) and D(Θ) in Eq. (1), can also be written in terms of indicator functions on X(Θ),

i=1 ai E x Di[ℓ(x, θi)]

= E x P[min i [k] ℓ(x, θi)|Θ]

i=1 ℓ(x, θi) 1Xi(x)|Θ],

where 1Xi(x) is the indicator function,

( 1, if x Xi(Θ) 0, otherwise

Fix Θ Rk d, Let γ > 0 be a scalar, and v Rk d with v = 1. Denote Θ = Θ + γv and the subpopulation partition induced by Θ as X( Θ) = (X1( Θ) , Xk( Θ)) = ( X1 , Xk). Follow Eq. (10), f PR(Θ) = E x P[Pk i=1 ℓ(x, θi)

1Xi(x)|Θ] while f PR( Θ) can be decoupled as

f PR( Θ) = E x P[

i=1 ℓ(x, θi) 1 Xi(x)| Θ]

i=1 ℓ(x, θi) 1Xi(x) +

i=1 ℓ(x, θi) 1 Xi\Xi(x)

i=1 ℓ(x, θi) 1Xi\ Xi(x)| Θ, Θ],

where we used the fact that Xi = [Xi ( Xi\Xi)]\(Xi\ Xi).

Now we compute the directional derivative Dvf PR(Θ) along direction v:

Dvf PR(Θ) = lim γ 0 1 γ (f PR(Θ + γv) f PR(Θ))

= lim γ 0 1 γ (f PR( Θ) f PR(Θ))

= lim γ 0 1 γ

i=1 ℓ(x, θi) 1Xi(x) +

i=1 ℓ(x, θi) 1Xi\ Xi(x)

i=1 ℓ(x, θi) 1 Xi\Xi(x) Θ, Θ

i=1 ℓ(x, θi) 1Xi(x) Θ

= lim γ 0 1 γ

i=1 ℓ(x, θi) 1Xi(x) Θ, Θ

i=1 ℓ(x, θi) 1Xi(x) Θ

+ lim γ 0 1 γ

i=1 ℓ(x, θi) 1Xi\ Xi(x)

i=1 ℓ(x, θi) 1 Xi\Xi(x) Θ, Θ

We look at the first two terms and the last two terms of Eq. (11) separately. We show that the first two terms is the directional derivative of a surrogate function that is easy to compute. Then we show that the last two terms is 0. To simplify the notation,

Learning from Streaming Data when Users Choose

we introduced a family of surrogate objectives parameterized by Θ ,

i=1 ai(Θ ) E x Di(Θ )[ℓ(x, θi)]

i=1 ℓ(x, θi) 1Xi(Θ )(x)|Θ, Θ ],

Compared to f PR, F(Θ; Θ ) simply fixes the subpopulation of which the expectation is taken over, making it independent of Θ. Once the subpopulation is fixed, the derivative is easy to compute.

Note that {F( , Θ ) : Θ Rk d} forms a family of convex, L-Lipschitz and β-smooth functions since it is a sum of convex, L-Lipschitz and β-smooth function ℓ( , ). Moreover, when taking the parameter Θ as Θ, we have

F(Θ; Θ) = f PR(Θ)

Then, the first two terms of Eq. (11) can be written as:

lim γ 0 1 γ

i=1 ℓ(x, θi) 1Xi(x) Θ, Θ

i=1 ℓ(x, θi) 1Xi(x) Θ

= lim γ 0 1 γ

ℓ(x, θi) ℓ(x, θi) 1Xi(x) Θ, Θ

= lim γ 0 1 γ

F( Θ; Θ) F(Θ; Θ)

= lim γ 0 1 γ (F(Θ + γv; Θ) F(Θ; Θ))

=Dv F(Θ; Θ)

where the directional derivative Dv F(Θ; Θ) is taken only over the first argument (i.e. the partition Xi is fixed).

Now, let s look at the rest two terms in Eq. (11). Note that for any point x Xi\ Xi, there must exist some j [k], j = i, such that x Xj, which are users that prefer model i most compared to other models, but prefers other models (for example, some j [k]) on the new parameter tuple Θ.

Thus Pk i=1 ℓ(x, θi) 1Xi\ Xi(x) = P

i,j [k],i =j ℓ(x, θi) 1Xi Xj(x), where

1Xi Xj(x) =

( 1, if x (Xi\ Xi) ( Xj\Xj) 0, otherwise,

indicating users that prefer model i user parameter tuple Θ but would choose model j under parameter tuple Θ.

Similarly, for any point x Xi\Xi, there must exists some j [k], j = i, such that x Xj, thus, Pk i=1 ℓ(x, θi) 1 Xi\Xi(x) = P

i,j [k],i =j ℓ(x, θi) 1Xj Xi(x), where

1Xj Xi(x) =

( 1, if x (Xj\ Xj) ( Xi\Xi) 0, otherwise,

indicating users attracted from other services j = i to i due to the parameter update from Θ to Θ. Therefore, the rest two terms in Eq. (11) can be rewritten as

lim γ 0 1 γ

i=1 ℓ(x, θi) 1Xi\ Xi(x)

i=1 ℓ(x, θi) 1 Xi\Xi(x) Θ, Θ

= lim γ 0 1 γ

i,j [k],i =j

ℓ(x, θi) ℓ(x, θj) 1Xi Xj(x) Θ, Θ

Learning from Streaming Data when Users Choose

For any x (Xi\ Xi) ( Xj\Xj), i.e., 1Xi Xj(x) = 1, according to the definition, we have ℓ(x, θi) ℓ(x, θj)(Under

parameter Θ, user x prefers model i) and ℓ(x, θj) ℓ(x, θi) (Under parameter Θ, user x prefers model j).

|ℓ(x, θi) ℓ(x, θj)|

=ℓ(x, θi) ℓ(x, θj) Since ℓ(x, θj) ℓ(x, θi)

=ℓ(x, θi + γvi) ℓ(x, θj + γvj)

=ℓ(x, θi) + γvi ℓ(x, θi) ℓ(x, θj) γvj ℓ(x, θj) + o(γ2) (Taylor expansion)

γ (vi ℓ(x, θi) vj ℓ(x, θj)) + o(γ2) Since ℓ(x, θi) ℓ(x, θj)

It follows that

lim γ 0 1 γ

i,j [k],i =j

ℓ(x, θi) ℓ(x, θj) 1Xi Xj(x) Θ, Θ

lim γ 0 1 γ

i,j [k],i =j

γ (vi ℓ(x, θi) vj ℓ(x, θj)) + o(γ2) 1Xi Xj(x) Θ, Θ

i,j [k],i =j (vi ℓ(x, θi) vj ℓ(x, θj)) 1Xi Xj(x) Θ, Θ

=0 ((Xi\ Xi) ( Xj\Xj) decreases to some measure zero set when γ 0)

Combining Eq. (12), Eq. (13) and Eq. (14), we have

Dvf PR(Θ) = Dv F(Θ; Θ),

which implies that Θf PR(Θ) = ΘF(Θ; Θ ) when Θ = Θ. Note that

θi F(Θ; Θ ) = ai(Θ ) E x Di(Θ )[ θiℓ(x, θi)]

when take the derivative of F(Θ; Θ ) w.r.t. Θ.

We thus have

θif PR(Θ) = ai(Θ) E x Di(Θ)[ θiℓ(x, θi)]

Learning from Streaming Data when Users Choose

C.3. Proof of Lemma 5.3

Proof. To simplify the notation, denote model update at each round as θt+1 i = θt i ηt+1 i ℓ(xt+1, θt i), where ηt+1 i = 0

only when model i was selected by the user, i.e., ηt+1 i =

( ηt+1 , If model i is selected at time t 0 , Otherwise.

f PR(Θt+1) f PR(Θt) + f PR(Θt), Θt+1 Θt + β

2 Θt+1 Θt 2

i=1 θif PR(Θt), θt+1 i θt i + β

i=1 θt+1 i θt i 2

=f PR(Θt) +

i=1 θif PR(Θt), ηt+1 i ℓ(xt+1, θt i) + β

i=1 (ηt+1 i )2 ℓ(xt+1, θt i) 2

=f PR(Θt) +

i=1 ηt+1 i θif PR(Θt), θif PR(Θt)

ai(Θt) ℓ(xt+1, θt i) θif PR(Θt)

i=1 (ηt+1 i )2 ℓ(xt+1, θt i) 2

i=1 ηt+1 i 1 ai(Θt) θif PR(Θt) 2

i=1 (ηt+1 i )2 ℓ(xt+1, θt i) 2

i=1 ηt+1 i θif PR(Θt), ℓ(xt+1, θt i) θif PR(Θt)

Similarly, for f NP(Θ), we have

f NP(Θt+1) f NP(Θt) + f NP(Θt), Θt+1 Θt + β

2 Θt+1 Θt 2

i=1 θif NP(Θt), θt+1 i θt i + β

i=1 θt+1 i θt i 2

=f NP(Θt) +

i=1 θif NP(Θt), ηt+1 i ℓ(xt+1, θt i) + β

i=1 (ηt+1 i )2 ℓ(xt+1, θt i) 2

=f NP(Θt) +

i=1 ηt+1 i θif NP(Θt), θif NP(Θt) ℓ(xt+1, θt i) θif NP(Θt) + β

i=1 (ηt+1 i )2 ℓ(xt+1, θt i) 2

i=1 ηt+1 i θif NP(Θt) 2

i=1 (ηt+1 i )2 ℓ(xt+1, θt i) 2

i=1 ηt+1 i θif NP(Θt), ℓ(xt+1, θt i) θif NP(Θt)

Adding them together, we have

f(Θt+1) = (1 ζ)f PR(Θt+1) + ζf NP(Θt+1)

(1 ζ) (f PR(Θt) At+1 PR + Bt+1 PR Ct+1 PR ) + ζ (f NP(Θt) At+1 NP + Bt+1 NP Ct+1 NP )

= f(Θt) (1 ζ)At+1 PR ζAt+1 NP + (1 ζ)Bt+1 PR + ζBt+1 NP (1 ζ)Ct+1 PR ζCt+1 NP

Learning from Streaming Data when Users Choose

Letting i denote the model chosen at time t. Let At+1, Bt+1 and Ct+1 be

At+1 =(1 ζ) θif PR(Θt) 2

ai(Θt) + ζηt+1 θif NP(Θt) 2

2 (ηt+1)2 ℓ(xt+1, θt i) 2

Ct+1 =(1 ζ)ηt+1 θif PR(Θt), ℓ(xt+1, θt i) θif PR(Θt)

ai(Θt) + ζηt+1 θif NP(Θt), ℓ(xt+1, θt i) θif NP(Θt)

Then, we have f(Θt+1) f(Θt) At+1 + Bt+1 Ct+1. (15)

C.4. Proof of Lemma 5.4

Proof. Since the set of stationary points { f(Θ) = 0} is compact. (Assumption 4), and that f is continuous, there exists ϵ0, s.t., { f ϵ0} is also compact.

For any ϵ (0, ϵ0), note that { θif ϵ 2} and { θif [ϵ, ϵ0]} are two closed disjoint compact subsets of { θif ϵ0} and can be separated by some distance R0 > 0. Without loss of generality, we assume that r0 satisfy r0 R0

L . Fix a r (0, r0). Given (ϵ, r), let (T, t0, s, c) be chosen so that

τ t<T (τ) ηt+1 < r, X

τ t<T (τ) ηt+1 i > s Fτ, θif(Θτ) [ϵ, ϵ0)

holds for any τ > t0. Fixed a τ > t0, and denote two events

τ t<T (τ) ηt+1 < r

, O1 = Θt Θτ < R0, t (τ, T(τ)]

We first prove that O 1 O1, by applying Lemma D.4, which implies that the displacement of Θ can be bounded by the stepsize. To see O 1 O1, simply note that, for any t such that τ < t T(τ), we have

τ t<t η t+1 (From Lemma D.4)

τ t<T (τ) η t+1

< Lr (If event O1 occurs.)

< R0 (Since r0 R0

L and r < r0)

Fix any i [k], denote the following events

τ t<T (τ) ηt+1 i > s

O3 = n θif(Θτ} > ϵ

O4 = { θif(Θτ} [ϵ, ϵ0)}

Then condition Eq. (16) can be rewritten as

Pr O 1 O2|Fτ, O4 > c,

Learning from Streaming Data when Users Choose

which implies that Pr (O1 O3 O2|Fτ, O4)

=Pr (O1 O2|Fτ, O4) (O4 O3)

Pr O 1 O2|Fτ, O4 > c (O 1 O1)

Note that O1 O3 implies θif(Θt) > ϵ

2 when τ t T(τ).

For any δ < sϵ2

8 , let O5 = {f(ΘT (τ)) < f(Θτ) δ}. In the following, we will prove that O5 occurs if O1 O2 O3 occurs, i.e., the occurrence of O1 O2 O3 implies f(Θt) decreases by at least a constant amount on the same interval.

Recall from Lemma 5.3, we have

f(Θt) f(Θt 1) At + Bt Ct

f(Θt 2) At + Bt Ct At 1 + Bt 1 Ct 1

τ t<t A t+1 + X

τ t<t B t+1 X

τ t<t C t+1

Recall Lemma D.3 shows that P t=0 N t+1 = P t=0 Bt+1 P t=0 Ct+1 convergences almost surely. Therefore, for any δ > 0, there exists an N-random variable Mδ such that P

τ t<t N t+1 < δ holds for all τ > Mδ.

f(ΘT (τ)) f(Θτ) X

τ t<T (τ) At+1 + X

τ t<T (τ) N t+1

i=1 ηt+1 i 1 ai(Θt) fθi(Θt) 2 + δ

τ t<T (τ) ηt+1 i fθi(Θt) 2 + δ (17)

τ t<T (τ) ηt+1 i + δ (O1 O3 occurs.)

4 + δ (O2 occurs.)

< f(Θτ) δ (Set δ < sϵ2

where in Eq. (17)) we dropped the summation over j = i, and then dropped 1 ai(Θt) term.

Thus, Pr (O1 O2 O3|Fτ, O4) > c indicates that

Pr(O5|Fτ, O4) > c,

In other words, if fθi(Θτ) is large at iteration τ, then with positive probability, f(Θt) will decrease by at least δ from time step τ to T(τ), i.e.,

Pr f(ΘT (τ)) < f(Θτ) δ Fτ, θif(Θτ) [ϵ, ϵ0) > c (18)

Since f(Θt) converges almost surely, this decrease of δ can only happen finite times, and by Borel-Cantelli lemma (Lemma A.7), event O4 = { θif(Θτ) [ϵ, ϵ0)} must also occur finitely often.

We prove this by contradiction: assume that O4 = { θif(Θτ) [ϵ, ϵ0)} happens infinitely often, then we can define the infinite sequences of stopping times: τ0 = max{t0, Mδ} and τj+1 = inf{t T(τj) : θif(Θt) [ϵ, ϵ0)}. Then by Borel-Cantelli lemma (Lemma A.7), f(Θt) decreases a constant amount of δ infinitely often, contradicting to the convergence of f(Θt). To complete the proof, we also have to show that the iterates don t return to the set { θif ϵ0}. Consider the following two cases:

Learning from Streaming Data when Users Choose

Case 1: The iterates never leave the set { θif ϵ0}.

Case 2: The iterates exits and re-enter the set { θif ϵ0} infinitely often.

Suppose there exists T such that if τ > T, then θif(Θτ) ϵ0, then by Lemma 5.4, we have

τ t<T (τ) ηt+1 i > s Fτ, τ > max{t0, T}

By Second Borel-Cantelli Lemma (Lemma A.7), there are infinitely many intervals τ t < T(τ) on which P

τ t<T (τ) ηt+1 i > s, so the total sum is infinite almost surely, i.e., P t=0 ηt i = . This leads to unbounded decrease in cost:

lim inf t f(Θt) lim t

f(Θτ) ϵ2 0 X

τ t<t η t+1 i + δ

where we used the similar reduction as above:

f(Θt) f(Θτ) X

τ t<t A t+1 + X

τ t<t N t+1

i=1 η t+1 i 1 ai(Θ t) fθi(Θ t) 2 + δ

τ t<t η t+1 i fθi(Θ t) 2 + δ

f(Θτ) ϵ2 0 X

τ t<t η t+1 i + δ

Thus, Case 1 is impossible.

As for Case 2, when the learning rate becomes sufficiently small, each time the iterates leave { θif ϵ0}, they must enter { θif [ϵ, ϵ0)}. Thus, the iterates eventually never return to { θif ϵ0}.

By ruling out both Case 1 and Case 2, we have shown that for all ϵ > 0, we almost surely have θif(Θt) > ϵ only finitely often.

D. Supporting Lemmas

Proposition D.1. Let

i=1 ai(Θ) E x Di(Θ)[ℓ(x, θi)]

be the perfect rationality objective in Eq. (1), and

i=1 ai(Θ ) E x Di(Θ )[ℓ(x, θi)],

be the surrogate function of introduced in Eq. (12), then, for all Θ, Θ Rk d , we have f PR(Θ) F(Θ; Θ ).

Proof. By the definition of X(Θ), given Θ, X(Θ) is the partition that minimizes f PR, namely, for any Θ = Θ, we have

i=1 ai(Θ) E x Di(Θ)[ℓ(x, θi)]

i=1 ai(Θ ) E x Di(Θ )[ℓ(x, θi)]

Learning from Streaming Data when Users Choose

To see this, note that, for any data x P, LHS always chooses the best model θ in (θ1, , θk), which is equivalent to E x P[mini [k] ℓ(x, θi)|Θ], while for RHS, we have

RHS = E x P

[min i [k] ℓ(x, θi)] 1X X (x)|Θ, Θ + E x P

i,j [k],i =j ℓ(x, θj) 1Xi X j(x)|Θ, Θ

[min i [k] ℓ(x, θi)] 1X X (x)|Θ, Θ + E x P

i,j [k],i =j ℓ(x, θi) 1Xi X j(x)|Θ, Θ

= E x P[min i [k] ℓ(x, θi)|Θ] = LHS

( 1, if x i [k](Xi(Θ) Xi(Θ )) 0, otherwise

1Xi X j(x) =

( 1, if x Xi(Θ) Xj(Θ ) 0, otherwise.

1X X (x) indicates all the users on the set i [k](Xi(Θ) Xi(Θ )), i.e., users that choose the model with the smallest loss, while 1Xi X j(x) is the indicator for set Xi(Θ) Xj(Θ ), i.e., users that should choose model i (which has the smallest

loss for them) but incorrectly chooses some other model j. Therefore, we have f PR(Θ) = F(Θ; Θ) F(Θ; Θ ).

Lemma D.2. f NP is β-smooth. Moreover, for all Θ, Θ+ Rk d, we also have

f PR(Θ+) f PR(Θ) + f PR(Θ), Θ+ Θ + β

namely, f PR is also β-smooth.

Proof. Since f NP is a sum of β-smooth functions, f NP is also β-smooth.

Now we prove f PR is β-smooth, to show that, we need the Proposition D.1 stating that f PR(Θ) is upper bounded by the surrogate functions {F( , Θ ) : Θ Rk d} introduced in Eq. (12).

Then, for any Θ, Θ+ Rk d, we have

f PR(Θ+) F(Θ+; Θ)

F(Θ; Θ) + FΘ(Θ; Θ)T (Θ+ Θ) + β

2 Θ+ Θ 2 (21)

= f PR(Θ) + f PR(Θ)T (Θ+ Θ) + β

2 Θ+ Θ 2, (22)

where Eq. (21) used the fact that F( ; Θ ) is β-smooth Θ Rk d, and Eq.(22) follows because f PR(Θ) = F(Θ; Θ) and f PR(Θ) = ΘF(Θ; Θ). Thus, f PR(Θ) is β-smooth.

Lemma D.3. Let (F t) t=0 be a filtration given by Definition A.1. Let Bt+1 = Es[Bt+1 s ] = (1 ζ)Bt+1 PR + ζBt+1 NP , and Ct+1 = Es[Ct+1 s ] = (1 ζ)Ct+1 PR + ζCt+1 NP , where

( Bt+1 PR w.p. 1 ζ Bt+1 NP w.p. ζ , Ct+1 s =

( Ct+1 PR w.p. 1 ζ Ct+1 NP w.p. ζ .

Learning from Streaming Data when Users Choose

Bt+1 PR = β

i=1 (ηt+1 i )2 ℓ(xt+1, θt i) 2, Bt+1 NP = β

i=1 (ηt+1 i )2 ℓ(xt+1, θt i) 2

i=1 ηt+1 i θif PR(Θt), ℓ(xt+1, θt i) θif PR(Θt)

ai(Θt) , Ct+1 NP =

i=1 ηt+1 i θif NP(Θt), ℓ(xt+1, θt i) θif NP(Θt)

and the expectation is taken over users random selection over services. Let N t+1 = Bt+1 Ct+1, then

1. (Ct) t=0 is F t-martingale difference sequences, and E[Ct+1|Ft] = 0.

2. Suppose that P

i [k] P t=1(ηt i)2 < a.s. Then the series P t=1 Bt < converges almost surely.

3. The series P t=1 N t = P t=1 Bt P t=1 Ct < converges almost surely.

Proof. (1) Proof of (Ct) t=1 being martingale difference sequences.

Take the expectation of Ct+1 conditioned on Ft:

E[Ct+1 s |Ft] =(1 ζ)

i=1 θif PR(Θt), E[ ℓ(xt+1, θt i) ηt+1 i |Ft] θif PR(Θt)

ai(Θt) E[ηt+1 i |Ft] User choose with PR

i=1 θif NP(Θt), E[ ℓ(xt+1, θt i) ηt+1 i |Ft] θif NP(Θt) E[ηt+1 i |Ft] User choose with NP

Conditioned on user choosing with prefect rationality, we have E[ηt+1 i |Ft] = at i ηt+1. Thus

E[ ℓ(xt+1, θt i) ηt+1 i |Ft] = ai(Θ) E x Di(Θ)[ θiℓ(x, θi)] ηt+1

ai(Θt) E[ηt+1 i |Ft] = θif PR(Θ) ηt+1 = ai(Θ) E x Di(Θ)[ θiℓ(x, θi)] ηt+1

As a result, we have shown that

ai(Θt) E[ηt+1 i |Ft] = E[ ℓ(xt+1, θt i) ηt+1 i |Ft]

Conditioned on user choosing randomly among models, we have E[ηt+1 i |Ft] = 1

k ηc t+1, which leads to

θif NP(Θt) E[ηt+1 i |Ft] = ηt+1 1

k E x P[ ℓ(x, θi)]

E[ ℓ(xt+1, θt i) ηt+1 i |Ft] = ηt+1 1

k E x P[ ℓ(x, θi)], (23)

Thus, we also get θif NP(Θt) E[ηt+1 i |Ft] = E[ ℓ(xt+1, θt i) ηt+1 i |Ft] (24)

In the end, we have E[Ct+1 s |Ft] = 0, i.e., E[Es[Ct+1 s |Ft]] = 0. And thus, we have E[Ct+1|Ft] = 0.

(2) Prove that P t=1 Bt < converges.

i [k] P t=1(ηt i)2 < , we have

t=1 Bt PR =β

i=1 (ηt i)2 ℓ(xt i, θt 1 i ) 2

i=1 (ηt i)2 <

Learning from Streaming Data when Users Choose

Similarly, P t=1 Bt NP < . Thus, P t=1 Bt < converges almost surely.

(3) Prove that P t=1 N t < converges.

Let Ht = Pt τ=1 Cτ and let Ht + = max{0, Ht}, since the terms in a martingale difference sequence are orthogonal, we have for all t N:

v u u u t E

τ=1 E[(Cτ)2]

| θif PR(Θt), ℓ(xt+1, θt i) θif PR(Θt)

θif PR(Θt) ℓ(xt+1, θt i) θif PR(Θt)

Similarly, we can also get

| θif NP(Θt), ℓ(xt+1, θt i) θif NP(Θt) | 2L2.

We thus have

t=1 E[(Ct)2]

i=1 ηt i 2L2)2]

i=1 (ηt i)2 < ,

Combining Eq. (25), it implies that supt N E[Ht +] < . According to martingale convergence theorem (Theorem A.6), as t 0, Ht converges, and thus P t=1 Ct < converges.

Moreover, due to the convergence of series P t=1 Bt < , the series P t=1 N t = P t=1 Bt P t=1 Ct < converges almost surely.

Lemma D.4. For all t < t , the displacement between Θt and Θt satisfies

Learning from Streaming Data when Users Choose

Proof. The displacement of Θt and Θt can be bounded as

i θt i (Minkowski s inequality)

i θt 1 i + θt 1 i + θt+1 i θt i

t+1 τ t θτ i θτ 1 i

t+1 τ t θτ 1 i ητ i ℓ(xτ, θτ 1 i ) θτ 1 i

t+1 τ t ητ i ℓ(xτ, θτ 1 i )

t+1 τ t ητ i (L-Lipschitz of ℓ)

Lemma D.5. Let 1 < τ < τ where τ, τ N, then

ln τ + 1 τ + 1 X

1 t + 1 ln τ

Proof. Simply note that

ln τ + 1 τ + 1 = Z τ

1 x + 1dx X

1 t + 1 Z τ

1 xdx = ln τ

Corollary D.6. Let r > 0 and α = er 1, set T := Tr, then

ατ T(τ) τ α(τ + 1) (27)

Proof. Replacing τ to be T(τ) in Lemma D.5, we get

ln T(τ) + 1

1 t + 1 < r X

τ t<T (τ)+1

1 t + 1 ln T(τ)

τ + 1 1 er 1 T(τ)

τ + 1 er 1 T(τ) τ

τ T(τ) τ (er 1) (τ 1) = α (τ + 1)

T(τ) τ (er 1) τ = ατ

We have thus proved ατ T(τ) τ α(τ + 1).

Learning from Streaming Data when Users Choose

Lemma D.7. (Local Lipschitz of ai(Θ)) Let p be a continuous density function supported in the closed ball B(0, R). Then under Assumption 6, ai(Θ) is locally Lipschitz.

Proof. First, we prove a simple case where k = 2 and we only have two services i = 1, j = 2. Given two sets of model parameters Θ, Θ , the difference of the induced portion of model i is

ai(Θ) ai(Θ ) = Z

Xi(Θ)\Xi(Θ ) p(x)dx Z

Xi(Θ )\Xi(Θ) p(x)dx

Since p is continuous on the closed ball B(0, R), it attains maximum pmax = sup p(x) < . Then

|ai(Θ) ai(Θ )| pmax λ(Xi(Θ)\Xi(Θ )) + λ(Xi(Θ )\Xi(Θ))

where λ is the Lebesgue measure.

Let s look at X1(Θ )\X1(Θ), which are users move from model 2 to model 1 when Θ is perturbed to Θ . For any point x in X1(Θ )\X1(Θ), we have

ℓ(x, θ 1) < ℓ(x, θ 2), ℓ(x, θ2) < ℓ(x, θ1)

ℓ(x, θ 2) ℓ(x, θ 1)

=ℓ(x, θ 2) ℓ(x, θ2) + ℓ(x, θ2) ℓ(x, θ1) + ℓ(x, θ1) ℓ(x, θ 1)

L θ 2 θ2 + L θ1 θ 1 + ℓ(x, θ2) ℓ(x, θ1)

Therefore, we have

0 < ℓ(x, θ1) ℓ(x, θ2) < L θ 2 θ2 + L θ1 θ 1 2L Θ Θ

Let S = {x : |ℓ(x, θ1) ℓ(x, θ2)| 2L Θ Θ }, then, from our assumption, we have λ(S) 2L Θ Θ by letting d = 2L Θ Θ . Since λ(S) λ(X1(Θ )\X1(Θ)), we have λ(X1(Θ )\X1(Θ)) 2L Θ Θ . Similarly, we have λ(X1(Θ)\X1(Θ ) 2L Θ Θ . Thus,

|ai(Θ) ai(Θ )| 4pmax L Θ Θ

Now that we have prove the lemma for k = 2, to extend it to general k, simply note that Xj(Θ )\Xj(Θ) = l =j(Xj(Θ ) Xl(Θ)), and that λ(Xj(Θ )\Xj(Θ)) P

l =j λ(Xj(Θ ) Xl(Θ)).

E. Additional Experiments

More experiments of MSGD for ACSEmployment task on census data. As a supplement to Figure 4, we compare the accuracy of MSGD and full information on the model specific subpopulation and the whole population when ζ = 0.1 (Figure 6) as well as the losses when ζ = 0 and 0.1 (Figure 7). While Figure 6 shows a similar trend as of Figure 4, we find that even evaluated on subpopulation, the increased the number of services from k = 2 to k = 4 decreases the accuracy on subpopulation. A plausible reason is that, besides the changes initial landscape due to the added service, the increase in k also reduces the average number of data each service provider receives, causing the relationship of the number of services and the accuracy over subpopulation less observable.

Since services are trained with loss functions, we compare the overall loss of MSGD and Full information over both subpopulation and whole population in Figure 7. We found that, the loss of full information decreases both in terms of subpopulation and the whole population. Meanwhile, even though MSGD decreases faster than full information in subpopulation loss, its loss over the whole population increases, which means they becomes worse and worse for general population. (Similar to what we see about the accuracy).

Learning from Streaming Data when Users Choose

0 1000 2000 Time t

Subppopulation

Full Information k=2 k=4 k=6

0 1000 2000 Time t

Whole Population

Figure 6. Accuracy of MSGD or Full Information on the model-specific subpopulation Di(Θ) (left) and whole population P (right) for the ACSEmployment task on census data with perfectly rational users (ζ = 0.1). For MSGD, we illustrate results of different total number of services k = 2, 4, 6.

0.0 2.5 5.0 7.5 log(t)

Subpopulation

Full Information k=2 k=4 k=6

0.0 2.5 5.0 7.5 log(t)

Whole Population

(a) Perfectly rational user: ζ = 0

0.0 2.5 5.0 7.5 log(t)

Subpopulation

Full Information k=2 k=4 k=6

0.0 2.5 5.0 7.5 log(t)

Whole Population

(b) Relatively rational user: ζ = 0.1.

Figure 7. Log-log scaled loss of MSGD or Full Information on the model-specific subpopulation Di(Θ) and whole population P for the ACSEmployment task on census data. For MSGD, we illustrate results of different total number of services k = 2, 4, 6.

MSGD in Different Settings We show in Figure 8 the following two variants of the setting to demonstrate and better understand the advantage and limitations of MSGD.

1. MSGD (Boltzmann-rational): When user behaves under Boltzmann-rational model with α = 0.5.

2. MSGD (smaller noise): When we have more accurate user gradient. (Instead of one, 6 users arrive in each time step.)

The experiments are with Census dataset, for MSGD baseline, we use ζ = 0.1. From these additional experiments, we can see that our MSGD algorithm can work well and converge even when user have more diverse behavior such as Boltzmannn Rational model. Having smaller noise on the online gradient can further improve the performance of our algorithm, which suggests that our algorithm, though designed for streaming user setting, can work well for both our setting and the more complete data setting that has been studied in previous papers.

MSGD under Boltzmann-rational Model Though it is theoretically difficult to analyze the convergence of MSGD under Boltzmann-rational model, we conduct additional empirical experiments of MSGD under Boltzmann-rational model on Census dataset with α = 0, 0.5, 1. The results are shown in figure 9, from which, we do see that the iterates, f(Θ) and accuracy still converges, which indicates that our MSGD algorithm is robust enough to work well even when user behaviors deviates from our setting.

Learning from Streaming Data when Users Choose

Figure 8. Comparing the iterates (left), objective function (middle) and prediction accuracy (right) of MSGD and two variants with Census dataset.

Figure 9. The convergence of the iterates (left), objective function (middle) and prediction accuracy (right) of MSGD under Boltzmannrational model on Census dataset with α = 0, 0.5, 1.

Learning from Streaming Data when Users Choose

F. Examples of Assumption 6

Generally, there are two kinds of scenarios where ℓ(x, θ) and ℓ(x, θ ) could be close to each other: One case is θ being close to θ , the expression in Assumption 6 may only hold for small d. As long as θ are not arbitrarily close to θ , the assumption states that we can still always find a small enough d0 such that the expression holds. Another case is when θ is distinguishable from θ , but there still exist some users who are ambiguous on which service to choose (i.e., {x : |ℓ(x, θ) ℓ(x, θ )| < d}). We give an example of these scenarios in Figure 10. Assumption 6 states that the volume of these ambiguous users can be controlled by d.

Figure 10. Examples of Assumption 6. Left: example of θ being close to θ , Assumption 6 is hard to hold with large d. Middle: θ being close to θ and Assumption 6 holds with sufficiently small d. Right: θ is distinguishable from θ , but there are still some users who are ambiguous about which service to choose (i.e., {x : |ℓ(x, θ) ℓ(x, θ )| < d}).