# iterated_learning_in_dynamic_social_networks__e1692791.pdf Journal of Machine Learning Research 20 (2019) 1-28 Submitted 8/18; Revised 12/18; Published 1/19 Iterated Learning in Dynamic Social Networks Bernard Chazelle CHAZELLE@CS.PRINCETON.EDU Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08544 Chu Wang CHUWANG@AMAZON.COM Amazon Inc. 500 Boren Avenue, Seattle, WA 98109 Editor: Mehryar Mohri A classic finding by (Kalish et al., 2007) shows that no language can be learned iteratively by rational agents in a self-sustained manner. In other words, if A teaches a foreign language to B, who then teaches what she learned to C, and so on, the language will quickly get lost and agents will wind up teaching their own common native language. If so, how can linguistic novelty ever be sustained? We address this apparent paradox by considering the case of iterated learning in a social network: we show that by varying the lengths of the learning sessions over time or by keeping the networks dynamic, it is possible for iterated learning to endure forever with arbitrarily small loss. 1. Introduction People typically form opinions by updating their current beliefs and reasons in response to new signals from other sources (friends, colleagues, social media, newspapers, etc.) (Tahbaz-Salehi et al., 2009; Acemoglu and Ozdaglar, 2011; Golub and Jackson, 2010). Suppose there were an information source that made a noisy version of the truth available to agents connected through a social network. Under which conditions would the agents reach consensus about their beliefs? What would ensure truthful consensus (meaning that the consensus coincided with the truth)? How long would it take for the process to converge? Addressing these questions requires agreeing on a formal model of distributed learning. Fully rational agents update their beliefs by assuming a prior and using Bayes rule to integrate all past information available to them (Acemoglu et al., 2011; Mueller-Frank, 2013; Lobel and Sadler, 2015; Mossel et al., 2011; Banerjee, 1992; Bala and Goyal, 1998). Full rationality is intractable in practice (Molavi et al., 2015; Rahimian and Jadbabaie, 2016a), so much effort has been devoted to developing computationally effective mechanisms, including non- (or partially) Bayesian methods (Jadbabaie et al., 2012; Molavi et al., 2015; Golub and Jackson, 2010, 2012; Jadbabaie et al., 2013). Much of this line of work can be traced back to the seminal work of (De Groot, 1974) on linear opinion pooling. As the simplest example of iterated learning in a social network, consider a system consisting of one teacher and one learner. The teacher samples data from a distribution and sends it to the learner; the learner updates her belief via Bayes rule repeatedly in order to learn that distribution. This system is equivalent to the usual Bayesian inference scenario. Under mild assumptions, the learner will eventually learn the ground truth asymptotically (Gelman et al., 2013). . Chu Wang did this work prior to joining Amazon c 2019 Bernard Chazelle and Chu Wang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-539.html. CHAZELLE AND WANG When the network structure comes into play, the dynamics of the learning process becomes more complicated. If the social network forms a chain X, Y, Z, . . . such that agents teach each other in sequence: X teaches Y, who then teaches Z, and so on, by a classic result of Griffiths and Kalish (Griffiths and Kalish, 2005), the information from the source will vanish after a finite number of iterations. At that point, the agents, assumed to be rational, will be teaching each other nothing they don t already know: iterated learning is not self-sustaining (Beppu and Griffiths, 2009; Griffiths and Kalish, 2007, 2005; Rafferty et al., 2009; Kirby et al., 2014; Griffiths et al., 2008; Perfors and Navarro, 2011; Rafferty et al., 2014; Smith, 2009; Kalish et al., 2007). Such findings are hard to validate empirically but variants of it are within the reach of experimental psychology (Kalish et al., 2007). Similar laboratory experiments with human subjects have confirmed the unstainability of iterated learning (Kalish et al., 2007; Beppu and Griffiths, 2009; Tamariz and Kirby, 2015; Bartlett; Griffiths et al., 2008). Similar results of unstainability are found in computational linguistics where, instead of agents sending information, the scenario is about language evolution through generations (Rafferty et al., 2009). If agents interact in a more complicated and dynamic network, more possibilities of the learning dynamics will emerge. Dynamic networks are common occurrences in opinion dynamics (Hegselmann and Krause, 2002; Mohajer and Touri, 2013; Chazelle and Wang, 2016; Chazelle, 2015), but, to our knowledge, somewhat new in the context of social learning. Following the Bayesian-Without Recall (BWR) model proposed by Rahimian and Jadbabaie in (Rahimian and Jadbabaie, 2016a), we assume the agents to be memoryless and rational: this means that they use Bayesian updates based on current beliefs and signals with no other information from the past, see also (Rahimian et al., 2015a,b; Rahimian and Jadbabaie, 2016b). The BWR model seeks to capture the benefits of rational behavior while keeping both the computation and the information stored to a minimum (Rahimian and Jadbabaie, 2016b). We focus in this paper on sustainable learning: the conditions to ensure arbitrarily small information loss in truthful consensus (formal definition in the following sections). For chained learning, we show how keeping the length of the training sessions (number of samples transferred) growing slightly allows iterated learning to be sustained in perpetuity. This resolves the paradox raised from language evolution models (Kalish et al., 2007). We further analyze the case when the learner has the ability to reach back to her early ancestors for fresher data instead of listening to her direct ancestor, and show how this hopped learning mechanism further helps prevent information decay. For the case when the learning network changes over time, we show that under the assumption that each agent hears a noisy signal from the truth at a frequency bounded away from zero, the system reaches truthful consensus almost surely with a convergence rate polynomial in expectation. The relation between the convergence rate and the graph structure is also revealed with a seemingly counter-intuitive finding that agents in a better connected network learn more slowly. We first introduce the iterated learning model framework with the notation, definitions, and basic properties in Section 2. Then we exam the scenarios of chained learning, hopped learning, and networked learning in Section 3, Section 4, and Section 5, respectively. We show our main results in each section, followed by details of the proofs and discussions. 2. Models, Preliminaries, and Notation In this section, we formally define the problem mathematically. Assume there are n agents denoted by their indices 1, 2, . . . , n. At time t = 0, 1, . . ., the belief of agent i is a probability distribution ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS over a state space Θ, which is denoted by µt,i. The interactions between agents are modeled by an infinite sequence (Gt)t 0, where each Gt is a directed graph over the node set {1, . . . , n}. An edge pointing from i to j in Gt indicates that i receives data from j at time t. Intuitively, the direction of the edge has the same meaning as the listen-to activity. Typically, the sequence of graphs is specified ahead of time or is chosen randomly: the only condition that matters is that it should be independent of the randomness used in the data generating and learning process; specifically, taking expectations and variances of the random variables that govern the dynamics will assume a fixed graph sequence (possibly random). The adjacency matrix of Gt is denoted by At: it is an n n 0/1 matrix. 2.1. The existence of an information source When an information source exists whose belief is fixed, we label it agent 0 and refer to it as the truth. In such a case, the graph Gt is over the node set {0, 1, . . . , n}. Because agent 0 (if it exists) holds the truth, no edge out of it points to another node. The adjacency matrix then becomes an (n + 1) (n + 1) matrix whose first row is (1, 0, . . . , 0), with a self-loop at agent 0 for simplicity. 2.2. Data generation At time t 0, each agent i > 0 samples a state θt,i R consistent with her own belief: θt,i µt,i. A noisy measurement at,i = θt,i + εt,i is then sent to each agent j such that (At)ji = 1. All the noise terms εt,i are sampled iid from N(0, σ2). An equivalent formulation is to say that the likelihood function l(a|θ) is drawn from N(θ, σ2). In our setting, agent i sends the same data to all of her neighbors; this is done for notational convenience and the same results would still hold if we were to resample independently for each neighbor. Except for the omission of explicit utilities and actions, our setting is easily identified as a variant of the BWR model of (Rahimian and Jadbabaie, 2016a). 2.3. Beliefs update A single-step update for agent i > 0 consists of setting µt+1,i as the posterior P[µt+1,i|d] P[d|µt,i]P[µt,i], where d is the data from the neighbors of i received at time t. For the case when the beliefs are Gaussian, we get the classical update rules from Bayesian inference by plugging in the corresponding normal distribution (Box and Tiao, 2011). Updated beliefs remain Gaussian so we can use the notation µt,i N(xt,i, τ 1 t,i ), where τt,i denotes the precision (inverse variance) σ 2 t,i . Writing τ = σ 2 and letting dt,i denote the outdegree of i in Gt, for any i > 0 and t 0, we have ( xt+1,i = (τt,ixt,i + τat,1 + + τat,dt,i)/(τt,i + dt,iτ); τt+1,i = τt,i + dt,iτ, (1) where at,1, . . . , at,dt,i are the signals received by agent i from its neighbors at time t. 2.4. The influence of the graph sequence The graph sequence Gt plays a crucial role in the dynamics of the system. A simple starting example is the constant graph of two agents with one directed edge. Such a graph sequence defines the case where one learner repeatedly gets samples from the information source. If the information source is regarded as the ground truth, this system is equivalent to the usual Bayesian inference CHAZELLE AND WANG scenario. Instead of exhausting all possible graph sequences, in this paper, we focus on three representative types, namely the chained learning, hopped learning, and networked learning models. The fundamental problem we would like to solve is whether the system is able to converge to the truth; in other words, whether the truthful information is able to propagate uncorruptedly across the entire system. 3. Chained Learning Following (Beppu and Griffiths, 2009; Griffiths and Kalish, 2007, 2005; Rafferty et al., 2009; Kirby et al., 2014; Griffiths et al., 2008; Perfors and Navarro, 2011; Rafferty et al., 2014; Smith, 2009; Kalish et al., 2007), we begin with chained iterated learning: a learner s state of belief is modeled by a distribution over a hypothesis space H, which is itself equipped with a likelihood function: P[d|h] indicates the probability of generating data d D given the hypothesis h H. A learner s state of belief may change after a learning process, and we naturally call her belief before and after the learning prior and posterior. Notice that the hypothesis space H is the same as the state space Θ, but we will use H to emphasize the application background and avoid confusion. The initial hypothesis hinit generates m1 items iid for the first learner. These items provide the training data d1 = (d1,1, . . . , d1,m1) with which the first learner Bayes-updates its prior. Its posterior is given by setting t = 1 in this formula: P[h|dt] = P[dt|h] P[h]/P[dt], with P[dt] = X h H P[dt|h] P[h]. (2) From that point on, each successive learner updates its prior from their predecessor. For any t > 1, learner t receives mt items sampled by the posterior of agent t 1 to form the training set dt. To do that, she picks a random hypothesis h from H with probability P[h|dt 1] (the posterior of learner t 1) and then samples mt items iid from h to form dt Dmt. The posterior P[h|dt] is derived according to (2). Note that learner t has no direct access to the posterior of learner t 1 but only to data drawn from a hypothesis sampled from the posterior. Our formulation assumes a discrete space H but extends to continuous settings, as we show in 3.5. In the case of linguistic transmission, each hypothesis h H is a knob whose setting is given by a number between 0 and 1, specifically the prior probability P[h]. All learners share the same prior. Picking some h from that prior specifies a language (also denoted h for convenience). In this case, a language is defined as a probability distribution over D, interpreted here as a set of sentences. In this way, the prior can be viewed as a mixture over H: by abuse of terminology, we call it a mixed hypothesis, which we distinguish from a pure hypothesis of the form h H (corresponding to a single-point distribution). Access to language h is achieved by random sampling: the sentence d D is picked with probability P[d|h]. Iterated learning proceeds as follows. After selecting language h with probability P[h|dt 1], learner t collects mt independent samples from h. Thus, given a tuple dt = (d1, . . . , dmt) of sentences from D, the likelihood P[dt|h] is equal to Q 1 k mt P[dk|h]. The learner is now ready to Bayes-update its prior. Of course, the first one (t = 1) samples directly from the language hinit chosen for iterated learning. The notation is boldfaced to indicate that hinit may be a mixed hypothesis or, in other words, a distribution over hypotheses. ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS Suppose that D = {d1, . . . , ds} and H = {h1, . . . , hn} are both finite. After observing the data generated by the posterior of learner t 1, if learner t winds up choosing hi then, by Bayesian updating, the probability P |t ij that its posterior picks hj is given by: P |t ij = X d Dmt P[hj|d] P[d|hi] = X P[d|hi] P[d|hj] P[hj] Pn k=1 P[d|hk] P[hk] . (3) To our knowledge, the entire literature on the topic assumes a common, fixed sample size for all the learners: mt = m. Equation (3) can be then interpreted as marginalizing a Gibbs sampler over the data space, which creates a Markov chain over the hypothesis space H: if ht denotes the row vector formed by the n probabilities P[hk | dt], then ht = ht 1P t, where h0 = hinit. Assuming ergodicity (in this case, a fairly inconsequential technical assumption), the chain can be shown to converge to a unique stationary distribution h. It can be easily checked that it coincides with the prior: h = (P[h1], . . . , P[hn]) (Griffiths and Kalish, 2005; Norris, 1998); see (Rafferty et al., 2009, 2014) for an analysis of the mixing time in specific linguistic scenarios. This convergence reveals the long-term unsustainability of iterated learning. We show how diversifying the sample sizes mt, hence making the Markov chain time-inhomogeneous, can overcome this weakness. In particular, we prove that it is sufficient for mt to increase logarithmically with respect to t in order to achieve sustainability. 3.1. Self-Sustainability We show how to make iterated learning self-sustaining in the presence of a finite hypothesis space H = {h1, . . . , hn}. This involves specifying a sequence of training session lengths m1, m2, . . . so that the posterior of any learner ends up differing from hinit by an arbitrarily small amount. Formally, given any δ, ε 0, we say that iterated learning is (δ, ε)-self-sustaining if, with probability at least 1 ε, a random h H picked from any learner s posterior distribution differs from hinit in total variation by at most δ. We recall a few facts: the hypothesis h denotes a language modeled as a probability distribution over D; the total variation distance is half the ℓ1-norm; and the posterior of learner t after the t-th iteration is defined by marginalizing P[h|dt] over all samples dt drawn from a random h picked from the posterior of learner t 1 (or hinit if t = 1). As a shorthand, we speak of ε-self-sustainability to refer to the case δ = 0. The parameters δ and ε allow us to distinguish between two metrics: the distance between two languages over D and the distance between two mixtures over H. The two notions could differ widely. For example, if all of H corresponds to languages very close to hinit, to achieve (δ, ε)-selfsustainability might be easy for a tiny δ > 0 but hopelessly difficult for δ = 0. The complexity of iterated learning depends on the geometry of the languages formed by the pure hypotheses. This is best captured by introducing a metric that, though more specialized than the total variation (it works only on the simplex of probability vectors), brings all sorts of technical benefits: the root-sine distance between two probability distributions u = (u1, . . . , us) and v = (v1, . . . , vs) over D is defined as d RS(u, v) = uivj ujvi 2 = v u u t1 s X uivi 2 . (4) CHAZELLE AND WANG Note that the root-sine distance will be used to measure similarities between two likelihoods, and we will continue the analysis of sustainability defined based on the total variation distance. It would be surprising if this distance had not been used before, but we could not find a reference. We prove that it is indeed a metric in the Appendix and also explain its name. We show that it is related to the Hellinger, Bhattacharyya and total variation distances, d H, d B, d TV by the following relations: 2 ln(1 d2 RS) ; 3.2. The results We focus on the pure case hinit H, and later briefly discuss how to generalize the method to mixed hypotheses. Using the shorthand dij for d RS(P[ |hi], P[ |hj]), we define di := minj:j =i dij. Let p = (p1, p2, . . . , pn) be the prior distribution over H, where pi := P[hi]. We can obviously assume that each pi is positive and that all the pure hypotheses are distinct, hence di > 0. The two theorems below assume that hinit = h1. Theorem 1. For any positive ε < 1, the following sample size sequence makes iterated learning ε-self-sustaining: for some C > 0 independent of t, ε, d1. The factor 4 can be reduced to 21+o(1) if we adjust the constant C. It is to be expected that the lengths of the training sessions should grow to infinity as p1 tends to zero, as the vanishing prior makes it increasingly difficult for the posteriors to attach to h1. The session lengths are sensitive to the minimum distance between the languages specified by H and the target language h1. Settling for (δ, ε)-self-sustainability allows us to remove this dependency. Theorem 2. For any positive δ, ε < 1, the following sample size sequence makes iterated learning (δ, ε)-self-sustaining: for some C > 0 independent of t, δ, ε. ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS 3.3. The proofs To establish Theorem 1, we estimate the probability P that each leaner ends up picking h1. Recall that ht is the posterior distribution of learner t, by the Markovian property of the system, P = P[h0 = h1] Y t 0 P[ht+1 = h1|ht = h1] = Y t 1 P |t 11. (6) Since the matrix P |t is the transition matrix of a Markov chain, we proceed by bounding its offdiagonal elements P |t ij for i = j. We have P[d|hi] P[d|hj]pj P[d|hi]pi + P[d|hj]pj = pj pj P[d|hi] P[d|hj] pi pj P[d|hi] + P[d|hj] P[d|hi] P[d|hj] = 1 P[d|hi] P[d|hj] mt P[d|hi] P[d|hj] 2 1 , where the two equalities are simple deformations, the first inequality is achieved by dropping some non-negative terms from the definition of P |t ij in (3), the second inequality is obtained via Young s inequality, and the last inequality is from Taylor expansion of the natural logarithm function at 1. By definition of the root-sine distance, we have 2 d2 ijmt (i = j). (7) Setting i = 1 in (7) and summing over 2 j n, it follows by Cauchy-Schwarz that j=2 P |t 1j 1 2 d2 1 mt . (8) Combining (6) and (8) yields 2 d2 1 mt ! 2 d2 1 mt. (9) Given 0 < ε < 1, we constrain the sequence (mt) to satisfy: 2 d2 1mt < ε 4p1 n(1 p1) . (10) For example, we can pick the sequence d2 1 ln n(1 p1)t4 CHAZELLE AND WANG which completes the proof. A closer look at the calculation shows that the factor t4 can be reduced to Cαt2+α for any small α > 0 and a suitable constant Cα > 0, which makes the dependency on t arbitrarily close to (2/d2 1) ln t. To prove Theorem 2, we set a target distance ρ := δ/(n 2s) and find a subset A H such that (i) d1j ρn for j A and (ii) dij ρ for i A and j A. To see why such a subset must exist, consider spheres centered at hinit = h1 of radius kρ, for k = 1, . . . , n + 1 (with respect to d RS). These define n + 1 disjoint (open) regions and, by the pigeonhole principle, at least one of them must be empty. We set A to include all the points in the regions preceding the empty one; note that h1 A. The claim follows from the triangular inequality. We begin with a straightforward generalization of (8): for any i A, X j A P |t ij 1 2 ρ2 mt , (11) where p A := mini A pi. Now let P be the probability that ht A for each t, then (6) and (9) are generalized to j / A P |t ij 2 ρ2 mt. (12) ρ2 ln n(1 p A)t4 ensures that P > 1 ε. The root-sine distance between the languages denoted by h1 and any h A is at most ρn, so that, by (5), the total variation distance is bounded by 2sρn = δ, which concludes the proof of Theorem 2. So far, we have analyzed only the pure case hinit H. The idea of the training is to prevent the prior to drag the posterior mixture all across H. It should be clear that a similar result obtains if hinit H is concentrated on a subset A of H. The proof follows the path charted in Theorem 2 and need not be repeated here. It is crucial to note, however, that this result is to be understood in a coarse-graining sense: iterated learning cannot ensure that the original weights in the mixture hinit are retained but only that A contributes most of the mass in the posteriors. To retain the weights would require changing the stationary distribution to conform with hinit, as the process unfolds, something that straightforward Bayesian learning seems unable to do. Learning pure hypotheses bypasses that difficulty. 3.4. Applications We briefly discuss a direct application of our results to a well-known model of language acquisition via iterated learning and we mention some natural extensions of the techniques. Language evolution. Rafferty et al. show how iterated learning fails rapidly in a simple model of language evolution (Rafferty et al., 2009). Given n hypotheses, iterated learning with fixed-length training sessions ceases to learn anything new after only O(log n log log n) rounds. Our previous theorems show how to turn this around and achieve self-sustainality. In our notational system, their model is defined on a hypothesis space H = {h1, . . . , hn}, where n = 2k and hi denotes the ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS language whose sentences are words in {0, 1, ?}k with exactly m question marks and 0, 1 matching the binary decomposition of i 1 outside the question marks. For example, if k = 4 and m = 2, then h3 denotes the language { 00??, 0?1?, ?01?, 0??0, ?0?0, ??10 }. We can assume that m is much smaller than k. Each language has the same length k m and the total number of sentences is s = k m 2k m. The prior is given by P[hi] = pi = 1/n. Given a hypothesis hi, P[d|hi] = 1/ k m if d has m question marks and match the bits of i 1 elsewhere; else it is 0 (and d, h are called incompatible). Given h H, P[d] = P h H P[d|h]P[h] = 2m k/ k m ; P[h|d] = P[d|h]P[h]/P[d] = 2 m (or 0 if d, h are incompatible). We easily check that d2 1 = 1 Ps i=1 aibi 2 1 m 2; hence, by Theorem 1, session lengths mt no larger than O(log t ε) are sufficient to maintain ε-self-sustainability. Meanings and utterances. In the use of iterated learning for studying language evolution (Griffiths and Kalish, 2005; Perfors and Navarro, 2011), it is common to model the data d as a joint distribution (x, y) over a product space X mt Ymt. The idea is to distinguish between meanings x and utterances y. In this setting, P[d|h] = P[y|x, h]µ(x), where µ(x) is the probability of generating x. The transition matrix of the Markov chain thus becomes P |t ij = X y Ymt P[hj|x, y] P[y|x, hi]µ(x) P[y|x, hi] P[y|x, hj] P[hj] Pm k=1 P[y|x, hk] P[hk] µ(x) . (14) Since the output y now depends on both the hypothesis and the input data, we redefine dij as the root-sine distance between the two distributions P[y|x, hi]µ(x) and P[y|x, hj]µ(x): d ij 2 := 1 P[y|x, hi]P[y|x, hj] µ(x) and we define d i := minj:j =i d ij. Given any i = j, P[y|x, hi] P[y|x, hj] pj P[y|x, hi] pi + P[y|x, hj] pj µ(x) P[y|x, hi] P[y|x, hj] µ(x) P[y|hi] P[y|hj] µ(x) mt P[y|x, hi] P[y|x, hj] µ(x) 2 1 . CHAZELLE AND WANG This gives us this new version of inequality (7), which we can use as the basis for a repeat of the argument of the previous section: 2 d 2 ijmt (i = j). (16) 3.5. Gaussian chained learning When iterated learning operates over a hypothesis space H parametrized continuously, say, in R, the minimum root-sine distance usually vanishes and the previous arguments run into singularities and collapse. A new approach is needed. To make our discussion concrete, we assume that the prior distribution of each learner is a Gaussian P[h] N( µ, σ2) and that the likelihood of producing data d given hypothesis h is also normal: P[d|h] = N(h, σ2). The likelihood can also be understood as a noisy measurement of h: d = h + φ, where the noise φ N(0, σ2). We assume that the data received by the first learner comes from N(µ0, σ2 0). This is the simplest instance of a continuous setting in which the root-sine distance argument fails. We discuss it in some detail, considering both chained learning and its generalizations; and then we use the results to treat the case of iterated Bayesian linear regression. During its training session, the t-th learner receives data dt = (dt,1, . . . , dt,mt) from its predecessor: it is obtained by first picking a random hypothesis h from the posterior of learner t 1 and then collecting mt independent random samples from N(h, σ2). For the case t = 1, we can treat the original teacher as learner 0 with its posterior equal to N(µ0, σ2 0). Learner t Bayes-updates its posterior as follows: P[h|dt] P[dt|h]P[h] exp 1 i=1 (dt,i h)2 exp 1 2 σ2 (h µ)2 , which is still Gaussian, with mean and variance denoted by µt and σ2 t , respectively. Carrying out the usual square completion gives up these update rules: for t > 0, µt = 1 τ+mtτ ( τ µ + τ(dt,1 + dt,2 + + dt,mt)) τt = τ + mtτ, (17) where we define the precisions τ = 1/σ2, τ = 1/ σ2, and τt = 1/σ2 t . We say that iterated learning is ε-self-sustaining if |E µt µ0| ε and σ2 t + var µt remains bounded for all t. If σ2 t + var µt 0 as t , we say that iterated learning is strongly ε-self-sustaining. We consider successively the case of chained iterated learning and the more challenging hopping scenario in which a new learner picks a random teacher from the past (instead of the previous one). In chained iterated learning, the data dt,i is a noisy message drawn from the posterior of the (t 1)-th learner; hence dt,i N(µt 1, σ2 t 1 + σ2). In view of (17), µt is itself Gaussian. By taking the expectation and variance of equation (17), we find the following recursive relations for E µt and var µt: for t > 0, E µt = 1 τ+mtτ τ µ + mtτ E µt 1 ; var µt = mtτ 2 ( τ+mtτ)2 var µt 1 + σ2 t 1 + σ2 . (18) ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS If we define βt := mtτ/( τ + mtτ), then (18) becomes E µt = βt E µt 1 + (1 βt) µ. If mt = m is a constant, then so is βt, and the recursive relation (18) becomes E µt µ = βt 1(µ0 µ), which shows that E µt converges to µ exponentially fast. As in the discrete case, iterated learning is not self-sustainable with constant-length training sessions. By letting mt increase as t1+o(1) order, however, we can achieve self-sustainability: Theorem 3 For any 0 < ε < 1, the following sample size sequence makes chained iterated learning strongly ε-self-sustaining: mt = |µ0 µ| for an arbitrarily small constant c > 0. Proof We observe that E µt is a convex combination of µ and E µs (s < t); specifically, s=1 βsµ0 + 1 Because P s>0(1/s)1+c < 1 + R 1 x 1 c dx = 1 + 1/c, we have 1 τ msτ + τ 1 s1+c > 1 ε |µ0 µ| . This shows that |E µt µ0| = 1 s=1 βs | µ µ0| ε. By (17), σ2 t = 1/τt < 1/mtτ 0. Since σ2 t 1 σ2 for t > 1, it follows from (18) that var µt (var µt 1 + σ2 + σ2)/mt for t > 1, and var µ1 (σ2 0 + σ2)/m1. Writing Mt := mtmt 1 . . . m1, we have Mtvar µt Mt 1var µt 1 + Mt 1(σ2 + σ2) t Mt 1(σ2 0 + σ2 + σ2), and thus var µt (σ2 0 + σ2 + σ2)t/mt 0 since mt = Ω(t1+c). 3.6. Iterated Bayesian Linear Regression The iterated version of Bayesian linear regression has been the subject of extensive study in the field of psychology (Kalish et al., 2007; Beppu and Griffiths, 2009; Tamariz and Kirby, 2015; Bartlett; Griffiths et al., 2008). The work has involved experimentation with human subjects but little in CHAZELLE AND WANG the way of theoretical analysis. This section is a first step toward filling this gap. The task at hand is to estimate a hypothesis h H := Rd given a noisy measurements on the hyperplane y = h T x, where x Rd. In the Bayesian setting, we assume a Gaussian prior on the hypothesis space: P[h] N( µ, σ2Id). The data is given by (x, y), where x N(0, Id) and y = h T x + φ, for φ N(0, σ2) (with x, φ independent). Since we typically make several measurements, we write this (likelihood) relation in matrix form: y = Xh + φ, where y Rm (with m the number of measurements); φ N(0, σ2Im); and X is an m-by-d matrix each of whose rows denotes a random vector x N(0, Id). This means that the matrix X is random (a fact of key importance in our discussion below). We have: 2σ2 φ 2 2 (noise) 2 σ2 h µ 2 2 (prior) P[y|X, h] exp 1 2σ2 y Xh 2 2 (likelihood) In iterated Bayesian linear regression, the t-th learner receives her data from learner t 1. Here, learner 0 is treated just like any other agent, except that his prior P[h] N(µ0, σ2Id) is the distribution to be learned iteratively. Since sampling from the prior is independent of X, Bayesian updating gives the posterior N(µt, Σt), where P[h|X, y] = P[h] P[y|X, h]/P[y|X] exp n 1 2 σ2 h µ 2 2 1 2σ2 y Xh 2 2 o . Completing the square in the usual fashion shows that the posterior of learner t is given by: Σt = σ 2Id + σ 2XT t Xt 1 ; µt = Σt σ 2 µ + σ 2XT t yt , (20) where (Xt, yt) is the data gathered by learner t from her predecessor: specifically, yt = Xth + φt, where h is collected from the (t 1)-th learner by sampling his posterior distribution N(µt 1, Σt 1). Theorem 4 Given any small enough δ, ε > 0, the following sample size sequence for iterated Bayesian linear regression ensures that Eµt µ0 2 δ with probability greater than 1 ε: mt = Dc µ0 µ 2 2 t1+c + Dc d log t + 1 for an arbitrarily small c > 0 and a constant Dc that depends only on c. Proof We proceed in two steps: first, we show that to keep Eµt arbitrarily close to µ0 for all t hinges on spectral properties of certain random matrices; second, we call on known facts about the singular values of random Gaussian matrices to translate the spectral condition into a high-probability event. The proof unfolds as a series of simple relations, which we state first and then demonstrate. The first one follows directly from (20): Eµt = (Id + Mt) 1 ( µ + Mt Eµt 1) , where Mt := σ 2 XT t Xt. (21) ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS Note that (21) is a randomized recursive relation since the data points X1, X2, . . . are themselves random. We note that all the matrices whose inverses are taken are positive definite, hence nonsingular. To move on to our second relation, we define the matrix Qt := (Id + Mt) 1Mt(Id + Mt 1) 1Mt 1 (Id + M1) 1M1, for t > 0, with Q0 = Id, and prove by induction that E µt = Qtµ0 + (Id Qt) µ. (22) The base case is obvious so we assume that t > 0: by (21), E µt = (Id + Mt) 1( µ + Mt E µt 1) = (Id + Mt) 1( µ + Mt Qt 1µ0 + Mt(Id Qt 1) µ) = (Id + Mt) 1Mt Qt 1µ0 + (Id + Mt) 1(Id + Mt(Id Qt 1)) µ = Qt µ0 + (Id (Id + Mt) 1Mt Qt 1) µ, which proves (22). Our next goal is to bound the information decay E µt µ0 2. To do that, we investigate the spectral norm of the matrix Id Qt, which leads to our third relation. We prove by induction that, for t > 0, s=1 As 2, (23) where As := (Id + Ms) 1. For t = 1, Q1 = (Id + M1) 1M1 = Id (Id + M1) 1 and the claim follows. If t > 1, then Id Qt 2 = (Id Qt 1) + (Qt 1 Qt) 2 Id Qt 1 2 + Qt Qt 1 2 s=1 As 2 + Ψ 2, where Ψ := (At Mt Id)Qt 1. Since At(Id + Mt) = Id, we have Ψ = At Qt 1. Each matrix Ms is positive semidefinite, so the eigenvalues of (Id + Ms) 1Ms are of the form λ/(1 + λ), where λ 0. This shows that all the eigenvalues of Qs are between 0 and 1; therefore Qs 2 1. The eigenvalues of Id At Mt are the same as those of At; hence, by submultiplicativity, Ψ 2 At 2 Qt 1 2 At 2, which establishes (23). We are now ready to express the information decay in spectral terms. Pick an arbitrarily small constant c > 0 and assume that As 2 δ µ µ0 2 By (22), E µt µ0 = (Id Qt)( µ µ0); therefore, by (23), E µt µ0 2 µ µ0 2 s=1 As 2 δc 1 + c 1 x 1 c dx = δ, CHAZELLE AND WANG The relation says that, on average, the means of any of the agents posteriors can be brought as close to the original mean to be learned as we want. We can turn this into a high-probability event by using some basic random matrix theory. Recall that E µt is itself a random variable whose stochasticity comes from the matrices Xs, which are all drawn from Gaussians. Because Ms is positive semidefinite, As 2 M 1 s 2 (σ/ σ)2 λmin(XT t Xt) σ/ σ which gives us a relation between the spectral norm of (Is + Ms) 1 and the smallest singular value σ1(Xt) of an mt-by-d matrix Xt whose elements are drawn iid from N(0, 1). The asymptotic behavior of σ1(Xt) for large values of mt has been extensively studied within the field of random matrix theory (Davidson and Szarek, 2001; Edelman, 1988; Rudelson and Vershynin, 2009). Following Theorem II.13 in (Davidson & Szarek (Davidson and Szarek, 2001)), for any γt > 0, P[σ1(Xt) < mt d γt] e γ2 t /2. We use C below as a generic constant large enough to satisfy the inequalities where it appears. Setting γt = C p log((t + 1)/ε) ensures that P t>0 e γ2 t /2 < ε, hence that σ1(Xt) < mt d γt holds for all t with probability less than ε. With our setting of mt, this means that, for all t > 0, P h σ1(Xt) mt i > 1 ε. (27) Assuming the event in (27), it follows from (26) and our setting of mt that hence (24) holds for Dc large enough. By (25, 27), this proves that, with probability greater than 1 ε, E µt µ0 2 δ for all t > 0, which completes the proof. 4. Hopped Learning In this section, we consider the hopped learning scenario in which learner t hops back to pick a teacher from {0, 1, . . . , t 1} at random, and then takes mt samples from her posterior. To model the data generating process, we continue to adopt the Gaussian setting from Sections 3.5 and 3.6. Note that the graph sequence Gt becomes random under the constraint that a learner can only get data from the learners before her. Since multiple samples can be sent from a teacher to a learner, we use dt,s,i to denote the i-th sample generated by agent s at time t. Note that though a learner only receives data from one teacher, without loss of generality, we assume all her predecessors generate samples but she only listens to one of them. The recursive relation for µt becomes i=1 dt,s,i + (1 βt) µ, (28) where, given t, the random variable χt,s is 1 for a value of s picked at random between 0 and s 1, and is zero elsewhere; recall that βt := mtτ/( τ + mtτ). Hopped iterated learning provides access ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS to earlier data, so one would expect the lengths of the training sessions to grow more slowly than in chained learning. The change is indeed quite dramatic: Theorem 5 For any positive ε < |µ0 µ|, the following sample size sequence makes hopped iterating learning ε-self-sustaining: mt = Bc |µ0 µ| 2 (1 + log t)1+c, for an arbitrarily small c > 0 and a constant Bc that depends only on c. Proof By taking expectation on both sides of (28), for any t > 0, s=0 E µs + (1 βt) µ, We define γ1 = β1 and, for t > 1, γt := (1 + β1) 1 + β2 We verify easily that E µt = γtµ0 + (1 γt) µ, for t > 0; therefore, the first part in establishing ε-self-sustainability consists of proving that 1 γt 1 ε |µ0 µ| , (29) which will show that |E µt µ0| ε. Note that Now define αs = ε Bc|µ0 µ|s(1 + log s)1+c . for s > 0. We pick a constant Bc large enough so that αs is small enough to carry out first-order Taylor approximations around 1 + αs. We find that 1 1 1 + msτ/ τ 1 1 (s + 1)msτ/ τ (1 αs) 1 + 1 e 2 Pt 1 s=1 αs = βte 2 Pt 1 s=1 αs 1 ε |µ0 µ|, which establishes (29). Our derivation relies on the fact that βt 1 ε Bc|µ0 µ|(1 + log t)1+c 1 ε 2|µ0 µ| CHAZELLE AND WANG 1 s(1 + log s)1+c 1 + 1 (log e)1+c 1 x(ln x)1+c dx = O 1 hence, e 2 Pt 1 s=1 αs e O(ε/(c Bc|µ0 µ|)) 1 ε 2|µ0 µ|. Having shown that |E µt µ0| ε for all t, it now suffices to prove that σ2 t + var µt remains bounded. We note that τt > mtτ , hence σ2 t = 1/τt 0, so the remainder of the proof needs to establish that the variance of µt stays bounded. Writing Dt,s := dt,s,1 + + dt,s,mt, we have var Dt,s = mtvar dt,s,1 = mt(σ2 s + σ2 + var µs); hence E D2 t,s = var Dt,s + (E Dt,s)2 = mt(σ2 s + σ2 + var µs) + m2 t (E µs)2. In (28), the variables χt,s and Dt,s are independent, for 0 s t 1; furthermore, E χt,s = E χ2 t,s = 1/t, and E χt,s1χt,s2 = 0 if s1 = s2; therefore, var [χt,s Dt,s] = E χ2 t,s E D2 t,s (E χt,s)2(E Dt,s)2 = E D2 t,s t (E Dt,s)2 σ2 s + σ2 + var µs + mt(E µs)2 mt 2 (E µs)2 (30) and, for s1 = s2, cov [χt,s1Dt,s1, χt,s2Dt,s2] = E [χt,s1χt,s2Dt,s1, Dt,s2] E [χt,s1Dt,s1]E [χt,s2Dt,s2] = E [χt,s1χt,s2]E [Dt,s1Dt,s2] E χt,s1E Dt,s1E χt,s2E Dt,s2 t2 E Dt,s1E Dt,s2 = mt 2 E µs1E µs2. Then, by taking the variance on both sides of (28), we have var µt = βt s=0 χt,s Dt,s s=0 var [χt,s Dt,s] + X 0 s1 =s2 t 1 cov [χt,s1Dt,s1, χt,s2Dt,s2] σ2 s + σ2 + var µs + mt(E µs)2 mt σ2 s + σ2 + var µs + mt(E µs)2 . Notice that σ2 s 0 and (E µs)2 is bounded since |E µt µ0| ε. We conclude that σ2 t + var µt remains bounded for all t. ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS 5. Networked Learning In this section, we study the information transfer and iterated learning with general graph sequence Gt. We assume that the initial belief µ0,i of agent i is Gaussian: µ0,i N(x0,i, σ2 0,i). Without loss of generality, the truth is assumed to be a constant (single-point distribution: µt,0 = 0; σt,0 = 0 for all t) and the standard deviation is the same for all other agents, i.e., σ0,i = σ0 > 0 for i > 0. Because agent 0 holds the truth, no edge points out of it. The adjacency matrix of Gt is denoted by At: it is an (n + 1) (n + 1) matrix whose first row is (1, 0, . . . , 0). Note that n is the number of learners and should not be confused with the number of hypotheses in Section 3. 5.1. The dynamics in matrix form Let Dt and Pt denote the (n + 1)-by-(n + 1) diagonal matrices diag(ηt,i) and (τ0/τ)I + Pt 1 k=0 Dk, respectively, where ηt,i is the out-degree of agent i at time t, I is the identity matrix and the sum is 0 for t = 0. It follows from (1) that µt,i N(xt,i, (τPt) 1 ii ) for i > 0. Regrouping the means in vector form, xt := (xt,0, . . . , xt,n)T , where xt,0 = 0 and x0,1, . . . , x0,n are given as inputs, we have xt+1 = (Pt + Dt) 1 (Ptxt + At (xt + ut + εt)) , (32) where ut is such that ut,0 N(0, 0) and, for i > 0, ut,i N(0, (τ(Pt)ii) 1); and εt is such that εt,0 N(0, 0) and, for i > 0, εt,i N(0, 1/τ). We refer to the vectors xt and yt := E xt as the mean process and the expected mean process, respectively. Taking expectations on both sides of (32) with respect to the random vectors ut and εt yields the update rule for the expected mean process: y0 = x0 and, for t > 0, yt+1 = (Pt + Dt) 1 (Pt + At) yt. (33) A key observation is that (Pt + Dt) 1 (Pt + At) is a stochastic matrix, so the expected mean process yt forms a diffusive influence system (Chazelle, 2015): the vector evolves by taking convex combinations of its own coordinates. What makes the analysis different from standard multiagent agreement systems is that the weights vary over time. In fact, some weights typically tend to 0, which violates one of the cardinal assumptions used in the analysis of averaging systems (Chazelle, 2015; Moreau, 2005). This leads us to the use of arguments, such as fourth-order moment bounds, that are not commonly encountered in this area. 5.2. The results The belief vector µt is Gaussian with mean xt and covariance matrix Σt formed by zeroing out the top-left element of (τPt) 1. We say that the system reaches truthful consensus if both the mean process xt and the covariance matrix tend to zero as t goes to infinity. This indicates that all the agents beliefs share a common mean equal to the truth and the error bars vanish over time. In view of (1), the covariance matrix indeed tends to 0 as long as the degrees are nonzero infinitely often, a trivial condition. To establish truthful consensus, therefore, boils down to studying the mean process xt. We do this in two parts: first, we show that the expected mean process converges to the truth; then we prove that fluctuations around it eventually vanish almost surely.1 1. The Kullback-Leibler divergence (Jadbabaie et al., 2012) is not suitable here because the estimator is Gaussian, hence continuous, whereas the truth is a single-point distribution. CHAZELLE AND WANG Truth-hearing assumption: Given any interval of length κ:= 1/γ , every agent i > 0 has an edge (i, 0) in Gt for at least one value of t in that interval. Theorem 6 Under the truth-hearing assumption, the system reaches truthful consensus with a convergence rate bounded by O(t γ/2η), where η is the maximum outdegree over all the networks. We prove the theorem in the next two sections. It will follow directly from Lemmas 7 and 8 below. The convergence rate can be improved to the order of t (1 ε)γ/η, for arbitrarily small ε > 0. The inverse dependency on γ is not surprising: the more access to the truth the stronger the attraction to it. On the other hand, it might seem counterintuitive that a larger outdegree should slow down convergence. This illustrates the risk of groupthink. It pays to follow the crowds when the crowds are right. When they are not, however, this distracts from the lonely voice that happens to be right. How essential is the truth-hearing assumption? We show that it is necessary. Simply having access to the truth infinitely often is not enough to achieve truthful consensus. 5.3. The proofs In this subsection, we demonstrate technical details for the proof of the results. We begin with some repeated used inequalities. 5.3.1. USEFUL MATRIX INEQUALITIES We highlight certain matrix inequalities to be used throughout. We use the standard element-wise notation R S to indicate that Rij Sij for all i, j. The infinity norm R = maxi P j |rij| is submultiplicative: RS R S , for any matching rectangular matrices. On the other hand, the max-norm R max := maxi,j |rij| is not, but it is transpose-invariant and also satisfies: RS max R S max. It follows that RSRT max R SRT max = R RST max R 2 ST max = R 2 S max. (34) 5.3.2. THE EXPECTED MEAN PROCESS DYNAMICS We analyze the convergence of the mean process in expectation. The expected mean yt = E xt evolves through an averaging process entirely determined by the initial value y0 = (0, x0,1, . . . , x0,n)T and the graph sequence Gt. Intuitively, if an agent communicates repeatedly with a holder of the truth, the weight of the latter should accumulate and increasingy influence the belief of the agent in question. Our goal in this section is to prove the following result: Lemma 7 Under the truth-hearing assumption, the expected mean process yt converges to the truth asymptotically. If, at each step, no agent receives information from more than η agents, then the convergence rate is bounded by Ct γ/2η, where C is a constant that depends on x0, γ, η, σ0/σ. ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS Proof We define Bt as the matrix formed by removing the first row and the first column from the stochastic P 1 t+1 (Pt + At). If we write yt as (0, zt) then, by (33), = 1 0 αt Bt where αt,i = (P 1 t+1)ii if there is an edge (i, 0) at time t and αt,i = 0 otherwise. This further simplifies to zt+1 = Btzt. (36) Let 1 be the all-one column vector of length n. Since P 1 t+1 (Pt + At) is stochastic, αt + Bt1 = 1 (37) In matrix terms, the truth-hearing assumption means that, for any t 0, αt + αt+1 + + αt+κ 1 Q 1 t+κ1, (38) where Qt is the matrix derived from Pt by removing the first row and the last column; the inequality relies on the fact that Pt is monotonically nondecreasing. For any t > s 0, we define the product matrix Bt:s defined as Bt:s := Bt 1Bt 2 . . . Bs, (39) with Bt:t = I. By (36), for any t > s 0, zt = Bt:s zs. (40) To bound the infinity norm of Bt:0, we observe that, for any 0 l < κ 1, the i-th diagonal element of Bs+κ:s+l+1 is lower-bounded by j=l+1 (Bs+j)ii = (Ps+j + As+j)ii (Ps+j+1)ii (41) (Ps+j)ii (Ps+j+1)ii = (Ps+l+1)ii (Ps+κ)ii (Ps)ii (Ps+κ)ii . The inequalities follow from the nonnegativity of the entries and the monotonicity of (Pt)ii. Note that (41) also holds for l = κ 1 since (Bs+κ:s+κ)ii = 1. Since P 1 t+1 (Pt + At) is stochastic, the row-sum of Bt does not exceed 1; therefore, by premultiplying Bs+1, Bs+2, . . . on both sides of (37), we obtain: l=0 Bs+κ:s+l+1αs+l. (42) Noting that Bt = Bt1 for any t, as Bt is non-negative, we combine (38), (41), and (42) together to derive: Bs+κ:s 1 min i>0 (Ps)ii (Ps+κ)2 ii . (43) CHAZELLE AND WANG Let η := maxt 0 max1 i n ηt,i denote the maximum outdegree in all the networks, and define δ = min{τ0/τ, 1}. For any i > 0 and s κ, κ (Ps)ii ηs + τ0 hence, max i (Ps+κ)ii η(s + κ) + τ0 It follows that (Ps+κ)ii (Ps)ii (Ps+κ)ii = Pκ 1 l=0 ηs+l,i (Ps+κ)ii ηκ2δ 1 s + κ . (46) Thus, we have min i>0 (Ps)ii (Ps+κ)ii = 1 max i>0 (Ps+κ)ii (Ps)ii s + κ . (47) We can replace the upper bound of (43) by 1 1 maxi>0(Ps+κ)ii min i>0 (Ps)ii (Ps+κ)2 ii , which, together with (45) and (47) gives us Bs+κ:s 1 1 η(s + κ) + τ0/τ 1 1 2ηκ(m + 2). (48) The latter inequality holds as long as s = mκ > 0 and It follows that, for m0 m , B(m0+m)κ:m0κ 1 1 2ηκ(m0 + j) The matrices Bt are sub-stochastic so that Bt z Bt z z . By (40), for any t (m0 + m)κ, zt = Bt:(m0+m)κB(m0+m)κ:m0κ zm0, ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS so that, by using standard bounds for the harmonic series, ln(k + 1) < 1 + 1 k 1 + ln k, we find that zt B(m0+m)κ:m0κ zm0 B(m0+m)κ:m0κ z0 Ct 1/(2ηκ), where C > 0 depends on z0, κ, η, τ0/τ. We note that the convergence rate can be improved to the order of t (1 ε)γ/η, for arbitrarily small ε > 0, by working a little harder with (48). 5.3.3. THE MEAN PROCESS DYNAMICS Recall that µt,i N(xt,i, τ 1 t,i ), where τt,i denotes the precision σ 2 t,i . A key observation about the updating rule in (1) is that the precision τt,i is entirely determined by the graph sequence Gt and is independent of the actual dynamics. Adding to this the connectivity property implied by the truth-hearing assumption, we find immediately that τt,i for any agent i. This ensures that the covariance matrix Σt tends to 0 as t goes to infinity, which satisfies the second criterion for truthful consensus. The first criterion requires that the mean process xt should converge to the truth 0. Take the vector xt yt and remove the first coordinate (xt yt)0 to form the vector t Rn. Under the truth-hearing assumption, we have seen that yt 0 (Lemma 7), so it suffices to prove the following: Lemma 8 Under the truth-hearing assumption, the deviation t vanishes almost surely. Proof We use a fourth-moment argument. The justification for the high order is technical: it is necessary to make a certain deviation power series converge. By (32), xt is a linear combination of independent Gaussian random vectors us and εs for 0 s t 1, and thus xt itself is a Gaussian random vector. Therefore t is also Gaussian and its mean is zero. From Markov s inequality, for any c > 0, X t 0 P[| t,i| c] X E 4 t,i c4 . (50) If we are able to show the right hand side of (50) is finite for any c > 0, then, by the Borel-Cantelli lemma, with probability one, the event | t,i| c occurs only a finite number of times, and so t,i goes to zero almost surely. Therefore, we only need to analyze the order of the fourth moment E 4 t,i. By subtracting (33) from (32), we have: t+1 = Bt t + Mtvt, (51) where vt := ut + εt and Mt := P 1 t+1At; actually, for dimensions to match, we remove the top coordinate of vt and the first row and first column of Mt (see previous section for definition of Bt). Transforming the previous identity into a telescoping sum, it follows from 0 = x0 y0 = 0 and the definition Bt:s = Bt 1Bt 2 . . . Bs that s=0 Bt:s+1Msvs = s=0 Rt,svs, (52) CHAZELLE AND WANG where Rt,s := Bt:s+1Ms. We denote by C1, C2, . . . suitably large constants (possibly depending on κ, η, n, τ, τ0). By (44), Ms C1/(s + 1) and, by (49), for sufficiently large s, Bt:s+1 C2(s + 1)β(t + 1) β, where β = 1/2ηκ < 1. Combining the above inequalities, we obtain the following estimate of Rt,s as Rt,s C3(s + 1) 1+β(t + 1) β. (53) In the remainder of the proof, the power of a vector is understood element-wise. We use the fact that vs and vs are independent if s = s and that the expectation of an odd power of an unbiased Gaussian is always zero. By Cauchy-Schwarz and Jensen s inequalities, s=0 E(Rt,svs)4 + X 0 s =s 0. The key to the proof is that, by (33), as agent 1 repeatedly links to agent 2, she is pulled arbitrarily close to it. Indeed, the transition rule gives us yt+1,1 = (Pt)11 (Pt+1)11 yt,1 + 1 (Pt+1)11 yt,2, where (Pt+1)11 = (Pt)11 + 1, which implies that yt,1 can be brought arbitrarily close to yt,2 while the latter does not move: this follows from the fact that any product of the form Qtb t>ta t t+1 tends to 0 as tb grows.2 Thus a suitably increasing sequence of sk, tk ensures the two conditions (57). The beliefs of the two agents do not converge to the truth even though they link to the truth agent infinitely often. Acknowledgments We wish to thank the anonymous referees for their useful comments and suggestions. Some results of this work were previously published in the conferences, Innovations in Theoretical Computer Science (ITCS 2017), and American Control Conference (ACC 2017) by the same authors. Chu Wang did this work prior to joining Amazon. The research of Bernard Chazelle was sponsored by the Army Research Office and the Defense Advanced Research Projects Agency and was accomplished under Grant Number W911NF-17-1-0078. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office, the Defense Advanced Research Projects Agency, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. 2. We note that the construction shares a family resemblance with one used by Moreau (Moreau, 2005) to show the non-consensual dynamics of certain multiagent averaging systems. The difference here is that the weights of the averaging change at each step by increasing the agent s self-confidence. CHAZELLE AND WANG Daron Acemoglu and Asuman Ozdaglar. Opinion dynamics and learning in social networks. Dynamic Games and Applications, 1(1):3 49, 2011. Daron Acemoglu, Munther A Dahleh, Ilan Lobel, and Asuman Ozdaglar. Bayesian learning in social networks. The Review of Economic Studies, 78(4):1201 1236, 2011. Venkatesh Bala and Sanjeev Goyal. Learning from neighbours. The review of economic studies, 65 (3):595 621, 1998. Abhijit V Banerjee. A simple model of herd behavior. The quarterly journal of economics, 107(3): 797 817, 1992. FC Bartlett. Remembering: a study in experimental and social psychology (1932). Aaron Beppu and Thomas L Griffiths. Iterated learning and the cultural ratchet. In Proceedings of the 31st annual conference of the cognitive science society, pages 2089 2094, 2009. A Bhattachayya. On a measure of divergence between two statistical population defined by their population distributions. Bulletin Calcutta Mathematical Society, 35:99 109, 1943. George E.P. Box and George C. Tiao. Bayesian inference in statistical analysis, volume 40. John Wiley & Sons, 2011. Bernard Chazelle. Diffusive influence systems. SIAM Journal on Computing, 44(5):1403 1442, 2015. Bernard Chazelle and Chu Wang. Inertial Hegselmann-Krause systems. In Proceedings of the IEEE American Control Conference (ACC), pages 1936 1941, 2016. Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Real-time tracking of non-rigid objects using mean shift. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 2, pages 142 149. IEEE, 2000. Kenneth R Davidson and Stanislaw J Szarek. Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001. Morris H. De Groot. Reaching a consensus. Journal of the American Statistical Association, 69(345): 118 121, 1974. Alan Edelman. Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix Analysis and Applications, 9(4):543 560, 1988. Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013. Benjamin Golub and Matthew O. Jackson. Naive learning in social networks and the wisdom of crowds. American Economic Journal: Microeconomics, 2(1):112 149, 2010. Benjamin Golub and Matthew O. Jackson. How homophily affects the speed of learning and best response dynamics. 2012. ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS Thomas L. Griffiths and Michael L. Kalish. A bayesian view of language evolution by iterated learning. In Proceedings of the 27th annual conference of the cognitive science society, pages 827 832, 2005. Thomas L. Griffiths and Michael L. Kalish. Language evolution by iterated learning with bayesian agents. Cognitive Science, 31(3):441 480, 2007. Thomas L Griffiths, Michael L Kalish, and Stephan Lewandowsky. Theoretical and empirical evidence for the impact of inductive biases on cultural evolution. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 363(1509):3503 3514, 2008. Michiel Hazewinkel. Encyclopaedia of Mathematics. Springer Science & Business Media, 2013. Rainer Hegselmann and Ulrich Krause. Opinion dynamics and bounded confidence models, analysis, and simulation. Journal of Artificial Societies and Social Simulation, 5(3), 2002. Ali Jadbabaie, Pooya Molavi, Alvaro Sandroni, and Alireza Tahbaz-Salehi. Non-bayesian social learning. Games and Economic Behavior, 76(1):210 225, 2012. Ali Jadbabaie, Pooya Molavi, and Alireza Tahbaz-Salehi. Information heterogeneity and the speed of learning in social networks. Columbia Business School Research Paper, (13-28), 2013. Michael L Kalish, Thomas L Griffiths, and Stephan Lewandowsky. Iterated learning: Intergenerational knowledge transmission reveals inductive biases. Psychonomic Bulletin & Review, 14(2): 288 294, 2007. Simon Kirby, Tom Griffiths, and Kenny Smith. Iterated learning and the evolution of language. Current opinion in neurobiology, 28:108 114, 2014. Ilan Lobel and Evan Sadler. Preferences, homophily, and social learning. Operations Research, 64 (3):564 584, 2015. Soheil Mohajer and Behrouz Touri. On convergence rate of scalar hegselmann-krause dynamics. In American Control Conference (ACC), 2013, pages 206 210. IEEE, 2013. Pooya Molavi, Alireza Tahbaz-Salehi, and Ali Jadbabaie. Foundations of non-bayesian social learning. Columbia Business School Research Paper, 2015. Luc Moreau. Stability of multiagent systems with time-dependent communication links. IEEE Transactions on Automatic Control, 50:169 182, 2005. Elchanan Mossel, Allan Sly, and Omer Tamuz. From agreement to asymptotic learning. Arxiv preprint ar Xiv, 1105, 2011. Manuel Mueller-Frank. A general framework for rational learning in social networks. Theoretical Economics, 8(1):1 40, 2013. James R Norris. Markov chains. Cambridge university press, 1998. Amy Perfors and Daniel Navarro. Language evolution is shaped by the structure of the world: An iterated learning analysis. Cognitive Science Society, 2011. CHAZELLE AND WANG Anna N Rafferty, Thomas L Griffiths, and Dan Klein. Convergence bounds for language evolution by iterated learning. In Proceedings of the Thirty-First Annual Conference of the Cognitive Science Society, 2009. Anna N Rafferty, Thomas L Griffiths, and Dan Klein. Analyzing the rate at which languages lose the influence of a common ancestor. Cognitive science, 38(7):1406 1431, 2014. Mohammad Amin Rahimian and Ali Jadbabaie. Learning without recall from actions of neighbors. In 2016 American Control Conference (ACC), pages 1060 1065. IEEE, 2016a. Mohammad Amin Rahimian and Ali Jadbabaie. Naive social learning in ising networks. In 2016 American Control Conference (ACC), pages 1088 1093. IEEE, 2016b. Mohammad Amin Rahimian, Shahin Shahrampour, and Ali Jadbabaie. Learning without recall by random walks on directed graphs. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 5538 5543. IEEE, 2015a. Mohammad Amin Rahimian et al. Learning without recall: A case for log-linear learning. IFACPapers On Line, 48(22):46 51, 2015b. Mark Rudelson and Roman Vershynin. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics, 62(12):1707 1739, 2009. Kenny Smith. Iterated learning in populations of bayesian agents. In Proceedings of the 31st annual conference of the cognitive science society, pages 697 702, 2009. Alireza Tahbaz-Salehi, Alvaro Sandroni, and Ali Jadbabaie. Learning under social influence. In Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE Conference on, pages 1513 1519. IEEE, 2009. Mónica Tamariz and Simon Kirby. Culture: copying, compression, and conventionality. Cognitive science, 39(1):171 183, 2015. ITERATED LEARNING IN DYNAMIC SOCIAL NETWORKS Appendix A. Analysis of the Root-Sine-Distance In this appendix, we provide detailed discussion of the root-sine-distance. We will first show that it is a metric, followed by the proof of its equivalency to Euclidean distance and the discussion of its relation to other similar metrics. The notation in the appendix is local and should not be confused with the one used in the main body of this paper. The root-sine-distance is a metric. The two forms of the function d RS in (4) make it clear that 0 d RS(a, b) 1 and d RS(a, b) = 0 if and only if a and b are identical. We easily check that d RS makes the simplex S of distributions over D into a metric space. Indeed, d RS( , ) is obviously symmetric, and d RS(a, b) = 0 implies that a = b. To check the triangular inequality, notice that d RS(a, b) = v u u t1 s X aibi 2 = sin a, b is the angle between the unit vectors a and b, using the notation v = ( v1, . . . , vs). To prove that d RS(a, b) + d RS(b, c) d RS(a, c) for any a, b, c S, we denote by α, β, γ the corresponding angles in that order, ie, α = a, b , etc. The coordinates in a, b, c are nonnegative; therefore 0 α, β, γ π/2. These form the three angles at the origin of a tetrahedron with a vertex at the origin; therefore, by the triangular inequality in spherical geometry, α + β γ. If α + β π 2 , then sin α + sin β sin α cos β + cos α sin β = sin(α + β) sin γ. On the other hand, if α+β > π/2, then sin α+sin β = 2 sin α+β 4 = 1 sin γ, which establishes the triangular inequality. Relation to the Euclidean distance. Shrinking the simplex S by a tiny amount, we define Sε := {a S : ε ai 1 ε} and note that d E(a, b) := a b 2 = bi)2( ai + p It follows that, for a, b Sε, 1 2d E(a, b) d E( a, b ) 1 2 ε d E(a, b). (59) On the other hand, a 2 = b 2 = 1, so the vectors a and b form an isosceles triangle; hence b ) = 2 sin 1 b = d RS(a, b) cos 1 d RS(a, b) d E( a, 2 d RS(a, b). Together with (59) this shows that, for any a, b Sε, 2 d E(a, b) d RS(a, b) 1 2 ε d E(a, b), (60) which shows that the Euclidean distance and the metric d RS are equivalent in Sε. CHAZELLE AND WANG Relation to other distances. The metric d RS is related to the Hellinger and Bhattacharyya distances. Writing C(a, b) = Ps i=1 aibi (Comaniciu et al., 2000), then d RS(a, b) = p 1 C(a, b)2. The Hellinger distance is defined as d H(a, b) = p 1 C(a, b) (Hazewinkel, 2013), while the Bhattacharyya distance is defined as d B(a, b) = ln C(a, b) (Bhattachayya, 1943). The total variation distance d TV is half the ℓ1-norm; therefore d TV (a, b) 1 2 s d E(a, b). Combining these observations with (60) establishes (5): 2 ln(1 d2 RS) ;