# sequential_cooperative_bayesian_inference__7d7858fa.pdf Sequential Cooperative Bayesian Inference Junqi Wang 1 Pei Wang 1 Patrick Shafto 1 Abstract Cooperation is often implicitly assumed when learning from other agents. Cooperation implies that the agent selecting the data, and the agent learning from the data, have the same goal, that the learner infer the intended hypothesis. Recent models in human and machine learning have demonstrated the possibility of cooperation. We seek foundational theoretical results for cooperative inference by Bayesian agents through sequential data. We develop novel approaches analyzing consistency, rate of convergence and stability of Sequential Cooperative Bayesian Inference (SCBI). Our analysis of the effectiveness, sample efficiency and robustness show that cooperation is not only possible in specific instances but theoretically well-founded in general. We discuss implications for human-human and human-machine cooperation. 1. Introduction Learning often occurs sequentially, as opposed to in batch, and from data provided by other agents, as opposed to from a fixed random sampling process. The canonical example of sequential learning from an agent occurs in educational contexts where the other agent is a teacher whose goal is to help the learner. However, instances appear across a wide range of contexts including informal learning, language, and robotics. In contrast with typical contexts considered in machine learning, it is reasonable to expect the cooperative agent to adapt their sampling process after each trial, consistent with the goal of helping the learner learn more quickly. It is also reasonable to expect that learners, in dealing with such cooperative agents, would know the other agent intends to cooperate and incorporate that knowledge when updating their beliefs. In this paper, we analyze basic statistical properties of such sequential cooperative inferences. 1Co Da S Lab, Department of Math & CS, Rutgers University at Newark, New Jersey, USA. Correspondence to: Junqi Wang . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). Large behavioral and computational literatures highlight the importance cooperation for learning. Across behavioral sciences, cooperative information sharing is believed to be a core feature of human cognition. Education, where a teacher selects examples for a learner, is perhaps the most obvious case. Other examples appear in linguistic pragmatics (Frank & Goodman, 2012), in speech directed to infants (Eaves Jr et al., 2016), and children s learning from demonstrations (Bonawitz et al., 2011). Indeed, theorists have posited that the ability to select data for and learn cooperatively from others explains humans ability to learning quickly in childhood and accumulate knowledge over generations (Tomasello, 1999; Csibra & Gergely, 2009). Across computational literatures, cooperative information sharing is also believed to be central to human-machine interaction. Examples include pedagogic-pragmatic value alignment in robotics (Fisac et al., 2017), cooperative inverse reinforcement learning (Hadfield-Menell et al., 2016), machine teaching (Zhu, 2013), and Bayesian teaching (Eaves Jr et al., 2016) in machine learning, and Teaching dimension in learning theory (Zilles et al., 2008; Doliwa et al., 2014). Indeed, rather than building in knowledge or training on massive amounts of data, cooperative learning from humans is a strong candidate for advancing machine learning theory and improving human-machine teaming more generally. While behavioral and computational research makes clear the importance of cooperation for learning, we lack mathematical results that would establish statistical soundness. In the development of probability theory, proofs of consistency and rate of convergence were celebrated results that put Bayesian inference on strong mathematical footing (Doob, 1949). Moreover, establishment of stability with respect to mis-specification ensured that theoretical results could apply despite the small differences between the model and reality (Kadane et al., 1978; Berger et al., 1994). Proofs of consistency, convergence, and stability ensure that intuitions regarding probabilistic inference were formalized in ways that satisfied basic desiderata. Our goal is to provide a comparable foundation for sequential Cooperative Bayesian Inference as statistical inference for understanding the strengths, limitations, and behavior of cooperating agents. Grounded strongly in machine learning (Murphy, 2012; Ghahramani, 2015) and human learning Sequential Cooperative Bayesian Inference (Tenenbaum et al., 2011), we adopt a probabilistic approach. We approach consistency, convergence, and stability using a combination of new analytical and empirical methods. The result will be a model agnostic understanding of whether and under what conditions sequential cooperative interactions result in effective and efficient learning. Notations are introduced at the end of this section. Section 2 introduces the model of sequential cooperative Bayesian inference (SCBI), and Bayesian inference (BI) as the comparison. Section 3 presents a new analysis approach which we apply to understanding consistency of SCBI. Section 4 presents empirical results analyzing the sample efficiency of SCBI versus BI, showing convergence of SCBI is considerably faster. Section 5 presents the empirical results testing robustness of SCBI to perturbations. Section 6 introduces an application of SCBI in Grid world model. Section 7 describes our contributions in the context of related work, and Section 8 discusses implications for machine learning and human learning. Preliminaries. Throughout this paper, for a vector θ, we denote its i-th entry by θi or θ(i). Similarly, for a matrix M, we denote the vector of i-th row by M(i, ), the vector of j-th column by M( ,j), and the entry of i-th row and j-th column by M(i,j) or simply Mij. Further, let r, c be the column vectors representing the row and column marginals (sums along row/column) of M. Let en or simply e be the vector of ones. The symbol Nvec (θ, s) is used to denote the normalization of a non-negative vector θ, i.e., Nvec (θ, s) = s P θi θ with s = 1 if absent. Similarly, the normalization of matrices are denoted by Ncol (M, θ), with col indicating column normalization (for row normalization, write row instead), and θ denotes to which vector of sums the matrix is normalized. The set of probability distributions on a finite set X is denoted by P(X), we do not distinguish it with the simplex |X| 1. The language of statistical models and estimators follows the notations of the book (Miescke & Liese, 2008). 2. The Construction θ0 L Teacher Figure 1. Two agents and their knowledge before starting. In this paper, we consider cooperative communication models with two agents, which we call a teacher and a learner. Let H = {1, . . . , m} be the set of m hypotheses, i.e., concepts to teach. The shared goal is for the learner to infer the correct hypothesis h H which is only known by the teacher at the beginning. To facilitate learning, the teacher passes one element from a finite data set D = {1, . . . , n} sequentially. Each agent has knowledge about the relation between H and D, in terms of a positive matrix whose normalization can be treated as the likelihood matrix in a Bayesian sense. Let T, L Matn m(R+) be the matrices for teacher and learner, respectively. In order to construct a Bayesian theory, the learner has an initial prior θ0 on H, which, along with posteriors θk(k 1), are elements in P(H) = m 1 := {θ Rm : Pm i=1 θ(i) = 1}. Privately, the teacher knows the true hypothesis h H to teach. To measure how well a posterior θk performs, we may view h as a distribution on H, namely bθ = δh P(H), and calculate the L1-distance ||θk bθ||1 on P(H) = m 1 Rm. We assume that H, D, T, L and θ0 satisfy: (i) There are no fewer data than hypotheses (n m). (ii) The hypotheses are distinguishable, i.e., there is no λ R such that T( ,i) = λT( ,j) for any i = j, and so is L. (iii) T is a scaled matrix of L, i.e., there exist invertible diagonal matrices E1 and E2 such that T = E1LE2. (Both agents aware this assumption, though possibly neither know the other s matrix.) (iv) θ0 is known by the teacher. Our model is constructed and studied under these assumptions (Sec. 3 and Sec. 4). We also studied stability under violations of (iii) and (iv), where we assume that T and teacher s knowledge θT 0 about θ0 is slightly different from (some scaled matrix of) L and θ0 (Sec. 5). Assumption (iii) is a relaxation of the assumption of Bayesian inference that T = L = M is the likelihood matrix. Practically, we may achieve (iii) by adding to the common ground a shared matrix M (e.g. joint distribution on D and H) and scaling it to T and L. We may obtain M by taking the same ground model or using the same statistical data (e.g. historical statistical records). In fact, with (iii), it does not affect the process of inference whether M is accessible to agents. In SCBI (see details in later this section), thanks to the property that a matrix and its scaled matrices behave the same in Sinkhorn scaling (Hershkowitz et al., 1988), the pre-processings of T and of L lead to the same results under (iii) and (iv). Thus assumption (iii) is equivalent to: (iii ) T = L = M where M is a column-stochastic matrix. We assume (iii ) is valid until we discuss stability. In our setup, the teacher teaches in sequence. At each round the teacher chooses a data point from D by sampling according to a distribution. And the learner learns by maintaining a posterior distribution on H through Bayesian inference with likelihood matrices not necessarily fixed. Sequential Cooperative Bayesian Inference Formally, the teacher s job is to select a sequence of data (dk)k N by first constructing a sequence of random variables (Dk)k N, then sampling each dk as a realization of Dk. Each dk is given to the learner at round k. And the learner s job is to produce a sequence of posteriors (θk)k N on P(H). To calculate θk, learner can use the matrix L, the initial prior θ0 which is common knowledge, and the sequence of data (di)i k which is visible at round k. The learner find each posterior by giving a function Sk((di)i k; L, θ0) 1 for k > 0. We may further define S0( ; L, θ0) = θ0. Since (dk)k N is generated by a sequence of random variables (Dk)k N, the function Sk can be treated as a function taking (Di)i k as inputs and producing a random variable Θk as output. We call the distribution of Θk by µk P(P(H)) = P( m 1). The Sk s as functions of random variables are called estimators. Being a special case of the above framework, Bayesian inference dealing with sequential data is a well-studied model. However, there is no cooperation in Bayesian inference since the teaching distribution and learning likelihood are constant on time (the teacher side is typically left implicit). To introduce cooperation following cooperative inference (Yang et al., 2018), we propose Sequential Cooperative Bayesian Inference (SCBI), which is a sequential version of the cooperative inference. 2.1. Sequential Cooperative Bayesian Inference Sequential Cooperative Bayesian Inference (SCBI) assumes that the two agents a teacher and a learner cooperate to facilitate learning. Prior research has formalized this cooperation (in a single-round game) as a system of two interrelated equations in which the teacher s choice of data depends on the learner s inference, and the learner s inference depends on reasoning about the teacher s choice. This prior research into such Cooperative Inference has focused on batch selection of data (Yang et al., 2018; Wang et al., 2019a), and has been shown to be formally equivalent to Sinkhorn scaling (Wang et al., 2019b). Following this principle, we propose a new sequential setting in which the teacher chooses data sequentially, and both agents update the likelihood at each round to optimize learning. Cooperative Inference. Let PL0(h) be the learner s prior of hypothesis h H, PT0(d) be the teacher s prior of selecting data d D. Let PT (d|h) be the teacher s likelihood of selecting d to convey h and PL(h|d) be the learner s posterior for h given d. Cooperative inference is then a system of two equations shown below, with PL(d) and PT(h) the normalizing constants: PL(h|d) = PT(d|h) PL0(h) PL(d) , PT(d|h) = PL(h|d) PT0(d) PT(h) . (1) 1we may omit L and (or) θ0 when there is no ambiguity. Algorithm 1 SCBI, without assumption (iii ) == Teacher s Part: == Input: T Matn m(R+), θ0, h H, (bθ = δh) Output: Share (d1, d2, . . . ) to learner for all i 1 do sample di Nvec T nθi 1 ( ,h), 1 . θi T nθi 1 (di, ) estimation of learner s posterior end for == Learner s Part: == Input: L Matn m(R+), θ0, (d1, d2, . . . ) Output: (θ0, θ1, θ2, . . . ) posteriors for all i 1 do θi L nθi 1 (di, ) end for Note: T nθi 1 , L nθi 1 are the M ck 1 in the text. It is shown (Wang et al., 2019a;b) that Eq. (1) can be solved using Sinkhorn scaling, where (r, c)-Sinkhorn scaling of a matrix M is simply the iterated alternation of row normalization of M with respect to r and column normalization of M with respect to c. The limit of such iterations exist if the sums of elements in r and c are the same (Schneider, 1989). Sequential Cooperation. SCBI allows multiple rounds of teaching and requires each choice of data to be generated based on cooperative inference, with the learner updating their beliefs between each round. In each round, based on the data being taught and the learner s initial prior on H as common knowledge, the teacher and learner update their common likelihood matrix according to cooperative inference (using Sinkhorn scaling), then the data selection and inference proceed based on the updated likelihood matrix. Precisely, starting from learner s prior S0 = θ0 m 1, let the data been taught up to round k be (d1, . . . , dk 1) and the posterior of the learner after round k 1 be θk 1 = Sk 1(d1, . . . , dk 1; θ0) P(H), which is actually predictable for both agents (obvious for k = 1 and inductively correct for k > 1 by later argument). To teach, the teacher calculates the Sinkhorn scaling of M given the uniform row sums rk 1 = en = (1, 1, . . . , 1) and column sums ck 1 = nθk 1 (to make the sum of rk 1 equal that of ck 1, which guarantees the existence of the limit in Sinkhorn scaling), denoted by M ck 1 . The teacher s data selection is proportional to columns of M ck 1 . Thus let Mk be the column normalization of M ck 1 by em, i.e., Mk = Ncol M ck 1 , em . Then the teacher defines Dk using distribution (Mk)( ,h) on set D and samples dk Dk, then passes dk to the learner. On learner s side, the learner obtains the likelihood matrix Mk in the same way as above and applies normal Bayesian inference with datum dk past from the teacher. First, learner Sequential Cooperative Bayesian Inference takes the prior to be the posterior of the last round, θk 1 = 1 nck 1, then multiply it by the likelihood of selecting dk the row of Mk corresponding to dk, which results θk = (Mk)(dk, )diag(θk 1). Then the posterior θk is obtained by row normalizing θk. Inductively, in the next round, the learner will start with θk and ck = nθk. The learner s calculation in round k can be simulated by the teacher, so the teacher can predict θk, which inductively shows the assumption (teacher knows θk 1) in previous paragraph. The calculation can be simplified. Consider that the vector ck 1, being proportional to the prior, is used in Mk = Ncol(M ck 1 , em) = M ck 1 (diag(nθk 1)) 1, then θk = M ck 1 (diag(nθk 1)) 1 diag(θk 1) n M ck 1 (dk, ). Furthermore, since M ck 1 is row normalized to em, each row of it is a probability distribution on H. Thus Sk(d1, . . . , dk) = θk = n θk 1 = M ck (dk, ). 2 The simplified version of SCBI algorithm is given in Algorithm 1. 2.2. Bayesian Inference: the Control In order to test the performance of SCBI, we recall the classical Bayesian inference (BI). In BI, a fixed likelihood matrix M is used throughout the communication process. Bayes rule requires M to be the conditional distribution on the set of data given each hypothesis, thus M = T = L is column-stochastic as in assumption (iii ). For the teacher, given h H, the teaching distribution is the column vector Ph = M( ,h) P(D). This defines random variable Dk. Then the teacher selects data via i.i.d. sampling according to Ph. The random variables (Dk)k 1 are identical. The learner first chooses a prior θ0 P(H) (θ0 = S0 is part of the model, usually the uniform distribution), then uses Bayes rule with likelihood M to update the posterior distribution repeatedly. Given taught datum d, the map from the prior θ to the posterior distribution is denoted by Bd(θ) = Nvec M(d, )diag(θ), 1 . Thus the learner s estimation over H given a sequential data (d1, . . . , dk) can be written recursively by S0 = θ0, and Sk(d1, . . . , dk) = Bdk(Sk 1(d1, . . . , dk 1)). Thus, by induction, Sk(d1, . . . , dk) = (Bdk Bdk 1 Bd1)(S0). 3. Consistency We investigate the effectiveness of the estimators in both BI and SCBI by testing their consistency: setting the true hypothesis h H, given (Dk), (Sk) and θ0, we examine the convergence (using the L1-distance on P(H)) of the 2See Supplementary Material for detailed examples. posterior sequence (Θk) = (Sk(D1, . . . , Dk)) as sequence of random variables and check whether the limit is bθ as a constant random variable. 3.1. BI and KL Divergence The consistency of BI has been well studied since Bernstein and von Mises and Doob (Doob, 1949). In this section, we state it in our situation and derive a formula for the rate of convergence, as a baseline for the cooperative theory. Derivations and proofs can be found in the Supplementary Material. Theorem 3.1. [(Miescke & Liese, 2008, Theorem 7.115)] In BI, the sequence of posteriors (Sk) is strongly consistent at bθ = δh for each h H, with arbitrary choice of an interior point θ0 (P(H)) (i.e. θ0(h) > 0 for all h H) as prior. Remark 1. For a fixed true distribution bθ, strong consistency of (Sk)k N is defined to be: the sequence of posteriors Θk given by the estimator Sk, as a sequence of random variables, converges to bθ (as a constant random variable) almost surely according to random variables (Dk)k N that the teacher samples from. If the convergence is in probability, the sequence of estimators is said to be consistent. Remark 2. Theorem 3.1 also assumes that hypotheses are distinguishable (Section 2). In a general theory of statistical models, bθ is not necessarily δh for some h H. However, in BI, it is critical to have bθ = δh, since BI with a general bθ P(H) is almost never consistent or strongly consistent. Consistency independent of the choice of prior θ0 interior of P(H) guarantees that BI is always effective. Rate of Convergence. After effectiveness, we provide the efficiency of BI in terms of asymptotic rate of convergence. Theorem 3.2. In BI, with bθ = δh for some h H, let Θk(h)(D1, . . . , Dk) := Sk(h|D1, . . . , Dk) be the hcomponent of posterior given D1, . . . , Dk as random vari- ables valued in D. Then 1 k log Θk(h) 1 Θk(h) converges to a constant minh =h KL(M( ,h), M( ,h )) almost surely. Remark 3. We call minh =h KL(M( ,h), M( ,h )) the asymptotic rate of convergence (Ro C) of BI, denoted by Rb(M; h). 3.2. SCBI as a Markov Chain From the proof of Theorem 3.1, the pivotal property is that the variables D1, D2, . . . are commutative in posteriors (the variables can occur in any order without affecting the posterior) thanks to commutativity of multiplication. However, in SCBI, the commutativity does not hold, since the likelihood matrix depends on previous outcome. Thus the method used in BI analysis no longer works here. Sequential Cooperative Bayesian Inference Because the likelihood matrix Mk = M ck 1 depends on the predecessive state only, the process is in fact Markov, we may analyze the model as a Markov chain on the continuous state space P(H). To describe this process, let P(H) = m 1 be the space of states, and let h H be the true hypothesis to teach (bθ = δh), let learner s prior be S0 = θ0 P(H), or say, the distribution of learner s initial state is µ0 = δθ0 P(P(H)). The operator Ψ. In the Markov chain, in each round, the transition operator maps the prior as a probability distribution on state space P(H) = m 1 to the posterior as another, i.e., Ψ(h) : P(P(H)) P(P(H)). To make the formal definition of Ψ(h) simpler, we need to define some maps. For any d D, let Td : m 1 m 1 be the map bringing the learner s prior to posterior when data d is chosen by the teacher, that is, Td sends each normalized vector θ to Td(θ) = M nθ (d, ) according to SCBI. Each Td is a bijection based on the uniqueness of Sinkhorn scaling limits of M, shown in (Hershkowitz et al., 1988). Further, the map Td is continuous on m 1 and smooth in its interior according to (Wang et al., 2019b). Continuity and smoothness of Td make it natural to induce a push-forward Td : P( m 1) P( m 1) on Borel measures. Explicitly, (Td (µ))(E) = µ(T 1 d (E)) for each Borel measure µ P( m 1) and each Borel measurable set E m 1. Let τ : P(H) P(D) be the map of teacher s adjusting sample distribution based on the learner s prior, that is, given a learner s prior θ m 1, by definition of SCBI, the distribution of the teacher is adjusted to τ(θ) = M nθ ( ,h) nθ(h) = (nθ(h)) 1(T1(θ)(h), . . . , Tn(θ)(h)). Each component d of τ is denoted by τd. We can use τ only for θ0 = δh in which case teacher can trace learner s state. Now we can define Ψ(h) formally. Definition 3.3. Given a fixed hypothesis h H, or say δh P(H), the operator Ψ(h) : P( m 1) P( m 1) translating a prior as a Borel measure µ to the posterior distribution Ψ(h)(µ) according to one round of SCBI is given below, for any Borel measurable set E m 1. (Ψ(h)(µ)) (E) := Z d D τd(T 1 d (θ))d (Td (µ)) (θ). (2) In our case, we start with a distribution δθ where θ P(H) is the prior of the learner on the set of hypotheses. In each round of inference, there are n different possibilities according to the data taught. Thus in any finite round k, the distribution of the posterior is the sum of at most nk atoms (actually, we can prove nk is exact). Thus in the following discussions, we assume that µ is atomic. The Ψ action on an atomic distribution is determined by that of an atom: M nθ (i,h) nθ(h) δ M nθ (i, ) Moreover, since the SCBI behavior depends only on the prior (with fixed M and h) as a random variable, the same operator Ψ(h) applies to every round in SCBI. Thus we can conclude that the following proposition is valid: Proposition 3.4. Given h H, let bθ = δh, the sequence of estimators (Sk)k N in SCBI forms a time-homogeneous Markov chain on state space P(H) with transition operator Ψ(h) characterized by Eq. (2) and Eq. (3). Thanks to the fact that the SCBI is a time homogeneous Markov process, we can further show the consistency. Theorem 3.5 (Consistency). In SCBI, let M be a positive matrix. If the teacher is teaching one hypothesis h (i.e., bθ = δh P(H)), and the prior distribution µ0 P( m 1) satisfies µ0 = δθ0 with θ0(h) > 0, then the estimator sequence (Sk) is consistent, for each h H, i.e., the posterior random variables (Θk)k N converge to the constant random variable bθ in probability. Remark 4. The assumption in Theorem 3.5 that θ0(h) > 0 is necessary in any type of Bayesian inference since it is impossible to get the correct answer in posterior by Bayes rule, if it is excluded in the prior at the beginning. In practice, the prior distribution is usually chosen to be µ0 = δu with the uniform distribution vector in P(H), i.e., u = 1 m(1, . . . , 1) m 1. Rate of Convergence. Thanks to consistency, we can calculate the asymptotic rate of convergence for SCBI. Theorem 3.6. With matrix M, hypothesis h H, and a prior µ0 = δθ0 P( m 1) same as in Theorem. 3.5, let θk denote a sample value of the posterior Θk after k rounds of SCBI, then k log θk(h) 1 θk(h) = Rs(M; h) (4) where Rs(M; h) := minh =h KL M ( ,h), M ( ,h ) with M = Ncol(diag(M( ,h)) 1M). Thus we call Rs(M; h) the asymptotic rate of convergence (Ro C) of SCBI. 4. Sample Efficiency In this section, we present some empirical results comparing the sample efficiency of SCBI and BI. 4.1. Asymptotic Ro C Comparison We first compare the asymptotic rate of convergence (Rb for BI and Rs for SCBI, see Theorems 3.2 and 3.6). The matrix M is sampled through m i.i.d. uniform distributions on n 1, one for each column. For each column-normalized matrix M, we compute two variables to compare BI with SCBI: the probability P := Pr 1 h H Rs(M; h) 1 h H Rb(M; h) Sequential Cooperative Bayesian Inference and the expected value of averaged difference E := E 1 h H Rs(M; h) 1 h H Rb(M; h) . Two-column Cases. Consider the case where M is of shape n 2 with the two columns sampled from n 1 uniformly and independently, we simulated for n = 2, 3, . . . , 50 with a size-1010 Monte Carlo method for each n to calculate P and E. The result is shown in Fig. 2(A)(B). We can reduce the calculation of E to a numerical integral E = R ( n 1)2 ln Pn i=1 xi yi dxdy ln n n 1 Since P goes too close to 1 as the rank grows, we use ln(1 P) to show the increasing in detail. 4 More Columns of a Fixed Row Size. To verify the general cases, we simulated P and E by Monte Carlo on matrices of 10-row and various-column shapes, see Fig. 2(C)(D). We sampled 108 different M of shape 10 m for each 2 m 10. Empirical results show that E decreases slowly but P still increase logistically as m grows. Square Matrices. Fig. 2(E)(F) shows the square cases with size from 2 to 50, simulated by size 108 Monte Carlo. The empirical P is the mean of N (sample-size) i.i.d. variables valued 0 or 1, thus the standard deviation of a single variable is smaller than 1. By Central Limit Theorem, the standard deviation σ(P) < N 1/2 (precision threshold). So we draw lines y = N 1/2 in each log-figure, but only in one figure the line lies in the view area. In all simulated cases, we observe that E > 0 and P > 0.5, indicating that SCBI converges faster than BI in most cases and in average. It is also observed that SCBI behaves even better as matrix size grows, especially when the teacher has more choices on the data to be chosen (i.e., more rows). 4.2. Distribution of Inference Results The promises of cooperation is that one may infer hypotheses from small amounts of data. Hence, we compare SCBI with BI after small, fixed numbers of rounds. We sample matrices of shape 20 20 whose columns are distributed evenly in 19 to demonstrate. Equivalently, they are column-normalizations of the uniformly sampled matrices whose sum of all entries is one. Assume that the correct hypothesis to teach is h P(H) We first simulate a 5-round inference behavior, exploring all possible ways that the teacher may teach, then calculate the expectation and standard deviation of θ(h). With 300 matrices sampled in the above way, Fig. 3 shows this comparison between BI and SCBI. 3Details can be found in Supplementary Material. 4We guess an empirical formula ln(1 P) 1 2 ln(x(x + 1)/(x 1.5)) + 0.1x 0.3, see Supplementary Material. Similarly, we extend the number of rounds to 30 by Monte Carlo since an exact calculation on exhausting all possible teaching paths becomes impossible. With sampling 500 matrices independently, we simulate a teacher teaches 2000 times to round 30 for each matrix, and the statistics are also shown in Fig. 3. From Fig. 3, we observe that SCBI have better expectation and less variance in the short run. In conclusion, experiments indicate that SCBI is both more efficient asymptotically, and in the short run. 5. Stability In this section, we study the robustness of SCBI by setting the initial conditions of teacher and learner different. This could happen when agents do not have full access to their partner s exact state. Theory. In this section, we no longer have assumption (iii). Let T and L be matrices of teacher and learner (not necessarily have (iii)). Let θT 0 and θL 0 be elements in P(H) representing the prior on hypotheses that the teacher and learner use in the estimation of inference (teacher) and in the actual inference (learner), i.e., µT 0 = δθT 0 and µL 0 = δθL 0 . During the inference, let µT k and µL k be the distribution of posteriors of the teacher and the learner after round k, and denote the corresponding random variables by θT k and θL k , for all positive k and , where represents the limit in probability. Let D be a random variable on D, we define an operator ΨL D : P(P(H)) P(P(H)) similar to the Ψ in Section 3. Let Td(θ) = L nθ (d, ), then d(ΨL D(µ))(θ) := P d D P(D = d)d(Td µ)(θ). Proposition 5.1. Given a sequence of identical independent D-valued random variables (Di)i 1 following the uniform distribution. Let µ0 P(P(H)) be a prior distribution on P(H), and µk+1 = ΨL Dk+1(µk), then µk converges, in probability, to P i H aiδi where ai = Eµ0 [θ(i)]. Remark 5. This proposition helps accelerate the simulation, that one may terminate the teaching process when θT k is sufficiently close to δh, since Prop. 5.1 guarantees that the expectation of the learner s posterior on the true hypothesis h at that time is close enough to the eventual probability of getting δh, i.e. EθL (h) EθL k (h). Definition 5.2. We call EθL (h) := lim k Eµk(θ(h)) the successful rate of the inference given T, L, θT 0 and θL 0 . By the setup in Section 2, the failure probability, 1 EθL (h), is 1 2||EθL δh||1, half of the 1-distance on P(H). Simulations with Perturbation on Priors. We simulated the square cases of rank 3 and 4. We sample 5 matrices (M1 to M5) of size 3 3, whose columns distribute uniformly on P({d1, d2, d3}) = 2, and 5 priors (θ1 to θ5) in P(H), Sequential Cooperative Bayesian Inference Figure 2. Comparison of Ro C between BI and SCBI. (A), (C), (E): the comparison on P in blue and on E in red. (B), (D), (F): plotting ln(1 P). (A), (B): two-column case, number of rows from 2 to 50. Monte Carlo of 1010 samples for each point on figure. (C), (D): 10-row case, number of columns from 2 to 10. Monte Carlo of size 108. (E), (F): square case, number of rows from 2 to 50. Monte Carlo of size 108. The horizontal line in (F) is the theoretical threshold of precision by central limit theorem. For n > 17, MC provides P = 1 (Rs > Rb for all samples). From the figures, except in (C) where E decays slowly when column number grows, the two values E and P increases as size grows in all the other cases. Moreover, P grows to 1 logistically in all situations. Figure 3. Comparison between BI and SCBI on 20 20 matrices: Top: 300 points (matrices) of round 5 accurate value. Bottom: 500 points of round 30 using Monte Carlo of size 2000. Left: comparison on expectations of learner s posterior on h. Right: comparison on the standard deviations. Orange line is the diagonal. used as θT 0 . Similarly, we sample 3 matrices (M 1, M 2 and M 3) of size 4 4, and 3 priors (θ 1, θ 2, θ 3) from 3 in the same way as above. In both cases, we assume h = 1 H to be the true hypothesis to teach. Our simulation is based on Monte Carlo method of 104 teaching sequences (for each single point plotted) then use Proposition 5.1 to calculate the successful rate of inference. For 3 3 matrices, we perturb θL 0 in two ways: (1) take θL 0 around θT 0 distributed evenly on concentric circles, thus 630 points for each θT 0 are taken. In this area, there are 84 points lying on 6 given directions (60 apart, see Supplementary Material for figures). (2) sample θL 0 evenly in the whole simplex P(H) = 2 (300 points for each θT 0 ). For 4 4 matrices, we perturb θL 0 in two ways: (1) along 15 randomly chosen directions in 3 evenly take 21 points on each direction, and (2) sample 300 points evenly in 3. Then we have the following figure samples (for figures demonstrating the entire simulation, please see Supplementary Material). From the figures we see: 1. left pictures indicate that the learner s expected posterior on h is roughly linear to perturbations along a line. 2. right pictures indicate that the learner s expected posterior on h is closely bounded by a multiple of the learner s prior on true h. Thus we have the following conjecture: Conjecture 5.3. Given L = T = M and θT 0 , let h be the true hypothesis to teach. For any ϵ > 0, let θL 0 be learner s prior with a distance to θT 0 less than ϵ. Then the successful rate for sufficiently many rounds is greater than 1 Cϵ, where C = 1 θT 0 (h). Simulations with Perturbation on Matrices. We now investigate robustness of SCBI to differences between agents matrices. Let T and L be stochastic, and let L be perturbed from T. The simulations are performed on the matrices M1 to M5 mentioned above with a fixed common prior θ1. Sequential Cooperative Bayesian Inference Figure 4. From left to right: (A). Rank 3, M3 and θ1, θL 0 is perturbed along six directions. (B). Rank 3, M3 and θ1, sample θL 0 uniformly in 2. (C). Rank 4, M 1 and θ 1, along 15 different directions. (D). Rank 4, M 1 and θ 1, sample θL 0 uniformly in 3. Let all matrices mentioned be column-normalized (this does not affect SCBI since cross-ratios and marginal conditions determines the Sinkhorn scaling results), we call the column determined by the true hypothesis h (the first column in our simulation) the target column ( tr. h on Fig. 5), the column which Rs uses (argmin column) the relevant column ( rel. h ) and the other column the irrelevant column ( irr. h ). Let T be given, and let L be obtained from T by perturbing along the relevant / irrelevant column. Without loss of generality, we assume that only one column of the learner s matrix L is perturbed at a time as other perturbations may be treated as compositions of such. For each T and each column h , we apply 330 perturbations on concentric circles around T (the disc), 90 perturbations preserving the normalized-KL divergence (KL(e/n, Nvec(L( ,h )/L( ,1), 1)) used in Rs) from the target column and 50 linear interpolations with target column. Each point in Fig. 5 is estimated using a size-104 Monte Carlo method using Proposition 5.1. From the graphs, we can see that the successful rate varies continuously on perturbations, slow on one direction (the yellow strip crossing the center) and rapid on the perpendicular direction (color changed to blue rapidly). 6. Grid World: an Application Consider a 3 5 grid world with two possible terminal goals, A and B, and a starting position S as shown below. Let the reward at the terminal position ht be R. Assuming no step costs, the value of a grid that distanced k from ht is then R γk (in the RL-sense), where γ <1 is the discount factor. Suppose the structure of the world is accessible to both agents whereas the true location of the goal ht is only known to a teacher. The teacher performs a sequence of actions to teach ht to a learner. At each round, there are three available actions, left, up and right. After observing the teacher s actions, the learner updates their belief on ht accordingly. We now compare BI and SCBI agents behaviours under this grid world. In terms of previous notations, the hypothesis set H = {A, B}, the data set D = {left, up, right}. Let the learner s prior over H be θ0 = (0.5, 0.5) and the true hypothesis ht be A, then at each blue grid, agents (unnormalized) initial matrix M = left γ(k 1) γ(k+1) up γ(k 1) γ(k 1) right γ(k+1) γ(k 1) . Assume both BI teacher and SCBI teacher start with grid S. Based on M, the BI teacher would choose equally between left and up, whereas the SCBI teacher is more likely to choose left as the teacher s likelihood matrix T = 2/(3 + 3γ2) 2γ2/(3 + 3γ2) 1/3 1/3 2γ2/(3 + 3γ2) 2/(3 + 3γ2) , obtained from Sinkhorn scaling on M, assigns higher probability for left. Hence, comparing to the BI teacher who only aims for the final goal, the SCBI teacher tends to cooperate with the learner by selecting less ambiguous moves towards the goal. This point is aligned with the core idea of many existing models of cooperation in cognitive development (Jara-Ettinger et al., 2016; Bridgers et al., in press), pragmatic reasoning (Frank & Goodman, 2012; Goodman & Stuhlm uller, 2013) and robotics (Ho et al., 2016; Fisac et al., 2017). Moreover, even under the same teaching data, the SCBI learner is more likely to infer ht than the BI learner. For instance, given the teacher s trajectory {left, up}, the left plot in Fig. 6 shows the SCBI and BI learners posteriors on the true hypothesis ht. Hence, comparing to the BI learner who reads the teacher s action literally, the SCBI learner interprets teacher s data corporately by updating belief sequentially after each round. Regarding the stability, consider the case where the learner s discount factor is either greater or less (with equal probability) than the teacher s by 0.1. The right plot in Fig. 6 illustrates the expected difference between the learner s posterior on ht after observing a teacher s trajectory of length 2 and the teacher s estimation of it. As discussed in Sec 4.1, showing in Fig. 2, as the board gets wider and the number of possible goals gets more (i.e. the number of hypotheses increases), the gap between Sequential Cooperative Bayesian Inference Figure 5. The perturbations on M3 along a column, and their zoomed-in version (with different color scale). The crosses shows the position of three normalized columns of T = M3, the location of the dots represent the perturbed column of T (unperturbed columns are represented by crosses on figures which are not the center of disc) and whereas their colors depict the successful rate of inference. Left two figures are perturbations on the irrelevant column. Right two figures are perturbations on the relevant column. Figure 6. The top plot demonstrates that both BI and SCBI converge to the true hypothesis with SCBI having higher sample efficiency. The bottom plot shows that both BI and SCBI agents are robust to perturbations with SCBI relatively less stable. posteriors of SCBI and BI learners will increase whereas the expected difference between agents for the same magnitude of perturbation will decrease. Thus, this example illustrates the consistency, sample efficiency, and stability of SCBI versus BI. 7. Related Work Literatures on Bayesian teaching (Eaves & Shafto, 2016; Eaves Jr et al., 2016), Rational Speech act theory (Frank & Goodman, 2012; Goodman & Stuhlm uller, 2013), and machine teaching (Zhu, 2015; 2013) consider the problem of selecting examples that improve a learner s chances of inferring a concept. These literatures differ in that they consider the single step, rather than sequential problem, that they do not formalize learners who reason about the teacher s selection process, and that they models without a mathematical analysis. The literature on pedagogical reasoning in human learning (Shafto & Goodman, 2008; Shafto et al., 2012; 2014) and cooperative inference (Yang et al., 2018; Wang et al., 2019a;b) in machine learning formalize full recursive reasoning from the perspectives of both the teacher and the learner. These only consider the problem of a single interaction between the teacher and learner. The literature on curriculum learning considers sequential interactions with a learner by a teacher in which the teacher presents data in an ordered sequence (Bengio et al., 2009), and traces back to various literatures on human and animal learning (Skinner, 1958; Elman, 1993). Curriculum learning involves one of a number of methods for optimizing the sequence of data presented to the learner, most commonly starting with easier / simpler examples first and gradually moving toward more complex or less typical examples. Curriculum learning considers only problems where the teacher optimizes the sequence of examples, where the learner does not reason about the teaching. 8. Conclusions Cooperation is central to learning in humans and machines. We set out to provide a mathematical foundation for sequential cooperative Bayesian inference (SCBI). We presented new analytic results demonstrating the consistency and asymptotic rate of convergence of SCBI. Empirically, we demonstrated the sample efficiency and stability to perturbations as compared to Bayesian inference, and illustrated with a simple reinforcement learning problem. We therefore provide strong evidence that SCBI satisfies basic desiderata. Future work will aim to provide mathematical proofs of the empirically observed efficiency and stability. Sequential Cooperative Bayesian Inference Acknowledgements This material is based on research sponsored by the Air Force Research Laboratory and DARPA under agreement number FA8750-17-2-0146 and the Army Research Office and DARPA under agreement HR00112020039. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. This work was also supported by Do D grant 72531RTREP and NSF MRI 1828528 to PS. Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41 48. ACM, 2009. Berger, J. O., Moreno, E., Pericchi, L. R., Bayarri, M. J., Bernardo, J. M., Cano, J. A., De la Horra, J., Mart ın, J., R ıos-Ins ua, D., Betr o, B., et al. An overview of robust bayesian analysis. Test, 3(1):5 124, 1994. Bonawitz, E., Shafto, P., Gweon, H., Goodman, N. D., Spelke, E., and Schulz, L. The double-edged sword of pedagogy: Instruction limits spontaneous exploration and discovery. Cognition, 120(3):322 330, 2011. Bridgers, S., Jara-Ettinger, J., and Gweon, H. Young children consider the expected utility of others learning to decide what to teachn. Nature Human Behaviour, in press. Csibra, G. and Gergely, G. Natural pedagogy. Trends in cognitive sciences, 13(4):148 153, 2009. Doliwa, T., Fan, G., Simon, H. U., and Zilles, S. Recursive teaching dimension, VC-dimension and sample compression. Journal of Machine Learning Research, 15(1): 3107 3131, 2014. Doob, J. L. Application of the theory of martingales. Le calcul des probabilites et ses applications, pp. 23 27, 1949. Eaves, B. S. and Shafto, P. Parameterizing developmental changes in epistemic trust. Psychonomic Bulletin & Review, pp. 1 30, 2016. Eaves Jr, B. S., Feldman, N. H., Griffiths, T. L., and Shafto, P. Infant-directed speech is consistent with teaching. Psychological review, 123(6):758, 2016. Elman, J. L. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71 99, 1993. Fisac, J. F., Gates, M. A., Hamrick, J. B., Liu, C., Hadfield Menell, D., Palaniappan, M., Malik, D., Sastry, S. S., Griffiths, T. L., and Dragan, A. D. Pragmatic-pedagogic value alignment. ar Xiv preprint ar Xiv:1707.06354, 2017. Frank, M. C. and Goodman, N. D. Predicting pragmatic reasoning in language games. Science, 336(6084):998 998, 2012. Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature, 521(7553):452, 2015. Sequential Cooperative Bayesian Inference Goodman, N. D. and Stuhlm uller, A. Knowledge and implicature: Modeling language understanding as social cognition. Topics in cognitive science, 5(1):173 184, 2013. Hadfield-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, pp. 3909 3917, 2016. Hershkowitz, D., Rothblum, U. G., and Schneider, H. Classifications of nonnegative matrices using diagonal equivalence. SIAM journal on Matrix Analysis and Applications, 9(4):455 460, 1988. Ho, M. K., Littman, M., Mac Glashan, J., Cushman, F., and Austerweil, J. L. Showing versus doing: Teaching by demonstration. In Advances in Neural Information Processing Systems, pp. 3027 3035, 2016. Jara-Ettinger, J., Gweon, H., Schulz, L. E., and Tenenbaum, J. B. The na ıve utility calculus: Computational principles underlying commonsense psychology. Trends in cognitive sciences, 20(8):589 604, 2016. Kadane, J. B., Chuang, D. T., et al. Stable decision problems. The Annals of Statistics, 6(5):1095 1110, 1978. Miescke, K.-J. and Liese, F. Statistical Decision Theory: Estimation, Testing, and Selection. Springer, 2008. doi: https://doi.org/10.1007/978-0-387-73194-0. Murphy, K. P. Machine learning: a probabilistic perspective. MIT press, 2012. Schneider, M. H. Matrix scaling, entropy minimization, and conjugate duality. i. existence conditions. Linear Algebra and its Applications, 114:785 813, 1989. Shafto, P. and Goodman, N. Teaching games: Statistical sampling assumptions for learning in pedagogical situations. In Proceedings of the 30th annual conference of the Cognitive Science Society, pp. 1632 1637. Cognitive Science Society Austin, TX, 2008. Shafto, P., Goodman, N. D., and Frank, M. C. Learning from others: The consequences of psychological reasoning for human learning. Perspectives on Psychological Science, 7(4):341 351, 2012. Shafto, P., Goodman, N. D., and Griffiths, T. L. A rational account of pedagogical reasoning: Teaching by, and learning from, examples. Cognitive Psychology, 71:55 89, 2014. Skinner, B. F. Teaching machines. Science, 128(3330): 969 977, 1958. Tenenbaum, J. B., Kemp, C., Griffiths, T. L., and Goodman, N. D. How to grow a mind: Statistics, structure, and abstraction. science, 331(6022):1279 1285, 2011. Tomasello, M. The cultural origins of human cognition. Harvard University Press, Cambridge, MA, 1999. Wang, P., Paranamana, P., and Shafto, P. Generalizing the theory of cooperative inference. AIStats, 2019a. Wang, P., Wang, J., Paranamana, P., and Shafto, P. A mathematical theory of cooperative communication, 2019b. Yang, S. C., Yu, Y., Givchi, A., Wang, P., Vong, W. K., and Shafto, P. Optimal cooperative inference. In AISTATS, volume 84 of Proceedings of Machine Learning Research, pp. 376 385. PMLR, 2018. Zhu, X. Machine teaching for bayesian learners in the exponential family. In Advances in Neural Information Processing Systems, pp. 1905 1913, 2013. Zhu, X. Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI, pp. 4083 4087, 2015. Zilles, S., Lange, S., Holte, R., and Zinkevich, M. Teaching dimensions based on cooperative learning. In COLT, pp. 135 146, 2008.