# streaming_inference_for_infinite_feature_models__b8025863.pdf

Streaming Inference for Infinite Feature Models

Rylan Schaeffer 1 2 Yilun Du 3 Gabrielle Kaili-May Liu 2 Ila Rani Fiete 2 4

Unsupervised learning from a continuous stream of data is arguably one of the most common and most challenging problems facing intelligent agents. One class of unsupervised models, collectively termed feature models, attempts unsupervised discovery of latent features underlying the data and includes common models such as PCA, ICA, and NMF. However, if the data arrives in a continuous stream, determining the number of features is a significant challenge, and the number may grow with time. In this work, we make feature models significantly more applicable to streaming data by imbuing them with the ability to create new features, online, in a probabilistic and principled manner. To achieve this, we derive a novel recursive form of the Indian Buffet Process, which we term the Recursive IBP (R-IBP). We demonstrate that R-IBP can be be used as a prior for feature models to efficiently infer a posterior over an unbounded number of latent features, with quasilinear average time complexity and logarithmic average space complexity. We compare R-IBP to existing sampling and variational baselines in two feature models (Linear Gaussian and Factor Analysis) and demonstrate on synthetic and real data that R-IBP achieves comparable or better performance in significantly less time.

1. Introduction

Feature models are a broad class of unsupervised probabilistic models that aim to decompose data into an unknown number of unknown features under certain assumptions, a class which includes principal component analysis, factor analysis, independent component analysis, non-negative

*Equal contribution 1Computer Science, Stanford University 2Brain and Cognitive Sciences, MIT 3Electrical Engineering and Computer Science, MIT 4Mc Govern Institute for Brain Research, MIT. Correspondence to: Rylan Schaeffer <rschaef@cs.stanford.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

Figure 1. Motivation for Infinite Feature Models. As more data are observed, a feature model (here: PCA) requires increasingly more features to explain the data (Omniglot handwritten characters) (left) or else becomes increasingly unable to do so (right).

matrix factorization, matching pursuit, and more.

A fundamental problem in feature modeling analogous to the problem in mixture modeling of choosing the number of clusters is choosing the number of features. Users typically employ one of two approaches: Either (1) prespecifying a fixed number of features or (2) retroactively choosing a number of features after seeing all the data based on some criterion (e.g., selecting the number of principal components necessary to explain 95% of the variance). In a streaming setting, however, where data are received over time, neither approach suffices. For instance, representing handwritten characters with a fixed number of principal components becomes inadequate as more characters are encountered (Fig. 1). Thus the number of features should flexibly adapt to the data in the streaming context.

Such flexibility is a goal not only because the streaming setting for feature models is important in its own right, but also because feature models are a pervasive approach taken in neuroscience and cognitive science to explain how intelligent agents model the world as they move through it (Olshausen & Field, 1997; Hyv arinen, 2010; Pehlevan et al., 2015). Intelligent agents, from mice to humans to mobile devices, must deal with streaming data since these agents operate with limited memory that renders storage of and computation on all previously seen data prohibitive.

This raises the question of how to perform efficient streaming inference for infinite feature models, a question we answer here. Following the approach of Schaeffer et al. (2021) to efficient streaming inference for infinite mixture models, we first show that the Indian Buffet Process

Streaming Inference for Infinite Feature Models

(Griffiths & Ghahramani, 2005), a stochastic process frequently used for Bayesian nonparametric feature models, can be rewritten in a novel form designed for streaming inference with expected quasilinear time and expected logarithmic space complexity. We then demonstrate on both synthetic and real (tabular & non-tabular) data that R-IBP matches or exceeds the performance of five streaming and non-streaming baseline inference algorithms in less time.

2. Background

2.1. Generative Model

We consider observing a sequence of D-dimensional variables o1:N (on RD) based on a sequence of N Kdimensional binary latent variables z1:N, with zn {0, 1}K, K unknown, and 1:N denoting the sequence ( 1, 2, ..., N). Each znk in the (N K)-dimensional latent variable matrix Z denotes the presence or absence of the kth feature in the nth observation. Each feature is some unknown vector Ak RD drawn i.i.d. from some distribution p(A ). Because the number of latent features K is unknown, the Indian Buffet Process (IBP) serves as a flexible prior over the latent indicators:

z1:N IBP(α, β)

Ak i.i.d p(A )

on|zn, {Ak} p(o|zn, {Ak}) (1)

This encompasses many feature models including Principal Component Analysis, Factor Analysis, Independent Component Analysis, and Non-Negative Matrix Factorization.

2.2. Indian Buffet Process

The Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2011) is a two-parameter1 (α > 0, β > 0) stochastic process that defines a discrete distribution over binary matrices with finitely many rows (observations) and an unbounded number of columns (features). The name IBP arises from imagining customers (rows/observations) arriving sequentially at a buffet that has an infinite number of dishes (columns/features) and selecting which dishes to eat: the nth customer selects an integer number of new dishes λn Poisson(αβ/(β + n 1)) and then selects previous dishes with probability proportional to the number of previous customers who selected those dishes. Denoting the total number of dishes after the first n customers Λn = Pn =n n =1 λn , the IBP defines a conditional distribution

1The IBP originally had a single parameter (Griffiths & Ghahramani, 2005) but was extended to two (Ghahramani et al., 2007) and later three (Teh & G or ur, 2009). Our paper applies equally to all, but since our focus is on efficient streaming inference and not particular properties of an IBP variant, we chose the two parameter IBP to balance expositional simplicity against model flexibility.

for the nth row and kth column s binary variable znk:

p(zn,k = 1|z<n,k, Λn 1, λn, α, β)

n <n zn k if 1 k Λn 1 1 if Λn 1 < k Λn 1 + λn 0 otherwise (2)

The IBP is a useful stochastic process for defining a prior over the number of features as well as the presence/absence of features in any particular observation because it allows for the number of features to grow as more data are observed while independently controlling the features sparsity. Because each λn is an independent Poisson with rate αβ/(β + n 1) and because the sum of independent Poisson random variables is itself Poisson, we know that Λn Poisson(Pn n =1 αβ/(β + n 1)). This implies the expected number of dishes grows logarithmically with n because E[Λn] = Pn n =1 αβ/(β + n 1)) αβ R n n =1 dn /(β+n 1) = αβ(log(β+n 1) log(β)) αβ log(1 + n/β); this detail becomes important in our later complexity analysis. Ahead, we often omit α, β for brevity.

3. Efficient Streaming Inference

3.1. Objective

Our goal is to infer a posterior distribution over the current observation s binary latent variables zn def ={znk}k= k=1 and the latent features {Ak} k=1, given the entire history of observations o n, subject to two constraints:

1. Inference must be performed online, i.e. the nth observation is discarded before proceeding to the (n + 1)th observation.

2. Inference must be efficient in the large sample limit.

Inferring the latent posterior p(zn, {Ak}|o n) is often called filtering (e.g., Kalman filter, particle filter). We slightly abuse terminology by calling p(zn, {Ak}|o<n) the filtering prior and p(zn, {Ak}|o n) the filtering posterior, to indicate whether the observation on is conditioned upon.

3.2. Challenges with Streaming Inference

Filtering with an IBP prior requires solving several emergent problems. For concreteness, we illustrate these problems on the commonly used linear-Gaussian model (Griffiths & Ghahramani, 2005; Teh et al., 2007; Doshi-Velez et al., 2009; Paisley & Carin, 2009; Doshi-Velez & Ghahramani, 2009), although our experiments will also showcase Factor Analysis. In the linear-Gaussian model, O RN D are the observed data, Z {0, 1}N K are the binary indicators,

Streaming Inference for Infinite Feature Models

Figure 2. Visualization of the Recursive IBP. To make streaming inference possible, we break the IBP s dependence on the entire history z<n by converting the conditional p(zn|z<n, α, β) into a sequence of marginals p(zn|α, β). The running sum of the previous marginal distributions P

n <n p(zn k = 1) (left) and the distribution over the number of dishes p(Λn = k) (middle) together determine the next marginal distribution p(znk = 1) (right).

A RK D are the features, and E RN D are noise.

Z IBP(α, β)

Ak N(µA, ΣA)

En N(0, σ2 o ID D)

On streaming data, each on (i.e. row of O) is observed, then discarded. What are the challenges for inference?

1. Dependence on Entire History: The IBP s conditional distribution p(znk|z<nk, Λn 1, λn) renders the current indicators zn dependent on all previous indicators z<n, implying any inference algorithm must remember the entire history of indicators.

2. Exponentially Many Evaluations of Likelihood: The likelihood p(on|zn; A) looks benign, but recall that zn is the set of binary variables {znk}k=Λn k=1 . This means computing a posterior requires evaluating the likelihood for 2Λn possible configurations at each step n.

3. History Dependence and Non-Factorized Posterior: In the prior, the indicators are independent, i.e., p(zn|z<n, Λn 1, λn) = Qk=Λn k=1 p(znk|z<nk, Λn 1, λn), although this independence depends on knowing the entire history z<n. Upon conditioning on the observations, the indicators are no longer independent, i.e., p(zn|o n) = Qk=Λn k=1 p(znk|o n), because features are not required to be orthogonal and the presence/absence of one feature can explain away the presence/absence of another feature.

4. Unknown Posterior over Number of Features: In the IBP prior, the number of new indicators per observation λn and the total number of indicators after n observations Λn = Pn n =1 λn are both Poisson with

known rates. But what are the posterior distributions, and are they efficiently computable?

3.3. Recursive Expression for IBP Marginals

Our approach is to recast the IBP in a novel form that breaks the IBP conditional distribution s dependence on the entire history. We achieve this by converting the conditional distribution p(znk = 1|z<n, Λn 1, λn, α, β), which depends on the entire history, into a sequence of marginal distributions p(znk = 1|α, β) that can be efficiently computed recursively. The marginal distribution p(znk = 1) is exactly equal to the IBP s conditional distribution averaged over all sample paths:

p(znk = 1) = Ep(znk)[I(znk = 1)]

= Ep(z<n,Λn 1,λn) h Ep(znk|z<n,Λn 1,λn)[I(znk = 1)] i

= Ep(z<n,Λn 1,λn) h p(znk|z<n, Λn 1, λn) i

Substituting Eqn. (2) and simplifying yields a recursive expression for the marginal distribution:

p(znk = 1) = 1 β + n 1

n <n p(zn k = 1)

+ p(Λn 1 k 1) p(Λn 1 + λn k 1)

We term Eqn. (4) the recursive form of the IBP, or the Recursive IBP for short. Intuitively, the Recursive IBP tells us that the probability that the kth feature is present in the nth observation is given by the running sum of how probable the kth feature s presence was in each previous observation, plus the difference of two terms; this difference is between two Poisson CDFs, which drives new observations to add new features. Fig. 2 offers a visual intuition for Eqn. (4), showing how the accumulating probability mass of previous dishes competes with the addition of new dishes to determine the next customer s likely dishes. This recursive form

Streaming Inference for Infinite Feature Models

of the IBP preserves two qualities of the IBP: (1) if a feature is frequently present in observations, then the next observation is also likely to possess that feature, and (2) new features can be created to explain new data.

3.4. Performing Inference with the Recursive IBP

For the IBP as a stochastic process, Eqn. (4) is exact. However, to use the R-IBP for inference, we use one approximation. To see why, suppose we want a prior for the next observation and so condition on the sequence of observations up to but excluding the current index:

p(znk = 1|o<n) = 1 β + n 1

n <n p(zn k = 1|o<n)

+ p(Λn 1 k 1|o<n) p(Λn 1 + λn k 1|o<n)

Each term p(zn k = 1|o<n) in the sum requires that, for each observation, all previous posteriors must be retroactively revised; these revisions would require O(n) operations at each step n, and would also require remembering all n observations. To avoid this, we turn to approximate inference by approximating the true IBP prior p(zn|o<n) with an approximate IBP-like prior q(zn|o<n):

q(znk|o<n) def = 1 β + n 1

n <n q(zn k = 1|o n )

+ q(Λn 1 k 1|o<n) q(Λn 1 + λn k 1|o<n)

This approximate prior is akin to the true prior, with the key difference that the former prohibits revising previous posteriors based on later observations. For the linear-Gaussian model, the variational family in which we optimize is:

q(zn|o n; θn) def = Y

k q(znk|o n;bnk)q(Ak|o n; µnk, Σnk)

q(znk|o n; bnk) def = Bern(bnk)

q(Ak|o n; µnk, Σnk) def = N(µnk, Σnk)

where θn def ={bnk}k {µnk}k {Σnk}k are the variational parameters and the optimization problem is to maximize the approximate lower bound:

L(θn) def = Eq(zn,A|o n)[log p(on|zn, A)]

+ Eq(A|o n)[log q(A|o<n)]

+ Eq(zn|o n)[log q(zn|o<n)

+ H[q(zn, A|o n)]

The variational parameters must be solved self-consistently, and we derive the necessary equations in closed form in the

Supplement. At the risk of overloading terminology, we also call this inference algorithm R-IBP based on its origin. R-IBP operates by performing a single iteration of message passing on the IBP s directed graph. Two advantages of our approximation are that it solves the second challenge (exponentially many likelihood evaluations) and the third (non-factorized posterior), but at the cost of using an objective function that is no longer a guaranteed lower bound on the log evidence. We do not necessarily see this as a problem since prior work shows that tighter log evidence bounds do not necessarily produce better models (Rainforth et al., 2019). Yet the fourth issue remains: what is the filtered posterior q(Λn|o n) over the total number of features?

3.5. Distribution over Number of Features

Perhaps surprisingly, under the same assumption and regardless of the particular feature model, the filtered posterior over the number of features is Poisson with a calculable rate. The total number of dishes after the nth customer is defined:

Each term in the sum counts whether the kth feature was present in at least one of the first n observations, and the sum is therefore a Bernoulli random variable with success probability 1 Qn =n n =1 p(zn k = 0|o n ) because, in order for the k-th feature to not exist, the feature cannot have been present in any of the first n observations. Le Cam s Theorem (Le Cam, 1960) tells us that the sum of independent Bernoullis is closely approximated by a Poisson:

q(Λn|o n) Poisson

n =1 q(zn k = 0|o n ) !

Additionally, because the prior over the number of new features added by the nth observation is independent from the preceding total number of features, the second Poisson in Eqn. (5) is distributed:

q(Λn|o<n) Poisson

n =1 q(zn k = 0|o n ) !

Although the equation looks daunting, all terms are available through R-IBP. For detailed derivation, see the Supplement.

3.6. Complexity Analysis

The time and space complexity of the Recursive IBP is determined by the number of latent features Λn, which is

Streaming Inference for Infinite Feature Models

unbounded and neither converges nor concentrates. If we instead use the expected number of latent features in the prior E[Λn] = αβ log(1+β/n), and assume that we take at most S coordinate ascent steps per observation, the average-case time complexity per observation is O(SE[Λn]), making the total average-case time complexity for N observations quasilinear O(NS log(1 + N)). The total average-case space complexity is logarithmic O(E[ΛN]) = O(log(1 + N)) to store the variational parameters, the running sum of probability masses and the running product of probability masses. Empirically, we find that R-IBP follows these asymptotic trends on both synthetic and real data (Fig. 8).

4. Analytical Results

4.1. R-IBP in Zero-Noise Limit

In general, given some generative model and an inference algorithm, one often wants to know whether the algorithm converges, what it converges to, and how quickly it converges. Feature models are notoriously difficult to characterize analytically for several reasons including degeneracies and combinatoric complications. To our knowledge, there is one setting in which IBP theory was achieved: the linear-Gaussian model in the zero noise limit, i.e. σ2 o 0 (Broderick et al., 2013b). By considering R-IBP in the same limit, and similarly reparameterizing the model with α def = exp( γ2/2σ2 o) and setting β = 1, we show:

Proposition 4.1. Consider a linear-Gaussian model O = ZA + E with an IBP(α, β) prior on Z. In the limit σ2 o 0, R-IBP fits the data using Z and A with a regularization term penalizing the number of features ΛN by minimizing the objective function:

Tr h (O ZA)T (O ZA) i + γ2ΛN (7)

This objective function tells us R-IBP will seek to minimize the squared error between the observations and the subset of infinite features thought to be present (or equivalently, maximize the log likelihood of the data), with a regularization term that penalizes the number of used features. This objective function is akin to the Bayesian Information Criterion (Schwarz, 1978), in that it maximizes the log likelihood while penalizing the number of parameters. However, RIBP does not necessarily converge because it performs only a single pass through the data and multiple passes may be necessary for convergence. Our proof works by showing that in the σ2 o 0 limit, R-IBP becomes Broderick et al. s BP-Means algorithm (Broderick et al., 2013b) and thus minimizes the same objective; see the Supplement for details.

Figure 3. Monte Carlo vs. Analytical Expression. Over a wide range of (α, β) parameter pairs, we find excellent visual match between Monte Carlo estimates of the marginal probabilities drawn from the conditional p(zn|z<n, α, β) (left) and the Recursive IBP s marginal probabilities p(zn|α, β) (right).

5. Experimental Results

5.1. Exactness of Recursive IBP for the IBP

Setting inference aside temporarily and considering solely the IBP stochastic process, the Recursive IBP should exactly give the IBP indicators marginal distributions. We confirm this by comparing Eqn. (4) s analytical expression to 5000 Monte Carlo samples drawn from the IBP s conditional distribution over α {1.1, 10.78, 15.37} β {2.3, 5.6, 12.9}. Visually, the analytical and Monte Carlo plots display excellent agreement (Fig. 3). Quantitatively, the mean squared error between the analytical expression for all p(znk|α, β) and the Monte Carlo estimates falls approximately as a power law in the number of Monte Carlo samples (Fig. 4) for all (α, β) values. This supports our claim that R-IBP is exact for the IBP as a stochastic process.

5.2. Infinite Linear Gaussian on Synthetic Data

We next turned to performing inference in the linear Gaussian (LG) feature model given in Eqn. (3):

where the indicators Z are drawn from an IBP and the features A = {Ak} from a matrix Normal distribution. We used synthetically generated data to have access to ground truth features. We compared R-IBP against five baseline algorithms. The first two baselines are streaming algorithms, whereas the last three baselines are non-streaming algorithms that have unfettered access to all observations and

Streaming Inference for Infinite Feature Models

Figure 4. Mean-Squared Error between analytical expression for the marginal and a Monte Carlo marginal estimate. Over a wide range of (α, β) pairs, the mean-squared error between our analytical expression and Monte Carlo estimates falls approximately as a power law, showing the exactness of Eqn. (4).

therefore serve as upper bounds on performance; any comparison against these last three baselines maximally disfavors our method. The baseline algorithms are:

Streaming Variational Inference (Widjaja & Doshi Velez, 2017), both finite and infinite variants.

Variational Inference (Doshi-Velez et al., 2009), both finite and infinite variants.

Hamiltonian Monte Carlo-Gibbs Sampling (Duane et al., 1987), implemented in Pyro (Bingham et al., 2019)

We included Widjaja & Doshi-Velez (2017) s method despite being less well known because it is the only streaming variational inference algorithm for the IBP that we are aware of. At a high level, the algorithm works via a Beta Process stick-breaking construction. Specifically, each presence/absence indicator znk for the kth feature is sampled i.i.d. from Bernoulli(πk), where πk is defined as a product of i.i.d. Beta variables πk def = Q

k k vk and vk i.i.d. Beta(α, 1); the variational distribution for each vk is then defined as Beta(τk 1, τk 2) for variational parameters τk 1, τk 2.

Quantitatively comparing inference algorithms for feature models is notoriously difficult, and many papers skip attempting to do so altogether, e.g., (Griffiths & Ghahramani, 2005; Teh et al., 2007; Miller et al., 2009; Paisley & Carin, 2009; Paisley et al., 2012; 2010). The most appropriate metric we found was the (negative log) posterior predictive probability (Widjaja & Doshi-Velez, 2017; Paisley et al., 2011), as the metric may be be computed for any inference algorithm, regardless of underlying parametric assumptions. The posterior predictive distribution quantifies, in a parameter-free manner, how probable new observations

Otest drawn from the same generative process are, after seeing the original data Otrain, marginalizing over all possible parameters:

p(Otest|Otrain)

= Z p(Otest|Ztest, A)p(Ztest, A|Otrain)d(Ztest, A)

= Ep(Ztest,A|Otrain)[p(Otest|Ztest, A)]

s=1 N(Otest|Z(s) test, A(s), σ2 o)

where S is a pre-specified number of samples (we arbitrarily use 100) and Z(s) test, A(s) p(Ztest, A|Otrain).

Over different (α, β) pairs and averaging over 10 synthetically generated datasets, we find that R-IBP achieves lower (better) negative log posterior predictive values than all other inference algorithms except for Doshi-Velez s (nonstreaming) finite algorithm (Doshi-Velez et al., 2009) (Fig. 5), outperforming even Doshi-Velez s (non-streaming) infinite algorithm. We also find that R-IBP is significantly faster than almost all other inference algorithms (Fig. 5) except for Widjaja s (streaming) finite algorithm (Widjaja & Doshi-Velez, 2017) which achieves significantly worse performance. These results demonstrate that R-IBP provides a good tradeoff between performance and speed and is a competitive inference algorithm for infinite feature modeling on streaming and on non-streaming data.

One surprise was that R-IBP sometimes performs as well as, or even better than, non-streaming baselines when the model is properly specified. For both of Widjaja et al. s algorithms and both of Doshi-Velez et al. s algorithms, we used authorpublished code to minimize the possibility of implementation errors. Our hypothesis (see Discussion for details and supporting evidence) is that because most (if not all) IBPinference algorithms rely on stick-breaking constructions that chain-multiply inferred beta variables, errors amplify in a multiplicative way, whereas R-IBP adds inferred beta variables to cumulative sums of sufficient statistics, washing out errors as more data are observed.

We also tested whether R-IBP recovers the true number of features when the model is properly specified. We found that as R-IBP receives more observations, it converges to the true number of inferred features (Fig. 6, left), over a range of different data dimensions. Those features are incrementally added with more observations (Fig. 6, right).

5.3. Infinite Linear Gaussian on MNIST Digits

We next tested how well R-IBP performs on real data, following the example set by (Paisley & Carin, 2009): we took the odd digits from MNIST (Lecun et al., 1998) and measured how (dis)similar the features inferred for each class

Streaming Inference for Infinite Feature Models

Figure 5. Comparison of Linear-Gaussian Inference Algorithms. Over a range of α values, R-IBP is significantly faster than baseline inference algorithms and has better (lower) negative log posterior predictive values than the streaming baselines and even some nonstreaming baselines, averaged over 10 synthetic datasets. We fix β = 1.0 because baseline algorithms are only defined for β = 1.0. In all panels, the correct α, β values are given to each inference algorithm.

Figure 6. R-IBP Feature Recovery on Streaming Data. R-IBP recovers the correct order of magnitude of number of features (left), adding features as more observations are encountered (right).

are. To quantify similarity between MNIST digit classes, we compute the fraction of features shared between two data drawn from the same digit vs. two data drawn from different digits. One might predict that 3 and 5 are similar, 7 and 9 are similar, and perhaps 1 is on its own. This is precisely what R-IBP recovers, in an unsupervised manner, qualitatively matching the confusion matrix of a separately trained supervised convolutional neural network classifier (Fig. 7), and matching the results of Paisley & Carin (2009).

R-IBP Feature Similarity Confusion Matrix Classifier

Example Images

Figure 7. R-IBP Recovers Intuitive Features for MNIST Classes. Feature similarity between images of MNIST digits drawn from same and different classes. Feature similarity matches the confusion matrix of an independently-trained convolutional neural network classifier on MNIST images. R-IBP infers more similar features for digit classes 3 and 5, and for 7 and 9, with the digit class 1 largely isolated.

Figure 8. R-IBP performance on cancer gene expression and diabetic patient data. On cancer gene expression (top) and diabetic patient (bottom) data, R-IBP matches or outperforms baseline algorithms across hyperparameter configurations (left). R-IBP runtime scales linearly with α and quasilinearly with β (right), qualitatively matching our complexity analysis.

5.4. Infinite Linear Gaussian on Tabular Data

We additionally tested R-IBP on tabular data, using two datasets from the UCI Machine Learning Repository (Dua & Graff, 2017): gene expression of cancer patients (801 samples, 20k features), and diabetic patient profiles (100k samples, 55 features) (Strack et al., 2014). Because the hyperparameters α, β, σA, σo are unknown, we swept these for each algorithm. The distribution of negative log posterior predictive scores shows that on both datasets, R-IBP performs well (Fig. 8, left); however, if one considers only the best configuration of hyperparameters for each algorithm, the two Doshi-Velez algorithms outperform R-IBP. We also tested whether our complexity analysis holds qualitatively by plotting how R-IBP s runtime varies as a function of α, β, with the expectation that the runtime should scale linearly with α and quasilinearly with β i.e. β log(1 + N/β); this is precisely what we found in both datasets (Fig. 8, right).

Streaming Inference for Infinite Feature Models

Figure 9. R-IBP vs Finite Factor Analysis on Omniglot. R-IBP overfits for low noise parameters (upper left), but outperforms Finite Factor Analysis for higher noise parameters (lower center).

5.5. Infinite Factor Analysis on Omniglot

We conclude by demonstrating R-IBP s general applicability using a different feature model, Factor Analysis (FA), which generalizes linear-Gaussian and (probabilistic) PCA. FA introduces Wn RK N(0, Σw) (with Σw diagonal) to capture the degree to which a feature is expressed. The FA generative model is:

Z IBP(α, β)

Ak N(µA, ΣA)

wn N(0, Σw)

en N(0, σ2 o ID D)

O = (Z W)A + E

To showcase the utility of R-IBP on non-tabular data, we took a pretrained variational autoencoder (VAE) (Kingma & Welling, 2014) with a Gaussian latent prior2, fed it Omniglot handwritten character images (Lake et al., 2015), and used its latent posterior means as observations. As a baseline, we used Finite Factor Analysis (FFA), implemented in scikit-learn (Pedregosa et al., 2011), sweeping the number of finite components. We found that low-noise parameters significantly overfit (Fig. 9) compared to FFA baselines, but for higher-noise parameters, achieved lower reconstruction error and negative log posterior predictive values.

6. Related Work

There is significant prior work on streaming inference as well as Bayesian nonparametric modeling. At the intersection of the two, early papers focused on mixture modeling (also known as clustering) (Lin, 2013; Tank et al., 2015; Campbell et al., 2013), but later papers considered more general nonparametric models (Campbell et al., 2015; Broderick et al., 2013a).

2The VAE was acquired from (Tomczak & Welling, 2018) s publicly available code at https://github.com/ jmtomczak/vae_vampprior.

R-IBP is similar to the Collapsed Gibbs sampler (CGS) proposed in the original IBP paper (Griffiths & Ghahramani, 2005), but differs in four critical ways. First, CGS is based on the IBP s conditional distribution, whereas R-IBP is based on the IBP s marginal distribution. Second, R-IBP never forces the indicators zn,k to take values in {0, 1}; rather, R-IBP s indicators exist in a superposition defined by the average over all sample paths. Third, unlike CGS, R-IBP does not marginalize out the features. Fourth and finally, CGS cannot be used on streaming data because the marginalization requires the features to follow a matrix normal distribution, yet once any data are observed, the features no longer follow a matrix normal distribution since some features shift to explain the data while other features do not. Two related IBP streaming inference papers are (Widjaja & Doshi-Velez, 2017) and (Wood & Griffiths, 2007).

7. Discussion

In this paper, we demonstrate how intelligent agents receiving streaming data can make use of infinite feature models that create new features online, as demanded by the data, in a probabilistic and principled manner. This was possible due to our novel recursive form of the Indian Buffet Process, which we termed the Recursive IBP. We showed that the Recursive IBP can be combined with different feature models, and that inference based on the Recursive IBP displays performance and speed close to or sometimes surpassing baseline algorithms (including some offline baseline algorithms, which have a significant advantage).

One curiosity is why Recursive-IBP performs so well. We used published code for the two Widjaja et al. and two Doshi-Velez et al. baselines, so implementation error is unlikely. Our hypothesis stems from the observation that the baselines do not use the IBP but rather its De Finetti mixing-distribution: the Beta Process, e.g. (Teh et al., 2007; Thibaux & Jordan, 2007; Doshi-Velez et al., 2009; Paisley & Carin, 2009). The consequence of using the Beta Process is that its stick-breaking constructions chain multiply inferred quantities. We hypothesize this multiplication causes impre-

Figure 10. Beta Process vs. Indian Buffet Process. The Beta Process (left) chain multiplies terms (red) to compute each feature s probability (yellow) for Z (aqua), whereas the IBP (right) creates columns (green) then adds terms within columns of Z (aqua).

Streaming Inference for Infinite Feature Models

cisions to quickly compound. In contrast, the Recursive IBP marginalizes over feature probabilities, and instead adds the inferred quantities to running sums, causing errors to become less damaging with large n (Fig. 10).

This hypothesis is similar to the claim that collapsed Gibbs sampling is often superior to Gibbs sampling (Liu, 1994) because variables have been marginalized out. Although we are currently unable to prove our hypothesis, Widjaja & Doshi-Velez s infinite algorithm provides supporting evidence. That algorithm and R-IBP are both infinite (i.e., nontruncated) and both streaming; the only difference is that Widjaja et al. use a stick-breaking prior for znk, whereas we use the approximate R-IBP prior. In our experiments, R-IBP consistently outperforms Widjaja et al. s infinite algorithm.

To emphasize one point, our particular choice of the filtering prior q(zn|o<n) drops dependence on future observations. Consequently, R-IBP will suffer if the smoothing and filtering distributions differ significantly. However, characterizing this difference analytically or empirically proved difficult. The challenge is that other inference algorithms we are familiar with use the stick-breaking construction of the IBP, and we couldn t think of how to disentangle the effect of assuming a different graphical structure from the effect of not revisiting past filtered distributions. Our paper is not the first to use this restriction for tractability (e.g., Marino et al. (2018)), and we attempted to remove it by adapting Campbell et al. (2021), but found their approach relies on assumptions inapplicable to the IBP. We view this as important, non-trivial future work.

Looking forward, Bayesian nonparametric models are a growing topic of interest in cognitive science and neuroscience, in studies ranging from human sensorimotor learning (Heald et al., 2021) to mouse spatial navigation (Sanders et al., 2020). We are keen to study whether R-IBP and similar streaming inference algorithms, e.g., (Schaeffer et al., 2021), can better explain behavioral and neural data.

Streaming Inference for Infinite Feature Models

Beal, M. J. Variational Algorithms for Approximate Bayesian Inference. Ph D thesis, Gatsby Computational Neuroscience Unit, UCL, pp. 281, 2003.

Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D. Pyro: Deep Universal Probabilistic Programming. Journal of Machine Learning Research, 20:6, 2019.

Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. Streaming Variational Bayes. Neural Information Processing Systems, pp. 9, 2013a.

Broderick, T., Kulis, B., and Jordan, M. I. MAD-Bayes: MAP-based Asymptotic Derivations from Bayes. International Conference on Machine Learning, pp. 9, 2013b.

Campbell, A., Shi, Y., Rainforth, T., and Doucet, A. Online Variational Filtering and Parameter Learning. ar Xiv:2110.13549 [cs, stat], October 2021. URL http://arxiv.org/abs/2110.13549. ar Xiv: 2110.13549.

Campbell, T., Liu, M., Kulis, B., How, J. P., and Carin, L. Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture. ar Xiv:1305.6659 [cs, stat], November 2013. URL http://arxiv.org/abs/ 1305.6659. ar Xiv: 1305.6659.

Campbell, T., Straub, J., Iii, J. W. F., and How, J. P. Streaming, Distributed Variational Inference for Bayesian Nonparametrics. Neural Information Processing Systems, pp. 9, 2015.

Doshi-Velez, F. and Ghahramani, Z. Accelerated sampling for the Indian Buffet Process. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML 09, pp. 1 8, Montreal, Quebec, Canada, 2009. ACM Press. ISBN 978-1-60558-516-1. doi: 10. 1145/1553374.1553409. URL http://portal.acm. org/citation.cfm?doid=1553374.1553409.

Doshi-Velez, F., Miller, K. T., Gael, J. V., and Teh, Y. W. Variational Inference for the Indian Buffet Process. Artificial Intelligence and Statistics, pp. 8, 2009.

Dua, D. and Graff, C. UCI Machine Learning Repository, 2017. URL http://archive.ics.uci.edu/ml.

Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. Hybrid Monte Carlo. Physics Letters B, 195(2):216 222, September 1987. ISSN 0370-2693. doi: 10.1016/0370-2693(87)91197-X. URL https://www.sciencedirect.com/ science/article/pii/037026938791197X.

Ghahramani, Z., Griffiths, T. L., and Sollich, P. Bayesian Nonparametric Latent Feature Models. Bayesian Statistics, 8:25, 2007.

Griffiths, T. L. and Ghahramani, Z. Infinite latent feature models and the Indian buffet process. Neural Information Processing Systems, pp. 8, 2005.

Griffiths, T. L. and Ghahramani, Z. The Indian Buffet Process: An Introduction and Review. Journal of Machine Learning Research, 12(32):1185 1224, 2011. ISSN 15337928. URL http://jmlr.org/papers/v12/ griffiths11a.html.

Heald, J. B., Lengyel, M., and Wolpert, D. M. Contextual inference underlies the learning of sensorimotor repertoires. Nature, 600(7889):489 493, December 2021. ISSN 0028-0836, 1476-4687. doi: 10.1038/ s41586-021-04129-3. URL https://www.nature. com/articles/s41586-021-04129-3.

Hyv arinen, A. Statistical Models of Natural Images and Cortical Visual Representation. Topics in Cognitive Science, 2(2):251 264, April 2010. ISSN 17568757, 17568765. doi: 10.1111/j.1756-8765.2009.01057. x. URL https://onlinelibrary.wiley.com/ doi/10.1111/j.1756-8765.2009.01057.x.

Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. International Conference on Learning Representations, 2014. URL https: //dare.uva.nl/search?identifier= cf65ba0f-d88f-4a49-8ebd-3a7fce86edd7. Publisher: Ithaca, NYar Xiv.org.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, December 2015. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.aab3050. URL https://science. sciencemag.org/content/350/6266/1332. Publisher: American Association for the Advancement of Science Section: Research Article.

Le Cam, L. An approximation theorem for the Poisson binomial distribution. Pacific Journal of Mathematics, 10(4):1181 1197, 1960. ISSN 00308730. URL https://projecteuclid.org/ euclid.pjm/1103038058. Publisher: Pacific Journal of Mathematics.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, November 1998. ISSN 1558-2256. doi: 10.1109/5.726791. Conference Name: Proceedings of the IEEE.

Streaming Inference for Infinite Feature Models

Lin, D. Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation. Neural Information Processing Systems, pp. 9, 2013.

Liu, J. S. The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem. Journal of the American Statistical Association, 89(427):958 966, September 1994. ISSN 0162-1459. doi: 10.1080/01621459.1994.10476829. URL https://doi.org/10.1080/01621459. 1994.10476829. Publisher: Taylor & Francis eprint: https://doi.org/10.1080/01621459.1994.10476829.

Marino, J., Cvitkovic, M., and Yue, Y. A General Method for Amortizing Variational Filtering. ar Xiv:1811.05090 [cs, stat], November 2018. URL http://arxiv.org/ abs/1811.05090. ar Xiv: 1811.05090.

Miller, K. T., Griffiths, T. L., and Jordan, M. I. Nonparametric Latent Feature Models for Link Prediction. January 2009. URL https://openreview.net/forum? id=B1b4F_bd-r.

Olshausen, B. A. and Field, D. J. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision research, 37(23):3311 3325, 1997. Publisher: Elsevier.

Paisley, J. and Carin, L. Nonparametric factor analysis with beta process priors. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML 09, pp. 1 8, Montreal, Quebec, Canada, 2009. ACM Press. ISBN 978-1-60558-516-1. doi: 10. 1145/1553374.1553474. URL http://portal.acm. org/citation.cfm?doid=1553374.1553474.

Paisley, J., Zaas, A., Woods, C. W., Ginsburg, G. S., and Carin, L. A Stick-Breaking Construction of the Beta Process. International Conference on Machine Learning, pp. 8, 2010.

Paisley, J., Carin, L., and Blei, D. Variational Inference for Stick-Breaking Beta Process Priors. International Conference on Machine Learning, pp. 8, 2011.

Paisley, J., Blei, D. M., and Jordan, M. I. Stick-Breaking Beta Processes and the Poisson Process. AISTATS, pp. 9, 2012.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85):2825 2830, 2011. URL http://jmlr.org/papers/ v12/pedregosa11a.html.

Pehlevan, C., Hu, T., and Chklovskii, D. B. A Hebbian/Anti Hebbian Neural Network for Linear Subspace Learning: A Derivation from Multidimensional Scaling of Streaming Data. Neural Computation, 27(7):1461 1495, July 2015. ISSN 0899-7667, 1530-888X. doi: 10. 1162/NECO a 00745. URL https://direct.mit. edu/neco/article/27/7/1461-1495/8104.

Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. Tighter Variational Bounds are Not Necessarily Better. ar Xiv:1802.04537 [cs, stat], March 2019. URL http://arxiv.org/ abs/1802.04537. ar Xiv: 1802.04537.

Sanders, H., Wilson, M. A., and Gershman, S. J. Hippocampal remapping as hidden state inference. e Life, 9: e51140, June 2020. ISSN 2050-084X. doi: 10.7554/ e Life.51140. URL https://doi.org/10.7554/ e Life.51140. Publisher: e Life Sciences Publications, Ltd.

Schaeffer, R., Bordelon, B., Khona, M., Pan, W., and Fiete, I. R. Efficient Online Inference for Nonparametric Mixture Models. Uncertainty in Artificial Intelligence, pp. 10, 2021.

Schwarz, G. Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461 464, 1978. ISSN 00905364. URL https://www.jstor.org/stable/ 2958889. Publisher: Institute of Mathematical Statistics.

Strack, B., De Shazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., and Clore, J. N. Impact of Hb A1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records. Bio Med Research International, 2014: e781670, April 2014. ISSN 2314-6133. doi: 10.1155/ 2014/781670. URL https://www.hindawi.com/ journals/bmri/2014/781670/. Publisher: Hindawi.

Tank, A., Foti, N. J., and Fox, E. B. Streaming Variational Inference for Bayesian Nonparametric Mixture Models. AISTATS, pp. 9, 2015.

Teh, Y. W. and G or ur, D. Indian buffet processes with power-law behavior. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS 09, pp. 1838 1846, Red Hook, NY, USA, December 2009. Curran Associates Inc. ISBN 978-161567-911-9.

Teh, Y. W., Gr ur, D., and Ghahramani, Z. Stick-breaking Construction for the Indian Buffet Process. In Artificial Intelligence and Statistics, pp. 556 563. PMLR, March 2007. URL http://proceedings.mlr.press/ v2/teh07a.html. ISSN: 1938-7228.

Streaming Inference for Infinite Feature Models

Thibaux, R. and Jordan, M. I. Hierarchical Beta Processes and the Indian Buffet Process. In Artificial Intelligence and Statistics, pp. 564 571. PMLR, March 2007. URL http://proceedings.mlr.press/ v2/thibaux07a.html. ISSN: 1938-7228.

Tomczak, J. M. and Welling, M. VAE with a Vamp Prior. ar Xiv:1705.07120 [cs, stat], February 2018. URL http://arxiv.org/abs/1705.07120. ar Xiv: 1705.07120.

Wainwright, M. J. and Jordan, M. I. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc, 2008. ISBN 978-1-60198-184-4. Google Books-ID: zp5Mo3Vs Jbg C.

Widjaja, F. and Doshi-Velez, F. Streaming Variational Inference for the Indian Buffet Process. Harvard University Senior Theses, July 2017. URL https://dash. harvard.edu/handle/1/38811474. Accepted: 2019-03-26T10:57:02Z.

Wood, F. and Griffiths, T. Particle Filtering for Nonparametric Bayesian Matrix Factorization. In Advances in Neural Information Processing Systems, volume 19. MIT Press, 2007.

Streaming Inference for Infinite Feature Models

A. Posterior Distribution Over Total Number of Dishes

As before, let Λn denote the total number of dishes after the nth customer:

Each term in the sum represents whether the kth dish (feature) exists after n customers (observations). Let s consider one term in the sum: mnk def = min(1, P

n n zn ,k). We can use the following proposition to determine the distribution of mnk: Proposition A.1. Let X be a random variable with CDF FX(x) = p(X x) and let c R be a constant. Then the

random variable Y def = min(c, X) has a CDF FY (y) = p(Y y) given by

( FX(y) if y < c 1 if y c .

Substituting mnk for Y , P

n n zn k for X and 1 for c, it follows that

Fmnk|o n(0) = FP zn k|o n(0)

and Fmnk|o n(1) = 1.

We can now determine the probability mass function (PMF) of mnk:

q(mnk = 0|o n) = q(mnk 0|o n)

= Fmnk|o n(0)

= FP zn k|o n(0)

n n zn k 0 o n

n n zn k = 0 o n

where the first and last steps follow because mnk and P

n n zn ,k can only take values in {0, 1, 2, ..., n}. Each zn k is a Bernoulli random variable with distribution given by p(zn k|o n ). The sum can only be 0 if all zn k = 0, which occurs with probability Q

n n p(zn k = 0|o n ). The PMF of mnk is therefore

p(mnk = 0|o n) = Y

n n p(zn k = 0|o n)

p(mnk = 1|o n) = Fmnk|o n(1) Fmnk|o n(0)

n n p(zn k = 0|o n )

and p(Mk = n) = 0 for n = 2, 3, ..., t. This tells us that Mk Bernoulli(1 Q

n n p(zn k = 0|o n)), which matters for two reasons. First, as t , the product approaches 0 and thus the probability that the kth feature exists goes to 1, which is what we expect: given infinite data, the IBP should fill the entire feature space. Second, because Λn is the sum of independent but non-identically distributed Bernoullis, Le Cam s Theorem (Le Cam, 1960) again tells us Λn closely follows a Poisson distribution:

p(Λn|o n) = Poisson

n =1 p(zn k = 0|o n )

Streaming Inference for Infinite Feature Models

B. Variational Parameter Updates

B.1. Closed Form Expression for Variational Parameters in the Exponential Family

In the following three subsections, we use the following fact (Beal, 2003; Wainwright & Jordan, 2008): if a distribution p and its variational approximation q are both in the exponential family, then the optimal variational parameters ζi that correspond to the variational distribution over variable Wi are the solution to

log q(Wi; ζi) = Eq(W i)[log p(W, X|θ)] (9)

This simply means that when optimizing the variational parameters for a variable, we can replace other variables with their expectations under the variational distribution and then solve for that one variable s parameters.

B.2. Closed Form Solutions for Linear-Gaussian Variational Parameters

We provide closed form solutions for the variational parameters for the linear Gaussian model. The model is

on = AT zn + ϵn

where A RK D, zn {0, 1}K, ϵn RD, with Gaussian priors on A and ϵ. We posit the variational family:

q(zn, A|o n; θn) def =

k=1 q(znk|o n; bnk)q(Ak|o n; µnk, Σnk)

q(znk|o n; bnk) def = Bern(bnk)

q(Ak|o n; µnk, Σnk) def = N(µnk, Σnk)

where θn def ={bnk}k {µnk}k {Σnk}k are our variational parameters for the n observation. Our optimization problem is to maximize the approximate lower bound with respect to θn:

Eq(zn,A|o n;θn) h log q(zn|o<n) + log q(A|o<n) + log p(on|zn, A) i + H[q(zn, A|o n)]

where q(A|o<n) def = q(A|o n 1) and q(zn|o<n) is given by Eqn. 5. To find the variational parameters for the indicators znl and features Anl, we will use the closed form solutions. Dropping irrelevant terms from line to line, for the binary indicators, we have:

log q(znl|o n; bnl) = Eq(zn l,A|o n;θn)[log p(on, zn, A)]

= Eq(zn l,A|o n;θn)[log q(znl|o<n) + log p(on|zn, A)]

Eq(zn l,A;θn)[log q(znl|o<n)] = log q(znl|o<n)

= znl log q(znl|o<n) 1 q(znl|o<n)

Eq(zn l,A|o n;θn)[log p(on|zn, A)] = 1

2σ2o Eq[(o T non 2o T n AT zn + z T n AAT zn)]

k znk AT k on + z T n AAT zn

2znlµT l on + znl Tr[Σnl + µlµT l ]) + 2znlµT l X

k:k =l bnkµk #

Streaming Inference for Infinite Feature Models

Grouping the znl terms, setting equal, substituting the canonical parameterization of the Bernoulli and solving, we have:

log bnl 1 bnl = log q(znl|o<n) 1 q(znl|o<n) 1 2σ2o

2µT l on + Tr[Σnl + µlµT l ] + 2µT l X

k:k =l bnkµk #

bnl = 1 1 + e ϑ

For the linear Gaussian parameters, we want to solve

log q(Al|o n; µnl, Σnl) = Eq(zn,A l|o n;θn)[log p(on, zn, A)]

We take only the terms that depend on Al. On the left hand side, the terms that depend on Al are:

log q(Al|o n; µnl, Σnl) 1

2(AT l Σ 1 nl Al 2AT l Σ 1 nl µnl)

On the right hand side, the terms that depend on Ak are:

Eq(zn,A l|o n;θn)[log p(on, zn, A)] = Eq(zn,A l|o n;θn)[log q(Al|o<n) + log p(on|zn, A)]

Eq(zn,A l|o n;θn)[log q(Al|o<n)] = 1

2(AT l Σ 1 n 1,l Al 2AT l Σ 1 n 1,lµn 1,l)

Eq(zn,A l|o n;θn) log p(on|zn, A)] = 1

2σ2o Eq(zn,A l|o n;θn)[(on AT zn)T (on AT zn)]

k znk Ak T on X

2o T nbnl Al + bnl AT l Al + 2 X

k:k =l bnkµnk T bnl Anl

bnl AT l Al + 2 X

k:k =l bnkµnk on T bnl Anl

Setting equal, removing the 1/2 prefactor and completing the square gives us

AT l Σ 1 nl Al 2AT l Σ 1 nl µnl = AT l Σ 1 n 1,l Al 2AT l Σ 1 n 1,lµn 1,l + 1

bnl AT l Al + 2 X

k:k =l bnkµnk on T bnl Anl

Considering terms with the form AT l ( )Al allows us to solve for the covariance Σnl:

AT l Σ 1 nl Al = AT l Σ 1 n 1,l + bnl

which gives us the final expression:

Σnl = Σ 1 n 1,l + bnl

σ2o I 1 (10)

To find the mean µnl, we consider terms of the form AT l ( )µnl:

2AT l Σ 1 nl µnl = 2AT l Σ 1 n 1,lµn 1,l + 2 1

k:k =l bnkµnk on

Streaming Inference for Infinite Feature Models

which gives us the final expression for the mean:

Σ 1 n 1,lµn 1,l + bnl

k:k =l bnkµnk

We add one heuristic, based on the intuition that the features should ossify as evidence accumulates to support their existence. Let µ nl and Σ nl denote the solutions to the previous equations. Note that the that optimization problem doesn t take into account how many observations were used to infer those parameters; the previous parameters µn 1,l and Σn 1,l carry just as much weight regardless of wheter n = 2 or n = 1010. Consequently, instead of accepting outright the solutions µ nl, Σ nl, we take a number-of-observations weighted average:

µnl q(znl = 1|o n)µ nl +

n <n q(zn k = 1|o n )

Σnl q(znl = 1|o n)Σ nl +

n <n q(zn k = 1|o n )

These running sums are already available from the recursion and thus require no additional time or space.

B.3. Closed Form Solutions for Factor Analysis Variational Parameters

We provide closed form solutions for the variational parameters for the Factor Analysis model. The model is

on = AT (zn wn) + ϵn

where wn RK N(0, Σw) and denotes element-wise multiplication. We posit the variational family:

q(zn, wn, A|o n; θn) def =

k=1 q(znk|o n; bnk)q(wn|o n; ϕn, Φn)q(Ak|o n; µnk, Σnk)

q(znk|o n; bnk) def = Bern(bnk)

q(wn|o n; ϕn, Φn) def = N(ϕn, Φn)

q(Ak|o n; µnk, Σnk) def = N(µnk, Σnk)

where θn def ={bnk}k {ϕn, Φn} {µnk}k {Σnk}k are our variational parameters for the n observation. Our optimization problem is to maximize the approximate lower bound with respect to θn:

Eq(zn,wn,A|o n;θn) h log q(zn|o<n) + log p(wn) + log q(A|o<n) + log p(on|zn, wn, A) i + H[q(zn, wn, A|o n)]

where q(A|o<n) def = q(A|o n 1; µn 1,k, Σn 1,k) and q(zn|o<n) is given in the main text as:

q(znk|o<n) def = 1 β + n 1

n <n q(zn k = 1|o n ) + q(Λn 1 k 1|o<n) q(Λn 1 + λn k 1|o<n)

We use the same approach as for the linear Gaussian model. Starting with the binary indicator variables znk, we want to

Streaming Inference for Infinite Feature Models

log q(znl; bnl) = Eq(zn l,wn,A|o n;θn)[log p(zn, wn, on, A)]

znl log bnl 1 bnl = Eq(zn l,wn,A|o n;θn)[log p(zn, wn, on, A)]

= Eq(zn l,A|o n;θn)[log q(znl|o<n) + log p(on|zn, wn, A)]

= znl log q(znl|o<n) 1 q(znl|o<n)

1 2σ2o Eq[(o T non 2o T n AT (zn wn) + (zn wn)T AAT (zn wn))]

= znl log q(znl|o<n) 1 q(znl|o<n)

2o T nµnlznlϕnl + Eq

k z2 nkw2 nk AT k Ak + X

k,k :k =k znkwnkznk wnk AT k Ak

= znl log q(znl|o<n) 1 q(znl|o<n)

2o T nµnlznlϕnl + znl[ϕ2 nl + Φnll] Tr[Σnl + µnlµT nl] + 2znlϕnlµT nl X

k:l =k bnkϕnkµk !

Grouping the znl terms, setting equal, substituting the canonical parameterization of the Bernoulli and solving, we have:

log bnl 1 bnl = log q(znl|o<n) 1 q(znl|o<n) 1 2σ2o

2o T nµnlϕnl + [ϕ2 nl + Φnll] Tr[Σnl + µnlµT nl] + 2ϕnlµT nl X

k:l =k bnkϕnkµk #

bnl = 1 1 + e ϑ

Next, for the scaling weights wn, we want to solve the following equation for mean ϕn and covariance Φn:

log q(wn; ϕn, Φn) = Eq(zn,A|o n;θn)[log p(zn, wn, on, A)]

2(w T n Φ 1 n wn 2ϕT nΦ 1 n wn) = Eq(zn,A|o n;θn)[log p(wn) + log p(on|zn, wn, A)]

2w T n Σ 1 w wn 1 2σ2o Eq h (o T non 2o T n AT (zn wn) + (zn wn)T AAT (zn wn)) i

2w T n Σ 1 w wn 1 2σ2o

2o T nµT n diag(bn)wn + Eq[(zn wn)T AAT (zn wn)])

2w T n Σ 1 w wn 1 2σ2o

2o T nµT n diag(bn)wn + w T n Eq[diag(zn)T AAT diag(zn)]wn

The term Eq[diag(zn)T AAT diag(zn)] is slightly trickier. We take the expectation with respect to A, then z:

Streaming Inference for Infinite Feature Models

Eq(A)[AAT ]ij = Eq[AT i Aj]

( Tr[µniµT nj] i = j Tr[µniµT ni + Σni] i = j

Eq(zn,A)[diag(zn)T AAT diag(zn)]ij = Eq(zn)[diag(zn)T M diag(zn)]ij = Eq(zn)[zni Mijznj]

( bni Mijbnj i = j bni Mii i = j

2w T n Φ 1 n wn = 1

2w T n Σ 1 w wn 1 2σ2o

We can then solve for Φn:

Φn = Σ 1 w + 1

σ2o S 1 (12)

Solving for ϕn similarly gives:

2 2ϕT nΦ 1 n wn = 1

2o T nµT n diag(bn) wn

σ2o ΦT n diag(bn)µnon

Lastly, for the feature values Ak, we solve the following equation to obtain the mean µnk and covariance Σnk:

log q(Al|o n; µnl, Σnl) = Eq(zn,wn,A l|o n;θn)[log p(zn, wn, on, A)]

= Eq(zn,wn,A l|o n;θn)[log q(Al|o< n)] + Eq(zn,wn,A l|o n;θn)[log p(on|zn, wn, A)]

As before, taking only the terms depending on Al gives:

log q(Al|o n; µnl, Σnl) = 1

2 AT l Σ 1 nl Al 2AT l Σ 1 nl µnl

Eq [log q(Al|o< n)] = 1

AT l Σ 1 n 1,l Al 2AT l Σ 1 n 1,lµn 1,l

Eq [log p(on|zn, wn, A)] = 1

2σ2o Eq h on AT (zn wn) T on AT (zn wn) i

k znkwnk Ak + X

k z2 nkw2 nk Ak AT k + X

k,k :k =k znkznk wnkwnk Ak AT k

2o T nbnlϕnl Al + bnl Φnll + ϕ2 nl AT l Al + 2bnlϕnl

k:k =l bnkµkϕnk

Streaming Inference for Infinite Feature Models

Setting the two sides equal, removing the 1

2 prefactor, and considering terms of the form AT l ( )Al allows us to solve for the covariance Σnl:

AT l Σ 1 nl Al = AT l

Σ 1 n 1,l + bnl

Φnll + ϕ2 nl I Al

which yields the final expression:

Σnl = Σ 1 n 1,l + bnl

Φnll + ϕ2 nl I 1

To obtain the mean µnl, we consider terms of the form Al( )µnl:

2AT l Σ 1 nl µnl = 2AT l Σ 1 n 1,lµn 1,l + 2AT l

k:k =l bnkµkϕnk on

which gives the final expression:

Σ 1 n 1,lµn 1,l + bnlϕnl

k:k =l bnkµkϕnk

C.1. Summary of (Broderick et al., 2013b)

Broderick, Kulis & Jordan ICML s 2013 paper MAD-Bayes: MAP-based Asymptotic Derivations from Bayes shows that Kulis & Jordan s 2012 DP-Means can be derived in a different manner, as the zero-noise limit of the MAP estimator of a Dirichlet Process Gaussian mixture model. With this view, they also consider a zero-noise limit of the MAP estimator of a Beta-Bernoulli Process Linear-Gaussian feature model. Letting K+ denote the inferred number of dishes, the high level idea is that the MAP estimator is the solution to the following optimization problem:

arg max Z,A,K+ p(Z, A, K+|O) = arg max Z,A,K+ p(Z, A, K+, O) = arg max Z,A,K+ p(X|Z, A, K+)p(Z, K+)p(A)

If A has a matrix normal prior, Z a Beta-Bernoulli prior with concentration parameter α, X|Z, A a matrix normal likelihood with covariance σ2 o I, then under the zero noise limit (i.e. σ2 o 0) and reparameterizing α = exp( λ2/2σ2 o), the objective function can be written: arg min Z,A,K+ Tr[(X ZA)T (X ZA)] + K+λ2 (13)

Broderick et al. then define an algorithm BP-Means and show it converges to a local optimum. My goal is to define a similar optimization problem for R-IBP and show that R-IBP monotonically improves.

C.2. Summary of R-IBP for Linear Gaussian Data

We consider the Linear-Gaussian generative model:

We posit the following variational family, which is a fancy way of saying (a) each znk is a Bernoulli with parameter bnk and (b) each feature Ak is a Normal with mean µk and covariance Σk:

q(zn, A|o n; θn) def =

k=1 q(znk|o n; bnk)q(Ak|o n; µnk, Σnk)

q(znk|o n; bnk) def = Bern(bnk)

q(Ak|o n; µnk, Σnk) def = N(µnk, Σnk)

Streaming Inference for Infinite Feature Models

On each time step n, we perform coordinate ascent, changing the variational parameters. For the Bernoulli parameters, the updates are given by

log bnl 1 bnl = log q(znl|o<n) 1 q(znl|o<n) 1 2σ2o

2µT l on + Tr[Σnl + µlµT l ] + 2µT l X

k:k =l bnkµk #

bnl = 1 1 + e ϑ

For the Normal parameters, the updates are given by:

Σnl = Σ 1 n 1,l + bnl

µnl = Σnl Σ 1 n 1,lµn 1,l + bnl

k:k =l bnkµnk

C.3. Zero Noise Limit of R-IBP

We repeat Thm. 4.1 for ease of reading. The proof follows.

Theorem C.1. For all k, initialize Ak s variational parameters µ0k = 0 and Σ0k O(1) with respect to σ2 o. On each n

and for all k, initialize znk s variational parameters bnk O(1) with respect to σ2 o. Reparameterize α def = exp( γ2/2σ2 o). Then in the limit σ2 o 0, R-IBP minimizes Eqn. (13).

Lemma C.2. Under the above assumptions, R-IBP and BP-Means populate the Z {0, 1}K and A RK D matrices with the same values after a single pass through the data.

Proof. We prove Lemma C.2 via induction.

Base Case: Consider the first observation (n = 1) and first feature (k = 1). We initialize b11 at q(z11 = 1|o<1) = q(z11 = 1), which is the sum of the probabilities that k 1 features are added:

q(z11 = 1; α) =

k! e α σ2 o 0 = O(αe α)

The update for this first feature s covariance is given by:

Σ11 = Σ 1 01 + q(z11 = 1; α)

= σ2 o q(z11 = 1; α)

σ2 o q(z11 = 1; α)Σ 1 01 + I 1

= σ2 o q(z11 = 1; α)

σ2 o q(z11 = 1; α)Σ 1 01 i ( 1)i

σ2 o 0 = 0(I)

where the second to last step is the Neumann series of the matrix ( σ2 o q(z11=1;α)Σ 1 01 + I) 1, which is applicable because the

matrix σ2 o q(z11=1;α)Σ 1 01 has spectral radius < 1. Intuitively, this makes sense: when the noise vanishes, we should be more

Streaming Inference for Infinite Feature Models

confident with where our features are. Now we turn to the updates for the mean:

µ11 = Σ11 Σ 1 01 µ01 + b01

= σ2 o q(z11 = 1; α)

σ2 o q(z11 = 1; α)Σ 1 01 + I 1 q(z11 = 1; α)

σ2 o 0 = o1

We next update the Bernoulli variational parameter, recalling that b11 = 1/(1 + e ϑnk) where:

ϑnk def = log αe α

1 αe α 1 2σ2o

2µT 11(on X

k>1 bnkµnk) + Tr[Σ11 + µ11µT 11]

= log α α log(1 αe α) 1 2σ2o Tr[Σ11] + 1 2σ2o (o T non)

σ2 o 0 = γ2

2σ2o 0 0 + 1 2σ2o (o T non)

if γ2 > o T 1 o1 0 if γ2 = o T 1 o1 if γ2 < o T 1 o1

0 if γ2 > o T 1 o1 0.5 if γ2 = o T 1 o1 1 if γ2 < o T 1 o1

Next, consider the first observation (n = 1) and any feature beyond the first (k > 1). Because µ11 = o1 and b11 = 1, the observation is fully explained and so on b11µ11 = 0, so all b1k = 0 and no further features will emerge. Note that this is identical to Broderick et al. s BP-Means on the first pass.

Inductive Step. Assume that by the (n 1)th observation, inclusive, R-IBP has filled the first n 1 rows in Z and A with the same values as BP-Means. We show that for the nth observation, R-IBP and BP-Means fill the nth row with the same values. First, note that the total number of features Λn 1 is known exactly because n n 1, k, we have that bn k {0, 1}. We need to consider what each algorithm will do in 1 of two cases:

1. Columns corresponding to existing dishes/features i.e. k [1, Λn 1]. In this case, Eqn. 5 dictates that

q(znk = 1|o<n) = α β + n 1

n <n bn k + q(Λn 1 k 1|o<n) | {z } =0

q(Λn 1 + λn k 1|o<n) | {z } =0

which is O(1) with respect to σ2 o. Consequently, the update for the Bernoulli variational parameter becomes:

ϑ σ2 o 0 = 1 2σ2o 2µT nk on X

k =k bnk µnk 1 2σ2o µT nkµnk

If the inner product of µnk with the unexplained remainder on P

k =k bnk µnk is more than half the inner product of µnk with itself, R-IBP will set bnk = 1 and if not, R-IBP will set bnk = 0. This is precisely what BP-Means does. This is because in BP-Means, bnk is set to 1 if

k =k bnk µnk )T (on X

k =k bnk µnk ) 2µT nk(on X

k =k bnk µnk ) + µT nkµnk

k =k bnk µnk )T (on X

k =k bnk µnk )

and 0 otherwise. Simplifying, we see that the BP-Means criterion is identical to the R-IBP criterion:

1 2µT nkµnk < µT nk(on X

k =k bnk µnk )

Streaming Inference for Infinite Feature Models

2. Columns corresponding to new dishes/features i.e. k (Λn 1, Λn 1 + λn]