# a_birthdeath_process_for_feature_allocation__36f9d085.pdf

A Birth-Death Process for Feature Allocation

Konstantina Palla 1 David Knowles 2 Zoubin Ghahramani 3 4

We propose a Bayesian nonparametric prior over feature allocations for sequential data, the birthdeath feature allocation process (BDFP). The BDFP models the evolution of the feature allocation of a set of N objects across a covariate (e.g. time) by creating and deleting features. A BDFP is exchangeable, projective, stationary and reversible, and its equilibrium distribution is given by the Indian buffet process (IBP). We show that the Beta process on an extended space is the de Finetti mixing distribution underlying the BDFP. Finally, we present the ﬁnite approximation of the BDFP, the Beta Event Process (BEP), that permits simpliﬁed inference. The utility of the BDFP as a prior is demonstrated on real world dynamic genomics and social network data.

1. Introduction - Problem Statement

We are interested in time series settings where we observe data {Yt Y : t = 1, . . . , L}. We consider problems where the observations are explained by a latent structure which assigns objects to features and this feature allocation changes over time. For instance, consider the topics covered by a number of newspapers over time; some topics die while new ones are born . The topic coverage of each paper is its latent feature allocation which could be modelled with an Indian buffet process (Grifﬁths & Ghahramani, 2011, IBP). While static feature allocation models are well studied, these are not able to handle the time series nature of many datasets. We propose a process that extends the IBP by allowing the feature allocation to evolve over the covariate as a result of birth and death of features.

1University of Oxford, Oxford, UK 2Stanford University, California, USA 3University of Cambridge, Cambridge, UK 4Uber AI Labs, SF, California, USA. Correspondence to: Konstantina Palla <konstantina.palla@gmail.com>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

2. Related Work

We target problems where the data depends on a covariate, such as time or space, and is explained by a latent structure, in particular a (multi-membership) clustering of the data points. The observations are result of the underlying partitioning and its evolution over the covariate. Typical models fall in two main categories: clustering and feature allocation. The former allow each data point to belong to one and only one class (cluster), while the latter let each data point belong to multiple groups (features). Bayesian nonparametric approaches are primarily based on the Chinese restaurant process (CRP, Aldous, 1983) or the Indian buffet process (IBP, Grifﬁths & Ghahramani, 2005) corresponding to the two categories. In particular, a sample from a CRP is an assignment of data points to disjoint classes (a clustering), while a sample from an IBP is an allocation of the data points to (possibly) overlapping classes (a feature allocation). Dependent nonparametric processes extend distributions over partitions to distributions over collections of partitions indexed by locations in some covariate space, such as R+ (e.g. continuous time), Z (e.g. discrete time), or Rd (e.g. geographical location). Teh et al. (2013) deﬁne such a process based on the duality between Kingman s coalescent (Kingman, 1982) and the Dirichlet diffusion tree (Neal, 2003). In the resulting Fragmentation-Coagulation process (FCP) a partitioning of the data points evolves over the covariate undergoing fragmentation and coagulation events while maintaining CRP marginals. More recently, Palla et al. (2013) derived a dependent partition-valued process (DPVP) on an arbitrary covariate space which, like the FCP, is exchangeable and has CRP distributed marginals. In the setting of feature allocations, Williamson et al. (2010) propose a nonparametric process, the dependent IBP (d IBP), with IBP distributed marginals and in which the feature allocations are coupled over the covariate space using a Gaussian process (GP, Rasmussen & Williams, 2006). In a similar vein, Van Gael et al. (i FHMM, 2009) deﬁne the Markov Indian Buffet process (m IBP), a probability distribution over a potentially inﬁnite number of binary Markov chains evolving in discrete time. They use the m IBP to extend the factorial hidden Markov model (FHMM, Ghahramani & Jordan, 1997) to the inﬁnite FHMM (i FHMM).

In this paper, we address the problem of dependence for

A Birth-Death Process for Feature Allocation

binary latent feature models. We propose a process that extends the IBP by allowing features to be born and die at times learnt by the model, while maintaining the essential mathematical properties of the IBP. The process is a Markov Jump process (MJP) where the events are the birth or the death of a feature. The idea is closely related to the FCP where the events are either a fragmentation of a cluster or a coagulation of two clusters. The partitions at each location in the FCP are marginally a sample from a Chinese restaurant process, while the feature allocations in the BDFP are marginally samples from an IBP. Compared to the d IBP, both processes model feature allocations evolving over the covariate. However, while in the d IBP the assignment of data points to a feature might change over the covariate, in our process, it remains the same until the feature dies. In the case of the i FHMM, the authors model the dependence of a feature allocation on a discrete time variable as opposed to our process where continuous covariate space is assumed. Moreover, in the i FHMM, the marginal distribution of a feature allocation is analogous but not equal to an IBP. We call the proposed process the birth-death feature allocation process (BDFP). The BDFP is exchangeable, projective, stationary and reversible, and its equilibrium distribution is given by the Indian buffet process.

3. Feature Allocations and the Indian Buffet Process

Consider a dataset with N data points indexed by integers [N] := {1, 2, . . . , N} (allowing N ). Each datapoint n is associated with a binary vector Zn of length K that deﬁnes its feature allocation; Znk = 1 if datapoint n has feature k and Znk = 0 otherwise. The potential total number of features K may be inﬁnite. The binary matrix Z[N] = [ZT 1 , ZT 2 , . . . , ZT N]T speciﬁes a random feature allocation of [N], while ZN denotes the space of all feature allocations of [N], i.e. Z[N] ZN. We deﬁne mk as the number of datapoints that possess feature k, K+ = P2N 1 h=1 Kh as the number of features for which mk > 0 and Kh as the multiplicity of feature h, that is the number of times the same binary column h appears in Z[N]. Under the IBP (Grifﬁths & Ghahramani, 2011), the probability of a matrix Z[N] is

g([Z[N]]; α) = αK+ QH h=1 Kh! exp( αHN)

(N mk)!(mk 1)

where α > 0 is the concentration parameter, HN = PN j=1 1 j is the Nth harmonic number and H 2N 1 is the number of distinct nonzero features in the allocation.

Thibaux & Jordan (2007) showed one can construct the Indian buffet process from a Beta-Bernoulli process using the

following two stage sampling process for n = 1, . . . , N:

B|c, µ0 BP(c, µ0) Zn|B Be P(B) (2)

where B = P k=1 ωkδθk and Z = P k=1 fkδθk. First a draw B is sampled from the Beta process BP(cµ0) (Hjort, 1990) with µ0 as the base distribution. B is a set of pairs (ωk, θk) sampled from a Poisson process on the product space [0, 1] Θ with L evy intensity ν(dω, dθ) = cω 1(1 ω)c 1dωµ0(dθ). Then, B is used as the atomic hazard measure for a Bernoulli process Be P(B). Each Zn is a draw from the Bernoulli process and constitutes a collection of atoms of unit mass on Θ. Then, Zn is a binary vector containing the {fk} k=1 values resulting from tossing a countably inﬁnite sequence of (conditionally independent) coins with success probabilities ωk, i.e. fk|ωk Bernoulli(ωk). This construction allows the use of de Finetti s theorem (de Finetti, 1931) that lets the joint distribution of the rows to be written as

P(Z1, . . . , ZN) = Z h N Y

n=1 P(Zn|B) i d P(B) (3)

where B is the random measure that renders the variables Zn conditionally independent. Equation (3) shows the exchangeability of the rows of Zn, since they can be described as a mixture of Bernoulli processes.

4. Birth-Death Process for Feature Allocation

We consider a continuous-time Markov process (Z(t))t 0 in which each Z(t) is a random feature allocation taking values in the discrete space ZN. The state space is countably inﬁnite; it is determined by all the possible feature allocations deﬁned by N datapoints and K features, where K . The Markov process (Z(t)) evolves over time jumping to different states (feature allocations). Let {t1, . . . , t J R : J N} denote the times when the chain jumps such that tj = inf{τ tj 1 : Z(τ) = Z(tj 1)} and Z(tj) ZN. These jumps are a result of a birth or a death of a feature. The process (Z(t)) can only jump to neighbouring states, i.e. if the chain is currently at state Z(tj) = s, then at time tj+1 it transitions to Z(tj+1) = s

where a new feature is created or an existing feature is deleted after a birth or a death event respectively. Let Zs N ZN be the discrete space of neighboring states to state s. The process is time homogeneous with transition probabilities P(Z(t + y) = s |Z(y) = s) = P(Z(t) = s |Z(0) = s) = pss (t) for all t, y, where s, s ZN. At time tj+1 the process jumps to the next state Z(tj+1) = s

with rate determined by the current state Z(t) = s and the corresponding event, i.e birth or death. More speciﬁcally,

Birth: Suppose s ZN is a feature allocation with Ks nonzero features and s Zs N is another feature

A Birth-Death Process for Feature Allocation

allocation that differs from s in having one additional feature of size |a| so that K s = Ks + 1. We choose the transition rate from s to s as

qss = R(|a| 1)!(N |a|)!

where R > 0 is a parameter governing the birth rate. The new feature a is a binary column of length N. There are N |a| binary formulations for this fea-

ture and 2N 1 = PN n=1 N n for all possible feature births and thus, the total birth rate from s is PN n=1

R (n 1)!(N n)!

N! = R PN n=1 1 n = R HN

where HN = PN n=1 1/n is the N-th harmonic number and n = |a| .

Death: The rate of transitioning from s to s is

where D = R

a is a parameter governing the death rate and r is the multiplicity of the feature in s that dies. The multiplicity r is the combinatorial factor that accounts for all the possible ways of obtaining the same equivalence class as deﬁned in Grifﬁths & Ghahramani (2011) . There are Ks features (including repetitions of the same feature) in s that might die , thus the total death rate from s is RKs

The total rate of transition out of state s ZN is the sum of the total birth and death rates, qs = RHN + RKs

α . We call (Z(t))t>=0 a birth-death feature

allocation process with birth rate R and death rate R

α and write BDFP(α, R).

Theorem 1. The Markov process (Z(t))t 0 is irreducible and has stationary distribution IBP(α). Furthermore, it is reversible.

Proof. A continuous time Markov chain is irreducible if it is possible to eventually get from every state to every other state with positive probability. It is reversible if detailed balance holds, i.e. there is a probability distribution π on ZN such that πsqss = πs qs s for all s, s ZN. Then π is also the invariant (equilibrium) distribution of the Markov chain. The chain in BDFP is irreducible, because for any T > 0 and any two distinct feature allocations γ, ρ ZN, there is a positive probability that if it starts at γ ZN, it will end at ρ ZN. Reversibility and the equilibrium distribution can be demonstrated by detailed balance. Suppose γ, ρ are feature allocations such that γ, ρ ZN and ρ differs from γ in that it has one additional feature a of size |a|. The number of (nonzero) features in ρ is Kρ = Kγ +1.

g(γ; α)qγρ = αKγ

ΠHγ h=1Kh! exp ( αHN)

(N mk)!(mk 1)!

N! R(|a| 1)!(N |a|)!

m Kγ +1=|α| = αKγ+1

αΠHγ h=1Kh! exp ( αHN)

(N mk)!(mk 1)!

rαΠHγ h=1Kh! exp ( αHN)

(N mk)!(mk 1)!

rα=Kα = αKρ

ΠHρ h=1Kh! exp ( αHN)

(N mk)!(mk 1)!

= g(ρ; α)qργ (6)

where g(γ; α) is the probability of a feature allocation γ under the IBP as deﬁned in Equation (1), qγρ is the transition rate from state γ to state ρ, Hγ, Hρ are the number of distinct features in states γ and ρ respectively and ra is the multiplicity (the times the feature is present at the current feature allocation) of feature a that dies. Detailed balance holds, and as such the process is reversible and the equilibrium distribution is IBP[N](α).

Assume that (z(t)) is a realization of the BDFP (Z(t)) over the ﬁnite interval [0, T], T > 0 and we write (z(t))0 t T . With probability one the sample path (z(t))0 t T will only contain a ﬁnite number of jump events, each of which is either a birth or a death event. We write B and Q to denote the set of the features created or turned off by birth or death events respectively.

Proposition 1. Writing q(t) = qz(t) to denote the total transition rate out of state z(t), the probability of a realization (z(t)) under the law of the BDFP is:

R|B|+|Q| αA |B| |Q|

QA |B | h=1 Kh! exp ( αHN) exp Z T

0 q(t)dt . . .

b B {z(t=0)}

(|b| 1)!(N |b|)!

where A = K0 + |B| = KT + |Q|, A = H0 + |B | = HT + |Q |. B , Q are the sets of features with zero multiplicity at their creation time or with multiplicity of one at their death time respectively, and {z(t)} denotes the set of features at time t.

4.1. Dependent Beta Process Construction

The BDFP process can be constructed using a nonhomogenous Poisson process Π. Consider the L evy measure ν(dωdxdtbdtω) on a product space [0, 1] X R [0, ). A sample corresponds to set of points Π = {ωk, xk, tk b, tk ω}k where the range of k is countably inﬁnite. Each atom corresponds to a feature and is associated with a weight ωk [0, 1], a location xk, a birth time tk b R and a life-span tk ω [0, ) (Figure 1). The L evy measure is of the form ν(dωdxdtbdtω) = ρ(dω)µ(dxdtbdtω) and

A Birth-Death Process for Feature Allocation

Figure 1. Cartoon for the dependent Beta process construction of the BDFP: a realisation of a Poisson process Π over the product space [0, 1] X R [0, ) is drawn. The tω dimension over the space [0, ) is omitted in the axis representation. However, each tω corresponding to each point (feature) is drawn as a blue line of length tω starting at the associated birth time point.

corresponds to a Beta process on the combined space Θ = X R [0, ) with ρ(dω) = αω 1(1 ω)α 1 and base measure µ(dθ) = µ(dxdtbdtw). Setting g(dtb) = dtb and β(dtω) = D exp Dtω dtω, the base measure is µ(dθ) = µ0(dx)g(dtb)β(dtω) = µ0(dx)dtb D exp Dtω dtω, where D is the death rate. The constant measure g(dtb) over the real line R is inﬁnite but σ-ﬁnite, that is the total measure g(R) = , but there is a measurable partition (Ek) of R with each g(Ek) < . Since ν(dωdθ) integrates to inﬁnity but satisﬁes R

Θ(1 |ω|)ν(dωdθ) < , a countably inﬁnite number of i.i.d. random points {(ωk, θk)} k=1 are obtained from the Poisson process and P k=1 ωk is ﬁnite with probability one. A Beta process is a completely random measure (Kingman, 1967) and, as such, a sample can be expressed as B = P k=1 ωkδθk|α, µ BP(αµ), where the atoms θk = {xk, tk b, tk ω} Θ and weights ωk [0, 1].

Having drawn a sample B we can construct the feature allocations over an index space R as follows:

k=1 ωkδθk |α, µ BP(αµ)

k=1 bnkδθ |B Be P(ωk)

Znk(t) = Snk I(tk b < t < tk b + tk ω) (8)

with bnk|ωk Bernoulli(ωk) and n = 1, . . . , N. The binary matrix S of dimension N K, is a feature potential matrix. Each binary element Snk indicates whether object n possesses feature fk. S is a global variable and doesn t depend on time t. At any time t, the feature allocation matrix Z(t) is a deterministic function of the current features present at t, that is {fk : tk b < t < tk b + tk w, k = 1, . . . , }

and the feature potential matrix S, i.e. Znk(t) = 1 iff Snk = 1 and tk b < t < tk b + tk ω.

The resulting feature allocation process (zn(t))T is equivalent to the following: every time a new feature fk is created, each object n joins with probability ωk, i.e. znk(tk b)|ωk Bernoulli(ωk). If znk(tk b) = 1, object n will possess feature fk until tk b + tk ω. Repeat this process for all objects.

Proposition 2. The BDFP is exchangeable and the Beta process BP(αµ) on X R [0, ) describes its underlying mixing measure.

Proof. Consider a sequence of variables (zn(t))T with n = 1, 2, . . . , N such that each (zn(t))T is the feature allocation evolution of object n over the index space T. These variables are not independent since each (zn(t))T depends on the Z|[n 1](t) = (z1:(n 1)(t))T. However, given a sample from the B BP(αµ) described in Section 4.1, each variable (zn(t))T becomes conditionally independent and the following holds

P((z1(t))T, (z2(t))T, . . . , (z N(t))T) = Z N Y

n=1 P((zn(t))T|B)φ(d B)

where φ = BP(αµ).

Equation (9) is the de Finetti representation of the BDFP and as such the BDFP is exchangeable and the BP on Θ = X R [0, ) is its underlying mixing measure. Restricting our focus on each index t, the overall Beta process BP(αµ) on X R [0, ) results in a set of dependent random measures over X, one Bt for each t T, such that each Bt is marginally a Beta process. Consider a ﬁxed time point t T and the space [0, 1] X (the red vertical plane in Figure 1). The point process on this plane (where blue lines intersect the plane) corresponds to features alive at time t, i.e. t [tb, tb + tω]. The L evy measure on this plane, is calculated by projecting the overall L evy measure onto the plane,

νt(dωdx) = Z

t tω ν(dωdxdtbdtω)

= αω 1(1 ω)α 1 µ0(dx)

where νt is a measure over [0, 1] X for a speciﬁc t T. More speciﬁcally, it is the L evy measure of a Beta process on X with ρ(dω) = αω 1(1 ω)α 1 and base measure µt(dx) = µ0(dx)

D . Thus we have that marginally Bt|α, µt BP(αµt), t T. (11)

The restricted and projected measure at any index t T deﬁnes a Beta process. Two draws, Bt and Bs, with t, s T, will be dependent with the amount of dependence decreasing as |s t| increases.

Proposition 3. The dependent Beta process construction presented has IBP marginals at any t.

A Birth-Death Process for Feature Allocation

Proof. At any t T, Bt|α, µt BP(αµt). It is straightforward to see that, marginally, the feature allocation matrix Zt obtained using the generative process in Equation (8) is equivalent to Zt|Bt Be P(Bt) and therefore Zt IBP(α), t T.

Corollary 1. At any t T, the feature allocation matrix Zt can be generated by the following generative model as K :

ωk|α Beta R

, Znk|ωk Bernoulli(ωk) (12)

for k = 1, . . . , K and n = 1, . . . , N

The proof of the corollary in included in the supplementary material. Note that the above is true only marginally, i.e. at time t T and it doesn t generste dependence structure between Zt s.

We underline the dependence of Zs and Zt when |s t| 0, s, t T. The closer s, t are, the more the atoms (features) Bs and Bt share. If we independently sampled Zs|Bs Be P(Bs) and Zt|Bt Be P(Bt) then Zs, Zt would be dependent, but not equal, even as |s t| 0. However, in the BDFP the presence of the same features results in the same (not just similar) allocation as |s t| 0. In both cases, the marginal distribution of the feature allocation matrix at any t T is Zt|Bt Be P(Bt) and Zt|α IBP(α). The BDFP results in a continuous evolu-

tion of the Z(t) over T: formally Zt d Zs as t s.

This construction of the BDFP resembles the spatial normalised Gamma process (SNΓP) by (Rao & Teh, 2009). The main difference lies in the marginal distribution; the SNΓP admits DP marginals as opposed to the Beta process marginals of the dependent Beta process as shown in Equation (11).

Proposition 4. The feature allocation process described by Equation (8) with B BP(αµ), has the same birth and death rates as the BDF process.

5. Finite Model

For the BDFP, the inference simpliﬁes considerably if we consider a ﬁnite approximation which gives the countably inﬁnite model in the limit. Consider the space S = [0, 1] X [0, T] [0, ), where we restrict the space of tb to be [0, T] instead of the whole real line R. This accounts for typical applications of the model where we observe data at distinct times over a ﬁnite time range. Consider the L evy measure ν(dωdxdtbdtω) on the space S. Then, under the dependent Beta process representation (see section 4.1), the expected number of atoms present in S is R

S ν(dωdxdtbdtω) = R 1 0 ρ(dω) R

X µ0(dx) R T 0 g(dtb) R 0 β(dtω) = KT, where

tfk b tfk d

FEATURES PRESENT

Figure 2. Cartoon for the Beta event construction of the BDFP: A Poisson(KT) number of features are uniformly distributed across the time range [0, T] (blue lines). Each feature is assigned a weight sampled from Beta( R

K , 1). The leftmost point of each line corresponds to the time of birth of that feature, while the length of each line indicates the life span of each feature sampled from Exponential(D). To sample feature allocations from the process, we consider random time points across time, e.g. t, t

and draw imaginary red lines. The feature allocation matrix at t involves the features that are crossing the red line at t. The membership of the objects n = 1, . . . , N to those features is deﬁned by the values of the corresponding elemnts in the potential matrix S.

K since R 1 0 ρ(dω) = . By considering ﬁnite K we allow inference on a ﬁnite model which approximates the inﬁnite case with increasing ﬁdelity as K .

The process is depicted in Figure 2 and the inﬁnite case can be derived as the limit K of the following:

Consider a time range [0, T] and a set of features F, such that |F| Poisson(KT). Assign to each feature fk F, k = 1, . . . |F| a weight ω, such that ωk Beta R

K , 1 and Ω= [ω1, ω2 . . . ω|F|]. Associate each feature fk F, k = 1, . . . |F| with a birth time tk b uniformly sampled in [0, T]; tk b U(0, T) and tb = [t1 b . . . t|F| b ]. For each fk F, sample its life span tk w Exponential(D), where D is the death rate. Deﬁne the time of death tk d as tk d = tk b + tk w and tw = [t1 w . . . t|F| w ].

We call the sequence of the above steps Beta Event Process (BEP). Putting everything together, generate a sample B = {F, Ω, tb, tw} BEP(α, R, K, T) as follows:

|F| Poisson(KT)

K , 1 , tk b U(0, T), tk ω Exponential(D)

for k = 1, . . . , |F|. Having drawn a sample B from the BEP, we can construct the feature allocations over time as follows

Snk|ωk Bernoulli(ωk)

Znk(t) = Snk I(tk b < t < tk b + tk ω) (14)

A Birth-Death Process for Feature Allocation

where n = 1, . . . , N. The feature potential matrix (as deﬁned in section 4.1) has now N |F| dimensions. Moreover, each Z(t) for t T is a matrix of dimensions N F (t) and F (t) |F|. Figure 3(a) show the graphical model for the BEP.

Figure 3. Graphical representation of the BEP for a time point t and for (a) a linear-Gaussian likelihood and (b) a sigmoid likelihood. The time series Z and Y are represented as single nodes indexed by the time location t. The birth and life span times of the total KT features are depicted using vector notation tb and tw. The black (Zt) and grey (Yt) nodes indicate deterministic and observed parameters respectively.

Proposition 5. In the ﬁnite model, the expected number of features present at any t T is E[Nf] = K D and for D = R

α we have E[Nf] = Kα

Hyperpriors. We put gamma priors on α and R.

Likelihood models. We consider two different likelihood models: linear-Gaussian for real data and logistic for binary network data.

For the linear-Gaussian likelihood model, consider a sequence of observations {Yt Y : t = 1, . . . , L} generated as

Yt = Zt A + ϵt (15)

where Yt is a N M observation matrix at each time t = 1, . . . , L, A is a factor loading matrix of dimension |F| M shared across time and ϵt N 0, σ2 ϵ is Gaussian white noise. We choose a Gaussian prior over A, i.e Afm N(0, 1).

In the case of dynamic binary network data we extend the latent feature relational model (LFRM) proposed by (Miller et al., 2009). Let Yt be the N N binary matrix that contains links, i.e. ytij = Yt(i, j) = 1 iff we observe a link from entity i to entity j at time t. We assume that the matrices Yt are symmetric and ignore diagonal elements (self-links). The probability of a link from one entity to another is determined by the combined effect of all pairwise feature interactions. Let Wt be a |F| |F| real-valued weight matrix where Wt(k, k ) is the weight that affects the probability of there being a link from entity i to entity j if entity i has feature k on, i.e. Ztik = Zt(i, k) = 1 and entity j has feature k on, i.e. Ztjk = Zt(j, k ) = 1. The links are independent conditioned on Zt and Wt, and only the features that are on for the entities i and j at time t inﬂuence the probability of a link between those entities at that time (see Figure 3(b)). Formally,

P(ytij = 1|Zt, Wt) = σ X

kl Ztik Ztjl Wtkl + s (16)

for k, l = 1, . . . , |F|, where s is a bias term and σ(x) = (1 + e x) 1 is the sigmoid function. For completeness, we assume the priors wt(k, l) N µw, σ2 w and s N µs, σ2 s .

5.1. Inference

As with many other Bayesian models, exact inference is intractable so we employ Markov Chain Monte Carlo (MCMC) for posterior inference over the latent variables of the ﬁnite model. A detailed description is provided in the supplementary material.

6. Experiments

We experimentally evaluate the BEP model on real-world genomics and social network data. To evaluate the model ﬁt, we compared the BEP model to independent models at each time point.

6.1. Circadian Rhythm Dataset

Here we used a subset of the gene expression data from Piechota et al. (2010), including N = 500 genes in D = 4 different conditions (exposure to different drugs) over L = 24 time intervals. The measurements indicate how active a gene is at different times. We created 7 train-test splits holding out 20% of the data, and ran 700 MCMC iterations. We see that in terms of predictive performance the BEP outperforms independent IBP models (Table 1). The genes belonging to each factor show enrichment for different known biological pathways (Figure 4). Of particular note are the tryptophan metabolism genes enriched in factor 2, given tryptophan s suspected effects on drowsiness;

A Birth-Death Process for Feature Allocation

the vasopressin regulated water reabsorption, given this hormone s known circadian regulation (Earnest & Sladek, 1986; Yamaguchi et al., 2013); and the regulation of insulin producing beta cells, another hormone with circadian variation (Shi et al., 2013).

Table 1. Circadian dataset results using 20% held out data, a truncation level of K = 10, |F| = 24, 700 iterations and a burnin of 500. Results are the average over 7 MCMC chains.

BEP INDEPENDENT IBP

TRAIN ERROR 0.0917 0.0368 0.0983 0.0012 TEST ERROR 0.0948 0.0343 1.3380 0.5155 TRAIN LOG LIKELIHOOD 6.508 0.7715 6.6871 0.0217 TEST LOG LIKELIHOOD 1.5661 0.1583 8.6861 4.0670

feature index 5 10 15 20

Kegg Tryptophan Metabolism

Kegg Ribosome Kegg Ppar Signaling Pathway Kegg Vascular Smooth Muscle Contraction Kegg Vasopressin Regulated Water Reabsorption Reactome Formation Of A Pool Of Free 40s Subunits Reactome Formation Of The Ternary Complex And Subsequently The 43s Complex

Reactome Influenza Viral Rna Transcription And Replication

Reactome Muscle Contraction Reactome Nuclear Receptor Transcription Pathway

Reactome Regulation Of Beta Cell Development Reactome Regulation Of Gene Expression In Beta Cells Regulation Of Lipid Metabolism By Peroxisome Proliferator Activated Receptor Alpha

Reactome Translation Initiation Complex Formation

Reactome Viral Mrna Translation Reactome Smooth Muscle Contraction

Figure 4. Circadian dataset: many of the features uncovered show enrichment in known biological pathways from Reactome and KEGG. Values here are log10 p from a hypergeometric test for enrichment of the genes in each factor against the 500 background genes.

6.2. Ch IP-seq Epigenetic Marks

For this experiment we used Ch IP-seq (chromatin immunoprecipitation sequencing) data downloaded from the ENCODE project (Consortium, 2007), representing histone modiﬁcations and transcription factor binding in human neural crest cell lines (see (Park, 2009) for a nice review).

The observations involve counts associated with N = 14 (human) cell lines and D = 10 proteins. The counts indicate what proteins, with what chemical modiﬁcations, are bound to DNA along the genome. The measurements are stored in N D matrix of counts Yt: for each cell line, how many reads for each of the 10 proteins mapped to bin t (100 base pair (bp) region of the genome). t = 1, . . . , 500 bins were considered at the start of chromosome 1 (50K bp in total). In Figure 5(a) each subﬁgure corresponds to one of the 10 proteins and in each subﬁgure the counts for the N = 14 cell lines are plotted over the genome section of length 50Kbp. Before inference, the raw counts were square-root transformed (a standard variance stabilizing transform for Poisson data) to make the Gaussian likelihood appropriate. We ran 7 different held-out tests, holding out a different 20% of the data each time. Results, using 700 MCMC iterations, are presented in Table 2. The

BEP outperforms the independent IBP model in both test likelihood and error with a statistically signiﬁcant difference. The independent IBP appears to have better results in train error and likelihood, again suggesting overﬁtting. Comparing the plots of the true measurements to the learnt ones by the BEP and independent IBP model in Figure 5 we see that both models successfully reproduce the data but the BEP reconstructions provide a cleaned up picture of the meaningful signal.

The features found by the model in the different genome locations correspond to different states associated with the speciﬁc genome location. Genes and regulatory DNA elements such as enhancers, silencers and insulators are embedded in genomes. These genomic elements on the DNA have footprints for the transacting proteins involved in transcription, either for the positioning or regulation of the transcriptional machinery. For instance, promoters are regions of DNA which recruit proteins required to initiate transcription of a particular gene and located near the transcription start sites. Enhancers are regions of DNA that can be bound by proteins which activate transcription of a distal gene. So a cell line, at speciﬁc genome location (recall that here each location corresponds to 100 base pairs), will have underlying feature membership (some promoters and some enhancer for example) that determines whether particular protein are found there using Ch IP-seq.

Genomic annotations, from Chrom HMM (Ernst et al., 2011), are shown in Figure 8 in the supplementary document for the region we model. Different levels of the marks in these different regions are much easier to see in the reconstructed signal using BEP in Figure 5(b).

Table 2. Quantitative results for the Ch IP-seq dataset . 20% held out data, a truncation level of K = 3, |F| = 21, 700 iterations and a burnin of 500. Results are the average over 7 held out sets.

BEP INDEPENDENT IBP

TRAIN ERROR 0.4459 0.0229 0.032 0.0089 TEST ERROR 0.4574 0.018 0.7746 0.013 TRAIN LOG LIKELIHOOD 12.4979 0.1439 0.5916 0.0979 TEST LOG LIKELIHOOD 3.1666 0.0318 175.7968 4.49 7. van de Bunt s Dataset

In van de Bunt et al. (1999), 32 university freshman students in a given discipline at a Dutch university were surveyed at seven time points about who in their class they considered as friends. Initially, i.e. t1, most of the students were unknown to each other. The ﬁrst four time points are three weeks apart, whereas the last three time points are six weeks apart as showin in Figure 11 in the supplementary matrial. We symmetrise the matrix by assuming friendship if either individual reported it. We test the performance of BEP using the sigmoid likelihood model as in Equation

A Birth-Death Process for Feature Allocation

0 100 200 300 400 500 0

H3K27ac human

0 100 200 300 400 500 0

H3K27me3 human

0 100 200 300 400 500 0

H3K36me3 human

0 100 200 300 400 500 0

H3K4me1 human

0 100 200 300 400 500 0

H3K4me2 human

0 100 200 300 400 500 0

H3K4me3 human

0 100 200 300 400 500 0

H3K79me2 human

0 100 200 300 400 500 0

H3K9ac human

0 100 200 300 400 500 0

H3K9me3 human

0 100 200 300 400 500 0

H4K20me1 human

0 100 200 300 400 500 10

H3K27ac human

0 100 200 300 400 500 0

H3K27me3 human

0 100 200 300 400 500 0

H3K36me3 human

0 100 200 300 400 500 0

H3K4me1 human

0 100 200 300 400 500 0

H3K4me2 human

0 100 200 300 400 500 20

H3K4me3 human

0 100 200 300 400 500 0

H3K79me2 human

0 100 200 300 400 500 0

H3K9ac human

0 100 200 300 400 500 2

H3K9me3 human

0 100 200 300 400 500 0

H4K20me1 human

0 100 200 300 400 500 20

H3K27ac human

0 100 200 300 400 500 10

H3K27me3 human

0 100 200 300 400 500 10

H3K36me3 human

0 100 200 300 400 500 20

H3K4me1 human

0 100 200 300 400 500 50

H3K4me2 human

0 100 200 300 400 500 50

H3K4me3 human

0 100 200 300 400 500 20

H3K79me2 human

0 100 200 300 400 500 50

H3K9ac human

0 100 200 300 400 500 10

H3K9me3 human

0 100 200 300 400 500 10

H4K20me1 human

Figure 5. Ch IP-seq data: The observed (a) and reconstructed observations (b), (c). The BEP reconstructions smooth out the noise making the meaning signal much easier to visualize. In both models, the noise signal was removed from the reconstructions.

(16) by holding out 10% of all links across all time points. We ran each model for 1000 MCMC iterations. The results are shown in Table 3. The independent network LFR models outperform BEP in the train setting and the test error while BEP outperforms in the test likelihood. However, here the results are comparable. Looking at Figure 6, both models provide the same picture of the allocation. It is possible the stationary assumption hurts the BEP: in the VDB dataset the number of links almost exclusively increases over time.

Table 3. van de Bunt s dataset results using 10% held out data, a truncation level of K = 4, |F| = 20, 1000 iterations and a burnin of 200. Results are the average over 7 MCMC chains.

BEP INDEPENDENT LFRM

TRAIN ERROR 1.7009 0.0850 1.3413 0.1147 TEST ERROR 1.9107 0.1321 1.7891 0.1131 TRAIN LOG LIKELIHOOD 1044.4943 41.6363 839.4544 56.9877 TEST LOG LIKELIHOOD 345.7038 49.9882 438.5848 74.6396

2 4 6 8 10 12 14 16 18 20

2 4 6 8 10 12 14 16 18 20

2 4 6 8 10 12 14 16 18 20

2 4 6 8 10 12 14 16 18 20

2 4 6 8 10 12 14 16 18 20

2 4 6 8 10 12 14 16 18 20

2 4 6 8 10 12 14 16 18 20

0.5 1 1.5 2 2.5

1 2 3 4 5 6 7 8 9 10

0.5 1 1.5 2 2.5 3 3.5

1 2 3 4 5 6 7 8

1 2 3 4 5 6

0.5 1 1.5 2 2.5 3 3.5

0.5 1 1.5 2 2.5 3 3.5 4 4.5

Figure 6. Inferred feature allocation matrices for the seven time points (from left to right) in the van De Bunt friendship dataset. First two rows: Feature allocation matrices inferred by BEP. Last two rows: Feature allocation matrices inferred by independent LFRM.

8. Discussion

Many modern machine learning and statistics tasks involve multidimensional data positioned along some linear covariate: we have shown functional genomics data where the covariate is position in the genome, and network data where links change over time. To model such data we need priors that utilize the dependencies through time, while handling high dimensionality. The BDFP is an expressive new Bayesian non-parametric prior that fulﬁlls these criteria. It outputs time-evolving feature allocations, which can then be effectively used to model high-dimensional time-series data. Since the number of latent features is unbounded, like other Bayesian non-parametric methods, the model can adapt its complexity to the data. While the combinatorial BDFP may seem like a complex object to handle computationally, our theoretical results showing that the de Finetti measure underlying the BDFP is a speciﬁc beta process, which can be well approximated by a ﬁnite K model, the BEP. Our experimental results, compared to independent feature allocations, provides evidence that effectively modeling dependency in the feature allocation through the birth-death mechanism is appropriate for a wide range of statistical applications. Moreover, the BEP provides an interpretable structure using parameters not found, to the best of our knowledge, in existing models, i.e. birth and death rate of features. We are interested in scaling inference under the BEP to larger datasets, for example using (stochastic) variational inference methods that have been successful for the IBP (Doshi et al., 2009).

Acknowledgements Konstantina s research leading to these results has received funding from the European Research Council under the European Union s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617411.

A Birth-Death Process for Feature Allocation

Aldous, D J. Exchangeability and related topics. In Ecole d Ete de Probabilities de Saint-Flour, volume XIII, pp. 1 198. Springer, 1983.

Consortium, The ENCODE Project. Identiﬁcation and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447(7146):799 816, 06 2007.

de Finetti, B. Funzione Caratteristica Di un Fenomeno Aleatorio, pp. 251 299. 6. Memorie. Academia Nazionale del Linceo, 1931.

Doshi, F., Miller, K. T., Van Gael, J., and Teh, Y. W. Variational inference for the Indian buffet process. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, volume 12, 2009.

Earnest, David J and Sladek, Celia D. Circadian rhythms of vasopressin release from individual rat suprachiasmatic explants in vitro. Brain research, 382(1):129 133, 1986.

Ernst, Jason, Kheradpour, Pouya, Mikkelsen, Tarjei S, Shoresh, Noam, Ward, Lucas D, Epstein, Charles B, Zhang, Xiaolan, Wang, Li, Issner, Robbyn, Coyne, Michael, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473(7345):43 49, 2011.

Ghahramani, Zoubin and Jordan, Michael. Factorial hidden markov models. Machine Learning, 29(2-3):245 273, 1997.

Grifﬁths, Thomas L. and Ghahramani, Zoubin. Inﬁnite latent feature models and the indian buffet process. In In NIPS, pp. 475 482. MIT Press, 2005.

Grifﬁths, Thomas L. and Ghahramani, Zoubin. The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12:1185 1224, July 2011.

Hjort, N. L. Nonparametric Bayes estimators based on Beta processes in models for life history data. Annals of Statistics, 18: 1259 1294, 1990.

Kingman, J.F.C. The coalescent. Stochastic Processes and their Applications, 13(3):235 248, 1982.

Kingman, John F. C. Completely Random Measures. Paciﬁc Journal of Mathematics, 21(1):59 78, 1967.

Miller, Kurt, Grifﬁths, Thomas, and Jordan, Michael. Nonparametric latent feature models for link prediction. In Advances in Neural Information Processing Systems 22, 2009.

Neal, Radford M. Density Modeling and Clustering Using Dirichlet Diffusion Trees. In Bayesian Statistics 7, pp. 619 629, 2003.

Palla, Konstantina, Knowles, David A., and Ghahramani, Zoubin. A dependent partition-valued process for multitask clustering and time evolving network modelling, 2013.

Park, Peter J. Chip seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 10(10):669 680, 2009.

Piechota, Marcin, Korostynski, Michal, Solecki, Wojciech, Gieryk, Agnieszka, Slezak, Michal, Bilecki, Wiktor, Ziolkowska, Barbara, Kostrzewa, Elzbieta, Cymerman, Iwona, Swiech, Lukasz, et al. The dissection of transcriptional modules regulated by various drugs of abuse in the mouse striatum. Genome Biology, 11(5):R48, 2010.

Rao, Vinayak and Teh, Yee Whye. Spatial normalized gamma processes. 2009.

Rasmussen, Carl Edward and Williams, Christopher K I. Gaussian processes for machine learning. MIT Press, 2006.

Shi, Shu-qun, Ansari, Tasneem S, Mc Guinness, Owen P, Wasserman, David H, and Johnson, Carl Hirschie. Circadian disruption leads to insulin resistance and obesity. Current Biology, 23(5):372 381, 2013.

Teh, Y. W., Elliott, L. T., and Blundell, C. Bayesian nonparametric modelling of genetic variations using fragmentationcoagulation processes. Submitted, 2013.

Thibaux, Romain and Jordan, Michael I. Hierarchical beta processes and the indian buffet process. In Proceedings of the Eleventh International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 2, pp. 564 571, 2007.

van de Bunt, Gerhard G, Van Duijn, Marijtje AJ, and Snijders, Tom AB. Friendship networks through time: An actor-oriented dynamic statistical network model. Computational & Mathematical Organization Theory, 5:167 192, 1999.

Van Gael, J., Teh, Y. W., and Ghahramani, Z. The inﬁnite factorial hidden Markov model. In Advances in Neural Information Processing Systems, volume 21, 2009.

Williamson, Sinead, Orbanz, Peter, and Ghahramani, Zoubin. Dependent Indian buffet processes. In Proceedings of the Thirteenth International Workshop on Artiﬁcial Intelligence and Statistics, AISTATS, 2010.

Yamaguchi, Yoshiaki, Suzuki, Toru, Mizoro, Yasutaka, Kori, Hiroshi, Okada, Kazuki, Chen, Yulin, Fustin, Jean-Michel, Yamazaki, Fumiyoshi, Mizuguchi, Naoki, Zhang, Jing, et al. Mice genetically deﬁcient in vasopressin v1a and v1b receptors are resistant to jet lag. Science, 342(6154):85 90, 2013.