# topological_mixture_estimation__c9199e7c.pdf

Topological Mixture Estimation

Steve Huntsman 1

We introduce topological mixture estimation, a completely nonparametric and computationally efﬁcient solution to the problem of estimating a one-dimensional mixture with generic unimodal components. We repeatedly perturb the unimodal decomposition of Baryshnikov and Ghrist to produce a topologically and information-theoretically optimal unimodal mixture. We also detail a smoothing process that optimally exploits topological persistence of the unimodal category in a natural way when working directly with sample data. Finally, we illustrate these techniques through examples.

1. Introduction

1.1. Background

Density functions that represent sample data are often multimodal, i.e. they exhibit more than one maximum. Typically this behavior indicates that the underlying data deserves a more detailed representation as a mixture of densities with individually simpler structure. The usual speciﬁcation of a component density is quite restrictive, with log-concave the most general case considered in the literature, and Gaussian the overwhelmingly typical case. It is also necessary to determine the number of mixture components a priori, and much art is devoted to this.

In this paper we detail how to efﬁciently determine a topologically and information-theoretically optimal mixture of generic unimodal component densities directly from a onedimensional input density and without any auxiliary information whatsoever. The topological criterion is a natural qualitative alternative to more traditional quantitative model selection criteria (e.g., information criteria) and is computed at the outset of computation, then subsequently preserved, while the information-theoretical criterion optimally sepa-

1BAE Systems FAST Labs, Arlington, VA, USA. Correspondence to: Steve Huntsman <steve.huntsman@baesystems.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

rates component densities. We further show how to optimally smooth the mixture when the input density itself is being estimated. Topological persistence (which operationally amounts to the assignment of signiﬁcance to topological features that persist as a function of scale) is the essential ingredient in both the model selection and smoothing.

1.2. Formal Motivation

To give some formal motivation, let D(Rd) denote a suitable space of continuous probability densities (henceforth merely called densities) on Rd. A mixture on Rd with M components is a pair (π, p) M D(Rd)M, where M := {π (0, 1]M : P

m πm = 1}; we write |(π, p)| := M, and note that π cannot have any components equal to zero. The corresponding mixture density is π, p := PM m=1 πmpm. The Jensen-Shannon divergence of (π, p) is (Bri et & Harremo es, 2009)

J(π, p) := H ( π, p ) π, H(p) (1)

where H(p)m := H(pm) and H(f) := R f log f dx is the entropy of f.

Now J(π, p) is the mutual information between the random variables Ξ π and X π, p . Since mutual information is always nonnegative, the same is true of J. The concavity of H gives the same result, i.e. H ( π, p ) π, H(p) . If M := |(π, p)| > 1, ˆπ := (π1, . . . , πM 2, πM 1 + πM),

and ˆp := p1, . . . , p M 2, πM 1p M 1+πMp M

, then is easy

to show that J(ˆπ, ˆp) J(π, p).

We say that a density f D(Rd) is unimodal if f 1([y, )) is either empty or contractible (i.e., topologically equivalent to a point in the sense of homotopy) for all y. For d = 1, this simply means that any nonempty sets f 1([y, )) are intervals and agrees with intuition. We call a mixture (π, p) unimodal iff each of the component densities pm is unimodal. The unimodal category ucat(f) is the smallest number of components of any unimodal mixture (π, p) that satisﬁes π, p = f. Figure 1 shows that the unimodal category can be much less than the number of maxima. In the event that π, p = f and |(π, p)| = ucat(f), we write (π, p) |= f: the symbol |= is called models. The unimodal category is a topological invariant that generalizes and relates to other classical invariants (Baryshnikov & Ghrist, 2011; Ghrist, 2014).

Topological Mixture Estimation

Figure 1. Upper panels: a unimodal decomposition obtained using the sweep algorithm from (Baryshnikov & Ghrist, 2011). Lower panels: the result of (3). The de/reblurring approach of 5.2 gives smoother decompositions on estimated densities: see Figures 6-9.

The preceding constructions naturally lead us to consider the unimodal Jensen-Shannon divergence

J (f) := sup (π,p)|=f J(π, p) (2)

as a simultaneous measure of both the topological and information-theoretical complexity of f, and

(π , p ) := arg max (π,p)|=f J(π, p) (3)

as a topologically and information-theoretically optimal topological mixture estimate (TME).

The natural questions are if such an estimate exists (is the supremum attained?), is unique, and if so, how to perform TME in practice. In this paper we address these questions for the case d = 1, and we demonstrate the utility of TME in examples (see Figures 6-10).

After reviewing related work in 2, we cover the basic algorithm of TME in 3. The proof therein that Algorithm 1 computes (3) is nearly trivial with Lemmas 1 and 2 in hand: these are respectively in appendices A and B. Next, in 4 we review the related technique of topological density estimation (TDE) before showing in 5 how blurring and deblurring mixture estimates can usefully couple TDE and TME. Finally, in 6 we produce examples of TME in action before making some closing remarks in 7.

2. Related Work

While density estimation enables various clustering techniques (Li et al., 2007; Azzalini & Menardi, 2014; Xu & Tian, 2015), mixture estimation is altogether more powerful than clustering: e.g., it is possible to have mixture components that signiﬁcantly and meaningfully overlap. For example, a cluster with a bimodal density will usually be considered as arising from two unimodal mixture components that are individually of interest. In this light and in view of its totally nonparametric nature, our approach can be seen as particularly powerful, particularly when coupled with TDE and deblurring/reblurring (see 4 and 5).

Still, even for clustering (even in one dimension, where an optimal solution to k-means can be computed efﬁciently (Wang & Song, 2011; Nielsen & Nock, 2014; Grønlund et al., 2017)), determining the number of clusters in data (Feng & Hamerly, 2007; Mirkin, 2011) is as much an art as a science. All of the techniques we are aware of either require some ad hoc determination to be made, require auxiliary information (e.g., (Tibshirani et al., 2001)) or are parametric in at least a limited sense (e.g., (Sugar & James, 2003)). While a parametric approach allows likelihoods and thus various information criteria (Burnham & Anderson, 2003) or their ilk to be computed for automatically determining the number of clusters, this comes at the cost of a strong modeling assumption, and criteria values themselves are difﬁcult to compare meaningfully (Melnykov & Maitra, 2010).

These shortcomings including determining the number of mixture components carry over to the more difﬁcult problem of mixture estimation. (Mc Lachlan & Peel, 2004; Melnykov & Maitra, 2010; Mc Lachlan & Rathnayake, 2014) As an example, an ad hoc and empirically derived unimodal mixture estimation technique that requires one of a few common functional forms for the mixture components has been recently employed in (Mints & Hekker, 2017). Univariate model-based mixtures of skew distributions admit EM-type algorithms and can outperform Gaussian mixture models (Lin et al., 2007; Basso et al., 2010). Though these generalize to the multivariate case quite effectively (see, e.g., (Lee & Mc Lachlan, 2016)), the EM-type algorithms are generically vulnerable to becoming trapped in local minima without good initial parameter values, and they require some model selection criterion to determine the number of mixture components, though the parameter learning and model selection steps can be integrated as in (Figueiredo & Jain, 2002). A Bayesian nonparametric mixture model that incorporates many but not arbitrary unimodal distributions is considered in (Rodr ıguez & Walker, 2014). Principled work has been done on estimating mixtures of log-concave distributions (Walther, 2009) and (Chan et al., 2013) describes how densities of discrete unimodal mixtures can be estimated. However, actually estimating generic unimodal mixtures themselves appears to be unaddressed in the literature, even in one dimension. Indeed, even estimating individual modes and their associated uncertainties or signiﬁcances has only been addressed recently (Genovese et al., 2016; Mukhopadhyay, 2017).

3. The Basic Algorithm

Given f, the sweep algorithm of (Baryshnikov & Ghrist, 2011) yields (π, p) |= f. We will repeatedly perturb (π, p) to obtain (3) using Lemmas 1 and 2, which are respectively in A and B. Lemma 1 states that that J is convex under perturbations of (π, p) that preserve π, p . Lemma 2 is

Topological Mixture Estimation

Algorithm 1 Topological Mixture Estimation (TME)

Input: function data {xk, f(xk)} Initialize (π, p) |= f as in (Baryshnikov & Ghrist, 2011) repeat

for each evaluation point and pair of components do

Greedily perturb (π, p) to (π , p ) using Lemma 2 end for Update (π, p) = arg max J(π , p ) until (π, p) and/or J(π, p) converge Output: (π, p)

a characterization of perturbations of two components of a piecewise afﬁne and continuous (or piecewise constant) mixture that preserve the predicate (π, p) |= f, i.e., that preserve unimodality (as in Figure 2) and the mixture density. Together, these results entail Theorem 1, which establishes that greedy unimodalityand density-preserving local perturbations of pairs of mixture components converge to (3).

Figure 2. A unimodal sequence with unimodality-saturating local perturbations ε from Lemma 2 in B indicated above and below.

Theorem 1. Let = x 1 < x0 < < x N < x N+1 = and f be piecewise constant (or afﬁne) over each [xk, xk+1]. Then Algorithm 1 efﬁciently computes (3).

Proof. By Lemma 1 (see A), greedily and locally perturbing the mixture (π, p) |= f according to Lemma 2 (see B), then updating the mixture according to the perturbation which optimizes J gives the desired result in O(MN) iterations. This result is unique by convexity. Each iteration requires O(M 2N) trial perturbations, each of which in turn requires O(MN) arithmetic operations to evaluate J.

4. Topological Density Estimation

The obvious situation of practical interest for TME is that a density has been obtained from a preliminary estimation process involving some sample data. There is a natural approach to this preliminary estimation process called topological density estimation (TDE) (Huntsman, 2017) that naturally dovetails with TME.

Algorithm 2 Topological Density Estimation (TDE)

Input: {Xj} for each proposed bandwidth h do

Compute u X(h) using (5) end for Compute ˆm X using (6) Output: ˆh X using (7)

We recall the basic idea here (for pseudocode, see Algorithm 2). Given a kernel K and sample data Xj for 1 j n, and for each proposed bandwidth h, compute the kernel density estimate (Silverman, 1986; Chen, 2017)

j=1 KXj,h (4)

where Kµ,σ(x) := 1

σ ). Next, compute

u X(h) := ucat( ˆfh;X) (5)

and estimate the unimodal category of the PDF that X is sampled from via

ˆm X := arg max m µ(u 1 X (m)) (6)

where µ denotes an appropriate measure (nominally counting measure or the pushforward of Lebesgue measure under the transformation h 7 1/h).

(6) gives the most prevalent and topologically persistent (Ghrist, 2014; Oudot, 2015) value of the unimodal category, i.e., this is a topologically robust estimate of the number of components required to produce the PDF that X is sampled from as a mixture. While any element of u 1 X ( ˆm X) is a bandwidth consistent with the estimate (6), considerations of robustness lead us to typically make the more detailed nominal speciﬁcation

ˆh X := medianµ(u 1 X ( ˆm X)). (7)

4.2. Performance

TDE turns out to be very computationally efﬁcient relative to the traditional technique of cross-validation (CV). On highly multimodal densities, TDE is competitive or at least reasonably performant relative to CV and other nonparametric density estimation approaches with respect to traditional statistical evaluation criteria. Moreover, TDE outperforms other approaches when qualitative criteria such as the number of local maxima and the unimodal category itself are considered (see Figures 3-5). In practice, such qualitative

Topological Mixture Estimation

criteria are generally of paramount importance. For example, precisely estimating the shape of a density is generally less important than determining if it has multiple modes.

As an illustration, consider µ(j, m) := j m+1, σ(k, m) := 2 (k+2)(m + 1) 2 and the family of distributions

j=1 Kµ(j,m),σ(k,m) (8)

for 1 k 3 and 1 m 10, and where here K is the standard Gaussian density: see Figure 3. Exhaustive details relating to the evaluation of TDE on this family and other densities are in the software package and test suite (BAE Systems, 2017): here, we merely show performance data for (8) in Figures 4 and 5.

Figure 3. The densities fkm in (8) for 1 k 3 and 1 m 6 over [ 0.5, 1.5]. Rows are indexed by k; columns by m. The upper left panel shows f11 and the lower right panel shows f36.

Figure 4. Pseudotransparency plots of performance measures relating to the unimodal category for the family (8) depicted in Figure 3 with n = 500 and using a Gaussian kernel. From left to right, we show empirical distributions of ucat (blue) for TDE, ucat (red) for CV, and the empirical probability that the estimate of ucat is correct. From top to bottom, we show k = 1, . . . 3. Each panel has m = 1, . . . 10 along the horizontal axis.

TDE has the very useful feature (shared by essentially no high-performing density estimation technique other than

Figure 5. As in Figure 4, but for the number of local maxima.

CV) that it requires no free parameters or assumptions. Indeed, TDE can be used to evaluate its own suitability: for unimodal distributions, it is often not an ideal choice but it is good at detecting this situation in the ﬁrst place. Furthermore, TDE is very efﬁcient computationally.

In situtations of practical interest, it is tempting to couple TDE and TME in the obvious way: i.e., perform them sequentially and indepdently. This yields a completely nonparametric estimate of a mixture from sample data alone. However, there is a much better way to couple these techniques, as we shall see in the sequel.

5. Blurring and Deblurring

5.1. Blurring

Recall that a log-concave function is unimodal, and moreover that a function is log-concave iff its convolutions with unimodal functions are identically unimodal (Ibragimov, 1956; Keilson & Gerber, 1971; Bertin et al., 2013). This observation naturally leads to the following question: if (π, p) |= f, how good of an approximation to the δ distribution must a log-concave density g be in order to have (π, p g) |= f g? In particular, suppose that g is a Gaussian density: what bandwidth must it have? An answer to this question of how much blurring a minimal unimodal mixture model can sustain deﬁnes a topological scale (viz., the persistence of the unimodal category under blurring) that we proceed to illustrate in light of TDE.

In this paragraph we assume that K is the standard Gaussian density, so that Kµ,h Kµ ,h = Kµ+µ ,(h2+h 2)1/2 and ˆfh;X K0,h = ˆf(h2+h 2)1/2;X. Write ˆh X for the bandwidth obtained via TDE, whether via the nominal speciﬁcation (7) or any other: by construction we have that inf u 1 X ( ˆm X)

Topological Mixture Estimation

ˆh X sup u 1 X ( ˆm X). Now if (π, p) |= ˆfˆh X;X, then ˆm X =

u X(ˆh X) = |(π, p)|. In order to have (π, p K0,h ) |= f K0,h , it must be that ˆm X = u X(ˆh X) = |(π, p)| = |(π, p K0,h )| = u X((ˆh2 X + h 2)1/2), i.e.,

h sup u 1 X ( ˆm X) 2 ˆh2 X 1/2 . (9)

In particular, we have the weaker inequality involving a purely topological scale:

h sup u 1 X ( ˆm X) 2 inf u 1 X ( ˆm X) 2 1/2 . (10)

The preceding considerations generalize straightforwardly if we deﬁne uf(h) := ucat(f K0,h), where once again K is a generic kernel. This generalizes (5) so long as we associate sample data with a uniform average of δ distributions. Under reasonable conditions, we can write uf(0) = ucat(f), and it is easy to see that the analogous bound is

h sup u 1 f (uf(0)). (11)

Of course, (11) merely restates the triviality that the blurred mixture ceases to be minimal precisely when the number of mixture components exceeds the unimodal category of the mixture density. Meanwhile, the special case furnished by TDE with the standard Gaussian kernel affords sufﬁcient structure for a slightly less trivial statement.

5.2. Deblurring/reblurring

The considerations of 5.1 suggest how to couple TDE and TME in a much more effective way than performing them sequentially and independently. The idea is to use a Gaussian kernel and instead of (7), pick the bandwidth

ˆh( ) X := infµ(u 1 X ( ˆm X)) (12)

and then perform TME; ﬁnally, convolve the results with K0, h where

h := ˆh2 X h ˆh( ) X i2 1/2 . (13)

This preserves the result of TDE while giving a smoother, less artiﬁcial, and more practically useful mixture estimate than the information theoretically optimal result.

Of course, a similar tactic can be performed directly on a density f by considering its Fourier deconvolution F 1(Ff/FK0,h ), where F denotes the Fourier transform and h is as in (11): however, any a priori justiﬁcation for such a tactic is necessarily context-dependent in general, and our experience suggests that its implementation would be delicate and/or prone to aliasing. Nevertheless, this would

Algorithm 3 Reblurred Topological Mixture Estimation

Input: {Xj} for each proposed bandwidth h do

Compute u X(h) using (5) end for Compute ˆm X using (6) Compute ˆh( ) X using (12) Compute (π, p) from ˆfˆh( ) X ;X using Algorithm 1 Update p = p K0, h with h as in (13) Output: (π, p)

be particularly desirable in the context of heavy-tailed distributions, where kernel density estimation requires much larger sample sizes in order to achieve acceptable results. In this context it would also be worth considering the use of a symmetric stable density (Uchaikin & Zolotarev, 1999; Nolan, 2018) (e.g., a Cauchy density) as a kernel with the aim of recapturing the essence of (13).

6. Examples

Figure 6. TME applied to n = 272 waiting times between eruptions of the Old Faithful geyser. Panels show area plots of unimodal decompositions obtained by (top) the sweep algorithm on a TDE with bandwidth given by (7); (second from top) the result of (3) on the same; (second from bottom) the deblurred result of (3) on a TDE with bandwidth given by (12); (bottom) the result of reblurring by convolving the deblurred mixture with a Gaussian kernel with bandwidth given by (13). Note that three of the four mixture estimates have the same density, but that the deblurred density is different (we highlight this with a * annotation).

We present two phenomenologically illustrative examples. First, in Figures 6 and 7 we consider the n = 272 waiting times between eruptions of the Old Faithful geyser from the data set in (H ardle, 2012). Then, in Figures 8 and 9 we consider the n = 2107 Sloan Digitial Sky Survey g r color indices accessed from the Vizie R database (Ochsen-

Topological Mixture Estimation

bein et al., 2000) at http://cdsarc.u-strasbg. fr/viz-bin/Cat?J/Ap J/700/523 and discussed in (An et al., 2009); the latter example is replicated in (BAE Systems, 2018).

Figure 7. Line plots of the same decompositions as Figure 6.

Figure 8. TME applied to n = 2107 g r color indices from (An et al., 2009). Panels are otherwise as in Figure 6.

As suggested, several phenomena are readily apparent from these examples. First, mixtures obtained via the sweep algorithm are manifestly parity-dependent, i.e., the direction of sweeping matters; second, mixtures obtained via TME alone exhibit artiﬁcial anti-overlapping behavior; third, deblurring followed by reblurring preserves unimodality, the overall density, a topologically persistent invariant (viz., the unimodal category) and the spirit of information-theoretical optimality while producing an obviously better behaved mixture; fourth and ﬁnally, the various techniques involved here can signiﬁcantly shift classiﬁcation/decision boundaries based on the dominance of various mixture components.

While the data in Figures 6 and 7 is at least qualitatively approximated by a two-component Gaussian mixture, it

Figure 9. TME applied to n = 2107 g r color indices from (An et al., 2009). Panels are otherwise as in Figure 7.

Figure 10. Structural analyses of the color index data from Figure

9. Top panel: density estimates from CV, the default LPMode algorithm of (Mukhopadhyay, 2017) (kindly provided by its author), and TDE. Second panel from top: the local maxima of the density estimates above. Third through ﬁfth panels from top: the result of (3) on CV, LPMode, and TDE, respectively, with maxima of components indicated. Bottom panel: the reblurred mixture from the bottom panel of Figure 9, augmented with component maxima.

Topological Mixture Estimation

is clear that a three-component Gaussian mixture cannot capture the highly oscillatory behavior of the density in Figures 8 and 9. Indeed, this example illustrates how such oscillatory behavior can actually arise from a unimodal mixture with many fewer components than might naively appear to be required.

Figure 10 shows that there are strong and independent grounds to conclude that the color index data of Figure 9 is produced by a unimodal mixture of three components, with componentwise modes as suggested by TME, and furthermore that CV undersmooths this data. For each density estimate shown, each componentwise maximum of the corresponding mixture (3) is virtually identical to one of the local maxima of the density estimate itself: this is a consequence of the anti-overlapping tendency described above.

In particular, the fourth panel of Figure 10 illustrates that it is possible and potentially advantageous to use TME as an alternative mode identiﬁcation technique in the LPMode algorithm of (Mukhopadhyay, 2017). Furthermore, while we have not implemented a reliable Fourier deconvolution/reblurring algorithm of the sort hinted at in 5.2, the ﬁfth and sixth panels of Figure 10 suggest that this is not particularly important for the narrowly deﬁned task of mode ﬁnding/bump hunting.

While the O(M 4N 3) arithmetic operations of TME as implemented in Algorithm 1 might seem uncomfortably high, in practice M is generally quite small and it is reasonable to enforce N = 102 as in fact we do in the MATLAB implementation (BAE Systems, 2018) and 6. Furthermore, avoiding redundant perturbations and incrementally computing J would dramatically reduce the computational complexity. Still, even without the beneﬁt of any such reﬁnements, the color index example in 6 runs in less than 40 seconds.

Note also that TDE and de/reblurring as respectively implemented in Algorithms 2 and 3 are relatively computationally inexpensive. The former case is helped by resampling data to on the order of 103 quantiles, which has no material effect on the output in cases where kernels are appropriate to use. That said, using a fast generalized Gauss transform (Spivak et al., 2010) would dramatically accelerate TDE.

Even though TME is presently limited to one dimension, it is still very useful due to the preponderance of onedimensional problems. For instance, TME is a good candidate to improve on some of the best practical unsupervised image thresholding techniques (Kapur et al., 1985; Kittler & Illingworth, 1986; Sezgin & Sankur, 2004), and has prospects for enhancing data/sensor fusion, informationtheoretical analysis of time series (by using sliding time windows to deﬁne samples), and many other tasks.

Furthermore, the one-dimensional framework can be used with random projections of high-dimensional data in a way that is likely to yield improvements for model selection (Feng & Hamerly, 2007) (cf. (Kalai et al., 2012)) and anomaly detection (Pevn y, 2016). We plan to explore these topics in future work, with the associated intent of gauging the art of the possible with respect to determining unimodal decompositions in dimension > 1.

That said, the extension of TME to dimension > 1 will require effort and mathematical tools well beyond those used in this paper. One reason is that computing a unimodal decomposition is algorithmically undecidable in high dimensions, a property inherited from the problem of determining contractibility of simplicial complexes (Tancer, 2016) and the geometric realization theorem (Edelsbrunner & Harer, 2010) applied to level and upper excursion sets. Therefore, extending the constructions of this paper will require some modiﬁcation of the notion of unimodal category in dimension > 1, approximations and/or heuristics.

For example, restricting to convex versus contractible upper excursion sets is probably desirable on intuitive as well as computational grounds. However, even in two dimensions the corresponding problem of constructing minimal convex partitions of polygons is still NP-hard. On the other hand, the case without interior holes is efﬁciently solvable (O Rourke et al., 2017) and there is a quasi-polynomial time approximation scheme for the general case (Bandyapadhyay et al., 2015). The heuristic of (Liu et al., 2010) seems to be a good starting point for exploring relevant tradeoffs.

An appropriate tactic for directly leveraging the onedimensional framework en route to higher dimensions appears to be tomography in the spirit of the topological Radon transform (Ghrist, 2014). Besides working with random projections in this vein, another sensible approach (suggested to the author by Robert Ghrist) is to foliate (Lawson Jr., 1974) the domain of a sample (in practice this would just mean taking a family of parallel lines) and perform TME on data in tubular neighborhoods of nearby leaves of the foliation, then assemble the results, essentially by interpolating. Here topological tools such as sheaves and Morse theory seem inevitably to be required in order to do things in a globally coherent way. A suitable member of the class of metrics on mixtures introduced in (Liu & Huang, 2000) that provides data relevant for assembly as a byproduct will likely also be necessary.

Finally, we note that there are prospects for recursively coupling TME and TDE: the idea here is to pull mixture components back to weighted subsamples, then re-run TDE (or in the unimodal case, CV) on these individually. The resulting variable-bandwidth mixture estimator would give a multiresolution description of data that would be be likely to yield further improvements in many applications.

Topological Mixture Estimation

A. Convexity

The following lemma shows that J is convex as we gradually shift part of one mixture component to another.

Lemma 1. Let |(π, p)| = 3 and deﬁne

π12,t := π1 + (1 t)π2;

π23,t := tπ2 + π3;

p12,t := π1p1 + (1 t)π2p2

p23,t := tπ2p2 + π3p3

so that (π12,t, π23,t) , (p12,t, p23,t) = π, p . The function gπ,p : [0, 1] [0, ) deﬁned by

gπ,p(t) := J ((π12,t, π23,t) , (p12,t, p23,t)) (14)

gπ,p(t) t gπ,p(1) + (1 t) gπ,p(0). (15)

Proof. We have that gπ,p(t) = H( π, p ) (π12,t, π23,t) , (H(p12,t), H(p23,t)) . Furthermore, if we write π12 := π1 + π2, π23 := π2 + π3, p12 := π1p1+π2p2

π12 , and p23 := π2p2+π3p3

π12,t = tπ1 + (1 t)π12;

π23,t = tπ23 + (1 t)π3;

p12,t = tπ1p1 + (1 t)π12p12

p23,t = tπ23p23 + (1 t)π3p3

It is well known that H is a concave functional: from this it follows that

H(p12,t) tπ1 π12,t H(p1) + (1 t)π12

π12,t H(p12);

H(p23,t) tπ23 π23,t H(p23) + (1 t)π3

π23,t H(p3).

gπ,p(t) = H( π, p )

(π12,t, π23,t) , (H(p12,t), H(p23,t))

H( π, p ) tπ1H(p1) (1 t)π12H(p12)

tπ23H(p23) (1 t)π3H(p3)

= t H( π, p ) t (π1, π23), (H(p1), H(p23))

+(1 t) H( π, p )

(1 t) (π12, π3), (H(p12), H(p3))

= t gπ,p(1) + (1 t) gπ,p(0)

as claimed.

B. Preserving Unimodality

Suppose that (π, p) is a unimodal mixture on R with |(π, p)| > 1. We would like to determine how we can perturb two components of this mixture so that the result is still unimodal and yields the same density. In the event that the mixture is piecewise afﬁne and continuous (or piecewise constant) the space of permissible perturbations can be characterized by the following

Lemma 2. For 0 k N, let yk [0, ) be such that y0 = 0 = y N and there are integers ℓ, u satisfying 0 < ℓ u < N with

y0 yℓ 1 < yℓ= = yu > yu+1 y N. (16) (That is, y1, . . . , y N 1 is a nonnegative, nontrivial unimodal sequence.) Then for 1 r N 1 and ε r 0, yk δkrε r is nonnegative and unimodal iff

ε r yr min{yr 1, yr+1}. (17)

Similarly, for ε+ r 0, yk + δkrε+ r is nonnegative and unimodal iff

( if ℓ 1 k u + 1 max{yr 1, yr+1} yr otherwise. (18)

Proof (sketch). We ﬁrst sketch ((17), ). Nonegativity follows from 0 min{yr 1, yr+1} yr ε r . Unimodality follows from a series of trivial checks for the cases 1 r < ℓ 1, r = ℓ 1, r = ℓ, and ℓ< r < u: the remaining cases r = u, r = u + 1, and u + 1 < r N 1 follow from symmetry. For example, in the case 1 r < ℓ 1, we only need to show that yr 1 yr ε r yr+1.

A sketch of ((17), ) amounts to using the same cases and symmetry argument to perform equally trivial checks. For example, in the case 1 r < ℓ 1, we have ε r yr yr 1 yr min{yr 1, yr+1}.

The proof of (18) is mostly similar to that of (17): the key difference here is that any argument adjacent to or at a point where the maximum is attained can have its value increased arbitrarily without affecting unimodality (or nonnegativity).

The example in Figure 2 is probably more illuminating than ﬁlling in the details of the proof sketch above.

Acknowledgements

The author thanks Adelchi Azzalini, Robert Ghrist, Subhadeep Mukhopadhyay, and John Nolan for their helpful and patient discussions, and reviewers for their comments and advice.

Topological Mixture Estimation

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL). Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of DARPA or AFRL.

An, D., Pinsonneault, M. H., Masseron, T., Delahaye, F., Johnson, J. A., Terndrup, D. M., Beers, T. C., Ivans, I. I., and Ivezi c, ˇZ. Galactic globular and open clusters in the Sloan Digital Sky Survey. II. Test of theoretical stellar isochrones. The Astrophysical Journal, 700(1):523, 2009.

Azzalini, A. and Menardi, G. Clustering via nonparametric density estimation: The R package pdf Cluster. Journal of Statistical Software, Articles, 57(11):1 26, 2014.

BAE Systems. Topological density estimation. https: //git.io/vy C4l, 2017.

BAE Systems. Topological mixture estimation. https: //git.io/v AJck, 2018.

Bandyapadhyay, S., Bhowmick, S., and Varadarajan, K. Approximation schemes for partitioning: convex decomposition and surface approximation. In Proceedings of SODA, 2015.

Baryshnikov, Y. and Ghrist, R. Unimodal category and topological statistics. In Proceedings of NOLTA, 2011.

Basso, R. M., Lachos, V. H., Cabral, C. R. B., and Ghosh, P. Robust mixture modeling based on scale mixtures of skew-normal distributions. Computational Statistics & Data Analysis, 54(12):2926 2941, 2010.

Bertin, E. M. J., Cuculescu, I., and Theodorescu, R. Unimodality of Probability Measures. Springer, 2013.

Bri et, J. and Harremo es, P. Properties of classical and quantum Jensen-Shannon divergence. Physical Review A, 79(5):052311, 2009.

Burnham, K. P. and Anderson, D. R. Model Selection and Multimodel Inference: A Practical Information-theoretic Approach. Springer, 2003.

Chan, S.-O., Diakonikolas, I., Servedio, R. A., and Sun, X. Learning mixtures of structured distributions over discrete domains. In Proceedings of SODA, 2013.

Chen, Y.-C. A tutorial on kernel density estimation and recent advances. ar Xiv:1704.03924, 2017.

Edelsbrunner, H. and Harer, J. L. Computational Topology: An Introduction. AMS, 2010.

Feng, Y. and Hamerly, G. PG-means: learning the number of clusters in data. In Proceedings of NIPS, 2007.

Figueiredo, M. A. T. and Jain, A. K. Unsupervised learning of ﬁnite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):381 396, 2002.

Genovese, C. R., Perone-Paciﬁco, M., Verdinelli, I., and Wasserman, L. Non-parametric inference for density modes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1):99 126, 2016.

Ghrist, R. W. Elementary Applied Topology. Createspace, 2014.

Grønlund, A., Larsen, K. G., Mathiasen, A., and Nielsen, J. S. Fast exact k-means, k-medians and Bregman divergence clustering in 1d. ar Xiv:1701.07204, 2017.

H ardle, W. Smoothing Techniques: With Implementation in S. Springer Science & Business Media, 2012.

Huntsman, S. Topological density estimation. In Proceedings of NOLTA, 2017.

Ibragimov, I. A. On the composition of unimodal distributions. Theory of Probability & Its Applications, 1(2): 255 260, 1956.

Kalai, A. T., Moitra, A., and Valiant, G. Disentangling Gaussians. Communications of the ACM, 55(2):113 120, 2012.

Kapur, J. N., Sahoo, P. K., and Wong, A. K. C. A new method for gray-level picture thresholding using the entropy of the histogram. Computer Vision, Graphics, and Image Processing, 29:273 285, 1985.

Keilson, J. and Gerber, H. Some results for discrete unimodality. Journal of the American Statistical Association, 66(334):386 389, 1971.

Kittler, J. and Illingworth, J. Minimum error thresholding. Pattern Recognition, 19(1):41 47, 1986.

Lawson Jr., H. B. Foliations. Bulletin of the American Mathematical Society, 80(3):369 418, 1974.

Lee, S. X. and Mc Lachlan, G. J. Finite mixtures of canonical fundamental skew t-distributions. Statistics and Computing, 26(3):573 589, 2016.

Li, J., Ray, S., and Lindsay, B. G. A nonparametric statistical approach to clustering via mode identiﬁcation. Journal of Machine Learning Research, 8(Aug):1687 1723, 2007.

Lin, T. I., Lee, J. C., and Hsieh, W. J. Robust mixture modeling using the skew t distribution. Statistics and Computing, 17(2):81 92, 2007.

Topological Mixture Estimation

Liu, H., Liu, W., and Latecki, L. J. Convex shape decomposition. In Proceedings of CVPR, 2010.

Liu, Z. and Huang, Q. A new distance measure for probability distribution function of mixture type. In Proceedings of ICASSP, 2000.

Mc Lachlan, G. J. and Peel, D. Finite Mixture Models. John Wiley & Sons, 2004.

Mc Lachlan, G. J. and Rathnayake, S. On the number of components in a Gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5):341 355, 2014.

Melnykov, V. and Maitra, R. Finite mixture models and model-based clustering. Statistics Surveys, 4:80 116, 2010.

Mints, A. and Hekker, S. A uniﬁed tool to estimate distances, ages, and masses (Uni DAM) from spectrophotometric data. Astronomy & Astrophysics, 604:A108, 2017.

Mirkin, B. Choosing the number of clusters. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3):252 260, 2011.

Mukhopadhyay, S. Large-scale mode identiﬁcation and data-driven sciences. Electronic Journal of Statistics, 11 (1):215 240, 2017.

Nielsen, F. and Nock, R. Optimal interval clustering: Application to Bregman clustering and statistical mixture learning. IEEE Signal Processing Letters, 21(10):1289 1292, 2014.

Nolan, J. P. Stable Distributions: Models for Heavy Tailed Data. Birkh auser, 2018.

Ochsenbein, F., Bauer, P., and Marcout, J. The Vizie R database of astronomical catalogues. Astronomy and Astrophysics Supplement Series, 143(1):23 32, 2000.

O Rourke, J., Suri, S., and T oth, C. D. Polygons. In Goodman, J. E., O Rourke, J., and T oth, C. D. (eds.), Handbook of Discrete and Computational Geometry. CRC Press, 3 edition, 2017.

Oudot, S. Y. Persistence Theory: From Quiver Representations to Data Analysis. American Mathematical Society, 2015.

Pevn y, T. Loda: Lightweight on-line detector of anomalies. Machine Learning, 102:275 304, 2016.

Rodr ıguez, C. E. and Walker, S. G. Univariate Bayesian nonparametric mixture modeling with unimodal kernels. Statistics and Computing, 24(1):35 49, 2014.

Sezgin, M. and Sankur, B. Survey over image thresholding techniques and quantitative performance evaluation. Journal of Electronic Imaging, 13(1):146 165, 2004.

Silverman, B. W. Density Estimation for Statistics and Data Analysis. CRC Press, 1986.

Spivak, M., Veerapaneni, S. K., and Greengard, L. The fast generalized Gauss transform. SIAM Journal on Scientiﬁc Computing, 32(5):3092 3107, 2010.

Sugar, C. A. and James, G. M. Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98(463): 750 763, 2003.

Tancer, M. Recognition of collapsible complexes is NPcomplete. Discrete and Computational Geometry, 55(1): 21 38, 2016.

Tibshirani, R., Walther, G., and Hastie, T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411 423, 2001.

Uchaikin, V. V. and Zolotarev, V. M. Chance and Stability: Stable Distributions and Their Applications. Walter de Gruyter, 1999.

Walther, G. Inference and modeling with log-concave distributions. Statistical Science, pp. 319 327, 2009.

Wang, H. and Song, M. Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming. The R Journal, 3(2):29 33, 2011.

Xu, D. and Tian, Y. A comprehensive survey of clustering algorithms. Annals of Data Science, 2(2):165 193, 2015.