# data_representation_and_compression_using_linearprogramming_approximations__2268fe91.pdf

Published as a conference paper at ICLR 2016

DATA REPRESENTATION AND COMPRESSION USING LINEAR-PROGRAMMING APPROXIMATIONS

Hristo S. Paskov Computer Science Department Stanford University hpaskov@stanford.edu

John C. Mitchell Computer Science Department Stanford University jcm@stanford.edu

Trevor J. Hastie Statistics Department Stanford University hastie@stanford.edu

We propose Dracula , a new framework for unsupervised feature selection from sequential data such as text. Dracula learns a dictionary of n-grams that efﬁciently compresses a given corpus and recursively compresses its own dictionary; in effect, Dracula is a deep extension of Compressive Feature Learning. It requires solving a binary linear program that may be relaxed to a linear program. Both problems exhibit considerable structure, their solution paths are well behaved, and we identify parameters which control the depth and diversity of the dictionary. We also discuss how to derive features from the compressed documents and show that while certain unregularized linear models are invariant to the structure of the compressed dictionary, this structure may be used to regularize learning. Experiments are presented that demonstrate the efﬁcacy of Dracula s features.

1 INTRODUCTION

At the core of any successful machine learning problem is a good feature representation that highlights salient properties in the data and acts as an effective interface to the statistical model used for inference. This paper focuses on using classical ideas from compression to derive useful feature representations for sequential data such as text. The basic tenets of compression abound in machine learning: the minimum description length principle can be used to justify regularization as well as various model selection criteria (Gabrilovich & Markovitch (2004)), while for unsupervised problems deep autoencoders (Salakhutdinov (2009)) and the classical K-means algorithm both seek a parsimonious description of data. Meanwhile, off-the-shelf compressors, such as LZ-77 (Ziv & Lempel (1977)), have been successfully applied to natural language problems as kernels that compute pairwise document similarities (Bratko et al. (2006)).

We propose a new framework, Dracula, so called because it simultaneously ﬁnds a useful data representation and compression using linear-programming approximations of the criterion that motivates dictionary-based compressors like LZ-77 (Ziv & Lempel (1977)). Dracula ﬁnds an explicit feature representation for the documents in a corpus by learning a dictionary of n-grams that is used to losslessly compress the corpus. It then recursively compresses the dictionary. This recursion makes Dracula a deep extension of Compressive Feature Learning (CFL) (Paskov et al. (2013)) that can ﬁnd exponentially smaller representations and promotes similar n-grams to enter the dictionary. As noted in (Paskov et al. (2013)), feature representations derived from off-the-shelf compressors are inferior because the algorithms used are sensitive to document order; both Dracula and CFL are invariant to document order.

Our framework is expressed as a binary linear program (BLP) that can viewed as a linear program (LP) over a sufﬁciently constrained polyhedron or relaxed to an LP by relaxing the integrality constraints. This is a notable departure from traditional deep learners (Salakhutdinov (2009); Socher & Manning (2013); Le Cun et al. (2006)), which are formulated as non-convex, non-linear optimization problems. This structure makes it possible to analyze Dracula in view of well known techniques from convex analysis (e.g. the KKT conditions), polyhedral combinatorics, and graph theory. For example, we show that Dracula is easily parameterized to control the depth and diversity of its dictionary and that its solutions are well behaved as its parameters vary.

ar Xiv:1511.06606v5 [cs.LG] 3 May 2016

Published as a conference paper at ICLR 2016

This paper introduces Dracula in section 2 and discusses some of its problem structure and computational properties, including its NP-Completeness. Section 3 uses Dracula s polyhedral interpretation to explore the compressed representations it ﬁnds as its storage cost model varies. It also discusses how to extract features directly from a compression and how to integrate dictionary structure into the features. Finally, section 4 provides empirical evidence that deep compression ﬁnds hierarchical structure in data that is useful for learning and compression, and section 5 concludes.

This section introduces Dracula by showing how to extend CFL to a deep architecture that compresses its own dictionary elements. We also show how to interpret any Dracula solution as a directed acyclic graph (DAG) that makes precise the notion of depth and provides useful statistical insights. Finally, we prove that Dracula is NP-Complete and discuss linear relaxation schemes.

Notation Throughout this paper Σ is a ﬁxed ﬁnite alphabet and C = {D1, . . . , DN} is a ﬁxed document corpus with each Dk C a string Dk = ck 1 . . . ck j of characters ck i Σ. An n-gram is any substring of some Dk and S is the set of all n-grams in the document corpus, including the original documents. For any s S a pointer p is a triple p = (s, l {1, . . . , |s|}, z S) indicating that z = sl . . . sl+|z| 1. We say that p uses z at location l in s. Let P be the set of all valid pointers and for any P P we use P(s) = {p P|p = (s, l, z)} to select pointers whose ﬁrst element is s, e.g. P = s SP(s). Moreover, P uses z S if there is some p P using z, and P reconstructs s S if every location in s is covered by at least one pointer, i.e. (s,l,v) P (s){l, . . . , l + |v| 1} = {1, . . . , |s|}. Conceptually, s is recovered from P by iterating through the (s, l, v) P and pasting a copy of v into location l of a blank string. It will be helpful to deﬁne PC = s CP(s) to be the set of pointers that can only be used to reconstruct the corpus.

CFL represents document corpus C by storing a dictionary S S, a set of n-grams, along with a pointer set P PC that only uses dictionary n-grams and losslessly reconstructs each of the documents in C. Importantly, CFL stores the dictionary directly in plaintext. The overall representation is chosen to minimize its total storage cost for a given storage cost model that speciﬁes ds, the cost of including n-gram s S in the dictionary, as well as cp, the cost of including pointer p PC in the pointer set. Selecting an optimal CFL representation may thus be expressed as

minimize S S,P PC

s S ds subject to P reconstructs Dk Dk C; P only uses s S. (1)

This optimization problem naturally decomposes into subproblems by observing that when the dictionary is ﬁxed, selecting the optimal pointer set decouples into |C| separate problems of optimally reconstructing each corpus document. We thus deﬁne the reconstruction module for document Dk C, which takes as input a dictionary S and outputs the minimum cost of reconstructing Dk with pointers that only use strings in S. Note that speciﬁc pointers and dictionary strings can be disallowed by setting their respective costs to . For example setting ds = for all s S longer than a certain length limits the size of dictionary n-grams. Of course, in practice, any variables with inﬁnite costs are simply disregarded.

The reconstruction module can be expressed as a BLP by associating with every pointer p P(Dk) a binary indicator variable wp {0, 1} whereby wp = 1 indicates that p is included in the optimal pointer set for Dk. We similarly use binary variables ts {0, 1} to indicate that s S is included in the dictionary. Since there is a one-to-one correspondence between pointer sets (dictionaries) and w {0, 1}|P(Dk)| (t {0, 1}|S|), the vector storing the wp (ts), we will directly refer to these vectors as pointer sets (dictionaries). Lossless reconstruction is encoded by the constraint XDkw 1 where XDk {0, 1}|Dk| |P(Dk)| is a binary matrix indicating the indices of Dk that each pointer can reconstruct. In particular, for every p = (Dk, l, z) P(Dk), column XDk p is all zeros except for a contiguous sequence of 1 s in indices l, . . . , l + |z| 1. Control of which pointers may be used (based on the dictionary) is achieved by the constraint w V Dkt where V Dk {0, 1}|P(Dk)| |S| contains a row for every pointer indicating the string it uses. In particular,

Published as a conference paper at ICLR 2016

for every p = (Dk, l, z), V Dk p,z = 1 is the only non-zero entry in the row pertaining to p. The BLP may now be expressed as

RDk(t; c) = minimize w {0,1}|P(Dk)|

p P(Dk) wpcp subject to XDkw 1; w V Dkt. (2)

The optimization problem corresponding to an optimal CFL representation may now be written as a BLP by sharing the dictionary variable t among the reconstruction modules for all documents in C:

minimize t {0,1}|S|

Dk C RDk(t, c) + X

s S tsds (3)

2.2 ADDING DEPTH WITH DRACULA

The simplicity of CFL s dictionary storage scheme is a fundamental shortcoming that is demonstrated by the string aa . . . a consisting of the character a replicated 22n times. Let the cost of using any pointer be cp = 1 and the cost of storing any dictionary n-gram be its length, i.e. ds = |s|. The best CFL can do is to store a single dictionary element of length 2n and repeat it 2n times, incurring a total storage cost of 2n+1. In contrast, a deep compression scheme that recursively compresses its own dictionary by allowing dictionary strings to be represented using pointers attains exponential space savings relative to CFL. In particular, the deep scheme constructs dictionary strings of length 2, 4, . . . , 22n 1 recursively and incurs a total storage cost of 4n 1.

Dracula extends CFL precisely in this hierarchical manner by allowing dictionary strings to be expressed as a combination of characters and pointers from shorter dictionary strings. CFL thus corresponds to a shallow special case of Dracula which only uses characters to reconstruct dictionary n-grams. This depth allows Dracula to leverage similarities among the dictionary strings to obtain further compression of the data. It also establishes a hierarchy among dictionary strings that allows us to interpret Dracula s representations as a directed acyclic graph (DAG) that makes precise the notion of representation depth.

Formally, a Dracula compression (compression for brevity) of corpus C is a triple D = (S S, P PC, ˆP P) consisting of dictionary, a pointer set P that reconstructs the documents in C, and a pointer set ˆP that reconstructs every dictionary string in S. As with CFL, any pointers in P may only use strings in S. However, a pointer p ˆP reconstructing a dictionary string s S is valid if it uses a unigram (irrespective of whether the unigram is in S) or a proper substring of s that is in S. This is necessary because unigrams take on the special role of characters for dictionary strings. They are the atomic units of any dictionary, so the character set Σ is assumed to be globally known for dictionary reconstruction. In contrast, document pointers are not allowed to use characters and may only use a unigram if it is present in S; this ensures that all strings used to reconstruct the corpus are included in the dictionary for use as features.

Finding an optimal Dracula representation may also be expressed as a BLP through simple modiﬁcations of CFL s objective function. In essence, the potential dictionary strings in S are treated like documents that only need to be reconstructed if they are used by some pointer. We extend the storage cost model to specify costs cp for all pointers p PC used for document reconstruction as well as costs ˆcp for all pointers p P used for dictionary reconstruction. In keeping with the aformentioned restrictions we assume that ˆcp = if p = (s, 1, s) illegally tries to use s to reconstruct s and s is not a unigram. The dictionary cost ds is now interpreted as the overhead cost of including s S in the dictionary without regard to how it is reconstructed; CFL uses the ds to also encode the cost of storing s in plaintext (e.g. reconstructing it only with characters). Finally, we introduce dictionary reconstruction modules as analogs to the (document) reconstruction modules for dictionary strings: the reconstruction module for s S takes as input a dictionary and outputs the cheapest valid reconstruction of s if s needs to be reconstructed. This can be written as the BLP ˆRs(t; ˆc) = minimize w {0,1}|P(s)|

p P(s) wpˆcp subject to Xsw ts1; w ˆV st. (4)

1Note that the recursive model is allowed to use pointers in the dictionary and therefore selects from a larger pointer set than CFL. Care must be taken to ensure that the comparison is fair since the size of a compression is determing by the storage cost model and we could cheat by setting all dictionary pointer costs to 0. Setting all pointer costs to 1 ensures fairness.

Published as a conference paper at ICLR 2016

Here Xs is analogously deﬁned as in equation (4) and ˆV s is analogous to V s in equation (4) except that it does not contain any rows for unigram pointers. With this setup in mind, the optimization problem corresponding to an optimal Dracula representation may be written as the BLP

minimize t {0,1}|S|

Dk C RDk(t, c) + X

h tsds + ˆRs(t; ˆc) i (5)

Finally, any compression can be interpreted graphically as, and is equivalent to, a DAG whose vertices correspond to members of Σ, S, or C and whose labeled edge set is determined by the pointers: for every (s, l, z) P or ˆP there is a directed edge from z to s with label l. Note that D deﬁnes a multi-graph since there may be multiple edges between nodes. Figure 1 shows the graph corresponding to a simple compression. As this graph encodes all of the information stored by D, and vice versa, we will at times treat D directly as a graph. Since D has no cycles, we can organize its vertices into layers akin to those formed by deep neural networks and with connections determined by the pointer set: layer 0 consists only of characters (i.e. there is a node for every character in Σ), layer 1 consists of all dictionary n-grams constructed solely from characters, higher levels pertain to longer dictionary n-grams, and the highest level consists of the document corpus C. While there are multiple ways to organize the intermediate layers, a simple stratiﬁcation is obtained by placing s S into layer i only if ˆP(s) uses a string in layer i 1 and no strings in layers i+1, . . . . We note that our architecture differs from most conventional deep learning architectures which tend to focus on pairwise layer connections we allow arbitrary connections to higher layers.

Figure 1: Compression of aabaabaax using a 3-layered dictionary. Layer 0 consists of characters; layers 1 and 2 are dictionary n-grams. There are three kinds of pointers: character to dictionary n-gram (dashed blue lines), dictionary n-gram to (longer) dictionary n-gram (solid blue line), and dictionary n-gram to document (double red lines).

2.3 COMPUTATIONAL HARDNESS AND RELAXATION

The document and dictionary reconstruction modules RDk/ ˆRs are the basic building blocks of Dracula; when dictionary t is ﬁxed, solving equation (5) is tantamount to solving the reconstruction modules separately. The discussion in the Appendix section A.1 shows that for a ﬁxed binary t, solving RDk or ˆRs is easy because of the structure of the constraint matrices XDk/Xs. In fact, this problem is equivalent to a min-cost ﬂow problem. Similarly, if the pointer sets are known for each document or dictionary string then it is easy to ﬁnd the corresponding dictionary t by checking which strings are used (in linear time relative to the number of pointers). One would hope that the easiness of Dracula s subproblems leads to an easy overall learning problem. However, learning the dictionary and pointer sets simultaneously makes this problem hard: Dracula is NPComplete. In particular, it requires solving a binary LP (which are NP-Complete in general) and it generalizes CFL which is itself NP-Complete (Paskov et al. (2014)) (see section 3.1.1 for how to restrict representations to be shallow).

We thus turn to solving Dracula approximately via its LP relaxation. This is obtained by replacing all binary constraints in equations (2),(4),(5) with interval constraints [0, 1]. We let QC denote this LP s constraint polyhedron and note that it is a subset of the unit hypercube. Importantly, we may also interpret the original problem in equation (5) as an LP over a polyhedron Q whose vertices are always binary and hence always has binary basic solutions. Here Q2 is the convex hull of all (binary) Dracula solutions and Q QC; all valid Dracula solutions may be obtained from the linear relaxation. In fact, the Chv atal-Gomory theorem (Schrijver (2003)) shows that we may prune

2Note that unlike QC, this polyhedron is likely to be difﬁcult to describe succinctly unless P = NP.

Published as a conference paper at ICLR 2016

QC into Q by adding additional constraints. We describe additional constraints in the Appendix section A.1.1 that leverage insights from sufﬁx trees to prune QC into a tighter approximation Q C QC of Q. Remarkably, when applied to natural language data, these constraints allowed Gurobi (Gurobi Optimization (2015)) to quickly ﬁnd optimal binary solutions. While we did not use these binary solutions in our learning experiments, they warrant further investigation.

As the pointer and dictionary costs vary, the resulting problems will vary in difﬁculty as measured by the gap between the objectives of the LP and binary solutions. When the costs force either t or the w Dk/ws to be binary, our earlier reasoning shows that the entire solution will lie on a binary vertex of QC that is necessarily optimal for the corresponding BLP and the gap will be 0. This reasoning also shows how to round any continuous solution into a binary one by leveraging the easiness of the individual subproblems. First set all non-zero entries in t to 1, then reconstruct the documents and dictionary using this dictionary to yield binary pointers, and ﬁnally ﬁnd the minimum cost dictionary based on which strings are used in the pointers.

3 LEARNING WITH COMPRESSED FEATURES

This section explores the feature representations and compressions that can be obtained from Dracula. Central to our discussion is the observation of section 2.3 that all compressions obtained from Dracula are the vertices of a polyhedron. Each of these vertices can be obtained as the optimal compression for an appropriate storage cost model3, so we take a dual perspective in which we vary the storage costs to characterize which vertices exist and how they relate to one another. The ﬁrst part of this section shows how to walk around the surface of Dracula s polyhedron and it highlights some landmark compressions that are encountered, including ones that lead to classical bag-of-n-grams features. Our discussion applies to both, the binary and relaxed, versions of Dracula since the former can viewed as an LP over a polyhedron Q with only binary vertices. The second part of this section shows how to incorporate dictionary structure into features via a dictionary diffusion process.

We derive features from a compression in a bag-of-n-grams (Bo N) manner by counting the number of pointers that use each dictionary string or character. It will be useful to explicitly distinguish between strings and characters when computing our representations and we will use squares brackets to denote the character inside a unigram, e.g. [c] . Recall that given a compression D = (S, P, ˆP), a unigram pointer in P (used to reconstruct a document) is interpreted as a string whereas a unigram pointer in ˆP is interpreted as a character. We refer to any z S Σ as a feature and associate with every document Dk C or dictionary string s S a Bo N feature vector x Dk, xs Z|S|+|Σ| + , respectively. Entry x Dk z counts the number of pointers that use z to reconstruct Dk, i.e. x Dk z = | {p P(s)| p = (Dk, l, z)} |, and will necessarily have x Dk z = 0 for all z Σ. Dictionary strings are treated analogously with the caveat that if p = (s, l, z) ˆP uses a unigram, p counts towards the character entry x Dk [z] , not x Dk z .

3.1 DRACULA S SOLUTION PATH

Exploring Dracula s compressions is tantamount to varying the dictionary and pointer costs supplied to Dracula. When these costs can be expressed as continuous functions of a parameter λ [0, 1], i.e. s S, p PC, ˆp P the cost functions ds(λ), cp(λ), ˆcˆp(λ) are continuous, the optimal solution sets vary in a predictable manner around the surface of Dracula s constraint polyhedron Q or the polyhedron of its relaxation QC. We use F(Q) to denote the set of faces of polyhedron Q (including Q), and take the dimension of a face to be the dimension of its afﬁne hull. We prove the following theorem in the Appendix section A.3:

Theorem 1. Let Q Rd be a bounded polyhedron with nonempty interior and b : [0, 1] Rd a continuous function. Then for some N Z+ { } there exists a countable partition Γ = {γi}N i=0 of [0, 1] with corresponding faces Fi F(Q) satisfying Fi = Fi+1 and Fi Fi+1 = . For all α γi, the solution set of the LP constrained by Q and using cost vector b(α) is Fi = arg minx Q x T b(α). Moreover, Fi never has the same dimension as Fi+1 and the boundary between γi, γi+1 is )[ iff dim Fi < dim Fi+1 and ]( otherwise.

3The storage costs pertaining to each vertex form a polyhedral cone, see (Ziegler (1995)) for details.

Published as a conference paper at ICLR 2016

Figure 2: Part (a) shows a nonlinear projection of a subset of Dracula s constraint polyhedron Q in which every vertex corresponds to a distinct compression of xaxabxabxacxac . Part (b) is the projection s polar; its faces delineate the (linear) costs for which each vertex in (a) is optimal. The red/ purple/ blue line in (b) demonstrates a continuous family of costs. All red (blue) costs are uniquely minimized by the vertex in (a) highlighted in red (blue), respectively; (c) shows the corresponding compressions. Purple costs lie on the edge between the faces containing the red and blue lines and are minimized by any convex combination of the vertices highlighted in (a).

This theorem generalizes the notion of a continuous solution path typically seen in the context of regularization (e.g. the Lasso) to the LP setting where unique solutions are piecewise constant and transitions occur by going through values of λ for which the solution set is not unique. For instance, suppose that vertex v0 is uniquely optimal for some λ0 [0, 1), another vertex v1 is uniquely optimal for a λ0 < λ1 1, and no other vertices are optimal in (λ0, λ1). Then Theorem 1 shows that v0 and v1 must be connected by a face (typically an edge) and there must be some λ (λ0, λ1) for which this face is optimal. As such, varying Dracula s cost function continuously ensures that the solution set for the binary or relaxed problem will not suddenly jump from one vertex to the next; it must go through an intermediary connecting face. This behavior is depicted in Figure 2 on a nonlinear projection of Dracula s constraint polyhedron for the string xaxabxabxacxac .

It is worthwhile to note that determining the exact value of λ for which the face connecting v0 and v1 is optimal is unrealistic in practice, so transitions may appear abrupt. While it is possible to smooth this behavior by adding a strongly convex term to the objective (e.g. an L2 penalty), the important insight provided by this theorem is that the trajectory of the solution path depends entirely on the combinatorial structure of Q or QC. This structure is characterized by the face lattice4 of the polyhedron and it shows which vertices are connected via edges, 2-faces, ..., facets. It limits, for example, the set of vertices reachable from v0 when the costs vary continuously and ensure that transitions take place only along edges 5. This predictable behavior is desirable when ﬁne tuning the compression for a learning task, akin to how one might tune the regularization parameter of a Lasso, and it is not possible to show in general for non-convex functions.

4We leave it as an open problem to analytically characterize Dracula s face lattice. 5Restricting transitions only to edges is possible with probability 1 by adding a small amount of Gaussian noise to c.

Published as a conference paper at ICLR 2016

We now provide a simple linear cost scheme that has globally predictable effects on the dictionary. For all s S, p PC, ˆp P we set ds = τ, cp = 1, ˆcˆp = αλ if ˆp uses as unigram (i.e. is a character), and ˆcˆp = λ otherwise. We constrain τ, λ 0 and α [0, 1]. In words, all document pointer costs are 1, all dictionary costs τ, and dictionary pointer costs are λ if they use a string and αλ if they use a character. The effects these parameters have on the compression may be understood by varying a single parameter and holding all others constant:

Varying τ controls the minimum frequency with which s S must be used before it enters the dictionary; if few pointers use s it is cheaper to construct s in place using shorter n-grams. Long n-grams appear less frequently so τ biases the dictionary towards shorter n-grams.

Varying λ has a similar effect to τ in that it becomes more expensive to construct s as λ increases, so the overall cost of dictionary membership increases. The effect is more nuanced, however, since the manner in which s is constructed also matters; s is more likely to enter the dictionary if it shares long substrings with existing dictionary strings. This suggests a kind of grouping effect whereby groups of strings that share many substrings are likely to enter together.

Varying α controls the Dracula s propensity to use characters in place of pointers in the dictionary and thereby directly modulates dictionary depth. When α < 1 K for K = 2, 3, . . . , all dictionary n-grams of length at most K are constructed entirely from characters.

3.1.1 LANDMARKS ON DRACULA S POLYHEDRON

While Dracula s representations are typically deep and space saving, it is important to note that valid Dracula solutions include all of CFL s solutions as well as a set of fully redundant representations that use as many pointers as possible. The Bo N features computed from these space maximizing compressions yield the traditional Bo N features containing all n-grams up to a maximum length K. A cost scheme that includes all pointers using all n-grams up to length K is obtained by setting all costs to be negative, except for ts = for all s S where |s| > K (to disallow these strings). The optimal compression then includes all pointers with negative cost and each document position is reconstructed K times. Moreover, it is possible to restrict representations to be valid CFL solutions by disallowing all non-unigram pointers for dictionary reconstruction, i.e. by setting ˆcp = if p is not a single character string.

3.2 DICTIONARY DIFFUSION

We now discuss how to incorporate dictionary information from a compression D = (S, P, ˆP) into the Bo N features for each corpus document. It will be convenient to store the Bo N feature vectors x Dk for each document as rows in a feature matrix X Z|C| (|S|+|Σ|) and the Bo N feature vectors xs for each dictionary string as rows in a feature matrix G Z(|S|+|Σ|) (|S|+|Σ|). We also include rows of all 0 s for every character in Σ to make G a square matrix for mathematical convenience. Graphically, this procedure transforms D into a simpler DAG, DR, by collapsing all multi-edges into single edges and labeling the resulting edges with an appropriate xs z. For any two features s, z, we say that s is higher (lower) order than z if it is a successor (predecessor) of z in D.

Once our feature extraction process throws away positional information in the pointers higher order features capture more information than their lower order constituents since the presence of an s S formed by concatenating features z1 . . . zm indicates the order in which the zi appear and not just that they appear. Conversely, since each zi appears in the same locations as s (and typically many others), we can obtain better estimates for coefﬁcients associated with zi than for the coefﬁcient of s. If the learning problem does not require the information speciﬁed by s we pay an unnecessary cost in variance by using this feature over the more frequent zi.

In view of this reasoning, feature matrix X captures the highest order information about the documents but overlooks the features lower order n-grams (that are indirectly used to reconstruct documents). This latter information is provided by the dictionary s structure in G and can be incorporated by a graph diffusion process that propagates the counts of s in each document to its constituent zi, which propagate these counts to the lower order features used to construct them, and so on. This process stops once we reach the characters comprising s since they are atomic. We can express this information ﬂow in terms of G by noting that the product GT x Dk = P

s S Σ x Dk s xs spreads x Dk s to each of the zi used to reconstruct s by multiplying x Dk s with xs zi, the number of times each zi

Published as a conference paper at ICLR 2016

is directly used in s. Graphically, node s in DR sends x Dk s units of ﬂow to each parent zi, and this ﬂow is modulated in proportion to xs zi, the strength of the edge connecting zi to s. Performing this procedure a second time, i.e. multiplying GT (GT x Dk), further spreads x Dk s xs zi to the features used to reconstruct zi, modulated in proportion to their usage. Iterating this procedure deﬁnes a new feature matrix ˆX = XH where H = I + P n=1 Gn spreads the top level x Dk to the entire graph6.

We can interpret the effect of the dictionary diffusion process in view of two equivalent regularized learning problems that learn coefﬁcients β, η R|S Σ| for every feature in S Σ by solving

minimize β R|S Σ| L( ˆXβ) + λR(β)

minimize η R|S Σ| L(Xη) + λR ((I G)η) . (6)

We assume that L is a convex loss (that may implicitly encode any labels), R is a convex regularization penalty that attains its minimum at β = 0, and that a minimizer β exists. Note that adding an unpenalized offset does not affect our analysis. The two problems are equivalent because H is deﬁned in terms of a convergent Neumann series and, in particular, H = (I G) 1 is invertible. We may switch from one problem to the other by setting β = H 1η or η = Hβ.

When λ = 0 the two problems reduce to estimating β/η for unregularized models that only differ in the features they use, ˆX or X respectively. The equivalence of the problems shows, however, that using ˆX in place of X has no effect on the models as their predictions are always the same. Indeed, if β is optimal for the ﬁrst problem then η = Hβ is optimal for the second and for any z R|S Σ|, the predictions z T η = (z T H)β are the same. Unregularized linear models including generalized linear models are therefore invariant to the dictionary reconstruction scheme and only depend on the document feature counts x Dk, i.e. how documents are reconstructed.

When λ > 0, using ˆX in place of X results in a kind of graph Laplacian regularizer that encourages ηs to be close to ηT xs. One interpretation of this is effect is that ηs acts a label for s: we use its feature representation to make a prediction for what ηs should be and penalize the model for any deviations. A complementary line of reasoning uses the collapsed DAG DR to show that (6) favors lower order features. Associated with every node s S Σ is a ﬂow ηs and node z sends ηz units of ﬂow to each of its children s. This ﬂow is attenuated (or ampliﬁed) by xs z, the strength of the edge connecting z to s. In turn, s adds its incoming ﬂows and sends out ηs units of ﬂow to its children; each document s prediction is given by the sum of its incoming ﬂows. Here R acts a kind of ﬂow conservation penalty that penalizes nodes for sending out a different amount of ﬂow than they receive and the lowest order nodes (characters) are penalized for any ﬂow. From this viewpoint it follows that the model prefers to disrupt the ﬂow conservation of lower order nodes whenever they sufﬁciently decrease the loss since they inﬂuence the largest number documents. Higher order nodes inﬂuence fewer documents than their lower order constituents and act as high frequency components.

4 EXPERIMENTS

This section presents experiments comparing traditional Bo N features with features derived from Dracula and CFL. Our primary goal is investigate whether deep compression can provide better features for learning than shallow compression or the traditional fully redundant Bo N representation (using all n-grams up to a maximum length). Since any of these representations can be obtained from Dracula using an appropriate cost scheme, positive evidence for the deep compression implies Dracula is uncovering hierarchical structure which is simultaneously useful for compression and learning. We also provide a measure of compressed size that counts the number of pointers used by each representation, i.e. the result of evaluating each compression with a common sense space objective where all costs are 1. We use Top to indicate Bo N features counting only document pointers (X in previous section), Flat for dictionary diffusion features (i.e. ˆX), CFL for Bo N features from CFL, and All for traditional Bo N features using all n-grams considered by Dracula.

6This sum converges because G corresponds to a ﬁnite DAG so it can be permuted to a strictly lower triangular matrix so that lim n Gn = 0. See Appendix section A.2 for weighted variations.

Published as a conference paper at ICLR 2016

Figure 3: Proteins represented using the 4th and 5th singular vectors of Top features from Dracula.

Table 1: Bacteria Identiﬁcation Accuracy using Protein Data

SVD Rank 5 10 15 20 All # Pointers

All 59.5 77.7 83.3 77.6 81.1 4.54 105

CFL 89.7 85.0 76.9 74.5 74.0 2.69 104

Top 87.5 91.2 89.0 83.3 84.3 1.76 104

We used Gurobi (Gurobi Optimization (2015)) to solve the reﬁned LP relaxation of Dracula for all of our experiments. While Gurobi can solve impressively large LP s, encoding Dracula for a generalpurpose solver is inefﬁcient and limited the scale of our experiments. Dedicated algorithms that utilize problem structure, such as the network ﬂow interpretation of the reconstruction modules, are the subject of a follow-up paper and will allow Dracula to scale to large-scale datasets. We limited our parameter tuning to the dictionary pointer cost λ (discussed in the solution path section) as this had the largest effect on performance. Experiments were performed with τ = 0, α = 1, a maximum n-gram length, and only on n-grams that appear at least twice in each corpus.

Protein Data We ran Dracula using 7-grams and λ = 1 on 131 protein sequences that are labeled with the kingdom and phylum of their organism of origin (pro). Bacterial proteins (73) dominate this dataset, 68 of which evenly come from Actinobacteria (A) and Fermicutes (F). The ﬁrst 5 singular values (SV s) of the Top features show a clear separation from the remaining SV s and Figure 3 plots the proteins when represented by their 4th and 5th principle components. They are labeled by kingdom and, in more interesting cases, by phylum. Note the clear separation of the kingdoms, the two main bacterial phyla, and the cluster of plants separated from the other eukaryotes. Table 1 shows the average accuracy of two binary classiﬁcation tasks in which bacteria are positive and we hold out either phylum A or F, along with other randomly sampled phyla for negative cases, as a testing set. We compare All features to Top features from Dracula and CFL using an ℓ2-regularized SVM with C = 1. Since there are many more features than training examples we plot the effect of using the top K principle components of each feature matrix. Flat features did not help and performance strictly decreased if we limited the n-gram length for All features, indicating that long n-grams contain essential information. Both compression criteria perform well, but using a deep dictionary seems to help as Dracula s proﬁle is more stable than CFL s.

Stylometry We extracted 100 sentences from each of the training and testing splits of the Reuters dataset (Liu) for 10 authors, i.e. 2, 000 total sentences, and replaced their words with part-of-speech tags. The goal of this task is to predict the author of a given set of writing samples (that all come from the same author). We make predictions by representing each author by the centroid of her 100 training sentences, averaging together the unknown writing samples, and reporting the nearest author centroid to the sample centroid. We ran Dracula on this representation with 10-grams and normalized centroids by their ℓ1 norm and features by their standard deviation. Table 2 compares the performance of All features to Top features derived from various λ s for various testing sentence sample sizes. We report the average of 1, 000 trials, where each trial tested every author once and

Published as a conference paper at ICLR 2016

Table 2: Author Identiﬁcation Accuracy

# Samples 5 10 25 50 75 # Pointers

All 36.0 47.9 67.9 80.6 86.4 5.01 105

CFL λ = 20 39.6 50.5 73.8 87.5 91.4 3.33 104

Top λ = 1 35.1 46.2 68.6 85.3 93.7 2.39 104

Top λ = 10 39.6 51.0 75.0 88.9 93.7 3.00 104

Top λ = 20 37.7 49.4 73.8 91.5 97.8 3.32 104

Table 3: Sentiment Classiﬁcation Accuracy

λ: MNL # Pointers Top Flat n-gram Len. NB All SVM ℓ1 All SVM ℓ2 All

0.25 4.02 1.79 105 73.9 78.2 5 77.9 76.6 76.9 0.5 3.78 1.75 105 75.1 78.8 4 77.9 76.8 77.0 1 3.19 1.71 105 76.6 78.2 3 78.4 77.0 77.2 2 2.51 1.71 105 78.0 78.1 2 78.8 77.2 77.5 5 1.96 1.86 105 78.0 78.0 1 78.0 76.3 76.5

randomly selected a set of sample sentences from the testing split sentences. As in the protein data, neither Flat nor shorter n-gram features helped, indicating that higher order features contain vital information. CFL with λ = 20 strictly dominated every other CFL representation and is the only one included for brevity. Dracula with λ = 10 or λ = 20 shows a clear separation from the other schemes, indicating that the deep compression ﬁnds useful structure.

Sentiment Prediction We use a dataset of 10, 662 movie review sentences (Pang & Lee (2005)) labeled as having positive or negative sentiment. Bigrams achieve state-of-the-art accuracy on this dataset and unigrams perform nearly as well (Wang & Manning (2012)), so enough information is stored in low order n-grams that the variance from longer n-grams hurts prediction. We ran Dracula using 5-grams to highlight the utility of Flat features, which focus the classiﬁer onto lower order features. Following (Wang & Manning (2012)), Table 3 compares the 10-fold CV accuracy of a multinomial na ıve-Bayes (NB) classiﬁer using Top or Flat features with one using all n-grams up to a maximum length. The dictionary diffusion process successfully highlights relevant low order features and allows the Flat representation to be competitive with bigrams (the expected best performer). The table also plots the mean n-gram length (MNL) used by document pointers as a function of λ. The MNL decreases as λ increases and this eventually pushes the Top features to behave like a mix of bigrams and unigrams. Finally, we also show the performance of ℓ2 or ℓ1regularized support vector machines for which we tuned the regularization parameter to minimize CV error (to avoid issues with parameter tuning). It is known that NB performs surprisingly well relative to SVMs on a variety of sentiment prediction tasks, so the dropoff in performance is expected. Both SVMs achieve their best accuracy with bigrams; the regularizers are unable to fully remove the spurious features introduced by using overly long n-grams. In contrast, Flat achieves its best performance with larger MNLs which suggests that Dracula performs a different kind of feature selection than is possible with direct ℓ1/ℓ2 regularization. Moreover, tuning λ combines feature selection with NB or any kind of classiﬁer, irrespective of whether it natively performs feature selection.

5 CONCLUSION

We have introduced a novel dictionary-based compression framework for feature selection from sequential data such as text. Dracula extends CFL, which ﬁnds a shallow dictionary of n-grams with which to compress a document corpus, by applying the compression recursively to the dictionary. It thereby learns a deep representation of the dictionary n-grams and document corpus. Experiments

Published as a conference paper at ICLR 2016

with biological, stylometric, and natural language data conﬁrm the usefulness of features derived from Dracula, suggesting that deep compression uncovers relevant structure in the data.

A variety of extensions are possible, the most immediate of which is the design of an algorithm that takes advantage of the problem structure in Dracula. We have identiﬁed the basic subproblems comprising Dracula, as well as essential structure in these subproblems, that can be leveraged to scale the compression to large datasets. Ultimately, we hope to use Dracula to explore large and fundamental datasets, such as the human genome, and to investigate the kinds of structures it uncovers.

ACKNOWLEDGEMENTS

Dedicated to Ivan i Kalinka Hand ievi (Ivan and Kalinka Handjievi). Funding provided by the Air Force Ofﬁce of Scientiﬁc Research and the National Science Foundation.

Protein classiﬁcation benchmark collection. http://hydra.icgeb.trieste.it/benchmark/index.php?page=00.

Bertsimas, Dimitris and Weismantel, Robert. Optimization over integers. Athena Scientiﬁc, 2005. ISBN 978-0-97591-462-5.

Bratko, Andrej, Filipiˇc, Bogdan, Cormack, Gordon V., Lynam, Thomas R., and Zupan, Blaˇz. Spam ﬁltering using statistical data compression models. JMLR, 7:2673 2698, 2006.

Gabrilovich, Evgeniy and Markovitch, Shaul. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In ICML, 2004.

Gurobi Optimization, Inc. Gurobi optimizer reference manual, 2015. URL http://www. gurobi.com.

Gusﬁeld, Dan. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. ISBN 0-521-58519-8.

Le Cun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, Marc Aurelio, and Huang, Fu-Jie. A tutorial on energy-based learning. In Bakir, G., Hofman, T., Sch olkopf, B., Smola, A., and Taskar, B. (eds.), Predicting Structured Data. MIT Press, 2006.

Liu, Zhi. Reuter 50 50 data set.

Lu, Shu and Robinson, Stephen M. Normal fans of polyhedral convex sets. Set-Valued Analysis, 16 (2-3):281 305, 2008.

Pang, Bo and Lee, Lillian. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 115 124. Association for Computational Linguistics, 2005.

Paskov, Hristo, West, Robert, Mitchell, John, and Hastie, Trevor. Compressive feature learning. In NIPS, 2013.

Paskov, Hristo, Mitchell, John, and Hastie, Trevor. An efﬁcient algorithm for large scale compressive feature learning. In AISTATS, 2014.

Salakhutdinov, Ruslan. Learning deep generative models, 2009.

Schrijver, A. Combinatorial Optimization - Polyhedra and Efﬁciency. Springer, 2003.

Socher, Richard and Manning, Christopher D. Deep learning for NLP (without magic). In Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, pp. 1 3, 2013.

Wang, Sida and Manning, Christopher D. Baselines and bigrams: Simple, good sentiment and topic classiﬁcation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pp. 90 94. Association for Computational Linguistics, 2012.

Published as a conference paper at ICLR 2016

Ziegler, Gunter M. Lectures on Polytopes. Graduate texts in mathematics. Springer-Verlag, 1995.

Ziv, Jacob and Lempel, Abraham. A universal algorithm for sequential data compression. TIT, 23 (3):337 343, 1977.

A.1 RECONSTRUCTION MODULES

The reconstruction modules RDk/ ˆRs are the basic building blocks of Dracula; when t is ﬁxed solving (5) is tantamount to solving the reconstruction modules separately. These simple BLPs have a number of properties that result in computational savings because of the structure of the constraint matrix XDk/Xs. In order to simplify notation we deﬁne

Ts(t, v; b, V ) = minimize w {0,1}|P(s)|

p P(s) wpbp subject to Xsw v1; w V t. (7)

Using TDk(t, 1; c, V Dk) = RDk(t; c) and Ts(t, ts; ˆc, ˆV s) = ˆRs(t; ˆc) results in the document or dictionary reconstruction modules. Now note that every column in Xs is all zeros except for a contiguous sequence of ones so that Xs is an interval matrix and therefore totally unimodular (TUM). Deﬁne T c s to be the LP relaxation of Ts obtained by replacing the integrality constraints:

T c s (t, v; b, V ) = minimize w [0,1]|P(s)|

p P(s) wpbp subject to Xsw v1; w V t. (8)

Aside from X, the remaining constraints on w are bound constraints. It follows from (Bertsimas & Weismantel (2005)) that T c s is an LP over a integral polyhedron so we may conclude that Proposition 1. If the arguments t, v are integral, then all basic solutions of T c s (t, v; b, V ) are binary.

Indeed, a simple dynamic program discussed in (Paskov et al. (2013)) solves Ts efﬁciently.

Our second property reformulates T c s by transforming the constraint matrix Xs into a simpler form. The resulting matrix has at most 2 non-zero entries per column instead of up to |s| non-zero entries per column in Xs. This form is more efﬁcient to work with when solving the LP and it shows that T c s is equivalent to a min-cost ﬂow problem over an appropriately deﬁned graph. Deﬁne Q {0, 1}|s| |s| be the full rank lower triangular matrix with entries Qs ii = Qs (i+1)i = 1 and 0 elsewhere (and Qs |s||s| = 1). The interval structure of Xs implies that column i of Zs = Qs Xs is all zeros except for Zs ij = Zs ik = 1 where j is the ﬁrst row in which Xs ij = 1 and k > j is the ﬁrst row in which Xs ik = 0 after the sequences of ones (if such a k exists). By introducing non-negative slack variables for the Xsw v1 constraint, i.e. writing Xsw = v1+ξ, and noting that Qs1 = e1, where e1 is all zeros except for a 1 as its ﬁrst entry, we arrive at:

T c s (t, v; b, V ) = minimize w,ξ

p P(s) wpbp

subject to Zsw Qsξ = ve1, 0 w V t, 0 ξ.

The matrix Ψ = [Zs| Qs] has special structure since every column has at most one 1 and at most one 1. This allows us to interpret Ψ as the incidence matrix of a directed graph if we add source and sink nodes with which to ﬁll all columns out so that they have exactly one 1 and one 1. T c s may then be interpreted as a min-cost ﬂow problem.

A.1.1 POLYHEDRAL REFINEMENT

We now show how to tighten Dracula s LP relaxation by adding additional constraints to QC to shrink it closer to Q. If every time we see a string s it is followed by the character α (in a given corpus), the strings s and sα belong to the same equivalence class; the presence of sα conveys the same information as the presence of s. Importantly, the theory of sufﬁx trees shows that all substrings

Published as a conference paper at ICLR 2016

of a document corpus can be grouped into at most 2n 1 equivalence classes (Gusﬁeld (1997)) where n is the word count of the corpus. We always take equivalence classes to be inclusion-wise maximal sets and say that equivalence class ε S appears at a location if any (i.e. all) of its members appear at that location. We prove the following theorem below. This theorem veriﬁes common sense and implies that, when the pointer costs do not favor any particular string in ε, adding the constraint P

s ε ts 1 to the LP relaxation to tighten QC will not remove any binary solutions.

Theorem 2. Let Ωdenote the set of all equivalence classes in corpus C and suppose that all costs are non-negative and ε Ω, z S, s, x ε, the dictionary costs ds = dx are equal, the pointer costs cz p = cz q (ˆcz p = ˆcz q) are equal when p = (l, s) and q = (l, x), and cs p = cx q ( ˆcs p = ˆcx q) whenever pointers p = q = (l, h) refer to the same location and use the same string (or character) h. Then there is an optimal compression D = (S, P ˆP) in which S contains at most one member of ε.

Proof. Suppose that the conditions for theorem 1 hold, let ε be an equivalence class, let D = (S, P, ˆP) be an optimal compression, and suppose for the sake of contradiction that s1, s2 ε are both included in the optimal dictionary. Without loss of generality we assume that |s1| < |s2|. Consider ﬁrst document pointer p which uses s1 for document Dk. By assumption there is another pointer q which uses s2 in the same location and c Dk p = c Dk q so we are indifferent in our choice. We thereby may replace all document pointers that use s1 with equivalent ones that use s2 without changing the objective value.

Consider next the usage of s1 to construct higher order dictionary elements. We must be careful here since if some dictionary element s3 is in the optimal dictionary S and can be expressed as s3 = zs1 for some string z then we may not use s2 in place of s1 since it would lead to a different dictionary string. The key step here is to realize that s3 must belong to the same equivalence class as string zs2 and we can use zs2 in place of s3 in all documents. If s3 is itself used to construct higher order dictionary elements, we can apply the same argument for s2 to zs2 in an inductive manner. Eventually, since our text is ﬁnite, we will reach the highest order strings in the dictionary, none of whose equivalence class peers construct any other dictionary n-grams. Our earlier argument shows that we can simply take the longest of the highest order n-grams that belong to the same equivalence class. Going back to s3, we note that our assumptions imply that the cost of constructing zs2 is identical to the cost of constructing s3 so we may safely replace s3 with zs2. The only remaining place where s1 may be used now is to construct s2. However, our assumptions imply that the cost of constructing s1 in place when constructing s2 is the same. By eliminating s1 we therefore never can do worse, and we may strictly improve the objective if ts1 > 0 or s1 is used to construct s2 and its pointer cost is non-zero. QED.

A.2 WEIGHTED DIFFUSION

When G is generated from the relaxation of Dracula and t (0, 1]|S| are the dictionary coefﬁcients, any s S with ts < 1 will have Gsz ts z S. In order to prevent overly attenuating the diffusion we may wish to normalize row s in G by t 1 s for consistency. We note that a variety of other weightings are also possible to different effects. For example, weighting G by a scalar ρ 0 attenuates or enhances the entire diffusion process and mitigates or enhances the effect of features the farther away they are from directly constructing any feature directly used in the documents.

A.3 PROOF OF PATH THEOREM

The fundamental theorem of linear programming states that for any c Rd, S(c, Q) arg minx Q x T c(α) F(Q) since Q has non-empty interior and is therefore non-empty. We will use a construction known as the normal fan of Q, denoted by N(Q), that partitions Rd into a ﬁnite set of polyhedral cones pertaining to (linear) objectives for which each face in F(Q) is the solution set. We begin with some helpful deﬁnitions.

A partition P 2X of a set X is any collection of sets satisfying S p P p = X and p, q P p = q implies p q = . The relative interior of a convex set X Rd, denoted by relint X, is the interior of X with respect to its afﬁne hull. Formally, relint X = {x X | ε > 0, B(x, ε) aff X X}. The following deﬁnition is taken from (Lu & Robinson (2008)): A fan is a ﬁnite set of nonempty polyhedral convex cones in Rd, N = {N1, N2, . . . , Nm}, satisfying:

Published as a conference paper at ICLR 2016

1. any nonempty face of any cone in N is also in N,

2. any nonempty intersection of any two cones in N is a face of both cones.

This deﬁnition leads to the following lemma, which is adapted from (Lu & Robinson (2008)):

Lemma 1. Let N be a fan in Rd and S = S

N N N the union of its cones.

1. If two cones N1, N2 N satisfy (relint N1) N2 = then N1 N2,

2. The relative interiors of the cones in N partition S, i.e. S

N N relint N = S.

Lemma 1 is subtle but important as it contains a key geometric insight that allow us to prove our theorem. Next, let Q Rd be a bounded polyhedron with vertex set V and nonempty interior, i.e. whose afﬁne hull is d-dimensional. For any face F F(Q) deﬁne V (F) = F V to be the vertices of F and NF = y Rd | x F, z Q, y T x y T z to be the normal cone to F. That NF is a (pointed) polyhedral cone follows from noting that it can be equivalently expressed as a ﬁnite collection of linear constraints involving the vertices of F and Q: NF = y Rd | x V (F), z V, y T x y T z . The normal fan for Q, N(Q) = {NF }F F(Q), is deﬁned to be the set of all normal cones for faces of Q. Noting that Q is bounded and therefore has a recession cone of {0}, the following Lemma is implied by Proposition 1 and Corollary 1 of (Lu & Robinson (2008)):

Lemma 2. Let N(Q) be the normal fan of a bounded polyhedron Q with non-empty interior in R. Then

1. N(Q) is a fan,

2. for any nonempty faces F1, F2 F(Q), F1 F2 iff NF1 NF2,

F F(Q) relint NF = Rd,

4. every nonempty face F F(Q) satisﬁes relint NF = y Rd | F = S(y, Q) .

We will also makes use of the following two results. The ﬁrst is implied by Theorem 2.7, Corollary 2.14, and Problem 7.1 in Ziegler (1995):

Lemma 3. Let Q Rd be a bounded polyhedron with nonempty interior, F F(Q), and NF the normal cone to F. Then dim F + dim NF = d.

The second Lemma states a kind of neighborliness for the cones in N(Q):

Lemma 4. Let Q Rd be a bounded polyhedron with nonempty interior. For any N N(Q) and x relint N there exists a δ > 0 such that for any y B(x, δ) there is a N N(Q) with y relint N and N N .

Proof. Let N N(Q) and x N be given. We say that N N(Q) occurs within δ (for δ > 0) if there is some y B(x, δ) with y relint N . Now suppose that there is an N N(Q) that occurs within δ for all δ > 0. Since N is a closed convex cone it must be that x N so we may conclude from Lemma 1 that N N . Next, let M be the set of cones in N(Q) which do not contain N and suppose that for all δ > 0 there is some N M that occurs within δ. Since |M| is ﬁnite, this is only possible if there is a cone N M that occurs within δ for all δ > 0. However, this leads to a contradiction since N must contain N so the Lemma follows.

We are now ready to prove our main theorem which is restated below with S(c, Q) = arg minx Q x T c(α) for simplicity.

Theorem 3. Let Q Rd be a bounded polyhedron with nonempty interior and c : [0, 1] Rd a continuous function. Then for some N Z+ { } there exists a countable partition Γ = {γi}N i=0 of [0, 1] with corresponding faces Fi F(Q) satisfying Fi = Fi+1 and Fi Fi+1 = and Fi = S(c(α), Q) α γi. Moreover, Fi never has the same dimension as Fi+1 and the boundary between γi, γi+1 is )[ iff dim Fi < dim Fi+1 and ]( otherwise.

Published as a conference paper at ICLR 2016

Proof. For ease of notation let f(x) = S(c(x), Q) and for k = 0, . . . , d deﬁne ωk = x [0, 1] | dim Nf(x) k to be the set of all arguments to c whose normal cone is at least kdimensional. Moreover, for any x [0, 1] deﬁne σ(x) = {y [0, x] | z [y, x], f(x) = f(z)} {y [x, 1] | z [x, y], f(x) = f(z)} to be the largest contiguous set containing x over which f remains constant and let m(x) = inf σ(x) and M(x) = sup σ(x) be its inﬁnimum and supremem, respectively. The proof follows by induction on k = d, d 1, . . . , 0 with the inductive hypothesis that for some Nk Z+ { } there exists a countable partition Γk = γk i Nk i=0 of ωk with corresponding faces F k i F(Q) satisfying F k i = S(c(α), Q) α γk i .

Base case (k = d): Let x ωd so that σ(x) ωd. Since Nf(x) is d-dimensional, int Nf(x) = relint Nf(x) so continuity of c implies that σ(x) is a (non-empty) open interval with m(x) < M(x). It follows that Γk = {σ(x) | x ωd} deﬁnes a partition of ωd into a set of open intervals. Each interval contains (an inﬁnite number) of rational numbers, and we see that Γk is countable by assigning to each interval a rational number that it contains.

Inductive step: Let x ωk\ωk+1. There are two cases to consider. If m(x) < M(x) then (m(x), M(x)) σ(x) contains a rational number. Thus, the set Γk o = {σ(x) | x ωk\ωk+1, m(x) < M(x)} is countable. Otherwise, if m(x) = x = M(x) then by Lemma 4 there is a δ > 0 such that if y B(x, δ) then Nf(x) NS(y,Q). Continuity of c implies that there is a ε > 0 for which c((x ε, x + ε)) B(x, δ) and hence (x ε, x + ε)\{x} ωk+1. Assigning to x any rational number in (x ε, x + ε) and letting Γk c = {σ(x) | x ωk\ωk+1, m(x) = M(x)}, we may appeal to the inductive hypothesis to conclude that Γk c is countable. Finally, Γk = Γk o Γk c Γk+1 is a ﬁnite union of countable sets and therefore countable.

Since ω0 = [0, 1] we have shown that Γ = Γ0 is a countable partition of [0, 1] into intervals over which f is constant. Now consider two consecutive intervals γi, γi+1 Γ and let M be the supremum of γi. If M / γi then since cone NFi is closed, c(M) NFi. Since c(M) relint NFi+1 by assumption, it follows that NFi+1 is a proper subset of NFi and hence that Fi is a proper subset of Fi+1. Otherwise, if M γi then the continuity of c and Lemma 4 imply that NFi is a proper subset of NFi+1 so Fi+1 is a proper subset of Fi. In either case Fi Fi+1 = and Lemma 3 implies the dimensionality result of our Theorem.