# categorical_feature_compression_via_submodular_optimization__d14c04c2.pdf Categorical Feature Compression via Submodular Optimization Mohammad Hossein Bateni * 1 Lin Chen * 1 2 Hossein Esfandiari * 1 Thomas Fu * 1 Vahab S. Mirrokni * 1 Afshin Rostamizadeh * 1 In the era of big data, learning from categorical features with very large vocabularies (e.g., 28 million for the Criteo click prediction dataset) has become a practical challenge for machine learning researchers and practitioners. We design a highly-scalable vocabulary compression algorithm that seeks to maximize the mutual information between the compressed categorical feature and the target binary labels and we furthermore show that its solution is guaranteed to be within a 1 1/e 63% factor of the global optimal solution. To achieve this, we introduce a novel reparametrization of the mutual information objective, which we prove is submodular, and design a data structure to query the submodular function in amortized O(log n) time (where n is the input vocabulary size). Our complete algorithm is shown to operate in O(n log n) time. Additionally, we design a distributed implementation in which the query data structure is decomposed across O(k) machines such that each machine only requires O( n k ) space, while still preserving the approximation guarantee and using only logarithmic rounds of computation. We also provide analysis of simple alternative heuristic compression methods to demonstrate they cannot achieve any approximation guarantee. Using the large-scale Criteo learning task, we demonstrate better performance in retaining mutual information and also verify competitive learning performance compared to other baseline methods. 1. Introduction In modern large scale machine learning tasks, the presence of categorical features with extremely large vocabularies *Equal contribution 1Google, New York, NY, USA 2Department of Electrical Engineering, Yale University, New Haven, CT, USA. Correspondence to: Lin Chen . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). is a standard occurrence. For example, in tasks such as product recommendation and click-through rate prediction, categorical variables corresponding to inventory id, webpage identifier, or other such high cardinality values, can easily contain anywhere from hundreds of thousands to tens of millions of unique values. The size of machine learning models generally grows at least linearly with the vocabulary size and, thus, the memory required to serve the model, the training and inference cost, as well as the risk of overfitting become an issue with very large vocabularies. In the particular case of neural networks model, one generally uses an embedding layer to consume categorical inputs. The number of parameters in the embedding layer is O(nh), where n is the size of the vocabulary and h is the number of units in the first hidden layer. To give a concrete example, the Criteo click prediction benchmark has about 28 million categorical feature values (Criteo Labs, 2014), thus resulting in an embedding layer more than 1 billion parameters for a modestly sized first hidden layer. Note, this number dwarfs the number of parameters found in the remainder of the neural network. Again, to give a concrete example, even assuming a very deep fully connected network of depth 102 with hidden layers of size 103, we have (103 103)102 = 108 parameters in the hidden network still an order of magnitude smaller than the embedding layer alone. This motivates the problem of compressing the vocabulary into a smaller size while still retaining as much information as possible. In this work, we model the compression task by considering the problem of maximizing the mutual information between the compressed version of the categorical features and the target label. We first observe a connection between this problem and the quantization problem for discrete memoryless channels, and note a polynomial-time algorithm for the problem (Kurkoski & Yagi, 2014; Iwata & Ozawa, 2014). The resulting algorithm, however, is based on solving a quadratic-time dynamic program, and is not scalable. Our main goal in this paper is to develop a scalable and distributed algorithm with a guaranteed approximation factor. We achieve this goal by developing a novel connection to submodular optimization. Although in some settings, entropy-based set functions are known to be submodular, this is not the case for the mutual information objective Categorical Feature Compression via Submodular Optimization we consider (mutual information with respect to the target labels). Our main insight is in proving the submodularity of a particular transformation of the mutual information objective, which still allows us to provide an approximation guarantee on the quality of the solution with respect to the original objective. We also provide a data structure that allows us to query this newly defined submodular function in amortized logarithmic time. This logarithmic-time implementation of the submodular oracle empowers us to incorporate the fastest known algorithm for submodular maximization (Mirzasoleiman et al., 2015), which leads us to a sequential quasi-linear-time (1 1/e ϵ)-approximation algorithm for binary vocabulary compression. Next, we provide a distributed implementation for binary vocabulary compression. Previous distributed algorithms for submodular maximization assume a direct access the query oracle on every machine (e.g., see (Barbosa et al., 2015; Mirrokni & Zadimoghaddam, 2015; Mirzasoleiman et al., 2013)). However, the query oracle itself requires O(n) space, which may be restrictive in the large scale setting. In this work, we provide a truly distributed implementation of the submodular maximization algorithm of (Badanidiyuru & Vondr ak, 2014) (or similarly (Kumar et al., 2015)) for our application by distributing the query oracle. In this distributed implementation we manage to decompose the query oracle across O(k) machines such that each machine only requires O( n k ) space to store the partial query oracle. As a result, we successfully provide a distributed (1 1/e ϵ)-approximation algorithm for vocabulary compression in logarithmic rounds of computation. Our structural results for submodularity of this new set function is the main technical contribution of this paper, and can also be of independent interest in other settings that seek to maximize mutual information. We also study a number of heuristic and baseline algorithms for the problem of maximizing mutual information, and show that they do not achieve a guaranteed approximation for the problem. Furthermore, we study the empirical performance of our algorithms on two fronts: First, we show the effectiveness of our greedy scalable approximation algorithm for maximizing mutual information. Our study confirms that this algorithm not only achieves a theoretical guarantee, but also it beats the heuristic algorithms for maximizing mutual information. Finally, we examine the performance of this algorithm on the vocabulary compression problem itself, and confirm the effectiveness of the algorithm in producing a high-quality solution for vocabulary compression large scale learning tasks. Organization. In the remainder of this section we review related previous works and introduce the problem formally along with appropriate notation. Then in Section 2, we introduce the novel compression algorithm and corresponding theoretical guarantees as well as analysis of some basic heuristic baselines. In Section 3, we present our empirical evaluation of optimizing the mutual information objective as well as an end-to-end learning task. 1.1. Related Work Feature Clustering: The use of vocabulary compression has been studied previously, especially in text classification applications where it is commonly known as feature (or word) clustering. In particular, Baker & Mc Callum (1998) and Slonim & Tishby (2001) both propose agglomerative clustering algorithms, which start with singleton clusters that are iteratively merged using a Jenson-Shannon divergence based function to measure similarity between clusters, until the desired number of clusters is found. Both algorithms are greedy in nature and do not provide any guarantee with respect to a global objective. In Dhillon et al. (2003), the authors introduce an algorithm that empirically performs better than the aforementioned methods and that also seeks to optimize the same global mutual information objective that is analyzed in this work. Their proposed iterative algorithm is guaranteed to improve the objective at each iteration and arrive at a local minimum, however, no guarantee with respect to the global optimum is provided. Furthermore, each iteration of the algorithm requires O(mn) time (where m is the size of the compressed vocabulary) and the number of iterations is only guaranteed to be finite (but potentially exponential). Later in this work, we compare the empirical performance of this algorithm with our proposed method. Input DMC Output Quantizer DMC Quantization Label Compression function Compressed feature values Vocabulary Compression Feature values Quantized output (To be designed) Figure 1. Translation of terminologies of the DMC quantizer design problem and the feature compression problem. Compression in Discrete Memoryless Channels: An area from information theory that is closely related to our vocabulary compression problem, and which our algorithm draws inspiration from, is compression in a discrete memoryless channels (DMC) (Cicalese et al., 2018; Zhang & Kurkoski, 2016; Iwata & Ozawa, 2014). In this problem, we assume there is a DMC which (in machine learning terminology) receive a class label and produces a corresponding categorical feature value drawn according to a fixed underlying distribution. The goal is to design a quantizer that maps the space of categorical features in lower cardinatility set, while preserving as much of the mutual information between the class label and newly constructed vocabulary. In Figure 1, we present a diagram that illustrates the DMC quantization problem and vocabulary compression problem as well as the translation of terminologies of these two problems. The Categorical Feature Compression via Submodular Optimization results of Kurkoski & Yagi (2014) are of particular interest, as they show a cubic-time dynamic programming based algorithm is able to provide an optimal solution in the case of binary labels. Following this work, Iwata & Ozawa (2014) improve the computational complexity of this approach to quadratic time using the SMAWK algorithm (Aggarwal et al., 1987). Such algorithms are useful in the smaller scale regime, however, the use of a cubicor even quadratic-time algorithm will be infeasible for our massive vocabulary size use cases. Finally, Mumey & Gedeon (2003) shows that in the general case of greater than two class labels, finding the optimal compression is NP-complete. In this work, we will be focusing on the binary label setting. Feature Selection: A related method for dealing with very large vocabularies is to do feature selection, in which we simply select a subset of the vocabulary values and remove the rest (see Guyon & Elisseeff (2003) and the many references therein). One can view this approach as a special case of vocabulary compression, where we are restricted to only singleton clusters . Restricting the problem by selecting a subset of the vocabulary may have some benefits, such as potentially simplifing the optimization problem and the use of a simple filter to transform data at inference time. However, the obvious downside to this restriction is the loss of information and potentially poorer learning performance (see introduction of Jiang et al. (2011)). In this work we focus on the more general vocabulary compression setting. Other Feature Extraction Approaches: Clustering features in order to compress a vocabulary is only one approach to lower dimensional feature extraction. There are of course many classical approaches to feature extraction (see Chapter 15 of Mohri et al. (2018)), such as learning linear projections (e.g., Principle Component Analysis, Linear Discriminant Analysis) or non-linear transformations (e.g., Locally Linear Embeddings, ISOMAP, Laplacian Eigenmaps). However, these classical methods generally incur more than quasilinear computational cost, for both learning and the application the transformation, and are not feasible for our setting. 1.2. Notation In the vocabulary compression problem we are given a correlated pair of random variables X (a categorical feature) and C (a label), where X {1, 2, . . . , n} and C {0, 1}. We aim to define a random variable Z {1, 2, . . . , m} as a function of X that maximizes the mutual information with the label C, i.e., I(Z; C), where for general random variables A and B taking values in A and B, respectively, I(A; B) = X B B Pr [A, B] log Pr [A, B] Pr [A] Pr [B] Note that Z is a function of X and hence we have I(Z; C) I(X; C). If we let m n, Z = X maximizes the mutual information I(Z; C). We are interested in the nontrivial case of m n. Intuitively, we are compressing the vocabulary of feature X from size n to a smaller size m, while preserving the maximum amount of information about C. 2. Algorithm and Analysis In this section, we first show how to transform the original objective into a set function and then prove that this set function is in fact submodular. Next, we describe the components of a quasi-linear and parallelizable algorithm to optimize the objective. Finally, we consider a few simple intuitive baselines and show that they may create features that fail to capture any mutual information with the label. 2.1. Objective Transformation Without loss of generality assume Pr [C = 0|X = i] for i {1, . . . , n} is sorted in increasing order. Once the feature values are sorted in this order, Lemma 3 of Kurkoski & Yagi (2014) crucially shows that in the optimum solution each value of Z corresponds to a consecutive subsequence of {1, . . . , n} this is a significant insight that we take from the quantization for DMC literature. Thus, we will cluster consecutive feature values into m clusters, with each cluster corresponding to a value in the compressed vocabulary of Z. Formally, define a function F(S) : 2{1,...,n 1} R as follows: Let S = {s1, . . . , sm 1}, and assume s1 < s2 < < sm 1. For simplicity, and without any loss in quality, we set s0 = 0 and sm = n. Let Z be a random variable constructed from X that has value i, if and only if si 1 < X si. We define F(S) = I(Z; C). Notice that we have max S {2...n 1}: |S|=m 1 F(S) = max Z I(Z; C) , where Z is a function of X with vocabulary size m. The nonnegativity of mutual information implies that the function F(S) = I(Z; C) is always non-negative (Cover & Thomas, 2006, p. 28). The monotonicity is equivalent to I(Z1; C) I(Z2; C) for any S1 S2 {1, . . . , n 1}, where Z1 and Z2 are the random variables constructed from S1 and S2, respectively. Since S2 represents a subdivision of S1, Z1 is a function of Z2. By the data-processing inequality, we have I(Z1; C) I(Z2; C) (Cover & Thomas, 2006, p. 34). In the following section, we show that the function F(S) is in fact submodular. 2.2. Submodularity of F(S) For a set S {1, . . . , n 1} and a number s {1, . . . , n 1} \ S we define s F(S) = F(S {s}) F(S). Let s be the item right before s when we sort S {s}. Note that, the items that are mapped to s by F(S) are either mapped to s or s by F(S {s}). We first observe the following useful Categorical Feature Compression via Submodular Optimization technical lemma (the proof of all lemmas can be found in the supplement). Lemma 1. Define the quantities p = Pr [Z = s ], q = Pr [Z = s], α = Pr [C = 0|Z = s ] and β = Pr [C = 0|Z = s], then the following equality holds s F(S) = pf(α) + qf(β) (p + q)f pα + qβ p + q , (2) where f( ) the following convex function over (0, 1): f(t) = t log t Pr [C = 0] + (1 t) log 1 t Pr [C = 1] . (3) Next, we provide several inequalities that precisely analyze expressions of the same form as (2) with various values of α, β, p and q. Lemma 2. Pick α β γ R, and p [0, 1]. Let q = 1 p and let f be an arbitrary convex function. We have pf(α)+qf(β) f(pα+qβ) pf(α)+qf(γ) f(pα+qγ). Replacing p and q in Lemma 2 with p p+q and q p+q and multiplying both sides by p + q implies the following corollary. Corollary 3. Pick α β γ R, and p, q R+. Let f be an arbitrary convex function. We have pf(α) + qf(β) (p + q)f pα + qβ pf(α) + qf(γ) (p + q)f pα + qγ Similarly, we have the following corollary (simply by looking at f( x) instead of f(x)). Corollary 4. Pick γ α β R, and p, q R+. Let f be an arbitrary convex function. We have pf(α) + qf(β) (p + q)f pα + qβ pf(γ) + qf(β) (p + q)f pγ + qβ We require one final lemma before proceeding to the main theorem. Lemma 5. Pick α, β R, and p, q, q (0, 1] such that q < q . Let f be an arbitrary convex function. We have pf(α) + qf(β) (p + q)f pα + qβ pf(α) + q f(β) (p + q )f pα + q β Figure 2. Illustration of boundaries used in proof of Theorem 6. Theorem 6 (Submodularity). For any pair of sets S1 S2 {1, . . . , n 1} and any s {1, . . . , n 1}\S2 we have s F(S1) s F(S2). Proof. Let s 1 and s 1 be the items right before and right after s when we sort S1 {s}. Also, let Z1 and Z 1 be the random variables corresponding to F(S1 {s}) and F(S1) respectively. Similarly let s 2 and s 2 be items right before and right after s when we sort S2 {s}, and let Z2 and Z 2 be the random variables corresponding to F(S2 {s}) and F(S2) respectively. Let us set p1 = Pr [Z1 = s 1], q1 = Pr [Z1 = s], α1 = Pr [C = 0|Z1 = s 1] and β1 = Pr [C = 0|Z1 = s]. Similarly let us set p2 = Pr [Z2 = s 2], q2 = Pr [Z2 = s], α2 = Pr [C = 0|Z2 = s 2] and β2 = Pr [C = 0|Z2 = s]. Note that since S1 S2, we have s 1, s 1 S2 and hence we have s 2 s 1 and s 2 s 1 (see Figure 2). Therefore, we have following set of inequalities p2 = Pr [Z2 = s 2] Pr [Z1 = s 1] = p1 , (4) q2 = Pr [Z2 = s] Pr [Z1 = s] = q1 . (5) Since in the definition of F( ) the elements are ordered by Pr [C = 0|X = x], we have the following set of inequalities α1 =Pr [C =0|Z1 =s 1] Pr [C =0|Z1 =s]=β1 , (6) α2 =Pr [C =0|Z2 =s 2] Pr [C =0|Z2 =s]=β2 , (7) α1 =Pr [C =0|Z1 =s 1] Pr [C =0|Z2 =s 2]=α2 , (8) β2 =Pr [C =0|Z2 =s] Pr [C =0|Z1 =s]=β1 . (9) Finally, we have (a) = p2f(α2) + q2f(β2) (p2 + q2)f p2α2 + q2β2 (b) p2f(α2) + q2f(β1) (p2 + q2)f p2α2 + q2β1 (c) p2f(α1) + q2f(β1) (p2 + q2)f p2α1 + q2β1 (d) p1f(α1) + q2f(β1) (p1 + q2)f p1α1 + q2β1 (e) p1f(α1) + q1f(β1) (p1 + q1)f p1α1 + q1β1 (f) = s F(S1) , Categorical Feature Compression via Submodular Optimization where (a) and (f) follow from equality 2, (b) follows from Corollary 3 and inequalities (9) and (7), (c) follows from Corollary 4 and inequalities (8) and (6), (d) follows from Lemma 5 and inequality (4), and (e) follows from Lemma 5 and inequality (5). This completes the proof. 2.3. Submodular Optimization Algorithms Given that we have shown F( ) is submodular, we now show two approaches to optimization: a single machine algorithm that runs in time O(n log n) as well as an algorithm which allows the input to be processed in a distributed fashion, at the cost of an additional logarithmic factor in running time. Single Machine Algorithm: We will make use of a 1 1/e ϵ approximation algorithm for submodular maximization that makes only O(n) queries to s F(S) (Mirzasoleiman et al., 2015). First, fix an arbitrary small constant ϵ (this appears as a loss in the approximation factor as well as in the running time). The algorithm starts with an empty solution set and then proceeds in m iterations where, in each iteration, we sample a set of n log 1/ϵ m elements uniformly at random from the elements that have not been added to the solution so far and then add the sampled element with maximum marginal increase to the solution. In general, we may expect that computing s F(S) requires at least Ω(|S|) time, which might be as large as m. However, we note that the algorithm of (Mirzasoleiman et al., 2015) (similar to most other algorithms for submodular maximization) only queries s F(S) for incrementally growing subsets of the final solution S. In that case, we can compute each incremental value of s F(S) in logarithmic time using a data structure that costs O(n log n) time to construct (see Algorithm 1). By using this query oracle, we do not require communicating the whole set S for every query. Moreover, we use a red-black tree to maintain S, and hence we can search for neighbors (s and s ) in logarithmic time. Thus, combining the submodular maximization algorithm that requires only a linear number of queries with the logarithmic time query oracle implies the following theorem. Theorem 7. For any arbitrary small ϵ > 0, there exists a (1 1/e ϵ)-approximation algorithm for vocabulary compression that runs in O(n log n) time. Distributed Algorithm: Again, fix an arbitrary small number ϵ > 0 (for simplicity assume ϵk and n ϵk are integers). In this distributed implementation we use ϵk machines, requires O( n ϵk) space per machine, and uses a logarithmic number of rounds of computation. To define our distributed algorithm we start with the (nondistributed) submodular maximization algorithm of (Badanidiyuru & Vondr ak, 2014), which provides a 1 1/e ϵ approximate solution using O(n log n) queries to the submodular function oracle. The algorithm works by defining a Algorithm 1 Data structure to compute s F(S) Procedure: Initialization Input: Sorted list of probabilities Pr [C = 0|X = xi] and probabilities Pr [X = xi]. 1: Initiate a red-black tree data structure S. 2: Insert 0 and n into S. 3: p<0 0 4: p C=0|<0 0 5: for i = 1 to n do 6: p 0, there exists a (1 1/e ϵ)-approximation (log n)-round distributed algorithm for vocabulary compression with O( n k ) space per machine and O(n log2 n) total work. 2.4. Heuristic Algorithms In this subsection we review a couple of heuristics that can serve as simple alternatives to the algorithm we suggest and show that they can, in fact, fail entirely for some inputs. We also provide an empirical comparison to these, as well as the algorithm of Dhillon et al. (2003), in Section 3. Bucketing Algorithm: This algorithm splits the range of probabilities [0, 1] into k equal size intervals [0, 1/k), [1/k, 2/k), . . . , [(k 1)/k, 1]. Then it uses these intervals (or buckets) to form the compressed vocabulary. Specifically, each interval represents all elements i such that Pr [C = 0|X = i] [(j 1)/k, j/k). Note that there exists a set Sb that such that F(Sb) correspond to the mutual information of the outcome of the bucketing algorithm and the la- bels. First we show that it is possible to give an upper bound on the mutual information loss, i.e., I(X; C) F(Sb). Theorem 9. Let Z be the random variable provided by the bucketing algorithm. The total mutual information loss of the bucketing algorithm is bounded as follows: I(X; C) I(Z; C) max, where max = maxj maxr [(j 1)/k,j/k) f(r) minr [ j 1 k ) f(r) and f( ) is defined in equation (3). Proof. Note that as we showed in Subsection 2.2 we have z Z Pr [Z = z] f(Pr [C = 0|Z = z]) z Z Pr [Z = z] f Ex z Pr [C = 0|X = x] . On the other hand we have I(X; C) = X x X Pr [X = x] f(Pr [C = 0|X = x]) x z Pr [X = x] f(Pr [C = 0|X = x]) z Z Pr [Z = z] X Pr [Z = z] f(Pr [C = 0|X = x]) z Z Pr [Z = z] Ex z h f Pr [C = 0|X = x] i . (11) Let j be the index of the interval corresponding to z. Then, by convexity of f( ), we have Ex z f Pr [C = 0|X = x] maxr [ j 1 k ) f(r) and f Ex z Pr [C = 0|X = x] minr [ j 1 k ) f(r). Therefore we have Ex z h f Pr [C =0|X =x] i f Ex z Pr [C =0|X =x] max r [(j 1)/k,j/k) f(r) min r [(j 1)/k,j/k) f(r) max . This together with Equations (10) and (11) show that I(X; C) F(S) P z Z Pr [Z = z] max = max and completes the theorem. The above theorem states that the information loss of the bucketing algorithm is no more than how much f( ) changes within one interval of size 1/k. Note that this is an absolute loss and is not comparable to the approximation guarantee that we provide submodular maximization. The main problem with the bucketing algorithm is that it is to some extent oblivious to the input data and, thus, will fail badly for certain inputs as shown in the following proposition. Proposition 10. There is an input X to the bucketing algorithm such that I(X; C) > 0 and I(Z; C) = 0, where Z is the output of the bucketing algorithm. Categorical Feature Compression via Submodular Optimization Proof. Fix a number j. In this example for half of the items we have Pr [C = 0|X = x] = j+1/3 k and for the other half we have Pr [C = 0|X = x] = j+2/3 k . We also set the probability of all values of X to be the same, and hence Pr [C = 0] = j+0.5 k . The mutual information of X with the label is non-zero since Pr [C = 0] = Pr [C = 0|X = x]. However, the bucketing algorithm merges all of the elements in the range [ j k ), thereby merging all values together giving us I(Z; C) = 0 and completing the proof. Note, we can further strengthen the previous example by giving a tiny mass to all buckets, so that all values do not collapse into a single bucket. However, still in this case, the bucketing method can only hope to capture a tiny fraction of mutual information since the vast majority of mass falls into a single bucket. Frequency Based Filtering: This is very simple compression method (more precisely, a feature selection method) that is popular in practice. Given a vocabulary budget, we compute a frequency threshold τ which we use to remove any vocabulary value that appears in fewer than τ instances of our dataset and which results in a vocabulary of the desired size. Even though the frequency based algorithm is not entirely oblivious to the input, it is oblivious to the label and hence oblivious to conditional distributions. Similar to the bucketing algorithm with a simple example in the following theorem we show that the frequency based algorithm fails to provide any approximation gaurantee. Proposition 11. There is an input X to the frequency based algorithm such that I(X; C) > 0 and I(Z; C) = 0, where Z is the outcome of the frequency based algorithm. Proof. Assume we have n = 3k values for X, namely x1, . . . , xn. For all i {1, . . . , k} define Pr [X = xi] = 2/n, and for all i {k + 1, . . . , n} we have Pr [X = xi] = 0.5/n. Note that the first k values are the most frequent values, however, we are going to define them such that they are independent of the label. For all i {1, . . . , k} let Pr [X = xi|C = 0] = 1/2, and for all i {k +1, . . . , 2k} let Pr [X = xi|C = 0] = 0, and for all i {2k + 1, . . . , 3k} let Pr [X = xi|C = 0] = 1. Note that we have Pr [C = 0] = 1 2. Therefore the mutual information of the k most frequent values with the label is zero, which implies for a certain vocabulary budget, and thereby frequency threshold, I(Z; C) = 0. Observe that even if we merge the last 2k values and use it as a new value (as opposed to ignoring them), the label corresponding the the merged value is 0 with probability half, and hence has no mutual information with the label. However, we have I(X; C) = P x X Pr [X = x] f(Pr [C = 0|X = x]) = P2k i=k+1 0.5 3k = 1 3 > 0, which completes the proof. 3. Empirical Evaluation In this section we report our empirical evaluation of the optimization the submodular function F(S) described in the previous section. All the experiments are performed using the Criteo click prediction dataset (Criteo Labs, 2014), which consists of 37 million instances for training and 4.4 million held-out points.1 In addition to 13 numerical features, this dataset contains 26 categorical features with a combined total vocabulary of more than 28 million values. These features have varying vocabulary sizes, from a handful up to millions of values. Five features, in particular, have more than a million distinct feature values each. In order to execute a mutual information based algorithm, we require estimates of the conditional probabilities Pr [C = 0|X = xi] and marginal probabilities Pr [X = xi]. Here, we simply use the maximum likelihood estimate based on the empirical count, i.e. given a sample of feature value and label pairs (ˆx1, ˆc1), . . . , (ˆxk, ˆck) , we have Pr [X = xi] = 1 k Pk j=1 1{ˆxj = xi} and Pr [C = 0|X = xi] = 1 k Pk j=1 1{ˆcj=0 ˆxj=xi} [X=xi] . We note that such estimates may sometimes be poor, especially when certain feature values appear very rarely. Evaluating more robust estimates is outside the scope of the current study, but an interesting direction for future work. 3.1. Mutual information evaluation We first evaluate the ability of our algorithm to maximize the mutual information retained by the compressed vocabulary and compare it to other baseline methods. In particular, we compare our algorithm to the iterative divisive clustering algorithm (Dhillon et al., 2003), as well as the frequency filtering and bucketing heuristics introduced in the previous section. The divisive clustering algorithm resembles a version of the k-means algorithm where k is set to be the vocabulary size and distances between points and clusters are defined in terms of the KL divergence between the conditional distribution of the label variable given a feature value and the conditional distribution of the label variable given a cluster center. Notice that due to the large size of the dataset, we cannot run the dynamic programming algorithm introduced by Kurkoski & Yagi (2014) which would find the theoretically optimal solution. For ease of reference, we call our algorithm SUBMODULAR, and the other algorithms DIVISIVE, BUCKETING and FREQUENCY. Note that our algorithm, as well as previous algorithms, seek to maximize the mutual information between a single categorical variable and the label, while in the Criteo dataset 1Note, we use the labeled training file from this challenge and chronologically partitioned it into train/hold-out sets. Categorical Feature Compression via Submodular Optimization we have several categorical variables that we wish to apply a global vocabulary size budget to. In the case of the FREQUENCY heuristic, this issue is addressed by sorting the counts of feature values across all categorical variables and applying a global threshold. In the case of SUBMODULAR, we run the almost linear-time algorithm for each categorical variable to obtain a sorted list of feature values and their marginal contributions to the global objective. Then we sort these marginal values and pick the top-score feature values to obtain the desired target vocabulary size. Thus, both SUBMODULAR and FREQUENCY are able to naturally employ global strategies in order to allocate the total vocabulary budget across different categorical features. For the DIVISIVE and BUCKETING algorithms, a natural global allocation policy is not available, as one needs to define an allocation of the vocabulary budget to each categorical feature a priori. In this study, we evaluate two natural allocation mechanisms. The uniform allocation assigns a uniform budget across all categorical features, whereas the MI allocation assigns a budget that is proportional to the mutual information of the particular categorical feature. The original vocabulary of over 28 million values is compressed by a factor of up to 2000. Using the methods mentioned above, we obtain vocabularies of size 10K, 20K, 40K, 80K, 120K and 160K. Then we compute the loss in average mutual information, which is defined as follows: let Xi denote the mutual information of uncompressed categorical feature i with the label and Zi denote mutual information of the corresponding compressed feature, then the average mutual information loss is equal to (P i Xi Zi)/(P vocabulary size (thousands) average MI loss (log-scale) 25 50 75 100 125 150 submodular divisive (MI) divisive (U) Figure 3. The average mutual information loss of several compression methods measured on to the Criteo dataset. For the heuristic FREQUENCY algorithm, the measured loss ranges from 0.520 (for budget of 160K) to 0.654 (for budget of 10K), while for BUCKETING the loss ranges from 5 10 6 to 5 10 3. As expected, the mutual information based methods perform significantly better, in particular, the loss for SUBMODULAR ranges from 9 10 7 to 3 10 9 and consistently outperforms the DIVISIVE algorithm (regardless of allocation strategy). Figure 3 provides a closer look at the mutual information based methods. Thus, we find that not only is our method fast, scalable and exhibits a theoretical 1 1/e lower bound on the performance, but that in practice it maintains almost all the mutual information between data points and the labels. vocabulary size (thousands) 25 50 75 100 125 150 frequency submodular divisive (MI) divisive (U) Figure 4. The log-loss of a neural network model trained with compressed vocabularies of several sizes and using several different compression methods. 3.2. Learning evaluation Our focus thus far has been in optimizing the mutual information objective. In this section we also evaluate the compressed vocabularies in an end-to-end task to demonstrate its application in a learning scenario. Given a compressed vocabulary we train a neural network model on the training split and measure the log-loss on the hold out set (futher details in supplement Section A.2).2 In Figure 4 we see that the mutual information based methods perform comparably to each other and significantly outperform popular heuristic method FREQUENCY. We observe that our scalable quasi-linear compression algorithm with provable approximation guarantees also performs competitively in end-to-end learning. 4. Conclusion In this work we have shown the first scalable quasi-linear compression algorithm for maximizing mutual information with the label that also exhibits and 1 1/e factor approximation guarantee. The algorithm, as well as our insights into constructing a submodular objective function, might be of interest in other applications as well (for example, quantization in DMC). One future line of work is extending this work to the multiclass (non-binary) setting. Acknowledgements LC was support by the Google Ph D Fellowship. 2In order to alleviate the potential issue of poor conditional/marginal distribution estimates we initially start with only features values that appear in at least 100 instances. Categorical Feature Compression via Submodular Optimization Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pp. 265 283, 2016. Aggarwal, A., Klawe, M. M., Moran, S., Shor, P., and Wilber, R. Geometric applications of a matrix-searching algorithm. Algorithmica, 2(1-4):195 208, 1987. Badanidiyuru, A. and Vondr ak, J. Fast algorithms for maximizing submodular functions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pp. 1497 1514. SIAM, 2014. Baker, L. D. and Mc Callum, A. K. Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 96 103. ACM, 1998. Barbosa, R., Ene, A., Nguyen, H., and Ward, J. The power of randomization: Distributed submodular maximization on massive datasets. In International Conference on Machine Learning, pp. 1236 1244, 2015. Cicalese, F., Gargano, L., and Vaccaro, U. Bounds on the entropy of a function of a random variable and their applications. IEEE Transactions on Information Theory, 64(4):2220 2230, 2018. Cover, T. M. and Thomas, J. A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA, 2006. Criteo Labs. Display Advertising Challenge, 2014. URL https://www.kaggle.com/c/ criteo-display-ad-challenge. Dhillon, I. S., Mallela, S., and Kumar, R. A divisive information-theoretic feature clustering algorithm for text classification. Journal of machine learning research, 3 (Mar):1265 1287, 2003. Guyon, I. and Elisseeff, A. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157 1182, 2003. Iwata, K.-i. and Ozawa, S.-y. Quantizer design for outputs of binary-input discrete memoryless channels using smawk algorithm. In Information Theory (ISIT), 2014 IEEE International Symposium on, pp. 191 195. IEEE, 2014. Jiang, J.-Y., Liou, R.-J., and Lee, S.-J. A fuzzy selfconstructing feature clustering algorithm for text classification. IEEE transactions on knowledge and data engineering, 23(3):335 349, 2011. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kumar, R., Moseley, B., Vassilvitskii, S., and Vattani, A. Fast greedy algorithms in mapreduce and streaming. ACM Transactions on Parallel Computing (TOPC), 2(3):14, 2015. Kurkoski, B. M. and Yagi, H. Quantization of binary-input discrete memoryless channels. IEEE Transactions on Information Theory, 60(8):4544 4552, 2014. Mirrokni, V. and Zadimoghaddam, M. Randomized composable core-sets for distributed submodular maximization. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 153 162. ACM, 2015. Mirzasoleiman, B., Karbasi, A., Sarkar, R., and Krause, A. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, pp. 2049 2057, 2013. Mirzasoleiman, B., Badanidiyuru, A., Karbasi, A., Vondr ak, J., and Krause, A. Lazier than lazy greedy. In AAAI, pp. 1812 1818, 2015. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018. Mumey, B. and Gedeon, T. Optimal mutual information quantization is np-complete. In Proc. Neural Inf. Coding (NIC) Workshop, 2003. Slonim, N. and Tishby, N. The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research, volume 1, pp. 200, 2001. Zhang, J. A. and Kurkoski, B. M. Low-complexity quantization of discrete memoryless channels. In Information Theory and Its Applications (ISITA), 2016 International Symposium on, pp. 448 452. IEEE, 2016.