# fair_and_diverse_dppbased_data_summarization__4d5486c2.pdf Fair and Diverse DPP-Based Data Summarization L. Elisa Celis 1 Vijay Keswani 1 Damian Straszak 1 Amit Deshpande 2 Tarun Kathuria 3 Nisheeth K. Vishnoi 1 Sampling methods that choose a subset of the data proportional to its diversity in the feature space are popular for data summarization. However, recent studies have noted the occurrence of bias e.g., under or over representation of a particular gender or ethnicity in such data summarization methods. In this paper we initiate a study of the problem of outputting a diverse and fair summary of a given dataset. We work with a well-studied determinantal measure of diversity and corresponding distributions (DPPs) and present a framework that allows us to incorporate a general class of fairness constraints into such distributions. Designing efficient algorithms to sample from these constrained determinantal distributions, however, suffers from a complexity barrier; we present a fast sampler that is provably good when the input vectors satisfy a natural property. Our empirical results on both real-world and synthetic datasets show that the diversity of the samples produced by adding fairness constraints is not too far from the unconstrained case. 1. Introduction A problem facing many services from search engines and news feeds to machine learning is data summarization: how can one select a small but representative, i.e., diverse, subset from a large dataset. For instance, Google Images outputs a small subset of images from its enormous dataset given a user query. Similarly, in training a learning algorithm one may be required to choose a subset of data points to train on as training on the entire dataset may be costly. However, data summarization algorithms prevalent in the online world have been recently shown to be biased with respect to sensitive attributes such as gender, race or ethnicity. For instance, a recent study found evidence of systematic 1 Ecole Polytechnique F ed erale de Lausanne (EPFL), Switzerland 2Microsoft Research, India 3UC Berkeley. Correspondence to: L. Elisa Celis , Nisheeth K. Vishnoi . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). under-representation of women in search results (Kay et al., 2015). Concretely, the above work studied the output of Google Images for various search terms involving occupations and found, e.g., that for the search term CEO , the percentage of women in top 100 results was 11%, significantly lower than the ground truth of 27%. Through studies on human subjects, they also found that such misrepresentations have the power to influence people s perception about reality. Beyond humans, since data summaries are used to train algorithms, there is a danger that these biases in the data might be passed on to the algorithms that use them; a phenomena that is being revealed more and more in automated data-driven processes in education, recruitment, banking, and judiciary systems, see (O Neil, 2016). A robust and widely deployed method for data summarization is to associate a diversity score to each subset and select a subset with probability proportional to this score; see (Hesabi et al., 2015). This paper focuses on a concrete geometric measure of diversity of a subset S of a dataset {vx}x X of vectors the determinantal measure denoted by G(S) (Kulesza & Taskar, 2012); and the resulting probability distribution is called a determinantal point process (DPP). G(S) generalizes the correlation measure for two vectors to multiple vectors and, intuitively, the larger G(S), the more diverse is S in the feature space. Among benefits of G( ) are its overall simplicity, wide applicability not depending on combinatorial properties of the data, and efficient computability. A potential downside might be the additional effort required in modeling, i.e., to represent the data in a suitable vector form so that the geometry of the dataset indeed corresponds to diversity. Despite the well-acknowledged ability of DPPs to produce diverse subsets, unfortunately, there seems to be no obvious way to ensure that this also guarantees fairness in the DPP samples in the form of appropriate representation of sensitive attributes in the subset selected. Partially, this is due to the fact that fairness could mean different things in different contexts. For instance, consider a dataset in which each data point has a gender. One notion of fairness, useful in ensuring that the ground truth does not get distorted, is proportional representation: i.e., the distribution of sensitive characteristics in the output set should be identical to that of the input dataset (Kay et al., 2015). Another notion of fairness, argued to be necesseary to reverse the effect of Fair and Diverse DPP-Based Data Summarization historical biases (Koriyama et al., 2013), could be equal representation the representation of sensitive characteristics should be equal independent of the ratio in the input dataset. While these measures of fairness have natural generalizations to the case when the number of sensitive types is more than two, and can be refined in several ways, one thing remains common: they all operate in the combinatorial space of sensitive attributes of the data points. Simple examples (see, e.g., Figure 1 in the Supplementary File) show that, in certain settings, geometric diversity does not imply fairness and vice-versa; however, there seems to be no intrinsic barrier in attaining both. We initiate a rigorous study of the problem of incorporating fairness with respect to sensitive attributes of data in DPPbased sampling for data summarization. Our contributions are: A framework that can incorporate a wide class of notions of fairness with respect to disjoint sensitive attributes and, conditioned on being fair in the specified sense, outputs subsets where the probability of a set is still proportional to G( ). In particular, we model the problem as sampling from a partition DPP the parts correspond to different sensitive attributes and the goal is to select a specified number of points from each. Unfortunately, the problem of sampling from partition DPPs has been recently shown to be intractable in a strong sense (Celis et al., 2017) and the question of designing fast algorithms for it, at the expense of being approximate, has been open. Our main technical result is a linear time algorithm (see Section 3.1) to sample from partition DPPs that is guaranteed to output samples from close to the DPP distribution under a natural condition on the data (see Definition 4). We prove that random data matrices satisfy this condition in Section 3.3. We run our algorithm on the Adult dataset (Blake & Merz, 1998) and a curated image dataset with various parameter settings and observe a marked improvement in fairness without compromising geometric diversity by much. A theoretical justification of this low price of fairness is provided in Section 4; while there have been few works on controlling fairness, ours is the first to give a rigorous, quantitative price of fairness guarantee in any setting. Overall, our work gives a general and rigorous algorithmic solution to the problem of controlling bias in DPP-based sampling algorithms for data summarization while maximizing diversity. Related Work. There are several data pre-processing approaches to reduce bias in training data. For example, in (Kamiran & Calders, 2012) or (He & Garcia, 2009), bias is removed from training data by overor under-sampling from the dataset with appropriately defined cardinality constraints on the parts of a partition. The sampling approach used is often either uniform or preferential (according to a problem-dependent ranking). We show that sampling using partition-DPPs has better results in ensuring diversity of the sampled subset than any such sampling method. DPP-based sampling has been deployed for many data summarization tasks including text and images (Kulesza & Taskar, 2011), videos (Gong et al., 2014), documents (Lin & Bilmes, 2012), recommendation systems (Zhou et al., 2010), and sensors (Krause et al., 2008); and the study of DPPs with additional budget or resource constraints is of importance. While for unconstrained DPPs there are efficient algorithms to sample (Hough et al., 2006), the problem of sampling from constrained DPPs is intractable; see (Celis et al., 2017), where pseudopolynomial time algorithms for partition DPPs are presented. There is also work on approximate MCMC algorithms for sampling from various discrete point processes (see (Rebeschini & Karbasi, 2015; Anari et al., 2016) and the references therein), and algorithms that are efficient for constrained DPPs under certain restrictions on the data matrix and constraints (see (Li et al., 2016) and the references therein). To the best of our knowledge, ours is the first algorithm for constrained DPPs that is near-linear time. Our algorithm is a greedy, approximate algorithm, and can be considered an extension of a similar algorithm for unconstrained DPPs given by (Deshpande & Vempala, 2006). Finally, our work contributes towards an ongoing effort to measure, understand and incorporate fairness (e.g., see (Barocas & Selbst, 2015; Caliskan et al., 2017; Dwork et al., 2012; Zafar et al., 2017)) in fundamental algorithmic problems, including ranking (Celis et al., 2018b), voting (Celis et al., 2018a), and personalization (Celis & Vishnoi, 2017). 2. Our Model In this section we present the formal notions, model and other theoretical constructs studied in this paper. X will denote the dataset and we let m denote its size. We assume that for each x X, we are given a (feature) vector vx Rn, where n m is the dimension of the data. Let V denote the m n matrix whose rows correspond to the vectors vx for x X. For a set S X, we use VS to denote the submatrix of V that is obtained by picking the rows of V corresponding to the elements of S. We can now describe geometric diversity formally. Definition 1. (Geometric Diversity) Given a dataset X and the corresponding feature vectors V Rm n, the geometric diversity of a subset S X is defined as G(S) := det VSV S , which is the squared volume of the parallelepiped spanned by the rows of VS. This volume generalizes the correlation measure for two vectors to multiple vectors and, intuitively, the larger the volume, the more diverse is S in the feature space; see Figure 2 in the Supplementary File for an illustration. Geometric diversity gives rise to the following distribution on subsets known as a determinantal point process (DPP). Definition 2. (DPPs and k-DPPs) Given a dataset X and the corresponding feature vectors V Rm n, the DPP is a distribution over subsets S X such that the probability Fair and Diverse DPP-Based Data Summarization P[S] det VSV S . The induced probability distribution over k-sized subsets is called k-DPP. A characteristic of a DPP measure is that the inclusion of one item makes including other similar items less likely. Consequently, DPPs assign greater probability to subsets of points that are diverse; for example, a DPP prefers search results that cover multiple aspects of a user s query, rather than the most popular one. Our Algorithmic Framework: We are given a dataset X along with corresponding feature vectors V Rm n and a positive number k m that denotes the size of the subset or summary that needs to be generated. The dataset X is partitioned into p disjoint classes X1 X2 Xp, each corresponding to a sensitive class. A key feature of our model is that we do not fix one notion of fairness; rather, we allow for the specification of fairness constraints with respect to these sensitive classes. Formally, we do this by taking as input p natural numbers (k1, k2, . . . , kp) such that Pp j=1 kj = k is the sample size. These numbers give rise to a fair family of allowed subsets defined to be B := {S X : |S Xj| = kj for all j = 1, 2, . . . , p}. By setting (k1, . . . , kp) appropriately, the user can ensure their desired notion of fairness. For example, if the dataset has mi items with the i-th sensitive attribute, then we can set ki := kmi/m to obtain proportional representation. Similarly, equal representation can be implemented by setting ki = k/p for all i. The fair data summarization problem is to sample from a distribution that is supported on B. However, there could be many such distributions; we pick one that is closest to the to the k-DPP described by V . We use the Kullback-Leibler (KL) divergence between distributions q and q defined as DKL(q|| q) := P S q S log q S q S .1 The following lemma characterizes the distribution supported on B that has the least KL-divergence to a given distribution (see Appendix B.1 in the Supplementary File for the proof). Lemma 1. Given a distribution q with support set C, let B C and q be any distribution on B. Then the optimal value of minq DKL(q|| q) is achieved by the distribution q , such that q S q S, for S B and 0 otherwise. Thus, the distribution above can be thought of as the most diverse while being fair; we call it partition DPP, or P-DPP. Definition 3. (P-DPP) Given a dataset X, the corresponding feature vectors V Rm n, a partition X = X1 X2 Xp into p parts, and natural numbers k1, . . . , kp, P-DPP defines a distribution q over subsets S X of size k = Pp i=1 ki such that for all S B we 1Note that when there are only two parts, one can recover the percentages of elements from each part from the KL-distance. For multiple parts, the KL-distance is a natural (and general) singledimensional function of the percentage vector with which to measure the deviation from the target distribution. have q S := det(VSV S ) P T B det(VT V T ), and q S = 0 otherwise. Given the results of (Celis et al., 2017), we know that sampling from P-DPPs is #P-hard and exact sampling algorithm for P-DPPs are unlikely. Correspondingly, the flexibility that our framework provides in specifying the fairness constraints comes at a computational cost. In this paper, we give a fast, approximate sampling algorithm for P-DPPs. 3. Our Algorithm Notions of Volume and Projection. Let us recall the interpretation of determinants in terms of volumes. For S X, VS is the set of vectors {vx}x S. If the vectors in S are pairwise orthogonal, then the matrix VSV S is diagonal with entries { vx 2}x S on the diagonal and, hence, det(VSV S ) = Q x S vx 2. In the general case, the determinant is not simply the (squared) product of the norms of vectors, however a similar formula still holds. Let H Rn be any linear subspace and H be its orthogonal complement, i.e., H := {y Rn | x, y = 0 for all x H}. Let ΠH : Rn Rn be the orthogonal projection operator on the subspace H , i.e., whenever w Rn decomposes as w1+w2 for w1 H and w2 H , then ΠH(w) = w2. By a slight abuse of notation, we also denote by Πv the operator that projects a vector to another that is orthogonal to a given vector v Rn, i.e., Πv(w) := w w, v / v 2 . The following lemma is a simple generalization of the formula derived above for orthogonal families of vectors and inspires our algorithm for P-DPPs. The proof of this lemma is presented in Section B.3 in the Supplementary File. Lemma 2 (Determinant Volume Lemma). Let w1, . . . , wk Rn be the rows of a matrix W Rk n, then det(WW ) = Qk i=1 ΠHiwi 2 , where Hi is the subspace spanned by {w1, . . . , wi 1} for all i = 1, 2, . . . , k. 3.1. Our Sample and Project Algorithm Before we describe our algorithms for sampling from PDPPs, it is instructive to consider the special case of k-DPPs itself and the simple orthogonal scenario where all the vectors vx, for x X, are pairwise orthogonal. In such a case, there is a simple iterative algorithm: sample x X with probability vx 2, then add x to S and remove x from X; repeat until |S| = k. It is intuitively clear, and not hard to prove, that the final probability of obtaining a given set S as a sample is proportional to Q x S vx 2 = det(VSV S ) and, hence, recovers the k-DPP exactly. In case of P-DPPs where all the vectors are pairwise orthogonal, and we need to sample ki vectors from partition Xi, we can sample the required number of elements from each partition independently using the procedure in the previous paragraph. The orthogonality of the vectors and the disjointness of the parts implies that this sampling procedure gives the right probability distribution. Fair and Diverse DPP-Based Data Summarization Algorithm 1 Sample-And-Project 1: Input: V, (X1, .., Xp), (k1, .., kp) 2: S 3: k k1 + k2 + + kp 4: Let wx := vx for all x X 5: while |S| < k do 6: Pick any2 i {1, . . . , p} such that |S Xi| < ki 7: Define q RXi by qx := wx 2 for x Xi 8: Sample x Xi from distribution n qx P x Xi 9: S S { x} 10: Let v := w x 11: For all x X, set wx := Πv(wx) 12: end while However, when the vectors vx are no longer pairwise orthogonal, the above heuristic can fail miserably. This is where we invoke Lemma 2. It suggests the following strategy: once we select a vector, then we should orthogonalize all the remaining vectors with respect to it before repeating the sampling procedure. For the case of k-DPPs, it can be shown that this heuristic outputs a set S with probability no more than k! times its desired probability (Deshpande & Vempala, 2006). The k! term is primarily because the k vectors can be chosen in any of the k! orders. Taking this simple heuristic as a starting point and incorporating an additional idea to deal with partition constraints, we arrive at our Sample and Project algorithm see Algorithm 1. Given that we have made several simplifications and informal jumps when deriving the algorithm one cannot expect that the distribution over sets S produced by Algorithm 1 to be exactly the same as P-DPP. Later in this section we give evidence that in fact the distribution output by the Sample and Project heuristic can be formally related to the P-DPP distribution, and hence the constructed algorithm is provably an approximation to a P-DPP. However, we first note an attractive feature of this algorithm it is fast and practical. For a V Rm n matrix and k = Pp i=1 ki, Algorithm 1 can be implemented in O(mnk) time. Note that the size of the data for this problem is already Θ(mn), hence, the algorithm does only linear work per sampled point. For P-DPPs there is only one known exact algorithm which samples in time m O(p), which is polynomial only when p = O(1) (Celis et al., 2017). Another possible approach for sampling from DPPs is the Markov Chain Monte Carlo method. It was proved in (Anari et al., 2016) that Markov Chains can be used to sample from k-DPPs in time roughly e O(mk4 + mn2) given a warm start , i.e., a set S0 of significant probability. This approach does not extend to P-DPPs indeed in (Anari et al., 2016) the underlying probability distribution is required to be Strongly Rayleigh, a property which holds for k-DPPs, but fails for P-DPPs whenever the number of parts is at least two. One can still formulate an analogous MCMC algorithm for the case of P-DPPs it fails on specially crafted bad instances but seems to perform well on real world data. However, even ignoring the lack of provable guarantees for this algorithm, it does not seem possible to reduce its running time below O(mk4 + mn2), which significantly limits its practical applicability. 3.2. Provable Guarantees for Our Algorithm We now present a theorem which connects the output distribution of Algorithm 1 to the corresponding P-DPP. To establish such a guarantee we require the following assumption on the singular values of the matrices VXi. Definition 4 (β-balance). Let X be a set of m elements partitioned into p parts X1, . . . , Xp and let V Rm n be a matrix. Denote by σ1 σn the singular values of V and for each i {1, 2, . . . , p}, let σi,1 σi,n denote the singular values of VXi. For β 1, the partition X1, . . . , Xp is called β-balanced with respect to V if for all i {1, . . . , p} and for all j {1, . . . , n}, σi,j 1 The β-balance property informally requires that the diversity within each of the partitions VXi, relative to V , is significant. A more concrete geometric way to think about this condition is as follows: if one thinks of the positive semidefinite matrix V V Rn n as representing an ellipsoid in Rn whose axes are the singular values, then the β-balance condition essentially says that the ellipsoids corresponding to each of the partitions are a β-approximation to that of V (see Figure 4 in the Supplementary File). Importantly, Algorithm 1 never outputs a set S / B, hence the only way its output distribution could significantly differ from the P-DPP would be if certain sets S B appeared in the output with larger probabilities than specified by the P-DPP. Our main theoretical result for Sample and Project is that for β-balanced instances we can control the scale at which such a violation can happen. Theorem 1 (Approximation Guarantee). Let X be a set of m elements partitioned into p parts X1, . . . , Xp, a matrix V Rm n and integers k1, . . . , kp, such that X1, . . . , Xp is a β-balanced partition with respect to V and Pp j=1 kj. Let B 2X denote the following family of sets B := {S X : |S Xj| = kj for all j = 1, 2, . . . , p} Then Algorithm 1, with V , (X1, . . . , Xp) and (k1, . . . , kp) as input, returns a subset S B with probability q(S) ηk β2k q S where q S = det(VSV S ) P T B det(VT V T ), k = Pp j=1 kj and ηk = k1! k2! kp!. The proof of the approximation guarantee uses techniques inspired by (Deshpande & Vempala, 2006) who prove a similar bound for k-DPP sampling. Fair and Diverse DPP-Based Data Summarization We use the following lemmas in the proof of the theorem. The proof of these lemmas appear in Appendix B.4 and Appendix B.5 in the Supplementary File. Lemma 3. For any matrix V Rm n with m n k, X i1