# multigroup_learning_for_hierarchical_groups__9cad6447.pdf Multi-group Learning for Hierarchical Groups Samuel Deng 1 Daniel Hsu 1 The multi-group learning model formalizes the learning scenario in which a single predictor must generalize well on multiple, possibly overlapping subgroups of interest. We extend the study of multi-group learning to the natural case where the groups are hierarchically structured. We design an algorithm for this setting that outputs an interpretable and deterministic decision tree predictor with near-optimal sample complexity. We then conduct an empirical evaluation of our algorithm and find that it achieves attractive generalization properties on real datasets with hierarchical group structure. 1. Introduction In the classical statistical learning setup, the goal is to construct a predictor with high average accuracy. However, in many practical learning scenarios, an aggregate, on-average measure of performance is insufficient. In general, averagecase performance can obscure performance for subgroups of examples a predictor that boasts 95% accuracy on average might only be 50% accurate on an important subgroup comprising 10% of the population. Such a subgroup might be difficult for a predictor trained for aggregate performance because it is not represented well during training (Oakden-Rayner et al., 2020) or it admits spurious correlations (Borkan et al., 2019). Recent work has shown that, in various learning domains, constructing a predictor that performs well on multiple subgroups is crucial. For instance, when fairness is a concern, a natural desideratum is that a predictor be accurate not only on average, but also conditional on possibly intersecting subgroups such as race and gender (Hardt et al., 2016; Diana et al., 2021). In medical imaging, a model might sys- *Equal contribution 1Department of Computer Science, Columbia University. Correspondence to: Samuel Deng , Daniel Hsu . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). tematically err on rarer cancers to achieve a higher average accuracy, leading to possibly life-threatening predictions on individuals with the rare cancer (Oakden-Rayner et al., 2020). Similar demands for group-wise accuracy appear in domains as varied as medical imaging (Bissoto et al., 2019; Oakden-Rayner et al., 2020; De Grave et al., 2021), facial recognition (Buolamwini & Gebru, 2018; Kuehlkamp et al., 2017), object recognition (De Vries et al., 2019), and NLP (Orr et al., 2020; Varma et al., 2021; Borkan et al., 2019). A common theme is that these high error subgroups come from some hierarchical group structure. Demographic subgroups in fairness-aware scenarios naturally arrange via, say, race, gender, and age (Borkan et al., 2019) or subgroups of face images in facial recognition naturally arrange via, say, gender, facial expression, and hair color (Liu et al., 2015). Motivated by these concerns, we study the multi-group (agnostic PAC) learning model first proposed by Rothblum & Yona (2021). In multi-group learning, there is a collection of groups G comprised of possibly overlapping subsets of the input space and a benchmark hypothesis class H of predictors. The objective in multi-group learning is to construct a predictor f whose accuracy is not much worse than the best (possibly different) predictor hg H for each group g G simultaneously. This generalizes the traditional (onaverage) statistical learning setting when G contains the entire input space. Rothblum & Yona (2021) gives an initial boosting-based algorithm that achieves multi-group learning, and Tosh & Hsu (2022) provide simpler algorithms with improved sample complexity. These results all assume no further structure on G. In this paper, we study the natural special case in which G has hierarchical structure. To this end, we present two main contributions. First, we identify a mathematically natural structural property for groups namely, hierarchical structure that permits a simple and efficient algorithm that achieves multi-group learning with near-optimal group-wise error rates. Second, we conduct an extensive empirical evaluation of this algorithm against other baselines for multi-group learning on several real datasets with hierarchical group structure. This partially addresses a question posed by Tosh & Hsu (2022) to design a learning algorithm that outputs a simple, deterministic predictor like their Prepend algorithm enjoys near-optimal error rates. The empirical results show our algorithm indeed achieves multi-group generalization Multi-group Learning for Hierarchical Groups on par with and, on some groups, better than existing multi-group learning methods. This supports our theory and suggests that, in the case where groups are hierarchically structured, the tree representation of our algorithm is a good inductive bias for generalization on the subgroups. 1.1. Summary of Results First, we analyze two algorithms for hierarchical multigroup learning, showing near-optimal group-wise excess error rates for the latter. The na ıve first algorithm, detailed in Section 3.1, simply applies ERM to a partition of the input space formed by the hierarchical groups. This procedure of training a separate decoupled predictor on disjoint subsets of the input space to achieve some desired multi-group property has been studied before (Dwork et al., 2018); we simply extend the analysis to the hierarchical multi-group learning framework. Our main algorithm, MGL-Tree (Algorithm 1), outputs a simple decision tree predictor that is guaranteed to achieve error competitive to the best possible predictor in a benchmark hypothesis class for every group simultaneously. Importantly, the best benchmark hypothesis for each group may differ. Namely, we show that, for finite H, MGL-Tree achieves multi-group learning with a group-wise excess error rate of O p log(|H||G|)/ng , where H is the hypothesis class, G is the collection of hierarchical groups, and ng is the number of training examples for a group g. We note that this also extends to H of bounded VC dimension, and these guarantees hold for any bounded loss function. In comparison, an algorithm of Tosh & Hsu (2022) also achieves the state-of-the-art group-wise excess error rate of O p log(|H||G|)/ng for finite H and G (without the hierarchical restriction on G) with a black-box online-to-batch reduction. The resulting predictor, however, is a randomized ensemble of n (also) random predictors, one for each training sample, and requires the explicit enumeration of a finite hypothesis class. It is unclear if a practical implementation of such an algorithm exists. A separate algorithm of Tosh & Hsu (2022) and Globus-Harris et al. (2022), called Prepend, on the other hand, outputs a simple, interpretable, and determnistic decision list predictor that avoids enumerating H. Its group-wise excess error rate, however, is not optimal: O log(|H||G|)/γng , where γ is the probability of the group with the smallest mass. Resolving this trade-off between simplicity and optimal rates was an unresolved issue posed in Tosh & Hsu (2022). Our algorithm avoids the trade-off and achieves the state-of-the-art excess error rate with a simple predictor that refines its predictions through a breadth-first search on the tree induced by the hierarchical group structure. Second, we conduct an empirical evaluation of existing multi-group learning algorithms on datasets with natural hierarchical group structure. We evaluate MGL-Tree, the Prepend algorithm, and the ERM baseline in several binary classification problems. We evaluate different benchmark hypothesis classes (linear predictors, decision trees, and tree ensembles) and different hierarchical collections of groups. We consider twelve US Census datasets from Ding et al. (2021) for the tasks of predicting income, health care coverage, and employment. We consider four different collections of hierarchically structured groups arising from different collections of demographic attributes: (1) race, sex, and age, (2) race, sex, and education, (3) US state, race, and sex, and (4) US state, race, and age. On each dataset, we evaluate the generalization of each algorithm through classification accuracy on a held-out test set. We find that MGL-Tree consistently improves on the accuracy of the global ERM and group-specific ERM hypotheses from our benchmark class, validating our theory. Our algorithm also consistently achieves equal or higher accuracy than Prepend, suggesting that the decision tree representation of our final predictor is a more favorable inductive bias to hierarchically structured groups than a decision list of possibly arbitrary order. We also find that, in some cases, MGL-Tree even outperforms more complex ERM predictors (e.g., tree ensembles) which were empirically observed in previous work to have good multi-group generalization properties (Gardner et al., 2022; Pfohl et al., 2022). 1.2. Related Work Our work touches upon several areas of theoretical and empirical literature, including multi-group learning, fairness, and hidden stratification. Multi-group learning. The multi-group agnostic PAC learning setting was first formalized by Rothblum & Yona (2021), and initial algorithms for multi-group learning relied on a reduction to the outcome indistinguishability framework of Dwork et al. (2021). Tosh & Hsu (2022) introduced additional algorithms for achieving multi-group learning with improved, near-optimal sample complexity, discussed in Section 1.1. Both of these works, like ours, study this problem in the batch setting. Blum & Lykouris (2020) studies the online setting, where the goal is regret minimization with respect to multiple, possibly intersecting subgroups. Rittler & Chaudhuri (2024) considers the problem in the active learning setting, where the learner can decide which examples to obtain labels on. Blum et al. (2017); Haghtalab et al. (2022) study another learning model in which the learner may decide the number of samples to obtain from different distributions (which are not necessarily subsets of the input space). Multi-group agnostic PAC learning is also related to a recent strand of work in multicalibration for pro- Multi-group Learning for Hierarchical Groups viding more stringent calibration guarantees across multiple, possibly intersecting groups (H ebert-Johnson et al., 2018; Kim et al., 2019; Globus-Harris et al., 2023). Group and minimax fairness. A primary motivation for multi-group learning comes from the group fairness literature, where the goal is to achieve specific fairness criterion over multiple intersecting subgroups of individuals defined by characteristics such as sex or race. H ebert-Johnson et al. (2018) and Kearns et al. (2018) initiated the study of fairness across multiple intersecting subgroups constructed from different combinations of protected attribute values (e.g. intersections of race and gender). These works developed algorithms for various fairness notions in the algorithmic fairness literature (e.g. equalized odds or false positive parity). In the context of fairness, multi-group learning is related to the natural fairness criterion of minimax fairness, in which the objective is to minimize the maximum loss across all groups rather than achieving parity in losses (Diana et al., 2021). We note that, when fairness is a concern, gerrymandered, intersectional groups naturally admit a hierarchical group structure (Kearns et al., 2018) once the defining features are ordered. Slice discovery and hidden stratification. In many other learning scenarios, fine-grained high error subgroups are unknown ahead of time. A plethora of empirical work has uncovered various settings in which predictors with high overall accuracy predict poorly on meaningful subgroups of the input space (Buolamwini & Gebru, 2018; Oakden Rayner et al., 2020; Borkan et al., 2019). The burgeoning literature of slice discovery aims to develop methods that identify high-error subgroups that are not known a priori (Chung et al., 2019; Eyuboglu et al., 2022; Sohoni et al., 2020; Li et al., 2023). These slice discovery methods find these high-error subgroups by recursively slicing features into smaller and smaller subcategories, so the collection of found subgroups are presented in a hierarchical structure. This literature serves as a motivation to our work, but we note that, in multi-group learning, the collection of groups is known ahead of time (perhaps already discovered as a result of one of these methods). 2. Problem Setup 2.1. Setting and notation We are interested in the standard supervised statistical learning setting with additional group structure. We denote the input space as X, the output space as Y, and the decision space as Z. It is not necessary that Z = Y. We denote D as an arbitrary joint probability distribution over X Y. We denote S = {(x1, y1), . . . , (xn, yn)} as an i.i.d. set of random examples drawn from D. A predictor (hypothesis) is a function h : X Z. A benchmark hypothesis class H is a collection of such functions. We will compute probabilities and expectations with respect to the true distribution D, or the empirical distribution over S (where a new example (x, y) is drawn uniformly at random from S). We denote these as P[ ], E[ ] for the true distribution, and PS[ ], ES[ ] for the empirical distribution. For a bounded loss function ℓ: Z Y [0, 1], we denote the risk as LD(f) := ED[ℓ(f(x), y)] and the empirical risk as LS(f) := ES[ℓ(f(x), y)] = 1 n Pn i=1 ℓ(f(xi), yi). Multi-group learning notation. A group is a subset of the input space, g X. We will overload g to also denote the indicator function for the set, g(x) := 1 {x g}. G denotes the collection of groups of interest, so it is a collection of subsets of the input space. For a set of n examples, denote ng as the number of examples where x g, i.e. ng := Pn i=1 g(xi). For a distribution P (in our case, either D or S), we will write P[A | g] (respectively, E[X | g]) to denote the probability of an event A (respectively, expectation of a random variable X) conditioned on the event that, on a random X P, g(X) = 1 (i.e. X g). Our main metric for comparison will be the groupconditional risk: LD(f | g) := ED[ℓ(f(x), y) | g]. We also denote the group-conditional empirical risk as LS(f | g) := 1 ng Pn i=1 g(xi)ℓ(f(xi), yi). For a group g G, we use ˆhg arg minh H LS(h | g) to refer to a hypothesis in H that minimizes the empirical risk on only the examples from g. 2.2. Background on multi-group agnostic learning Multi-group (agnostic PAC) learning is a generalization of the traditional PAC learning setting (Valiant, 1984). The goal in multi-group learning is to construct a predictor that achieves small risk conditional on the membership of an example in a group, for a collection of (potentially overlapping) groups G simultaneously. Small risk is measured with respect to the best hypothesis in a benchmark hypothesis class, H. A multi-group learning algorithm for (G, H) takes as input a dataset of n i.i.d. labeled examples from a distribution D and outputs a predictor f : X Z. The goal of such an algorithm is to ensure that the group-wise excess errors LD(f | g) infh H LD(h | g) are small for all g G. Formally, we say that such an algorithm achieves the multi-group learning property with group-wise excess error bounds of ϵg(n, δ). Definition 2.1 (Multi-group learning property). A multigroup learning algorithm for (G, H) satisfies the multi-group learning property with group-wise excess error bounds ϵg( , ) for g G if it returns a predictor f : X Z such Multi-group Learning for Hierarchical Groups that, with probability at least 1 δ over n i.i.d. examples from D, LD(f | g) inf h H LD(h | g) ϵg(n, δ) for all g G. These group-wise excess errors will all be decreasing functions of ng, and it is straightforward to convert a group-wise excess error bound to an upper bound on the sample size ng sufficient to guarantee a given excess error bound ϵg > 0 for group g. Note that f is a single predictor that simultaneously achieves this objective with respect to the best hypothesis in every group. In fact, every group may have a different best predictor achieving infh H LD(h | g), and there may be no single h H that achieves small risk for all groups simultaneously, as Figure 1 shows. Thus, we focus on the improper learning setting, where f is allowed to be outside of H. Figure 1. No best h for all groups simultaneously. Letting H be the class of halfspaces, the groups g1 (indicated by the green solid line) and g2 (indicated by the red dotted line) overlap, but their optimal predictors hg1 and hg2 are much different. When X G, this is a generalization of the traditional agnostic PAC learning setting, where it is well-known that empirical risk minimization (ERM) is necessary and sufficient. Tosh & Hsu (2022) observe that, if we are a priori concerned with the conditional distribution on each g G, then the optimal excess error from ERM on each group is O( p log(|H|)/ng) for finite H and O( p d log(ng)/ng for H with VC dimension d < . They show that there exists an algorithm that achieves this at a near-optimal rate that incurs an extra O p log |G| factor for finite G, with rate O( p log(|G||H|)/ng). In Section 3.2, we give a much simpler, deterministic algorithm that achieves this rate in the case of hierarchically structured groups. We briefly remark that, when Rothblum & Yona (2021) introduced the learning model in Definition 2.1, they required the additional assumption of multi-PAC compatibility. This is the requirement that H contains a hypothesis that is nearly optimal for every group simultaneously. This precludes settings where the best hypothesis on each group can be vastly different, such as in Figure 1 or the practical settings discussed in Section 1. Therefore, we avoid this assumption and provide group-wise excess error bounds where the best h H is allowed to be different across the groups G. 2.3. Hierarchical group structure We note a couple important features of the collection of groups G in this learning model. First, G is fixed a priori and is known during both training and test. In addition, the group identity of each example is known during training and test; that is, g(x) can always be evaluated. In some applications to fairness, group membership for certain protected attributes may not be available at training or test time due to privacy or legal concerns (Veale & Binns, 2017; Holstein et al., 2019; Lahoti et al., 2020). Because of such restrictions, recent works in fairness have pivoted from demanding fairness on a small, fixed number of coarse demographic attributes (e.g. race, sex) to instead fixing a combinatorially large number of subgroups based on conjunctions of attributes (Kearns et al., 2018). For example, considering all the possible subgroups generated by conjunctions of race, sex, and age generates an exponentially large number of subgroups. This motivates the importance of obtaining near-optimal sample complexity rates that are logarithmic in |G|, the total number of subgroups. Subdividing an input space X based on relevant attributes in a given order generates an exponentially large number of hierarchically structured subgroups. We can also obtain a hierarchically structured collection of groups from dividing an input space top-down into categories and further subcategories. In the unsupervised machine learning context, hierarchically structured groups are obtained via, say, agglomerative clustering or a dyadic partitioning. Definition 2.2 (Hierarchically Structured Groups). A collection of groups G is hierarchically structured if, for every pair of distinct groups g, g G, exactly one of the following holds: 1. g g = (g and g are disjoint). 2. g g (g is contained in g ). 3. g g (g is contained in g). A hierarchically structured collection of groups also naturally admits a rooted tree or collection of trees, where each group g G is a node in the tree. Figure 2 gives an example. Definition 2.3 (Hierarchical Tree). Let X be an input space. A collection of hierarchically structured groups G on X with X G (without loss of generality) admits a (rooted) tree TG such that each node of TG is a group g, X is the root of TG, and, for every pair g, g TG: Multi-group Learning for Hierarchical Groups ... ... ... ... R1 R2 R7 R6+ . . . R8 ... ... R6+ male R6+ female R6+ male age < 35 35 age < 60 R6+ male R6+ male 60 age Figure 2. Example of a hierarchically structured tree. Each level of the tree above corresponds to a demographic attribute (race, sex, and age). Proceeding down the tree yields increasingly granular subgroups. The leaves are the most granular level, with subgroups such as R6+ male age < 35. 1. If g is a child of g , then g g . 2. If g and g are the same distance from the root, then g g = . Subdividing on attributes in this way can create different hierarchical trees depending on the order of attributes. For example, subdividing by race first then sex gives a different tree than subdividing on race first and then age (although the leaves are the same). In our experiments (Section 4), similar findings held for all orderings. In practice, this ordering may be decided by a domain expert. 3. Algorithms for learning hierarchical groups In this section, we analyze two multi-group learning algorithms for hierarchically structured groups. The first, in Section 3.1, is a baseline algorithm of Dwork et al. (2018) studied in the context of fairness, but we include a brief analysis in our setting as a warm-up. It does not achieve the state-of-the-art group error rates for multi-group learning. Our main algorithm, in Section 3.2, outputs a simple decision tree predictor that obtains the near-optimal excess error rates, satisfying a desideratum from Tosh & Hsu (2022) for the case of hierarchically structured G. We restrict attention to finite H; infinite H with finite VC dimenison are considered in Appendices B and C, along with the proofs. 3.1. Algorithm: Decoupled Group ERM We first analyze a simple algorithm for multi-group learning in the hierarchical setting: training a predictor on each disjoint leaf node. This procedure has been considered before in previous work in the context of fairness (Dwork et al., 2018), but we include an analysis of its sample complexity in the multi-group learning framework with hierarchical groups for completeness and comparison. Let G be a hierarchically structured collection of groups, with hierarchical tree TG. Let g1, . . . , g N be the leaves of the tree, and let ˆhgi arg minh H LS(h | gi) be the ERM predictor trained only on samples from leaf gi. Our predictor f : X Z is simply: f(x) := ˆhgi(x) if x gi, (1) which is well-defined if the leaves g1, . . . , g N form a partitioning of the input space X. We note that this decoupled predictor was originally proposed under the assumption that the groups partition the input space (Dwork et al., 2018), so we make that assumption here on the leaves. However, we note that our main algorithm in Section 3.2 avoids this assumption and works with general hierarchically structured groups. Theorem 3.1. Let H be a hypothesis class and let G be a collection of hierarchically structured groups with leaf nodes g1, . . . , g N partitioning the input space X. Let ℓ( , ) [0, 1] be any bounded loss function. Then, with probability 1 δ over n i.i.d. examples (x, y) D over X Y, f in Equation (1) satisfies the multi-group learning property: LD (f | g) inf h H LD (h | g) ϵg(n, δ), (2) for any g G such that g = Sk i=1 gi and ϵg(n, δ) := 9 Pk i=1 P[x gi] Pk j=1 P[x gj] log(8|G||H|/δ) This simple algorithm achieves multi-group learning, but it is not optimal. In the worst case (e.g., when all disjoint leaf nodes have the same probability mass), the error rate grows with p |G|, which is far larger than the p log |G| achievable by other methods, as discussed in Section 2.3. 3.2. Algorithm: Multi-group Tree In this section, we present MGL-Tree (Algorithm 1), which achieves the group-wise excess error of O( p log |H||G|/ng) with a simple and interpretable decision tree predictor. One can view MGL-Tree as an adaptation of the Prepend algorithm of Tosh & Hsu (2022) and Globus-Harris et al. (2022). Our analysis of MGL-Tree depends on the hierarchical structure of G and is very different from that of Prepend, leading to the improved near-optimal sample complexity rate. One unresolved issue stated in Tosh & Hsu (2022) was to find an algorithm for multi-group learning that is both simple and enjoys near-optimal sample complexity. If we were only concerned with the conditional distribution of a single fixed group, standard statistical learning theory suggests that, for finite H, ERM is necessary and sufficient to achieve an excess error of ϵg = O( p log(|H|)/ng). Multi-group Learning for Hierarchical Groups Tosh & Hsu (2022) gives an algorithm (different from Prepend) that achieves the near-optimal excess error of ϵg = O( p log(|H||G|)/ng). The resulting predictor, however, is a randomized ensemble of n (also) random predictors, one for each training sample, and requires the enumeration of a finite hypothesis class. On the other hand, Prepend outputs a simple, interpretable, and determnistic decision list predictor that avoids enumerating H. Its excess error, however, is not optimal: ϵg = O( 3p log(|H||G|)/(γng)), where γ := infg G PD[x g]. This trade-off between simplicity and optimal rates was an unresolved issue stated in Tosh & Hsu (2022). For hierarchical G, MGL-Tree avoids this trade-off between simplicity (and implementability) and near-optimal rates, as stated in Theorems 3.2 and C.3. This allows us to also conduct an empirical evaluation of MGL-Tree in Section 4. Theorem 3.2. Suppose H is a finite benchmark hypothesis class and G is a collection of hierarchically structured groups. Let ℓ( , ) [0, 1] be any bounded loss function. Then, with probability 1 δ over n i.i.d. examples (x, y) D over X Y, MGL-Tree taking input ϵn(g) := 18 2 log(|G||H|) + log(8/δ) outputs a predictor f satisfying the multi-group learning property with: LD(f | g) min h H L(h | g) 2ϵn(g) for all g G. (3) If H is infinite, with finite VC dimension d 1, then setting MGL-Tree with ϵn(g) := 18 2d log(16|G|n/δ) outputs a predictor f satisfying Equation (3). MGL-Tree gradually refines the decision tree predictor f by updating its behavior on children groups when a child s predictor is significantly better than its parent s by margin ϵn(g). Each node g in TG has an associated predictor f g. To evaluate f on the input x, we find the deepest g TG that contains x by following a path starting from the root X and moving from parent to child whenever the child contains x. The prediction f(x) is taken to be f g(x). The main tension that the algorithm resolves is whether to use a predictor for a coarser-grained group (say, the predictor for just R6+) vs. a predictor for a finer-grained group (say, the predictor for R6+ male age < 35). The finer-grained predictor may be more specific to the finergrained group, but the finer-grained group necessarily had less samples to train on, and, hence, higher variance. The Algorithm 1 MGL-Tree Require: 1: S, a training dataset. 2: Collection of hierarchically structured groups G 2X . 3: Error rates ϵn(g) (0, 1) for all g G Ensure: Decision tree f : X Z. 4: Order G into a hierarchical tree TG (Definition 2.3). 5: Initialize the root: f X := ˆh X . 6: for each node g TG \ {X} in breadth-first order do 7: Compute: errg := ES[ℓ(f g(x), y) | x g] ES[ℓ(ˆhg(x), y) | x g] ϵn(g). 8: if errg 0 then 9: Set f g := ˆhg. 10: else 11: Set f g := f pa(g), where pa(g) denotes the parent node of g. 12: end if 13: end for 14: return f : X Z, a decision tree predictor. coarser-grained predictor may be less specific to the finergrained group (and might, therefore, not pick up on the group-specific signal for the labels), but it had more samples to train on. By the hierarchical structure, coarser predictors are always the parents of finer-grained children. The main intuition of the algorithm is to resolve this tradeoff by only using the finer-grained group s predictor when it is significantly better than its coarser-grained parent, where the significance is ϵn(g) in the Step 7 of the algorithm. Lemma 3.3 below establishes a nice monotonicity property every update operation will never make the overall decision tree violate any error bounds it satisfied earlier in the search, so we always make forward progress towards a decision tree that satisfies the multi-group learning objective. To prove the correctness of MGL-Tree, we need to show the main property that the tree s predictions monotonically improve as MGL-Tree runs. This monotonicity property is similar in spirit to the key idea in the analysis of Prepend (Tosh & Hsu, 2022), but we need to exploit breadth-first search ordering on the hierarchical tree to attain this monotonicity property. We present the statements of the key lemma for proving correctness, and the explicit statement of correctness here in the main body. The full proof is in Appendix C. Lemma 3.3. Consider any step of Algorithm 1 where we are considering g G. Let fold be the decision tree at this step before updating (the state of the tree at line 7). Let ˆhg arg minh H LS(h | g) for all g G. Then, for all x g, fold(x) = hg (x) for some g g already visited by the algorithm. Multi-group Learning for Hierarchical Groups Theorem 3.4 (Correctness of MGL-Tree). Let G be a hierarchically structured collection of groups, let S = {(xi, yi)}n i=1 be n i.i.d. training data drawn from any distribution D over X Y, and let ϵn : G (0, 1) be any error rate function. Then, Algorithm 1 run on these parameters outputs a predictor f : X A satisfying: LS(f | g) inf h H LS(h | g) + ϵn(g), for all g G. This predictor also has the bonus that it is easily interpretable. By reading off the nodes of the decision tree predictor, one can interpret the source of model errors for a specific group from its subgroups deeper in the tree. For example, for a hierarchical collection of groups induced by race, sex, and age, one can check if errors on race groups stem from an intersection of race sex or the deeper intersection of race sex age one level down the tree. Because the guarantees from multi-group learning apply to any arbitrary bounded loss function, this property may be attractive in, say, fairness scenarios that demand some notion of fairness that can be encoded by a bounded loss function. We leave such applications for future work. 4. Empirical evaluation In this section, we perform empirical evaluations of Algorithm 1 for several classification problems with natural hierarchical group structure. 4.1. Experiment setup and learning task Models. For the benchmark hypothesis class H, we consider the following models: logistic regression, decision trees (with maximum depth 2, 4, and 8), random forests, and XGBoost. Implementation details are in Appendix E. Dataset overview. We conduct our experiments on twelve U.S. Census datasets from the Folktables package of Ding et al. (2021). These datasets include many demographic attributes or other protected features of individuals, and they are large enough to have intersections with substantial data even when subdividing on multiple attributes. Nine datasets come from each of the U.S. states of New York (NY), California (CA), and Florida (FL), where we consider the Employment, Income, and Coverage binary classification tasks. The other three nationwide datasets come from concatenating the data from 18 U.S. states, determined by the two most populous states in each U.S. Census designated region of the U.S. (U.S. Census, 2023) and considering the same three binary classification tasks. The binary classification tasks are as follows: Income. Predict whether employed individuals older than 16 make over $50,000 USD per year. Employment. Predict whether individuals older than 16 and younger than 90 are employed. Coverage. Predict whether individuals younger than 65 (ineligible for Medicare) with income less than $30,000 USD has coverage from public health insurance. For each of the nine state datasets, we consider two hierarchically structured collections of intersecting groups that arise naturally from subdividing demographic attributes of {race, sex, age} or {race, sex, edu}. For the three nationwide datasets, we divide on {state, race, age} or {state, race, sex}. In fairness settings, by defining the intersecting groups of interest in this way, we interpolate between a coarse, small collection of pre-defined groups from a single attribute to ensuring fairness for individuals (Kearns et al., 2018). Additional details are provided in Appendix F. Setup. For each benchmark hypothesis class, we consider four training procedures: ERM. The predictor from minimizing empirical risk over all the data: f arg minf H LS(f). Prepend. The decision list predictor output by Prepend. Details are given in Appendix D. MGL-Tree. The decision tree from Algorithm 1. Group ERM. The group-specific ERM predictor ˆhg arg minf H LS(f | g) (when evaluating performance on group g). For all ERM procedures, we optimize the logistic loss, a typical surrogate for the misclassification loss. We note that Group ERM is a benchmark procedure for if we were a priori concerned only about the conditional distribution for group g G (as discussed in Section 2.2). We evaluate the multi-group generalization of each predictor by measuring misclassification error on data restricted to each group from a held-out test set of 20% of the data. We repeat over 10 trials and plot the mean and standard errors. 4.2. Main findings The main findings of our empirical evaluations are: 1. MGL-Tree achieves multi-group learning. 2. MGL-Tree performs on par with or better than Prepend on all groups across all datasets. 3. MGL-Tree outperforms subgroup robust baseline methods on certain groups. 4. Simpler hypothesis classes lead to larger improvements when updating the predictor of MGL-Tree. We note that these findings are consistent across all twelve datasets. Due to space, we relegate most of the plots to Appendix H and present the results on a couple of the datasets (CA Employment and CA Income) in the main body. First, we empirically observe that MGL-Tree indeed Multi-group Learning for Hierarchical Groups achieves multi-group learning. This supports the upper bound on excess error of Theorem 3.2. Across all datasets, MGL-Tree performs on-par with or better than the minimum error achieved by the ERM or Group ERM predictors from the benchmark hypothesis class. For instance, for CA Employment, Figure 3 shows that MGL-Tree (labeled TREE ) has consistently lower error than both ERM and Group ERM for logistic regression, depth 2 decision trees, and random forests. In cases where MGL-Tree beats ERM by a significant amount, the predictor from MGL-Tree has identified a high-error subgroup on a specific slice of the population, which may prove useful for further auditing. Though MGL-Tree consistently matches or beats ERM and Group ERM on other datasets, the gaps in group generalization are sometimes not as stark; this may suggest that ERM and Group ERM with the benchmark hypothesis class already achieve close to the Bayes optimal risk. Some evidence supports this: on such datasets, even random forest or XGBoost tree ensembles which are known to be very effective on tabular datasets (Gardner et al., 2022) have comparable error to simpler benchmark classes. Second, MGL-Tree performs on par with or better than Prepend across almost all groups across all datasets. This suggests that, in the case of hierarchially structured groups, MGL-Tree is better at finding an ordering of group predictors that improves the accuracy of the model. This might not be surprising, as MGL-Tree directly exploits an inductive bias towards hierarchically structured groups by constructing its decision tree with knowledge of the hierarchial structure. Prepend, on the other hand, constructs a decision list with a possibly arbitrary ordering of groups and group predictors. We observe this in Figure 3 for the CA Income dataset, with further discussion in Section 4.3. Third, MGL-Tree sometimes outperforms more complex random forest and XGBoost baselines even when using relatively simple benchmark hypothesis classes. A couple of recent works suggest that simply running ERM with a tree ensemble method such as random forest or XGBoost achieves good multi-group performance (Pfohl et al., 2022; Gardner et al., 2022). For example, we observed that MGL-Tree run with the simple benchmark hypothesis class of linear predictors (specifically, logistic regression models) yields a substantial improvement over the ERM random forest baseline on almost all groups in the CA Employment dataset. See Figure 5 in Appendix H for more details. Finally, we also observe that, for simpler model classes, the predictor from MGL-Tree yields a larger improvement over baseline ERM and Group ERM training compared to more complex model classes. We see this in both datasets in Figure 3, where the simplest model class of depth-2 decision trees has the largest gaps from the y = x line. The more complex hypothesis class of linear models from lo- gistic regression yields points closer to the line. The most complex classes of random forests and XGBoost see even less improvement than both logistic regression and depth-2 decision trees from running MGL-Tree, as evident from how many points hug the y = x line. This suggests that for the predictor from MGL-Tree, improvements in subgroup generalization compared to the benchmark hypothesis class becomes more difficult as the class becomes more expressive. This behavior is also predicted from the dependence on H in our excess error rates in Theorem 3.2. 4.3. Comparing Prepend and MGL-Tree One main finding in Section 4.2 was that MGL-Tree generalizes on par with or better than Prepend on all groups across all datasets. Figure 3 shows two examples of this, and Appendix H contains more examples. In this section, we compare the final predictors output by Prepend and MGL-Tree on two datasets from our empirical study when the benchmark class is logistic regression. We find that, perhaps unsurprisingly, MGL-Tree achieves better generalization than Prepend because it prefers using the groupspecific hypotheses for coarser groups at higher levels in the tree, while Prepend tends to overfit to the finer-grained subgroups at the leaves. We summarize these findings here and refer the reader to Appendix G for the full breakdown of the final predictors from MGL-Tree and Prepend. CA Employment. From the upper-right plot in Figure 3, we observe that, for most groups, MGL-Tree performs on par with Prepend when both are instantiated with logistic regression (blue dots). There are still several groups where MGL-Tree performs better, as illustrated by the blue dots above the y = x line, but the vast majority of the groups are on par with Prepend. Upon a closer examination of the group-specific predictors used by Prepend or MGL-Tree, we find that, for CA Employment, Prepend and MGL-Tree mostly resort to using leaf node predictors. In the case of the race-sexage groups, the leaves correspond to conjunctions of three attributes such as R6+ male age < 35. As such, the final predictor for both algorithms is almost identical, but both show a marked improvement over simply using ERM for most groups, as shown in the upper-left plot in Figure 3. CA Income. From the lower-right plot in Figure 3, on the other hand, we see that MGL-Tree generalizes better than Prepend on many groups. Upon inspection of the final predictor, we observe that MGL-Tree employs coarsergrained subgroup predictors further up the tree (e.g. using ALL or R1) on many of the leaves, while, on most groups, Prepend predicts with finer-grained subgroup predictors. This suggests that the arbitrary order of the Prepend decision list can overfit to finer-grained subgroups that may Multi-group Learning for Hierarchical Groups 0.0 0.2 0.4 0.0 Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.2 0.4 0.0 Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.2 0.4 0.0 Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (ERM) 0.0 0.2 0.4 Test error (TREE) Test error (G-ERM) 0.0 0.2 0.4 Test error (TREE) Test error (PREP) Logistic Regression Decision Tree2 Random Forest XGBoost Figure 3. Test accuracy on race-sex-age groups for CA Employment (top row) and CA Income (bottom row). Each point in the plot represents the test error on a specific group. The y = x line represents equal error between our MGL-Tree and the competing method; points above the y = x line are groups where MGL-Tree exhibits better generalization. not have sufficient samples for the finer-grained predictors to generalize. On the other hand, MGL-Tree only predicts with a finer-grained subgroup s predictor when its coarser-grained parent has sufficiently high error due to its breadth-first search on the hierarchical tree. 5. Conclusion In this paper, we study the multi-group (agnostic PAC) learning framework in the natural case where the groups are hierarchically structured. We give an algorithm that achieves multi-group learning by constructing a decision tree of predictors for each group in the hierarchical collection of groups. This algorithm achieves multi-group learning with a near-optimal group-wise excess error of O( p log(|H||G|)/ng) and outputs a simple and determin- istic predictor, addressing an issue posed by (Tosh & Hsu, 2022) for the case of hierarchically structured groups. We then conduct an extensive empirical evaluation of our algorithm and find that it achieves attractive generalization properties on real datasets with hierarchical group structure, as the theory suggests. Acknowledgements We acknowledge support from the NSF under grant IIS2040971. Impact Statement This work focuses on a learning setting initially inspired by the fairness literature (see, e.g., Diana et al. (2021), H ebert- Multi-group Learning for Hierarchical Groups Johnson et al. (2018), Kearns et al. (2018), Buolamwini & Gebru (2018)) requiring some fairness criterion to be met on multiple, possibly overlapping subgroups of individuals. If our algorithms are used in the fairness setting, it is important to note that the jury is still out on what definition of algorithmic fairness is philosophically and legally sound. Our particular learning objective corresponds to ensuring that every subgroup is as well-off as the best possible hypothesis for that subgroup; however, this is at odds with a fairness definition such as, for instance, statistical parity (Barocas et al., 2023). We also note that our setting may be used in applications more general than fairness, when fine-grained accuracy is the only concern. In those cases, the usual societal implications of the application area apply. Balsubramani, A., Dasgupta, S., Freund, y., and Moran, S. An adaptive nearest neighbor rule for classification. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. Barocas, S., Hardt, M., and Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities. MIT Press, 2023. Bissoto, A., Fornaciali, M., Valle, E., and Avila, S. (de) constructing bias on skin lesion datasets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0 0, 2019. Blum, A. and Lykouris, T. Advancing Subgroup Fairness via Sleeping Experts. In Vidick, T. (ed.), 11th Innovations in Theoretical Computer Science Conference (ITCS 2020), volume 151 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 55:1 55:24, Dagstuhl, Germany, 2020. Schloss Dagstuhl Leibniz-Zentrum f ur Informatik. ISBN 978-395977-134-4. doi: 10.4230/LIPIcs.ITCS.2020.55. URL https://drops.dagstuhl.de/entities/ document/10.4230/LIPIcs.ITCS.2020.55. Blum, A., Haghtalab, N., Procaccia, A. D., and Qiao, M. Collaborative PAC Learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasserman, L. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pp. 491 500, 2019. Buolamwini, J. and Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, pp. 77 91. PMLR, January 2018. URL https://proceedings.mlr. press/v81/buolamwini18a.html. ISSN: 26403498. Chen, T. and Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785 794, August 2016. doi: 10.1145/2939672.2939785. URL http://arxiv. org/abs/1603.02754. ar Xiv:1603.02754 [cs]. Chung, Y., Kraska, T., Polyzotis, N., Tae, K. H., and Whang, S. E. Slice finder: Automated data slicing for model validation. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1550 1553. IEEE, 2019. De Vries, T., Misra, I., Wang, C., and Van der Maaten, L. Does object recognition work for everyone? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 52 59, 2019. De Grave, A. J., Janizek, J. D., and Lee, S.-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nature Machine Intelligence, 3(7):610 619, July 2021. ISSN 2522-5839. doi: 10.1038/ s42256-021-00338-7. URL https://www.nature. com/articles/s42256-021-00338-7. Number: 7 Publisher: Nature Publishing Group. Diana, E., Gill, W., Kearns, M., Kenthapadi, K., and Roth, A. Minimax group fairness: Algorithms and experiments. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 66 76, 2021. Ding, F., Hardt, M., Miller, J., and Schmidt, L. Retiring adult: New datasets for fair machine learning. Advances in neural information processing systems, 34:6478 6490, 2021. Dwork, C., Immorlica, N., Kalai, A. T., and Leiserson, M. Decoupled classifiers for group-fair and efficient machine learning. In Conference on fairness, accountability and transparency, pp. 119 133. PMLR, 2018. Dwork, C., Kim, M. P., Reingold, O., Rothblum, G. N., and Yona, G. Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1095 1108, 2021. Eyuboglu, S., Varma, M., Saab, K., Delbrouck, J.-B., Lee Messer, C., Dunnmon, J., Zou, J., and R e, C. Domino: Discovering systematic errors with cross-modal embeddings. ar Xiv preprint ar Xiv:2203.14960, 2022. Gardner, J., Popovic, Z., and Schmidt, L. Subgroup robustness grows on trees: An empirical baseline investigation. Advances in Neural Information Processing Systems, 35: 9939 9954, 2022. Multi-group Learning for Hierarchical Groups Globus-Harris, I., Kearns, M., and Roth, A. An algorithmic framework for bias bounties. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1106 1124, 2022. Globus-Harris, I., Harrison, D., Kearns, M., Roth, A., and Sorrell, J. Multicalibration as boosting for regression. In Proceedings of the 40th International Conference on Machine Learning, ICML 23. JMLR.org, 2023. Haghtalab, N., Jordan, M., and Zhao, E. On-Demand Sampling: Learning Optimally from Multiple Distributions. Advances in Neural Information Processing Systems, 35: 406 419, December 2022. Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016. H ebert-Johnson, U., Kim, M., Reingold, O., and Rothblum, G. Multicalibration: Calibration for the (computationallyidentifiable) masses. In International Conference on Machine Learning, pp. 1939 1948. PMLR, 2018. Holstein, K., Vaughan, J. W., Daum e III, H., Dud ık, M., and Wallach, H. Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1 16, May 2019. doi: 10.1145/3290605.3300830. URL http://arxiv. org/abs/1812.05239. ar Xiv:1812.05239 [cs]. Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. In Proceedings of the 35th International Conference on Machine Learning, pp. 2564 2572. PMLR, July 2018. ISSN: 2640-3498. Kim, M. P., Ghorbani, A., and Zou, J. Multiaccuracy: Black-Box Post-Processing for Fairness in Classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES 19, pp. 247 254, New York, NY, USA, January 2019. Association for Computing Machinery. ISBN 978-1-4503-6324-2. doi: 10.1145/3306618.3314287. URL https://dl. acm.org/doi/10.1145/3306618.3314287. Kuehlkamp, A., Becker, B., and Bowyer, K. Gender-fromiris or gender-from-mascara? In 2017 IEEE winter conference on applications of computer vision (WACV), pp. 1151 1159. IEEE, 2017. Lahoti, P., Beutel, A., Chen, J., Lee, K., Prost, F., Thain, N., Wang, X., and Chi, E. Fairness without Demographics through Adversarially Reweighted Learning. In Advances in Neural Information Processing Systems, volume 33, pp. 728 740. Curran Associates, Inc., 2020. Li, D., Nguyen, H. L., and Zhang, H. R. Identification of negative transfers in multitask learning using surrogate models. ar Xiv preprint ar Xiv:2303.14582, 2023. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015. Oakden-Rayner, L., Dunnmon, J., Carneiro, G., and R e, C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM conference on health, inference, and learning, pp. 151 159, 2020. Orr, L., Leszczynski, M., Arora, S., Wu, S., Guha, N., Ling, X., and Re, C. Bootleg: Chasing the tail with selfsupervised named entity disambiguation. ar Xiv preprint ar Xiv:2010.10363, 2020. Pfohl, S. R., Zhang, H., Xu, Y., Foryciarz, A., Ghassemi, M., and Shah, N. H. A comparison of approaches to improve worst-case predictive model performance over patient subpopulations. Scientific Reports, 12(1): 3254, February 2022. ISSN 2045-2322. doi: 10.1038/ s41598-022-07167-7. URL https://www.nature. com/articles/s41598-022-07167-7. Number: 1 Publisher: Nature Publishing Group. Rittler, N. and Chaudhuri, K. Agnostic multi-group active learning. Advances in Neural Information Processing Systems, 36, 2024. Rothblum, G. N. and Yona, G. Multi-group agnostic pac learnability. In International Conference on Machine Learning, pp. 9107 9115. PMLR, 2021. Sohoni, N., Dunnmon, J., Angus, G., Gu, A., and R e, C. No subclass left behind: Fine-grained robustness in coarsegrained classification problems. Advances in Neural Information Processing Systems, 33:19339 19352, 2020. Tosh, C. J. and Hsu, D. Simple and near-optimal algorithms for hidden stratification and multi-group learning. In Proceedings of the 39th International Conference on Machine Learning, pp. 21633 21657. PMLR, June 2022. ISSN: 2640-3498. U.S. Census. U.S. Census Bureau Regions and Divisions of the United States, December 2023. URL https: //www2.census.gov/geo/pdfs/maps-data/ maps/reference/us_regdiv.pdf. Valiant, L. G. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, November 1984. ISSN 0001-0782. doi: 10.1145/1968.1972. URL https:// dl.acm.org/doi/10.1145/1968.1972. Multi-group Learning for Hierarchical Groups Varma, M., Orr, L., Wu, S., Leszczynski, M., Ling, X., and R e, C. Cross-Domain Data Integration for Named Entity Disambiguation in Biomedical Text. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4566 4575, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp. 388. URL https://aclanthology.org/2021. findings-emnlp.388. Veale, M. and Binns, R. Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data. Big Data & Society, 4(2):2053951717743530, December 2017. ISSN 2053-9517. doi: 10.1177/ 2053951717743530. URL https://doi.org/10. 1177/2053951717743530. Publisher: SAGE Publications Ltd. Multi-group Learning for Hierarchical Groups A. Technical Lemmas The main technical lemma we will employ in our proofs is a result from Tosh & Hsu (2022) that establishes the uniform convergence of risks conditional on group membership. This generalizes a lemma from Balsubramani et al. (2019) that gives a rate for the uniform convergence of empirical conditional measures. Lemma A.1 (Theorem 1 from Tosh & Hsu (2022)). Let H be a hypothesis class, let G be a collection of groups, and let ℓ: Z Y [0, 1] be a bounded loss function. Given an i.i.d. sample of size n drawn from a distribution D over X Y, with probability at least 1 δ, |LD(h | g) LS(h | g)| 9 2 log(S2n(H)S2n(G)) + log(8/δ) for all h H and g G, where Sk(C) is the kth shattering coefficient of the collection C of sets, and H := {(x, y) 7 1 {ℓ(h(x), y) t} : h H, t R} is the set of (boolean) functions constructed from composing the hypothesis class H with the loss ℓ( , ). Two immediate consequences of Lemma A.1 are Lemmas A.2 and A.3. Lemma A.2. In the setting of Lemma A.1, if H and G are finite, then |LD(h | g) LS(h | g)| 9 2 log(|H||G|) + log(8/δ) for all h H and all g G. Proof. For a finite set, the shattering coefficient is bounded by the size of the set, so S2n(H) |H| and S2n(G) |G|. Lemma A.3. In the setting of Lemma A.1, if H has VC dimension d > 0, then |LD(h | g) LS(h | g)| 9 2d log(2|G|n) + log(8/δ) Proof. By Sauer s lemma, S2n(H) Pd i=0 n i (2n)d. We will also need the following two lemmas for decomposing group conditional risks for hierarchically structured G. Lemma A.4. Let g1, . . . g N be N disjoint groups, and let S = {(xi, yi)}n i=1 be a dataset of n i.i.d. examples. Let n g be the number of examples in SN k=1 gk, and let ngk be the number of examples in gk. Then, the group conditional empirical risk for SK k=1 gk decomposes as follows: ngk n g LS(f | gk). (4) Proof. By definition of group conditional empirical risk, ℓ(f(xi), yi). We note that g1, . . . , g N are disjoint, so: ℓ(f(xi), yi) = 1 n g k=1 gk(xi)ℓ(f(xi), yi) i=1 gk(xi)ℓ(f(xi), yi) k=1 ngk LS (f | gk) = ngk n g LS (f | gk) . Multi-group Learning for Hierarchical Groups In the first equality, the gk(x) are just indicators for disjoint groups, so that immediately follows from boolean algebra on 1 n x SN k=1 gk o . The second equality just switches order of summation, and the third is the definition of group conditional empirical risk again. Lemma A.5. Let g1, . . . g N be N disjoint groups. Then, the group conditional risk for SN k=1 gk decomposes: P[x gk] PK j=1 P[x gj] LD (f | gk) . (5) Proof. By definition of group conditional risk, LD(f | g) = E [ℓ(f(x), y) | x g] . We first claim that LD(f | g) = 1 P[x g]E [g(x)ℓ(f(x), y)] . This follows from: E [g(x)ℓ(f(x), y)] = E [E [g(x)ℓ(f(x), y) | x g]] = E [g(x)E [ℓ(f(x), y) | x g]] = P[x g]E [ℓ(f(x), y) | x g] = P[x g]LD(f | g). Using this fact, we can re-write the group conditional risk as: P h x SN j=1 gj i E Because g1, . . . , g N are disjoint, we can use additivity of P( ): = 1 PN j=1 P [x gj] E k=1 gk(x)ℓ(f(x), y) = 1 PN j=1 P[x gj] k=1 E [gk(x)ℓ(f(x), y)] Using the same fact that P [x gk] LD (f | gk) = E [gk(x)ℓ(f(x), y)], we get the desired result: P [x gk] PN j=1 P [x gj] LD(f | gk). B. Proof of Theorem 3.1 We give the full version of Theorem 3.1 here, including the case where H has finite VC dimension. Theorem B.1. Let H be a hypothesis class with VC dimension d > 0 and let G be a collection of hierarchically structured groups with leaf nodes g1, . . . , g N partitioning the input space X. Let ℓ( , ) [0, 1] be any bounded loss function. Then, with probability 1 δ over n i.i.d. examples (x, y) D over X Y, f in Equation (1) satisfies the multi-group learning property with LD (f | g) inf h H LD (h | g) 9 P[x gi] Pk j=1 P[x gj] 2d log(16|G|n/δ) for any g G such that g = Sk i=1 gi. For finite H, LD (f | g) inf h H LD (h | g) 9 P[x gi] Pk j=1 P[x gj] log(8|G||H|/δ) Multi-group Learning for Hierarchical Groups Proof. We first show the proof for H with finite VC dimension d. For each disjoint atomic group gi, for i [N], the behavior of f is simply to use ˆhgi. We know from uniform convergence of conditional risks (Lemma A.3) that LD(f | gi) = LD(ˆhgi | gi) inf h H LD(h | gi) + 9 2d log(16|G|n/δ) where ngi is the number of examples, all from group gi. Consider the union of k of the N disjoint atomic groups, Sk i=1 gi (without loss of generality for index i, let gi be any one of the N groups). By Lemma A.5, P(gi) Pk j=1 P(gj) LD(f | gi). By Equation (8) and the definition of the predictor f from Section 3.1, P(gi) Pk j=1 P(gj) LD(f | gi) = P(gi) Pk j=1 P(gj) LD(ˆhgi | gi) P(gi) Pk j=1 P(gj) LD(h gi | gi) + 9 2d log(16|G|n/δ) where h gi arg minh H LD(h | gi). Now, let h gi arg minh H LD(h | Sk i=1 gi). Because each h gi is optimal for their conditional risks on their respective group gi, P(gi) Pk j=1 P(gj) LD(h gi | gi) + 9 2d log(16|G|n/δ) P(gi) Pk j=1 P(gj) LD(h gi | gi) + P(gi) Pk j=1 P(gj) 9 2d log(16|G|n/δ) P(gi) Pk j=1 P(gj) 9 2d log(16|G|n/δ) whcih is our our result. The final equality is from applying Lemma A.5 again to combine the first term. The proof for finite H is identical, but use the finite H uniform convergence bound for conditional risks in Lemma A.2 instead of Equation (8). C. Proof of Theorems 3.2 and C.3 We now present the proof of MGL-Tree (Algorithm 1). This algorithm outputs a final decision tree predictor, f : X Z. Each node of the decision tree is a group g G with an associated working predictor f g : X Z. The goal of the algorithm is to determine a good working predictor for each node of the tree. For each group g G and an i.i.d. sample S, we ll denote ˆhg to be the ERM minimizer of group conditional empirical risk: ˆhg := arg min h H LS(h | g) We construct the decision tree as follows. First, we generate the hierarchical tree TG (Definition 2.3) from the hierarchically structured collection of groups G. For simplicity, the root is X. In this tree, every node g is a subset of its ancestors. We begin by initializing the root node s working predictor f X := ˆh X , the ERM minimizer of group conditional empirical risk for all of X this is just standard unconditional empirical risk LS(h). To assign working predictors f g to each node, we start from the root of the tree where g = X and visit all |G| nodes in the tree in breadth-first order. Let f pa(g) denote the working predictor for the parent of node g. The main idea is to set the Multi-group Learning for Hierarchical Groups working predictor at node g to f g := ˆhg only if its parent is insufficient for achieving the desired margin of error ϵn(g). Otherwise, node g inherits its working predictor from its parent: f g := f pa(g). To show that Algorithm 1 is correct, the key is to prove a monotonicity property: at each update operation, the algorithm does not violate any error bounds for groups further up the tree. Let fold denote the state of the decision tree before an update operation. In Algorithm 1, this corresponds to the state of the decision tree at line 7 in each iteration of the main BFS loop. We will analyze fold in the proofs of Theorem C.2 and our main Theorems 3.2 and C.3. We state one more obvious lemma concerning the behavior of f when we choose to inherit the parent s working predictor, f pa(g). When we do this, f is functionally equivalent to fold on group g and all the nodes on the path from g back up to the root. This is referred to as 3.3 in the main body, and is restated here as Lemma C.1 for convenience. Lemma C.1 (Behavior of fold). Consider any step of Algorithm 1 where we are considering g G. Let fold be the decision tree at this step before updating (the state of the tree at line 7). Let ˆhg arg minh H LS(h | g) for all g G. Then, for all x g, fold(x) = hg (x) for some g g already visited by the algorithm. Proof. This just follows by induction. For the first step of Algorithm 1, fold is simply h arg minh H LS(h), the ERM predictor over all of X. Of course, g X for any g. Assume the lemma for all g G , the set of already visited nodes. Suppose we are on step g G in our BFS. Then, if x g, by hierarchical structure and BFS, x g for some g g because we ve visited all parents before their children. Therefore, fold uses hg for some g g. Lemma C.1 allows us to apply Lemma A.1, our conditional uniform convergence bound, on fold, as it is functionally equivalent to some h H, our benchmark hypothesis class. The key to Algorithm 1 is that each update operation at any node g does not make f violate the error bounds it satisfied further up the tree. We observe that, at an update iteration (when errg 0), either we accept the (conditional) ERM predictor f g := ˆhg or we inherit the parent s working predictor f g := f pa(g). See Figure 4. Figure 4. Let g1, the yellow node, be a group that Algorithm 1 has already seen. Suppose f updates on g2. We see that g1 is on the path from g2 to the root. We need to show that the inequality for g1 is not violated after the update. The next theorem, Theorem C.2, establishes the correctness of the algorithm when it terminates. This is stated in the main body as Theorem 3.4, and we state it here as Theorem C.2. Theorem C.2 (Correctness of MGL-Tree). Let G be a hierarchically structured collection of groups, let S = {(xi, yi)}n i=1 be n i.i.d. training data drawn from any distribution D over X Y, and let ϵn : G (0, 1) be any error rate function. Then, Algorithm 1 run on these parameters outputs a predictor f : X A satisfying: LS(f | g) inf h H LS(h | g) + ϵn(g), for all g G. (9) Multi-group Learning for Hierarchical Groups Proof. Consider any g G, which corresponds to a node in the decision tree, f. We analyze the step of the algorithm s breadth-first search concerned with g and argue that this step does not violate any inequalities of the form (9) satisfied up until this step. This is sufficient to prove correctness because g is an arbitrary node, and the tree traversal will visit every node, so (9) will be satisfied for all nodes in the tree. On the current step for g , the algorithm can either decide to update or not. If the if-condition is not satisfied and it does not update, then we keep f := fold from the previous round, and we are done. All inequalities previously satisfied must continue to be satisfied because f did not change. Suppose, then, that the algorithm did update. Let Pg := {ˆg1, . . . , ˆgk} be the set of nodes on the path from g to the root of the tree, including the root. Then, for all nodes g Pg , f continues to satisfy (9) because g g = , so for x g, f(x) = fold(x), as before. This is due to the hierarchical structure any node not on the path from g back up to the root must be disjoint from g . Now, consider our final case: any ˆgj Pg for j [k], a node back up to the root from g . Again, we are in the case where we updated, so f has changed. By the hierarchical structure, if ˆgj is further up the tree from g , then g ˆgj. We need to show that LS(f | ˆgj) LS(fold | ˆgj). Denote g as the complement of g, and apply Lemma A.4: LS(f | ˆgj) = LS(f | (ˆgj g ) (ˆgj g )) = LS(f | ˆgj g )Pr[x ˆgj g ] Pr[x ˆgj] + LS(f | ˆgj g )Pr[x (ˆgj g )] = LS(f | g ) Pr[x g | x ˆgj] + LS(f | ˆgj g ) Pr[x g | x ˆgj] = LS Pr[x g | x ˆgj] + LS(fold|ˆgj g ) Pr[x g | x ˆgj]. (10) The last equality is a result of how the f operates as a decision tree. Observe that, on x g , our updated decision tree f uses h arg minh H LS(h | g ). On x ˆgj g , our decision tree f simply operates as it did before the update: f(x) = fold(x). By definition of h as ERM predictor, LS(h | g ) LS(fold | g ), (11) so combining (10) and (11), LS(f | ˆgj) = LS(h | g ) Pr[x g | x ˆgj] + LS(fold | ˆgj g ) Pr[x g | x ˆgj] LS(fold | g ) Pr[x g | x ˆgj] + LS(fold | ˆgj g ) Pr[x g | x ˆgj] = LS(fold | ˆgj) min h H LS(h | ˆgj) + ϵn(ˆgj), (12) where the final equality in (12) follows from another application of Lemma A.4. Because LS(f | ˆgj) min h H LS(h | ˆgj) + ϵn(ˆgj), (13) where f is the updated decision list, we see that f does not violate any of the inequalities for the nodes ˆgj in the path up to the root, finishing our proof. We state the result for H with finite VC dimension to accompany Theorem 3.2 here. Theorem C.3. For H with VC dimension bounded by d > 0 and the other conditions of Theorem 3.2, running Algorithm 1 with input ϵn(g) := 18 2d log(16|G|n/δ) outputs a predictor f that achieves LD(f | g) inf h H L(h | g) + 36 2d log(16|G|n/δ) ng for all g G. (14) We may now prove Theorem 3.2 and Theorem C.3 for the multi-group learning guarantee of Algorithm 1, MGL-Tree. Multi-group Learning for Hierarchical Groups Proof. We will show this by induction on each iteration (visited group g) of Algorithm 1 for finite G and H. Condition on the event that we drew our i.i.d. dataset of size n and Lemma A.2 holds. For ease of notation, we will denote 2 log(|G||H|) + log(8/δ) so, for all groups g G, we have, uniformly over h H: |LD(h | g) LS(h | g)| UC(g). Note that we choose ϵn(g) = 2UC(g). Our goal will thus be to show that, for all g G, the decision tree f satisfies LD(f | g) min h H LD(h | g) + 4UC(g). (15) For the base case, we just need to show that our starting decision tree f, where each node is initialized with h0 arg minh H LS(h), satisfies inequality (15) for g = X. Note that this is just standard unconditional risk. This is immediately true from standard uniform convergence and f using the ERM predictor h0, so LD(f) min h H LD(h) + 4UC(g). For the inductive hypothesis, assume that we are on the step of the BFS in Algorithm 1 concerned with a group g G. Denote G as the nodes that we already visited in our BFS, and fold as the decision list before the possible update, as before. Then, LD(fold | g ) min h H LD(h | g ) + 4UC(g ) (16) holds for all g G . To prove the induction, we aim to show that (15) is true for our current iteration s node g as well, regardless of whether we updated f or not. Let f be the decision list after the update step of Algorithm 1, and let ˆhg arg minh H LS(h | g). We want to show, for all h H: LD(f | g) LD(h | g) 4UC(g) (17) LD(f | g ) LD(h | g ) 4UC(g ), for all g G . (18) So, showing (17) and (18) are our goals for each of our two cases: whether we update or not. These two cases depend on whether LS(fold | g) LS(ˆhg | g) ϵn(g) or LS(fold | g) LS(ˆhg | g) > ϵn(g), the central comparison in our algorithm. Suppose we are in the first case, when we do not update. Because errg 0, we have LS(fold | g) LS(ˆhg | g) ϵn(g). Then, because we do not update, f = fold, so (18) is immediately fulfilled because f is functionally equivalent to fold, which, by induction satisfied the inequality already for all g G . It suffices to show (17) for this case. Fix any h H. First, with Lemma C.1, we can apply conditional uniform convergence on f = fold. Then, LD(f | g) LD(h | g) = LD(fold | g) LD(h | g) LS(fold | g) LD(h | g) + UC(g) LS(fold | g) LS(h | g) + 2UC(g) LS(fold | g) LS(ˆhg | g) + 2UC(g) ϵn(g) + 2UC(g) = 4UC(g). The first inequality is from Lemma C.1 and Lemma A.2. The third inequality is from the fact that ˆhg is the optimal ERM predictor conditioned on x g. This proves (17) for the first case where we do not update f. Suppose we are in the second case. In this case, we do update and we have that LS(fold | g) LS(ˆhg | g) > ϵn(g). In this case, fold is the decision tree before the update, and f is the tree after the update. Its working predictor has been updated to ˆhg. Immediately, we have LS(f | g) LS(ˆhg | g) = 0 for the current node g, so, for any h H, LD(f | g) LD(h | g) = LD(ˆhg | g) LD(h | g) LS(ˆhg | g) LS(h | g) + 2UC(g) 2UC(g) 4UC(g). Multi-group Learning for Hierarchical Groups The first equality is because f is functionally equivalent to ˆhg on all x g, and the first inequality comes from applying Lemma A.2 twice to get sample risk for h and ˆhg. This proves (17). It suffices to prove (18). Consider any g G , the set of already visited groups. There are two types nodes in G : g p, the nodes on the path back up to the root from g, and g np, the nodes not on the path back up to the root from g. For any g np, by hierarchical structure, g np g = . So, for all x g np, the predictor just outputs as it did before the update: f = fold. Then, for any h H, we maintain the same guarantee we had before, fulfilling (18) for all g np: LD(f | g np) LD(h | g np) = LD(fold | g np) LD(h | g np) 4UC(g np). To finish the proof, it suffices to show that (18) is still fulfilled for all g p, the nodes on a path back up to the root from g. Again, fix some h H. Using Lemma A.5: LD(f | g p) = LD(f | (g p g) (g p \ g)) = Pr[x (g p g)] Pr[x g p] LD(f | g p g) + Pr[x (g p \ g)] Pr[x g p] LD(f | g p \ g) Pr[x g p]LD(f | g) + Pr[x (g p \ g)] Pr[x g p] LD(f | g p \ g). The last equality comes from g g p because all nodes are contained in their ancestors. The updated f now uses ˆhg for all x g. Therefore: LD(f | g p) = Pr[x g] Pr[x g p]LD(ˆhg | g) + Pr[x (g p \ g)] Pr[x g p] LD(fold | g p \ g). Apply Lemma A.2 for conditional uniform convergence on ˆhg: LD(f | g p) Pr[x g] Pr[x g p]LS(ˆhg | g) + Pr[x (g p \ g)] Pr[x g p] LD(fold | g p \ g) + Pr[x g] Pr[x g p]UC(g). Adding and subtracting ϵn(g) to use the fact that we are in the update case where LS(fold | g) > LS(ˆhg | g) + ϵn(g): Pr[x g p]LS(ˆhg | g) + Pr[x g] Pr[x g p]ϵn(g) + Pr[x (g p \ g)] Pr[x g p] LD(fold | g p \ g) + Pr[x g] Pr[x g p]UC(g) Pr[x g] Pr[x g p]ϵn(g) Pr[x g p]LS(fold | g) + Pr[x (g p \ g)] Pr[x g p] LD(fold | g p \ g) + Pr[x g] Pr[x g p]UC(g) Pr[x g] Pr[x g p]ϵn(g). Finally, using Lemma C.1 on fold on x g, applying Lemma A.2 again, and recombining terms with Lemma A.5, Pr[x g p]LD(fold | g) + Pr[x (g p \ g)] Pr[x g p] LD(fold | g p \ g) + 2 Pr[x g] Pr[x g p] UC(g) Pr[x g] Pr[x g p]ϵn(g) = LD(fold | g p) + 2 Pr[x g] Pr[x g p] UC(g) Pr[x g] Pr[x g p]ϵn(g) LD(h | g p) + 4UC(g p). The final line follows by our choice of ϵn(g) = 2UC(g) and the inductive hypothesis on g p. This shows (18) for the second case where we update f, and thus completes our proof. The proof for H with VC dimension d > 0 is identical, but with 2d log(16|G|n/δ) and ϵn(g) = 2UC(g). Multi-group Learning for Hierarchical Groups D. Description of Prepend (Algorithm 1 of Tosh & Hsu (2022)) In this section, we give a brief description of the Prepend Algorithm of Tosh & Hsu (2022), which is closely related to the concurrent decision list algorithm of Globus-Harris et al. (2022). This algorithm outputs a decision list of predictors h H, where the decision nodes are groups g G. Quoting from Tosh & Hsu (2022), such a decision list of length T predicts as follows on input x X by following: if g T (x) = 1 then return h T (x) else if g T 1(x) = 1 then return h T 1(x) else if ...else return h0(x). We shall represent such a decision list by alternating groups and hypotheses; a decision list of length T can be represented as: f T := [g T , h T , g T 1, h T 1, . . . , g1, h1, h0]. To construct such a predictor, the Prepend algorithm maintains a decision list ft and proceeds in rounds t = 1, 2, 3, . . . At round t, the algorithm checks whether there exists a group-hypothesis pair (g, h) that violates a desired empirical error bound. If such a pair exists, the algorithm prepends the group and hypothesis to the front of the list; if no such pair exists, the algorithm terminates. Algorithm 2 displays the algorithm as presented in Tosh & Hsu (2022). Algorithm 2 Prepend Require: 1: S, a training dataset. 2: Collection of groups G 2X . 3: Error rates ϵn(g) (0, 1) for all g G Ensure: Decision list f T : X Z. 4: Compute h0 arg minh H LS(h) 5: for t = 0, 1, 2, ..., do 6: Compute: (gt+1, ht+1) arg max (g,h) G H LS(ft | g) LS(h | g) ϵn(g). 7: if LS(ft | gt+1) LS(ht+1 | g) ϵn(gt+1) then 8: Prepend (gt+1, ht+1) to ft to obtain ft+1 := [gt+1, ht+1, gt, ht, . . . , g1, h1, h0]. 9: else 10: return ft : X Z, a decision list predictor. 11: end if 12: end for E. Experiment hyperparameters In this section, we include the hyperparameters used for training each model class. Logistic regression, decision trees, and random forests all used the implementation from scikit-learn 1. XGBoost was implemented using the open-source xgboost implementation of Chen & Guestrin (2016) 2. All the hyperparameters not listed were set to the default values in scikit-learn or xgboost. Model Hyperparameters Logistic Regression loss = log loss, dual=False, solver=lbfgs Decision Tree criterion = log loss, max depth = {2, 4, 8} Random Forest criterion = log loss XGBoost objective = binary:logistic Table 1. Hyperparamter settings for each choice of benchmark hypothesis class. 1https://scikit-learn.org/stable/ 2https://xgboost.readthedocs.io/en/stable/ Multi-group Learning for Hierarchical Groups F. Additional dataset details Our experiments span twelve datasets from the Folktables package of Ding et al. (2021). These are comprised of nine statewide datasets and three nationwide datasets. Statewide datasets. We consider nine different datasets corresponding to one of the three U.S. states of New York (NY), California (CA), and Florida (FL) and one of the three binary prediction tasks of Employment, Income, and Coverage. The description of each task is found in Section 4.1. We consider a couple of hierarchical group structures from subdividing the following categorical demographic attributes: race. 6 total categories: R1 (White alone), R2 (Black or African American alone), R3+ (Native), R6+ (Asian or Pacific Islander), R7 (other race alone), and R8 (two or more races). sex. 2 total categories: M (Male) and F (Female). age. 3 total categories. For Income and Employment: Ya (age < 35), Ma (35 age < 60), and Oa (age 60). For Coverage, Ma is instead defined as 35 age < 50 and Oa is instead defined as age 50 because Coverage only concerns individuals whose age is less than 65. edu. 4 total categories: HS- (no high school diploma), HS (high school diploma or equivalent, but no college degree), COL (associates or bachelor s degree), and COL+ (master s, professional, or doctorate degree beyond bachelor s). For each of the nine statewide datasets, we subdivide on each of the demographic attributes to attain hierarchically structured groups. We focus on two collections of groups: Attributes: {race, sex, age}. Total of 54 subgroups (6 groups from race, 12 groups from race sex, and 36 groups from race sex age). Attributes: {race, sex, edu}. Total of 66 subgroups (6 groups from race, 12 groups from race sex, and 48 groups from race sex edu). Tables 2 and 3 lists all the nine statewide datasets and the number of examples in each separate demographic attribute. Dataset All R1 R2 R3+ R6+ R7 R8 CA Emp. 376035 231232 18831 3531 58147 46316 17978 NY Emp. 196104 137179 25362 735 16304 11089 5435 FL Emp. 196828 155682 26223 589 5356 4328 4650 CA Inc. 190187 118212 8656 1585 31039 23285 7410 NY Inc. 101270 72655 11970 340 8639 5407 2259 FL Inc. 94507 75218 11978 267 2892 2314 1838 CA Cov. 145994 82486 8652 1626 21820 24526 6884 NY Cov. 71379 44632 11262 332 7244 5754 2155 FL Cov. 73406 53841 12872 274 2350 2235 1834 Table 2. Number of examples for the race attribute, for each of the nine datasets. Emp. stands for Employment, Inc. stands for Income, and Cov. stands for Coverage. Nationwide datasets. We consider three more datasets constructed from the Folktables package with many more samples than the statewide datasets. For each of the classification tasks (Income, Employment, and Coverage), we gather all examples from a total of 18 U.S. states. 2 states were chosen for each of the 9 federally designated Census geographic divisions of the United States (U.S. Census, 2023). These states and their corresponding geographic regions are: MA, CT (New England), NY, PA (Mid-Atlantic), IL, OH (East North Central), MO, MN (West North Central), FL, GA, (South Atlantic), TN, AL (East South Central), TX, LA (West South Central), AZ, CO (Mountain), CA, and WA (Pacific). For each of the three nationwide datasets, we subdivide on each of the demographic attributes to attain hierarchically structured groups. We focus on two collections of groups: Multi-group Learning for Hierarchical Groups Dataset M F Ya Ma Oa HSHS COL COL+ CA Emp. 185603 190432 164324 124293 87418 126469 132156 81714 35696 NY Emp. 94471 101633 81007 64422 50675 57694 71694 43783 22933 FL Emp. 95281 101547 70632 63329 62867 53825 80479 44754 17770 CA Inc. 101125 89062 65087 97507 27593 23840 80594 58824 26929 NY Inc. 51408 49862 33349 51543 16378 8417 42215 33038 17600 FL Inc. 48623 45884 28007 49451 17049 8150 44394 30289 11674 CA Cov. 65287 80707 75484 33174 37336 43646 70617 25871 5860 NY Cov. 31218 40161 36683 15235 19461 17649 35723 14282 3725 FL Cov. 32686 40720 33552 16604 23250 17209 38354 14913 2930 Table 3. Number of examples for the sex, age, and edu attributes, for each of the nine datasets. Emp. stands for Employment, Inc. stands for Income, and Cov. stands for Coverage. Attributes: {state, race, sex}. Total of 342 subgroups (18 groups from state, 108 groups from state race, and 216 groups from state race sex). Attributes: {state, race, age}. Total of 450 subgroups (18 groups from state, 108 groups from state race, and 324 groups from state race age). G. Comparison of Prepend and MGL-Tree (Algorithm 1) on CA Employment and CA Income In Table 4 and Table 5, we include the group-specific predictors of Algorithm 1 (labeled TREE ) and Prepend (labeled PREP ) for each leaf node in the hierarchical group structure induced by splitting by race, sex, and age for the CA Employment and CA Income datasets. Because the leaf nodes partition the input space, this accounts for the behavior of the predictors on any possible test example x X. As described in Section 4.3, we observe that Algorithm 1 prefers using coarser-grained group-specific predictors higher up the tree, such as ALL or R1, while Prepend often resorts to the finer-grained predictors at the leaves. Leaf R1, M, Ya R1, M, Ma R1, M, Oa R1, F, Ya R1, F, Ma R1, F, Oa TREE R1, M, Ya R1, M, Ma R1, M, Oa R1, F, Ya R1, F, Ma R1, F, Oa PREP R1, M, Ya R1, M, Ma R1, M, Oa R1, F, Ya R1, F, Ma R1, F, Oa Leaf R2, M, Ya R2, M, Ma R2, M, Oa R2, F, Ya R2, F, Ma R2, F, Oa TREE R2, M, Ya ALL ALL R2, F, Ya R2, F, Ma R2, F, Oa PREP R2, M, Ya R2, M, Ma R2, M, Oa R2, F, Ya R2, F, Ma R2, F, Oa Leaf R3+, M, Ya R3+, M, Ma R3+, M, Oa R3+, F, Ya R3+, F, Ma R3+, F, Oa TREE ALL ALL ALL R3+, F, Ya R3+, F, Ma ALL PREP R3+, M, Ya R3+, M, Ma R3+, M, Oa R3+, F, Ya R3+, F, Ma R3+, F, Oa Leaf R6+, M, Ya R6+, M, Ma R6+, M, Oa R6+, F, Ya R6+, F, Ma R6+, F, Oa TREE R6+, M, Ya R6+, M, Ma R6+, M, Oa R6+, F, Ya R6+, F, Ma R6+, F, Oa PREP R6+, M, Ya R6+, M, Ma R6+, M, Oa R6+, F, Ya R6+, F, Ma R6+, F, Oa Leaf R7, M, Ya R7, M, Ma R7, M, Oa R7, F, Ya R7, F, Ma R7, F, Oa TREE R7, M, Ya R7, M, Ma R7, M, Oa R7, F, Ya R7, F, Ma ALL PREP R7, M, Ya R7, M, Ma R7, M, Oa R7, F, Ya R7, F, Ma R7, F, Oa Leaf R8, M, Ya R8, M, Ma R8, M, Oa R8, F, Ya R8, F, Ma R8, F, Oa TREE R8, M, Ya R8, M R8, M ALL R8, F, Ma R8, F, Oa PREP R8, M, Ya R8, M, Ma R8, M, Oa R8, F, Ya R8, F, Ma R8, F, Oa Table 4. Predictors for each race-sex-age leaf node in CA Employment dataset. ALL indicates the ERM logistic regression predictor trained on all the data. Bolded entries are leaves where MGL-Tree and Prepend differ in their predictor. Multi-group Learning for Hierarchical Groups Leaf R1, M, Ya R1, M, Ma R1, M, Oa R1, F, Ya R1, F, Ma R1, F, Oa TREE R1 R1 R1 R1 R1 R1, F, Oa PREP R1, M, Ya R1, M, Ma R1, M, Oa R1, F, Ya R1, F, Ma R1, F, Oa Leaf R2, M, Ya R2, M, Ma R2, M, Oa R2, F, Ya R2, F, Ma R2, F, Oa TREE ALL ALL ALL ALL ALL ALL PREP R2, M, Ya R2, M, Ma R2, M, Oa R2, F, Ya ALL R2, F, Oa Leaf R3+, M, Ya R3+, M, Ma R3+, M, Oa R3+, F, Ya R3+, F, Ma R3+, F, Oa TREE ALL ALL ALL ALL ALL ALL PREP R3+, M, Ya ALL R3+, M, Oa R3+, F, Ya R3+, F, Ma R3+, F, Oa Leaf R6+, M, Ya R6+, M, Ma R6+, M, Oa R6+, F, Ya R6+, F, Ma R6+, F, Oa TREE R6+ R6+, M, Ma R6+ R6+ R6+ R6+ PREP ALL ALL ALL ALL R6+, F, Ma R6+, F, Oa Leaf R7, M, Ya R7, M, Ma R7, M, Oa R7, F, Ya R7, F, Ma R7, F, Oa TREE R7, M, Ya R7, M, Ma R7, M, Oa R7, F, Ya R7, F, Ma ALL PREP R7, M, Ya R7, M, Ma R7, M, Oa R7, F, Ya ALL ALL Leaf R8, M, Ya R8, M, Ma R8, M, Oa R8, F, Ya R8, F, Ma R8, F, Oa TREE R8, M, Ya ALL ALL ALL R8, F, Ma R8, F, Oa PREP R8, M, Ya R8, M, Ma R8, M, Oa R8, F, Ya R8, F, Ma R8, F, Oa Table 5. Predictors for each race-sex-age leaf node in CA Income dataset. ALL indicates the ERM logistic regression predictor trained on all the data. Bolded entries are leaves where MGL-Tree and Prepend differ in their predictor. H. Additional experimental results In this section, we provide additional results for the other datasets not shown in the main body. For each dataset, we consider the two collections of hierarchical groups described in Appendix F, and we compare the generalization performance of all the methods and benchmark hypothesis classes described in Section 4.1. For all the datasets, we plot the test error of each method on a held-out test set for each group. Each group corresponds to a point on the plot. The y = x line corresponds to equal test error between the methods; all points above the line indicates that our Algorithm 1 had lower test error. Error bars are from the standard error from 10 random trials on each group (from a fresh train-test split, retraining each model from scratch). Error bars are omitted for the nationwide datasets for visual presentation. Throughout all the figures, Decision Tree2, Decision Tree4, and Decision Tree8 refer to decision trees trained with the max depth parameter set to 2, 4, and 8, respectively. ERM refers to the model trained on all available data, G-ERM refers to the group-stratified model trained only on data from the group corresponding to the point on the plot, and PREP refers to the Prepend algorithm of Tosh & Hsu (2022). We also include a case study of a particular dataset s (CA Employment) error rates group-by-group for logistic regression. We see that the bars for Algorithm 1 are consistently on par with or lower than both the ERM and Group ERM bars, sometimes by a very significant margin. In cases where Algorithm 1 beats ERM by a significant amount (such as on, say, the group R1, M, Oa), the predictor from Algorithm 1 has identified an instance of a high-error subgroup on a specific slice of the population, which may prove useful for further auditing. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) Logistic Regression Decision Tree2 Random Forest XGBoost Figure 6. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (CA Income). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) CA income (decision trees) Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 7. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (CA Income). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) Logistic Regression Decision Tree2 Random Forest XGBoost Figure 8. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (NY Income). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) NY income (decision trees) Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 9. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (NY Income). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.2 0.4 0.6 0.8 1.0 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.2 0.4 0.6 0.8 1.0 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.2 0.4 0.6 0.8 1.0 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) Logistic Regression Decision Tree2 Random Forest XGBoost Figure 10. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (FL Income). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.00 0.25 0.50 0.75 1.00 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.00 0.25 0.50 0.75 1.00 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.00 0.25 0.50 0.75 1.00 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) FL income (decision trees) Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 11. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (FL Income). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) CA employment Logistic Regression Decision Tree2 Random Forest XGBoost Figure 12. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (CA Employment). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) CA employment (decision trees) Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 13. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (CA Employment). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) NY employment Logistic Regression Decision Tree2 Random Forest XGBoost Figure 14. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (NY Employment). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.2 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) NY employment (decision trees) Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 15. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (NY Employment). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) FL employment Logistic Regression Decision Tree2 Random Forest XGBoost Figure 16. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (FL Employment). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) FL employment (decision trees) Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 17. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (FL Employment). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) CA coverage Logistic Regression Decision Tree2 Random Forest XGBoost Figure 18. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (CA Coverage). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.2 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) CA coverage (decision trees) Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 19. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (CA Coverage). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) NY coverage Logistic Regression Decision Tree2 Random Forest XGBoost Figure 20. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (NY Coverage). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.2 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) NY coverage (decision trees) Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 21. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (NY Coverage). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) FL coverage Logistic Regression Decision Tree2 Random Forest XGBoost Figure 22. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (FL Coverage). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.2 0.4 Test error (TREE) Test error (ERM) race-sex-age (ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (G-ERM) race-sex-age (G-ERM vs. TREE) 0.0 0.2 0.4 Test error (TREE) Test error (PREP) race-sex-age (PREP vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (ERM) race-sex-edu (ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (G-ERM) race-sex-edu (G-ERM vs. TREE) 0.0 0.2 0.4 0.6 Test error (TREE) Test error (PREP) race-sex-edu (PREP vs. TREE) FL coverage (decision trees) Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 23. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups (FL Coverage). Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test error (TREE) Test error (ERM) state-race-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test error (TREE) Test error (G-ERM) state-race-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test error (TREE) Test error (PREP) state-race-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (ERM) state-race-sex (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (G-ERM) state-race-sex (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (PREP) state-race-sex (PREP vs. TREE) nationwide income Logistic Regression Decision Tree2 Random Forest XGBoost Figure 24. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups on nationwide dataset for the Income task. Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.2 0.4 0.6 0.8 Test error (TREE) Test error (ERM) state-race-age (ERM vs. TREE) 0.0 0.2 0.4 0.6 0.8 Test error (TREE) Test error (G-ERM) state-race-age (G-ERM vs. TREE) 0.0 0.2 0.4 0.6 0.8 Test error (TREE) Test error (PREP) state-race-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) state-race-sex (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) state-race-sex (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) state-race-sex (PREP vs. TREE) nationwide income Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 25. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups on nationwide dataset for the Income task. Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (ERM) state-race-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (G-ERM) state-race-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (PREP) state-race-age (PREP vs. TREE) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Test error (TREE) Test error (ERM) state-race-sex (ERM vs. TREE) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Test error (TREE) Test error (G-ERM) state-race-sex (G-ERM vs. TREE) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Test error (TREE) Test error (PREP) state-race-sex (PREP vs. TREE) nationwide employment Logistic Regression Decision Tree2 Random Forest XGBoost Figure 26. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups on nationwide dataset for the Employment task. Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (ERM) state-race-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (G-ERM) state-race-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 Test error (TREE) Test error (PREP) state-race-age (PREP vs. TREE) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Test error (TREE) Test error (ERM) state-race-sex (ERM vs. TREE) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Test error (TREE) Test error (G-ERM) state-race-sex (G-ERM vs. TREE) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Test error (TREE) Test error (PREP) state-race-sex (PREP vs. TREE) nationwide employment Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 27. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups on nationwide dataset for the Employment task. Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test error (TREE) Test error (ERM) state-race-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test error (TREE) Test error (G-ERM) state-race-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test error (TREE) Test error (PREP) state-race-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) state-race-sex (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) state-race-sex (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) state-race-sex (PREP vs. TREE) nationwide coverage Logistic Regression Decision Tree2 Random Forest XGBoost Figure 28. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups on nationwide dataset for the Coverage task. Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: logistic regression, decision trees with max depth 2, random forest, and XGBoost. Multi-group Learning for Hierarchical Groups 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test error (TREE) Test error (ERM) state-race-age (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test error (TREE) Test error (G-ERM) state-race-age (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test error (TREE) Test error (PREP) state-race-age (PREP vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (ERM) state-race-sex (ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (G-ERM) state-race-sex (G-ERM vs. TREE) 0.0 0.1 0.2 0.3 0.4 Test error (TREE) Test error (PREP) state-race-sex (PREP vs. TREE) nationwide coverage Decision Tree2 Decision Tree4 Decision Tree8 Random Forest Figure 29. Test error of MGL-Tree vs. ERM, Group ERM, and Prepend on race-sex-age and race-sex-edu groups on nationwide dataset for the Coverage task. Each point corresponds to a group; points above the y = x line show that MGL-Tree generalizes better than the competitor method on that particular group. Benchmark hypothesis classes considered: decision trees with max depth 2, decision trees with max depth 4, decision trees with max depth 8, and random forest. Multi-group Learning for Hierarchical Groups Logistic Regression (ERM) Logistic Regression (G-ERM) Logistic Regression (PREP) Logistic Regression (TREE) Random Forest (ERM) hierarchical group errors for CA employment (race-sex-age) Figure 5. Test accuracy of H = Logistic Regression for race-sex-age groups (CA Employment). Test errors across all |G| = 54 hierarchically structured groups from race-sex-age. See Appendix F for more information on the specific categories for each group.