# representative_language_generation__9b1cc63d.pdf

Representative Language Generation

Charlotte Peale * 1 Vinod Raman * 2 Omer Reingold * 1

We introduce representative generation, extending the theoretical framework for generation proposed by Kleinberg et al. (2024) and formalized by Li et al. (2024), to additionally address diversity and bias concerns in generative models. Our notion requires outputs of a generative model to proportionally represent groups of interest from the training data. We characterize representative uniform and non-uniform generation, introducing the group closure dimension as a key combinatorial quantity. For representative generation in the limit, we analyze both information-theoretic and computational aspects, demonstrating feasibility for countably infinite hypothesis classes and collections of groups under certain conditions, but proving a negative result for computability using only membership queries. This contrasts with Kleinberg et al. s (2024) positive results for standard generation in the limit. Our findings provide a rigorous foundation for developing more diverse and representative generative models.

1. Introduction

For decades, a central paradigm in machine learning has been prediction, where models are trained to map input data to specific output variables or categories. This approach encompasses tasks such as classification and regression, where the goal is to accurately estimate outcomes based on given inputs. However, recent years have seen a significant shift toward generative models, such as Large Language Models (LLMs) and diffusion-based image generators. These models are designed not to predict specific outcomes, but to create new data that resembles their training sets, offering a different approach to machine learning tasks.

This shift towards generative models necessitates the devel-

*Equal contribution 1Stanford University 2University of Michigan. Correspondence to: Charlotte Peale <cpeale@stanford.edu>, Vinod Raman <vkraman@umich.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

opment of new theoretical frameworks to rigorously analyze their performance, capabilities, and limitations. Recently, Kleinberg & Mullainathan (2024) proposed a theoretical framework that encapsulates the fundamental objective of generative models: after being shown a sequence of strings from an unknown target language (such as all valid code snippets in java), generate new, unseen strings from the target language. Informally, we say that a model satisfies generation in the limit if it achieves this goal after seeing a finite number of strings from the target language. Kleinberg & Mullainathan (2024) showed that generation in the limit is indeed possible in many scenarios, such as whenever the class of potential target languages is countable, contrasting to classic negative results by Gold and Angluin on identification, which tell us that no model can identify the target language from a sequence of strings for most natural classes of languages (1967; 1979; 1980). This positive result for the task of language generation has spurred a flurry of followup works further formalizing the landscape of generation tasks and understanding the limits of when generation is possible (Li et al., 2024; Kalavasis et al., 2024b; Charikar & Pabbaraju, 2024; Kalavasis et al., 2024a).

While these recent results have demonstrated the possibility of successful generation, a concern remains that models that successfully generate in the limit may do so by generating from a restricted subset of the true language. For instance, consider an image generator trained on a diverse set of animal pictures. If this model were to produce only new images of cats, it would technically satisfy the criteria for generation in the limit, yet fail to capture the full diversity of the true target set of animal pictures. In the spirit of this example, it is not hard to imagine real-life concerns about the usage of LLMs and other generative models. Training data often includes text or images representing diverse groups, characterized by ethnicity, race, gender, and other protected attributes. However, the mere presence of diverse data doesn t guarantee diverse outputs. Beyond demographics, it s essential to ensure that generators intended for widespread use accurately reflect the diverse perspectives found across various communities. Ensuring proportional representation in a generator s outputs can significantly enhance its utility, enabling it to accurately reflect the diverse trends, themes, and ideological perspectives present in human discourse, rather than producing homogeneous responses.

Representative Language Generation

This theoretical concern mirrors practical challenges observed in real-world generative models. One significant issue is the potential for these models to exacerbate biases present in their training data, leading to unfair representation across different demographic groups (Sheng et al., 2019; Bender et al., 2021; Kirk et al., 2021; Sheng et al., 2021; Mei et al., 2023; Gallegos et al., 2024; Zhou et al., 2024). Beyond fairness considerations, models that fail to capture the diversity of their data reduce their effectiveness and practical value. This reduced diversity is commonly observed in generative models and referred to as mode collapse, in which a learned model focuses only on a limited subset of the true data distribution (Arjovsky et al., 2017; Goodfellow et al., 2020; Thanh-Tung & Tran, 2020; Wang et al., 2024).

In this work, we introduce a new definition of generation that protects against these concerns that we term representative generation. Rather than generating a single element at each step, a representative generator generates a distribution over multiple elements at each step. In addition to the now standard requirement that a generator must eventually be consistent, i.e. the distribution must be supported on new elements from the true language, we additionally require that all of the generator s outputted distributions be representative, meaning that for each group in a set of groupsof-interest, the proportion of elements from that group seen in the training sequence thus far must be close to the probability that the generator outputs a member of that group. Coming back to our animal example, if our training data stream consisted of 1/3 cats, 1/3 dogs, and 1/3 rabbits, then a representative generator with respect to groups specified by animal species could not get away with only generating cats, but would be required to generate a roughly even mix of cats, dogs, and rabbits.

We provide a detailed discussion of related notions and additional related work in Appendix A.

1.1. Main Contributions

The loose goal of eventually generating consistently has been taxonomized by prior works into a hierarchy of three notions of generation. The weakest is Kleinberg & Mullainathan (2024) s generation in the limit (Definition 2.12), followed by the stronger notions of non-uniform generation and uniform generation introduced by Li et al. (2024) (Definitions 2.11 and 2.10, respectively).

Our results consider the feasibility of representative generation with respect to all three goals. For uniform and non-uniform generation, we focus on information-theoretic bounds, analogous to sample complexity results in learning theory, without considering computational efficiency. This approach is motivated by the recent barrier to efficiently computable non-uniform generators identified by Charikar & Pabbaraju (2024). For the weaker objective of representa-

tive generation in the limit, we analyze feasibility from both information-theoretic and computational perspectives. We prove a strong negative result, demonstrating the impossibility of achieving representative generation in the limit using only membership queries.

We give a brief overview of our main contributions below.

Formalizing Representative Generation. We propose a new property of generators termed representative generation. This property ensures that at each timestep, a generator s outputs closely approximate the proportions of training data across a collection of groups A 2X . Our definition extends the formal model of generation by Kleinberg & Mullainathan (2024), providing a rigorous framework to understand and address several real-world concerns related to generative models that we highlight below.

First, representative generation can guarantee accurate proportional representation of community opinions and perspectives found in the training data. This is crucial for maintaining the diversity of viewpoints present in the original dataset. Second, when the training data is highly diverse across the set of groups-of-interest, the generator s outputs must also exhibit high diversity. This feature protects against mode collapse and offers a tractable relaxation to existing notions of generation with breadth, which we discuss further in Appendix A.

Representative generation can also be viewed as a theoretical operationalization of some aspects of alignment, the process of fine-tuning generative models to conform with societal and use-specific values. Imagine feeding a representative generator a gold-standard distribution of data that encapsulates a practitioner s values and diverse opinions. Unlike standard generators, which may learn to produce correct outputs but offer no guarantees about maintaining alignment with the input data distribution, representative generators are designed to both generate accurate outputs and preserve the distribution of values and perspectives found in the alignment data. This approach ensures that the model remains faithful to the intended balance of viewpoints and opinions, even as it generates novel content.

Representative Uniform Generation. We give a characterization of which pairs of hypothesis classes H and collections of groups A satisfy representative uniform generation, assuming the groups in A form a partition of X. We characterize the representative uniform generatability of a pair (H, A) by a new combinatorial quantity that we term the group closure dimension (Definition 3.1):

Theorem 1.1 (Informal Statement of Theorem 3.3). A hypothesis class H and countable partition A can be uniformly generated with representation if and only if the group closure dimension of (H, A) is finite.

Representative Language Generation

We instantiate this result in Corollary 3.5, where we show that any finite hypothesis class H and finite partition A can satisfy representative uniform generation. This corollary can be compared with Theorem 2.2 of Kleinberg & Mullainathan (2024), who show that any finite H is uniformly generatable (without concern for representation).

While one might wonder if requiring representation with respect to only a finite collection of groups is always as easy as uniform generation without representation, we provide a counterexample to this conjecture in Corollary 3.6, in which we provide an example of a class H that is uniformly generatable, but cannot satisfy representative uniform generation with respect to a partition A containing just two disjoint groups.

Representative Non-Uniform Generation. We next consider the slightly weaker notion of representative nonuniform generation, and also give a characterization of which pairs of hypothesis classes and collections of disjoint groups can satisfy non-uniform generation with representation.

Theorem 1.2 (Informal Statement of Theorem 3.7). A hypothesis class H and countable partition A can be nonuniformly generated with representation if and only if there exists a non-decreasing sequence of classes H1 H2 such that H = S i=1 Hi and (Hi, A) is uniformly generatable with representation for all i N.

We pair this characterization with a practical instantiation in Corollary 3.8, in which we show that all countable H and finite partitions A satisfy representative non-uniform generatability.

Representative Generation in the Limit. Finally, we consider the weakest notion of generation: representative generation in the limit. As all previous possible results also apply to this weaker setting, the results of the preceding sections already establish that any countable H and finite collection of disjoint groups A can be generated in the limit with representation. We show that the requirement that A be finite and disjoint can be significantly relaxed when we only require generation in the limit. In particular, in Theorem 4.4, we show that any countable collection of (possibly overlapping) groups and countable H satisfying an assumption we term the finite support assumption (Definition 4.2) can be generated with representation in the limit.

We show that this finite support assumption is necessary, and in Lemma 4.3 give an example of a finite H and countable set of disjoint groups A that cannot satisfy representative generation in the limit.

We finally turn to computability, and show a negative result:

Lemma 1.3 (Informal Statement of Lemma 4.9). No gener-

ator can satisfy representative generation in the limit for all finite H and A, and during each timestep use only a finite number of membership queries of the form x supp(h)? or x A? for various h H and A A.

The impossibility contrasts with a positive result of Kleinberg & Mullainathan (2024), who show that generation in the limit (without representation) is possible with only a finite number of membership queries at each timestep.

2. Preliminaries

We consider a countable example space X, and associate with X a hypothesis class H {0, 1}X . Any particular h H can be thought of as an indicator for a valid element of X when h(x) = 1. In the context of languages, h denotes a language with h(x) = 1 when x is a valid string in that language.

We define the support of a hypothesis h to be the set of all valid elements of X according to h, denoted supp(h) := {x X : h(x) = 1}. We will also sometimes reference the support of a distribution µ X, where supp(µ) := {x X : µ(x) > 0} refers to the set of elements that are drawn from the distribution with positive probability. An enumeration of supp(h) is any infinite sequence x1, x2, ... such that S

i N{xi} = supp(h).

We will make the following assumption about hypothesis classes throughout.

Assumption 2.1 (Uniformly Unbounded Support (UUS)). A hypothesis class H {0, 1}X satisfies the Uniformly Unbounded Support (UUS) property if |supp(h)| = for every h H.

Going beyond the standard model of generation that has been considered in prior works, we also endow the space X with a countable set of (possibly overlapping) groups-ofinterest A 2X . We assume that X S

A A A, though this is without loss of generality because any A can be altered to satisfy this by adding a single group to the collection corresponding to the complement of the union of all existing groups.

Our results on representative uniform and non-uniform generation presented in Section 3 focus on the special case where A forms a partition of X:

Definition 2.2 (Countable Partition). Let X be any countable example space. Then, A := {A1, A2, . . . } is a countable partition of X if and only if the following two conditions hold: (1) Ai Aj = for all i = j and (2) S i=1 Ai = X.

We introduce two operators that will be useful for notation throughout. Note that we occasionally denote a finite sequence x1, ..., xn by x1:n, and use |{x1, ..., xn}| to denote

Representative Language Generation

the number of unique elements in x1:n. Definition 2.3 (Set of Consistent Hypotheses). For any finite sequence of samples x1:n and hypothesis class H, we define notation for the set of hypotheses consistent with the sample: H(x1:n) := {h H : {x1:n} supp(h)}. Definition 2.4 (Closure). Given a class H and finite sequence of samples x1, ..., xn, define the closure operator as the set of samples consistent with every hypothesis in H(x1, ..., xn):

x1, ..., xn H :=

(T h H(x1:n) supp(h), if |H(x1:n)| 1 , if |H(x1:n)| = 0.

2.1. Generators

In this paper, we will consider randomized generators. A randomized generator is a map G : X X. Note that by definition, randomized generators output distributions over examples. Given a finite sequence of examples x1, . . . , xt and a randomized generator G, one can always obtain a new example by sampling ˆxt G(x1, . . . , xt). By allowing our generators to output distributions over examples, we can measure representation by comparing how close a distribution s induced group probabilities are to the empirical probabilities of the groups in the stream. To formalize this, we need a few more definitions, starting with induced group probabilities and group empirical probabilities. Definition 2.5 (Induced Group Probabilities). Given a distribution µ X and countable family of groups A = {A}i N, let µ|A denote µ s induced probabilities over A such that for any group i N, we have that µ|A(i) := Prx µ [x Ai] .

Observe that for any countable partition A = {Ai}i N and µ X, µ|A is a probability distribution. If A is not a partition, the probabilities in µ|A may sum to more than 1. Definition 2.6 (Empirical Distribution and Group Empirical Probabilities). Let x1, . . . , xt be a finite sequence of examples, and x 1, . . . x m denote its subsequence of unique examples. The empirical distribution of unique examples in x1, . . . xt is denoted by x1:t. In particular, for every x X, we have x1:t(x) := 1 m Pm j=1 1{x j = x}. Given a countable collection of groups A = {Ai}i N, the group empirical probabilities of the unique examples in x1, . . . , xt with respect to A are defined analogously as the induced group probabilities of the empirical distribution and denoted by x1:t|A.

The final ingredient is a measure of closeness between the induced group probabilities of the output of a randomized generator and the empirical group probabilities: Definition 2.7 (Supremum Distance). Given two vectors π1, π2 [0, 1]N define their supremum distance as ||π1 π2|| := maxi N |π1(i) π2(i)|.

Our choice to focus on the supremum distance draws directly on the foundational principles established in algorithmic fairness literature, which emphasizes the importance of limiting the error experienced by the worst-off group. This perspective prioritizes ensuring good representation for every group rather than merely optimizing for average performance. The supremum distance naturally operationalizes this principle by measuring the maximum disparity across all groups, effectively placing an upper bound on the error that any group might experience.

It s worth noting that for finite group settings, different choices of distance measures (L1 or L2 distance) are typically within constant factors of one another, making the specific choice less critical. However, as we move to settings with infinite groups, these equivalences break down distance measures such as L1 become less informative, potentially obscuring significant disparities among individual groups.

While we selected the supremum distance for its robust guarantees from an algorithmic fairness perspective, exploring representation guarantees under alternative distance metrics is certainly an interesting direction for future research.

Finally, we are now able to rigorously define what it means to be α-representative.

Definition 2.8 (α-Representative Generator). A randomized generator is α-representative with respect to a countable collection of groups A if for every stream of examples x1, x2, . . . and for all t N, we have that ||G(x1:t)|A x1:t|A|| α.

On its own, α-representativeness is trivial to achieve given any collection of groups A and any finite sequence x1, . . . , xt, one can always output exactly the empirical distribution, x1:t. Thus, our goal will be to satisfy representation in addition to existing definitions of correctness for generation, as defined in Section 2.2.

2.2. Representative Generation

In this section, we introduce several notions of representative generatability for a tuple (H, A), each varying in the amount of distinct examples that an α-representative generator can observe before it needs to succeed. In our setting, a α-representative generator succeeds if it is eventually consistent. Formally, given an α-representative generator G, a hypothesis h H, and any stream x1, x2, supp(h), we say that G is consistent from time point t N, if for all s t:

Pr ˆxs G(x1:s) [ˆxs supp(h) \ {x1, . . . , xs}] = 1.

We note that generators are allowed some initial inconsistent outputs, provided they eventually achieve consistency. In

Representative Language Generation

contrast, outputs must be representative of the data stream at every timestep. Thus, our definition implicitly prioritizes group representation over correctness in generation during initial timesteps. While consistency cannot always be verified, it is possible to construct and verify a representative distribution at each time step. This makes it reasonable to require representation throughout, and less reasonable to favor potentially consistent but verifiably unrepresentative alternatives. However, understanding the tradeoffs between representation and consistency is an interesting direction of future research.

Given our definitions of consistency and representativeness, we are now ready to introduce the strongest form of representative generatability α-representative uniform generatability. Roughly speaking, α-representative uniform generatability requires that the amount of unique examples that G needs to observe before being consistent should be uniform over all h H and streams x1, x2, supp(h).

Definition 2.9 (α-Representative Uniform Generatability). Let H {0, 1}X be any hypothesis class satisfying the UUS property and A = {Ai}i N be any collection of groups over X. Then, (H, A) is α-representatively uniformly generatable if there exists an α-representative generator G and d N, such that for every h H and any sequence x1, x2, . . . with {x1, x2, . . . } supp(h), if there exists t N where |{x1, . . . , xt }| = d , then G is consistent from t .

A tuple (H, A) is then representatively uniformly generatable if it is α-representatively uniformly generatable for every α (0, 1].

Definition 2.10 (Representative Uniform Generatability). Let H {0, 1}X be any hypothesis class satisfying the UUS property and A = {Ai}i N be any collection of groups over X. Then, (H, A) is representatively uniformly generatable, if (H, A) is α-representatively uniformly generatable for every α (0, 1].

Representative uniform generatability is a strong property as it requires the sample complexity (i.e. number of distinct examples before consistency is achieved) to be uniform over all choices of the adversary. We can weaken the definition by allowing the sample complexity to depend on the hypothesis, but still be uniform over all possible streams of examples the adversary might pick. This leads to our next notion of representative non-uniform generatability.

Definition 2.11 (Representative Non-uniform Generatability). Let H {0, 1}X be any hypothesis class satisfying the UUS property and A = {Ai}i N be any countable collection of groups over X. Then, (H, A) is representatively non-uniformly generatable, if for every α > 0 there exists an α-representative generator G, such that for every h H, there exists a d N such that for any sequence x1, x2, . . .

with {x1, x2, . . . } supp(h), if there exists t N where |{x1, . . . , xt }| = d , then G is consistent from t .

Finally, we can weaken the requirements even further by allowing the sample complexity to depend on both the hypothesis and stream selected by the adversary.

Definition 2.12 (Representative Generation in the Limit). Let H {0, 1}X be any hypothesis class satisfying the UUS property and A = {Ai}i N be any countable collection of groups over X. Then, (H, A) is representatively generatable in the limit, if for every α > 0 there exists an α-representative generator G such that for every h H and any enumeration x1, x2, ... of supp(h), there exists t N such that G is consistent from t .

We remark that our notions of generatability are direct analogs of those from Li et al. (2024) with an additional representation requirement.

3. Characterizations of Representative Uniform and Non-uniform Generation

In this section, we focus on characterizing representative uniform and non-uniform generatability. In light of the computational barriers for uniform and non-uniform generation established by Charikar & Pabbaraju (2024), our characterization is information-theoretic in nature. Moreover, we will only consider collections of groups A which are countable partitions of X. This is still a very general setting, capturing problems like animal image generation, where one considers groups based on class. We leave the characterization of representative uniform and non-uniform for overlapping collections of groups as future work.

3.1. Representative Uniform Generation

Our starting point is representative uniform generation. In order to characterize which classes are representatively uniformly generatable, we extend the Closure dimension from Li et al. (2024) to account for the group-constraints induced by A. To do so, we define a new scale-sensitive dimension, termed the Group Closure dimension, which accounts for the complexity induced by both H and A.

Definition 3.1 (Group Closure dimension). Let H {0, 1}X be any hypothesis class satisfying the UUS property and A = {Ai}i N be any countable partition of X. The Group Closure dimension of (H, A) at scale α > 0, denoted GCα(H, A), is the largest natural number d N for which there exists distinct x1, . . . , xd X such that x1, . . . , xd H = and either (1) maxi S x1:d|A(i) > α or (2) α|N/S| < P

i S x1:d|A(i), where S := {i N : x1, . . . , xd H Ai \ {x1, . . . , xd} = }. If this is true for arbitrarily large d N, then we say that GCα(H, A) = . On the other hand, if this is not true for d = 1, we say that

Representative Language Generation

GCα(H, A) = 0. Remark 3.2. Our definition of the Group Closure dimension is for countable groups A = {Ai}i N. One can define the Group Closure dimension for finite groups A = {Ai}i [K] by modifying the definition of S and condition (2) in the following way. Define S := {i [K] : x1, . . . , xd H Ai \ {x1, . . . , xd} = } and then change (2) to α(K |S|) < P i S x1:d|A(i). We will use this modified definition when proving Corollary 3.5.

At a high-level, for any given error tolerance α > 0, one should interpret GCα(H, A) as the minimal number of distinct examples that a generator needs to see before it is guaranteed a winning strategy with respect to error level α (i.e. one that is both consistent and α-representative). Indeed, as we will show in this section, GCα(H, A) provides a precise quantitative bound on the sample complexity of representative uniform generation for (H, A). We highlight that unlike the Closure dimension, the Group Closure dimension depends on both H and A and is scale-sensitive and defined for every α > 0. Scale-sensitive combinatorial dimensions are not new to learning theory, and have also been defined to characterize learnability for regression problems (Bartlett et al., 1994; Rakhlin et al., 2015).

Our first result in this section shows that for every α > 0, the finiteness of GCα(H, A) provides a qualitative characterization of α-group constrained generatability. Theorem 3.3 (Characterization of α-Representative Uniform Generatability). Let X be countable, H {0, 1}X

be any class satisfying the UUS property, and A = {Ai}i N be any countable partition of X. Then, (H, A) is α-representatively uniformly generatable if and only if GCα(H, A) < .

An immediate consequence of Theorem 3.3 is a characterization for representative uniform generatability.

Corollary 3.4 (Characterization of Representative Uniform Generatability). Let X be countable, H {0, 1}X be any class satisfying the UUS property, and A = {Ai}i N be any countable partition of X. Then, (H, A) is representatively uniformly generatable if and only if GCα(H, A) < for all α > 0.

Our proof of Theorem 3.3 is constructive. In particular, to show that the finiteness of GCα(H, A) is necessary, we first assume that GCα(H, A) = . Then, for every generator G, we explicitly choose a hypothesis h H and construct a valid stream of examples x1, x2, supp(h) such that G eventually violates either consistency or αrepresentativeness. In fact, if GCα(H, A) = d, a simple adaption of our proof shows that for any generator G, there exists a hypothesis h H and a valid stream of examples x1, x2, supp(h) such that G violates either consistency or α-representativeness after observing d distinct

examples. To prove that the finiteness of GCα(H, A) is sufficient, we construct an α-representative generator G which satisfies consistency after observing GCα(H, A)+1 distinct examples. Unlike uniform generation without representation, our generator G for representative uniform generation computes closures at every time point t N. Combining both the necessary and sufficient directions shows that not only does the finiteness of GCα(H, A) characterize α-group constrained uniform generation, but also that the optimal sample complexity is exactly Θ(GCα(H, A)). We defer the full proof of Theorem 3.3 to Appendix C.1.

Another important consequence of Theorem 3.3 is its implications on representative uniform generation for finite H and A. Our next result uses the Group Closure dimension and Theorem 3.3 to show that all tuples (H, A) where both H and A are finite are representative uniform generatable. The full proof can be found in Appendix C.2

Corollary 3.5 (All Finite Classes and Finite Partitions are Representatively Uniformly Generatable). Let X be countable, H {0, 1}X be any finite class satisfying the UUS property, and A = {Ai}i K be any finite partition of X. Then, (H, A) is representatively uniformly generatable.

As we will show in Lemma 4.3, the finiteness of A is in a weak sense necessary for Corollary 3.5 to hold if A is allowed to be countably infinite in size, there exists a hypothesis class of size one that is not even generatable in the limit with representation! Nevertheless, even when A is finite, representative uniform generation is still harder than (unrepresentative) uniform generation as evidenced by the following Corollary, which we prove in Appendix C.3.

Corollary 3.6 (Representative Uniform Generation = Uniform Generation). Let X be countable. There exists a countable, UUS class H {0, 1}X and a finite partition A = {Ai}i [K] of X such that H is uniformly generatable but (H, A) is not representatively uniformly generatable.

In fact, Corollary 3.6 shows something stronger there are trivially uniformly generatable classes which are not representatively uniformly generatable with just two groups. This brittleness of generatability when forced to satisfy both consistency and representation is not unique to our notion of diversity, but also shown by existing work on generation with breadth (Kalavasis et al., 2024b; Charikar & Pabbaraju, 2024; Kalavasis et al., 2024a).

3.2. Representative Non-uniform Generation

We now proceed to give a characterization of representative non-uniform generation. Recall that for non-uniform generation, we allow the sample complexity (i.e., the number of distinct examples needed before consistency) of the generator to depend on the hypothesis chosen by the adversary, but not the stream. Similar to the characterization of

Representative Language Generation

non-uniform generatability from Li et al. (2024) without representation constraints, our main result in this section provides a characterization of representative non-uniform generation in terms of representative uniform generation.

Theorem 3.7 (Characterization of Representative Non-uniform Generatability). Let X be countable, H {0, 1}X be any class satisfying the UUS property, and A = {Ai}i N be any countable partition of X. Then, (H, A) is representatively non-uniformly generatable if and only if for every α > 0, there exists a non-decreasing sequence of classes H1 H2 such that H = S i=1 Hi and (Hi, A) is α-representative uniformly generatable i N.

The proof of Theorem 3.7 is similar to the proof of Theorem 3.5 in Li et al. (2024), hence we defer the details to Appendix C.4. We instantiate Theorem 3.7 with Corollary 3.8 to show that every tuple (H, A) where H is countably infinite and A is finite is representatively non-uniformly generatable. The proof of Corollary 3.8 is in Appendix C.5. Like for representative uniform generation, the finiteness of A is, in a weak sense, necessary for Corollary 3.8 to hold.

Corollary 3.8 (All Countable Classes and Finite Partitions are Representatively Non-uniformly Generatable). Let X be countable, H {0, 1}X be any countably infinite class satisfying the UUS property, and A = {Ai}i K be any finite partition of X. Then, (H, A) is representatively nonuniformly generatable and hence, also representatively generatable in the limit.

4. Representative Generation in the Limit

We finally consider the weakest definition of successful generation: generation in the limit. As per Definition 2.12, this requires a generator to output a distribution representative of the data stream at each timestep, and after some finite point in time, all outputs must be consistent with the true language. We note that as the weakest notion of generation, all positive results from Section 3 for uniform and non-uniform representative generatability apply to representative generatability in the limit. Notably, Corollary 3.8 implies that any countably infinite H can be generated in the limit with representation for any finite partition A of X.

In this section, we prove that relaxing to representative generatability in the limit expands the set of feasible A s in two ways: (1) allowing overlapping groups instead of disjoint partitions, and (2) permitting countably infinite sets of groups under a finite support assumption (defined below). We maintain that X S

Definition 4.1 (Finite Support Size). For any hypothesis h : X {0, 1} and collection of possibly overlapping groups A, we define h s finite support size with respect to

A S A supp(h)|<

A S A supp(h) .

In other words, the total size of all arbitrary intersections with a subset of groups in A and the support of h. Note that in the case of disjoint groups, this quantity simplifies to the number of individuals in supp(h) that are members of groups that have finite intersection with supp(h). While any finite collection of groups A will always satisfy the finite support assumption, this is not the case for countably infinite sets of groups. A simple example of a collection of groups that does not satisfy the assumption is any infinite collection of groups where the size of every group is finite, such as the collection of all singletons A = {{x} : x X}, or the collection defined in the proof of Lemma 4.3.

Definition 4.2 (Hypothesis Class with Finite Support). We say that a hypothesis class H has finite support with respect to a collection of groups A if for every h H, fh,A < .

While ideally we could show that all classes of countable groups can be generated in the limit with representation without any additional assumptions, the following lemma shows that this finite support assumption is crucially necessary for generatability with representation. Lemma 4.3 (Necessity of Finite Support). There exists a countably infinite partition of X, A, and a finite hypothesis class H with |H| = 1 that is not generatable in the limit with representation for any 0 < α < 1.

The proof of Lemma 4.3 can be found in Appendix D.1. We contrast this impossibility result with a strong positive result: any countable H and countable, possibly overlapping A satisfying the finite support assumption can be generated in the limit with representation. Theorem 4.4. Let X be countable, H {0, 1}X be any countable class satisfying the UUS property, and A = {Ai}i N a countable collection of possibly overlapping subsets of X such that H has finite support with respect to A. Then, (H, A) is representative generatable in the limit.

Before providing the proof in full, we provide a sketch of the main ideas. In Kleinberg & Mullainathan (2024), the generator they provide to generate in the limit uses the notion of a critical hypothesis:

Definition 4.5 (Critical Hypothesis). Given an enumeration of H, h1, h2, ..., we say that a hypothesis hn is critical at step t if n t, hn is consistent with the samples seen thus far, i.e. {x1, ..., xt} supp(hn), and for every i < n with {x1, ..., xt} supp(hi), we have supp(hn) supp(hi).

Given any countable H, the generator constructed by Kleinberg & Mullainathan (2024) works as follows: at time step t

Representative Language Generation

the generator finds the critical hypothesis hj with the largest index j t, and outputs an arbitrary unseen element from that hypothesis s support. The correctness of their algorithm follows from a key property of the true language:

Lemma 4.6 (Kleinberg & Mullainathan (2024), Claim 4.3). Given any countable H = {h1, h2, ...} and enumeration x1, x2, ... of some h H, there exists a timepoint t N such that h is critical for all timesteps s t.

By the definition of a critical language, if h is critical, then the critical hypothesis hj with the largest index at step t must satisfy supp(hj) supp(h). Thus, any point output by the generator after step t must be consistent, as it comes from a subset of the true language.

While sufficient for generating in the limit, this last line of reasoning exposes an obstacle in satisfying group representation: while hj is guaranteed to be a subset of the true hypothesis s support, supp(hj) could consist of only elements from a single group, even if the true hypothesis s support and the data stream thus far are more diverse. Thus, we will need to ignore some critical hypotheses that do not allow the possibility of representative generation. We define a feasible hypothesis to be an h that admits a distribution over its unseen elements that approximates the group proportions in the empirical distribution:

Definition 4.7 (Feasible Hypothesis). Given a hypothesis hi H, we say that a hi is α-feasible at step t if there exists a distribution µ over unseen data points in supp(hi) \ {x1, ..., xt} such that µ|A x1:t|A α.

Note that even the true hypothesis h may not be α-feasible at a given timestep. However, like criticality, we can show that there exists a timestep d N such that for all s d, h must be feasible:

Lemma 4.8. Consider any countable, UUS, H {0, 1}X

and countable collection of possibly overlapping subsets A = {Ai}i N and assume H has finite support with respect to A. Let x1, x2, ... be an enumeration of supp(h) for some h H. Then, for any α > 0, there exists a d N such that for all s d, h is α-feasible at timestep s.

We prove this lemma in Appendix D.2. By definition, the feasibility of a language guarantees that there exists a distribution µ over unseen elements that satisfies the αrepresentative requirement. Thus, we can tweak our algorithm to only consider the critical and feasible hypothesis hj with the largest index j t. Combining lemmas 4.6 and 4.8, we can guarantee that for any α > 0 there exists some finite point t = max{d, t} such that for all s t , the true hypothesis h is both critical and α-feasible. Thus, for all s t at least one such feasible and critical language exists, and by the same reasoning as before, we must have supp(hj) supp(h), and so outputting an α-representative

µ from the right-most α-feasible and critical language guarantees both consistency and representation after t . Thus, our generator satisfies representative generation in the limit.

With the building blocks of Lemmas 4.6 and 4.8 in place, the proof of Theorem 4.4 follows as described in the proof sketch. The formal proof can be found in Appendix D.3.

4.1. Barriers to Achieving Representative Generation in the Limit with only Membership Queries

Thus far, we have considered representative generatability only in an information-theoretic sense, without regard for the amount of computation required by our generators. However, Kleinberg & Mullainathan (2024) provide an interesting positive result: any countable H can be generated in the limit with a generator that only requires a finite number of membership queries of the form x supp(h)? for any h H at each timestep. In contrast to this positive result, Charikar & Pabbaraju (2024) show that no algorithm using only membership queries can be used to non-uniformly generate for all hypothesis classes of size two.

With these results in mind, it s natural to ask whether representative generation in the limit can also be achieved by an algorithm that uses only a membership query oracle. In this new setting where we care about representation as well as consistency, it s natural to assume we can make queries about both group and hypothesis membership, i.e. ask questions of the form x supp(h)? or x Ai? for any h H or Ai A.

We show that unlike in the standard setting of generation in the limit, the additional representation constraint poses a significant barrier to generating with only membership queries. This negative result is with respect to the generation in the limit setting, while Charikar & Pabbaraju (2024) s negative result holds only in the stronger non-uniform generation setting without representation. However, both proofs revolve around a similar obstacle: with only membership queries to single groups or single hypotheses, it s difficult to know whether the intersection of a group and a hypothesis s support (or in the case of Charikar & Pabbaraju (2024), the intersection of the supports of two hypotheses), contains infinitely many elements, or finitely many.

The following lemma shows that no algorithm that works even just for very simple, finite pairs of H and A can generate in the limit with representation using only membership queries. The proof proceeds by contradiction, assuming the existence of such a generator. We then analyze its behavior to construct an enumeration that forces the generator to violate either consistency or representation at each timestep. The complete proof, along with an extended discussion of special cases where representative generation with membership queries is feasible, is provided in Appendix D.4.

Representative Language Generation

Lemma 4.9 (Impossibility of Generating with Group Constraints with Only Membership Queries). For any α < 1/2, there cannot exist a (deterministically computed) randomized generator that satisfies α-representative generation in the limit for any UUS hypothesis class H = {h} and finite partition of X, A = {A1, A2}, and uses only a finite number of membership queries of the form x supp(h)? or x Ai? for h H and i {1, 2} at each step.

5. Discussion and Future Directions

In this paper, we introduced and analyzed the concept of representative generatability, which extends the theoretical framework for generation introduced by Kleinberg & Mullainathan (2024) and Li et al. (2024). This novel property ensures that the outputs of generative models closely approximate the proportions of certain groups-of-interest in the training data. We provide a complete, informationtheoretic characterization of which combinations of hypothesis classes and groups are uniformly and non-uniformly generatable with representation. In addition, we study representative generatability in the limit from both informationtheoretic and computational perspectives. Notably, we demonstrate that, unlike the case of non-representative generatability in the limit, membership queries alone are insufficient for computable algorithms to achieve representative generation in the limit.

Our additional constraint of representational generation highlights key tensions and possibilities between the positive results of Kleinberg & Mullainathan (2024) s model and real-world approaches to generation. Specifically, while real-world approaches typically aim to develop generative models that closely approximate training distributions and it is indeed natural to expect our generations to resemble training data in certain aspects Kleinberg & Mullainathan (2024) s notion of generation in the limit imposes no requirement that generated data must resemble previously observed data, only that it must belong to the true language. Our work maintains the generation-in-the-limit framework while introducing an additional constraint: generations must resemble training data with respect to simple statistical tests measuring the prevalence of certain subpopulations. Arguably, our notion of representation is useful to formalize even in a practical setting, as it addresses an important consideration: generative models can potentially underor over-represent certain subpopulations, even when they demonstrate good overall alignment with the training data. There are still several directions of future work, three of which we review below.

Representative Uniform and Non-uniform Generation for Richer Collections of Groups. In Sections 3.1 and 3.2, we show that for all finite A that form a partition of X,

all finite and countable classes are representative uniformly and non-uniformly generatable, respectively. In the case of representative generation in the limit, however, our positive results extend to a much richer class of group collections: any countable collection of possibly overlapping groups satisfying the finite support assumption. Lemma 4.3 shows that in full generality, this finite support assumption is necessary for representative generation to be possible. However, this still leaves a large gap between the collections of groups we show are uniformly and non-uniformly generatable with representation (finite partitions) vs. the collections of groups we can show are generatable in the limit with representation (countable overlapping groups with finite support). This raises the question of whether this gap can be closed: are all finite and countable classes representative uniformly and non-uniformly generatable respectively if H has finite support with respect to A? If not, what are the minimal assumptions that need to be placed on (H, A) so that all finite and countable H are representative uniformly and non-uniformly generatable, respectively? More generally, what characterizes representative uniform and non-uniform generation when groups may be overlapping?

Representative Generation with Dynamic Groups. Our model of representative generation considers a fixed collection of groups A. However, in practice, group membership may evolve with time, with existing groups growing and shrinking in size, different features signaling membership in particular groups, and even some new groups forming. We leave it as an important direction of future work to extend our characterizations and study of representative generation to the case where there exists a time-indexed collection of groups A1, A2, . . . , capturing the fact that in practice, group memberships may evolve with time.

Representative Generation Beyond the Supremum Distance. In this paper, we quantified the quality of group representation of a generator s outputs via the supremum distance between the empirical probabilities over groups and the induced group probabilities of the generator s output. However, there are other natural ways of enforcing group representation. For example, one could consider swapping the supremum distance with the ℓ1-distance between group probabilities, or some other notion of distance that is more appropriate for a particular application in mind. We leave the study of generatability under these alternate notions of group representation as a important direction of future work.

Impact Statement

This work introduces a theoretical framework for representative generation in machine learning models, addressing critical issues of bias and lack of diversity in generative model outputs. By requiring proportional representation of

Representative Language Generation

different groups from training data, our approach aims to mitigate risks of underrepresentation and promote more fair and inclusive AI systems.

The implementation of representative generation could help reduce societal biases perpetuated by AI systems, potentially leading to more equitable outcomes in machine-learningaided decision-making processes such as hiring or content recommendation. It may also enhance the utility of generative models by ensuring they capture a wider range of perspectives and experiences.

However, risks exist. Defining and categorizing groups for representation could introduce new biases or oversimplify complex social dynamics. As this work is theoretical, practical implementation will require careful consideration of these ethical implications.

Angluin, D. Finding patterns common to a set of strings. In Proceedings of the eleventh annual ACM Symposium on Theory of Computing, pp. 130 141, 1979.

Angluin, D. Inductive inference of formal languages from positive data. Information and control, 45(2):117 135, 1980.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214 223. PMLR, 2017.

Bartlett, P. L., Long, P. M., and Williamson, R. C. Fatshattering and the learnability of real-valued functions. In Proceedings of the seventh annual conference on Computational learning theory, pp. 299 310, 1994.

Bender, E. M., Gebru, T., Mc Millan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 610 623, 2021.

Bousquet, O., Hanneke, S., Moran, S., Van Handel, R., and Yehudayoff, A. A theory of universal learning. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 532 541, 2021.

Charikar, M. and Pabbaraju, C. Exploring facets of language generation in the limit. ar Xiv preprint ar Xiv:2411.15364, 2024.

Dwork, C., Kim, M. P., Reingold, O., Rothblum, G. N., and Yona, G. Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1095 1108, 2021.

Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., and Ahmed, N. K. Bias and fairness in large language models: A survey. Computational Linguistics, pp. 1 79, 2024.

Gold, E. M. Language identification in the limit. Information and control, 10(5):447 474, 1967.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. Communications of the ACM, 63(11):139 144, 2020.

Gopalan, P., Kalai, A. T., Reingold, O., Sharan, V., and Wieder, U. Omnipredictors. ar Xiv preprint ar Xiv:2109.05389, 2021.

Gopalan, P., Reingold, O., Sharan, V., and Wieder, U. Multicalibrated partitions for importance weights. In International Conference on Algorithmic Learning Theory, pp. 408 435. PMLR, 2022.

H ebert-Johnson, U., Kim, M., Reingold, O., and Rothblum, G. Multicalibration: Calibration for the (computationallyidentifiable) masses. In International Conference on Machine Learning, pp. 1939 1948. PMLR, 2018.

Kalavasis, A., Mehrotra, A., and Velegkas, G. Characterizations of language generation with breadth. ar Xiv preprint ar Xiv:2412.18530, 2024a.

Kalavasis, A., Mehrotra, A., and Velegkas, G. On the limits of language generation: Trade-offs between hallucination and mode collapse. ar Xiv preprint ar Xiv:2411.09642, 2024b.

Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International conference on machine learning, pp. 2564 2572. PMLR, 2018.

Kirk, H. R., Jun, Y., Volpin, F., Iqbal, H., Benussi, E., Dreyer, F., Shtedritski, A., and Asano, Y. Bias out-of-thebox: An empirical analysis of intersectional occupational biases in popular generative language models. Advances in neural information processing systems, 34:2611 2624, 2021.

Kleinberg, J. and Mullainathan, S. Language generation in the limit. ar Xiv preprint ar Xiv:2404.06757, 2024.

Li, J., Raman, V., and Tewari, A. Generation through the lens of learning theory. ar Xiv preprint ar Xiv:2410.13714, 2024.

Mei, K., Fereidooni, S., and Caliskan, A. Bias against 93 stigmatized groups in masked language models and downstream sentiment classification tasks. In Proceedings of

Representative Language Generation

the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 1699 1710, 2023.

Rakhlin, A., Sridharan, K., and Tewari, A. Online learning via sequential complexities. J. Mach. Learn. Res., 16(1): 155 186, 2015.

Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3407 3412, 2019.

Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. Societal biases in language generation: Progress and challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4275 4293, 2021.

Thanh-Tung, H. and Tran, T. Catastrophic forgetting and mode collapse in gans. In 2020 international joint conference on neural networks (ijcnn), pp. 1 10. IEEE, 2020.

Wang, P., Xu, D., Fan, Z., Wang, D., Mohan, S., Iandola, F., Ranjan, R., Li, Y., Liu, Q., Wang, Z., et al. Taming mode collapse in score distillation for text-to-3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9037 9047, 2024.

Zhou, M., Abhishek, V., Derdenger, T., Kim, J., and Srinivasan, K. Bias in generative ai. ar Xiv preprint ar Xiv:2403.02726, 2024.

Representative Language Generation

A. Related Works

We highlight a few existing notions that are closest to our work.

Language Identification and Generation. In his seminal 1967 paper, E Mark Gold introduced the model of Language Identification in the Limit (1967). In this model, there is a countable collection of strings U and a language family L = {L1, L2, . . . }, where each Li U. An adversary plays a sequential game with a player over rounds indexed by t N. Before the game begins, the adversary picks a language K L and an enumeration w1, w2, . . . of the strings in K, with possible repetitions. In each round t N, the player observes the string wt K, and outputs an index it. The goal of the player is to eventually identify the language K by eventually outputting indices it such that Lit = K. More formally, the player has identified K in the limit if there exists an s N such that Lit = K for all t s. The family L is identifiable in the limit, if every language K L is identifiable in the limit. Gold proved that while all finite language families L are identifiable in the limit, there exists simple countable language families where this is not the case. Following this work, (Angluin, 1979; 1980) provides a precise characterization of which language families are identifiable in the limit, further emphasizing the impossibility of language identification in the limit.

Very recently, Kleinberg & Mullainathan (2024) revisit the Gold s model of Language Identification in the Limit with a twist: suppose in each round t N, the player is not asked to output the index of K, but rather a string ˆw U with the hope that ˆw K \ {w1, . . . , wt}. Naming this setting Language Generation in the Limit , Kleinberg & Mullainathan (2024) show a surprisingly different result. Unlike the case of identification, Kleinberg & Mullainathan (2024) show that all countable language families are generatable in the limit. In a follow-up work, Li et al. (2024) extend Kleinberg & Mullainathan (2024) s results beyond language generation in the limit by (1) re-framing the problem in terms of a binary hypothesis classes defined over a countable example space (2) defining new, stronger models of generation termed uniform and non-uniform generation and (3) formalizing an abstract, prompted version of generation. In this paper, we mainly adopt the notation and models of generation from Li et al. (2024). In addition, we highlight connections between representative generation and the model of prompted generation introduced by Li et al. (2024) in Appendix B.

In addition to Li et al. (2024), there have been several follow-up works to Kleinberg & Mullainathan (2024) that study both consistency and breadth in language generation in the limit. We review the consistency results of these papers here, and defer a discussion of their results on breadth to the next section. Concurrently with Li et al. (2024), Kalavasis et al. (2024b) study generation in the stochastic setting, where the positive examples revealed to the generator are sampled i.i.d. from some unknown distribution. In this model, Kalavasis et al. (2024b) quantify the error rates for generation with consistency according to the universal rates framework of Bousquet et al. (2021). Charikar & Pabbaraju (2024) study several facets of language generation. First, they show that all countable classes satisfy the stronger property of non-uniform generation. Then, they show a hardness result the stronger setting of non-uniform generation is not possible using only membership queries. This is in contrast to generatability in the limit, where Kleinberg & Mullainathan (2024) show that every countable class is generatable in the limit using only membership queries. Lastly, Charikar & Pabbaraju (2024) characterize which classes are uniformly generatable when additional feedback is available.

Language Generation with Breadth. In their original paper, Kleinberg & Mullainathan (2024) formally describe a tension between consistency and breadth, defined as producing outputs that represent the full range of valid outputs in some reasonable way. This observation has prompted several follow-up studies proposing and examining various definitions of breadth. Charikar & Pabbaraju (2024) introduce the notion of exhaustive generation, while Kalavasis et al. (2024b) and Kalavasis et al. (2024a) explore three potential definitions: generation with breadth, generation with approximate breadth, and unambiguous generation. Although these notions differ slightly, they all essentially require that the generator s outputs cover nearly (or exactly) the entirety of the true language in the limit. Both sets of authors demonstrate an inherent tension between consistency and breadth, as theorized by Kleinberg & Mullainathan (2024). Notably, Kalavasis et al. (2024b) show that the strongest notion, generation with breadth is as difficult as identification in the limit.

Our notion of representative generation can be contrasted with these definitions of breadth as both stronger and weaker along different axes.

On one hand, representative generation can be viewed as a relaxation of generation with breadth. While it requires the generator to produce diverse outputs with respect to group membership, it doesn t demand complete coverage of the entire language. The generator has two key flexibilities: it can ignore groups that appear only in a tiny fraction of the data sequence, and it need not cover the entirety of every group it does generate from. These allowances make representative generation less

Representative Language Generation

stringent than full breadth requirements. Notably, our notion proves much more tractable compared to prior breadth concepts, making it a significant relaxation. In particular, Theorem 4.4 demonstrates that any countable collection of languages and countable collection of groups, under a minor assumption, can be generated in the limit with group representation.

On the other hand, representative generation can also be seen as a stronger notion compared to breadth. Previous breadth concepts do not consider how the generated output at each step compares to the data seen thus far. For instance, a valid generator with breadth for our diverse set of animal images might produce all possible animal images in the limit but could do so by generating 1000 cat pictures for every picture of a different species. While this approach might achieve coverage in the limit, it would fail to capture the diversity of animals seen in the data stream in the short term. Such a generator would lack practical utility, as it wouldn t reflect the variety in the input data. Representative generation, in contrast, maintains diversity throughout, offering more immediate value.

Multigroup Fairness. Our notion of α-representativeness is closely related to multigroup fairness notions in the algorithmic fairness literature that have been studied mainly in the context of prediction, such as multiaccuracy and multicalibration (H ebert-Johnson et al., 2018; Kearns et al., 2018). These notions require that a predictor satisfy quality metrics like accuracy-in-expectation or calibration not only for the overall population but also for every subset within a rich collection of demographic groups. Perhaps closest to our notion is the work of Gopalan et al. (2022), who study notions of multicalibration for distributions. While they focus on the stronger notion of multicalibration for distributions, they do define an analogous notion of α-multiaccuracy in expectation for distributions (Gopalan et al. (2022), Definition 9). In their framework, the requirement of an α-representative generator can be equivalently expressed as follows: at each step, the generator must produce a distribution that is α-multiaccurate in expectation with respect to both the empirical distribution of points generated thus far and the collection of groups A. Our work, however, diverges from this framework by concentrating on the generative process itself. We explore how group representation can be achieved and maintained throughout the sequential generation of points, in contrast to analyzing the properties of a single estimated distribution.

Outcome Indistinguishability. Although presented in terms of group representation, our notion can be reframed in the language of indistinguishability. From this perspective, the goal of representative generation can be interpreted as producing a generator that is not only consistent in the limit but also, at every step, outputs a distribution indistinguishable from the data seen so far. This indistinguishability is measured with respect to a set of tests, which in our case are precisely the indicator functions for group membership. We see potential in expanding this perspective and extending our results to incorporate richer sets of tests. This direction for future work is particularly promising given the success of related concepts in other domains. For instance, the outcome indistinguishability framework introduced by Dwork et al. (2021) for the prediction setting has proven to be a powerful notion, enabling powerful guarantees for predictors such as omniprediction (Gopalan et al., 2021).

B. Connections to Prompted Generation

Existing works have also explored variants of the generation setting that capture generation tailored to specific prompts at each time step. This concept is referred to as prompted generation by Kleinberg & Mullainathan (2024), and generalized to multiclass generation by Li et al. (2024).

Prompted generation modifies the standard generation setting by expanding the representation of a language from a binary hypothesis h : X {0, 1} to a multiclass hypothesis h : X [k]. At each time step, the adversary provides an element xt, as well as its associated label h(xt). The generator s goal is to generate a consistent element with the same label, i.e. an x X with h(x) = h(xt).

As noted by Remark 5.1 in Li et al. (2024), the multiclass framework could be shaped to be similar to the representative generation setting by selecting a partition A of X and transforming each h : X {0, 1} into a multiclass hypothesis such that h(x) = i if h(x) = 1 and x Ai, and 0 otherwise. However, the existence of such a conversion is less clear in the case of overlapping groups. Additionally, whereas representative generation only requires generations that approximate the empirical distribution, prompted generation could require the generator to output an element with a label that has appeared in only a tiny fraction of the data. For this reason, the generation guarantees for these approaches differ in nature. Prompted generation requires a certain number of elements to be seen per group before generating consistently. On the other hand, representative generation offers consistency guarantees that depend solely on the total number of elements seen, regardless of their group membership.

Representative Language Generation

C. Proofs from Section 3

C.1. Proof of Theorem 3.3

We separate the two directions (necessity and sufficiency) of Theorem 3.3 into two proofs below.

Proof of Necessity in Theorem 3.3. Let X be countable, H {0, 1}X be any class satisfying the UUS property, and A = {Ai}i N be any countable infinite 1 partition of X. Suppose that GCα(H, A) = . It suffices to show that for every generator G and d N, there exists d d, an h H, and a sequence x1, x2, . . . with {x1, x2, . . . } supp(h) such that either

(1) for every t N where |{x1, . . . , xt}| = d , there exists an s t where

Pr ˆxs G(x1:s) [ˆxs supp(h) \ {x1, . . . , xs}] < 1

(2) there exists a t N such that ||G(x1:t)|A x1:t|A|| > α.

To that end, fix a generator G and a number d N. Since GCα(H, A) = , we know there must exist some d d and a sequence of distinct examples x1, . . . , xd such that either

(a) i N such that x1, . . . , xd H Ai \ {x1, . . . , xd } = and x1:d |A(i) > α or

(b) α|N/S| < P

i S x1:d |A(i) where S = {i N : x1, . . . , xd H Ai \ {x1, . . . , xd } = }.

Consider passing to G the sequence x1, . . . , xd . If

Pr ˆxd G(x1:d ) [ˆxd x1, . . . , xd H \ {x1, . . . , xd }] = 1,

then pick any h H such that {x1:d } supp(h) and any completion of the stream xd +1, xd +2, . . . such that {xd +j}j N supp(h) \ {x1:d }. On the other hand, if

Pr ˆxd G(x1:d ) [ˆxd x1, . . . , xd H \ {x1, . . . , xd }] < 1,

then there must exist an h H such that {x1:d } supp(h) while

Pr ˆxd G(x1:d ) [ˆxd supp(h) \ {x1:d }] < 1.

Pick this h H and complete the stream like before using this h H. To see why such an h H must exist, note that if Prˆxd G(x1:d ) [ˆxd x1, . . . , xd H \ {x1, . . . , xd }] < 1, then there must exist an x / x1, . . . , xd H \ {x1, . . . , xd } which G(x1:d ) puts positive mass on. But because this x / x1, . . . , xd H \ {x1, . . . , xd }, there must be an h H which contains x1:d in its support but not x. We now show that the selected hypothesis and stream satisfies either condition (1) or (2) in three cases.

Case 1: Suppose on the input x1, . . . , xd , we have that

Pr ˆxd G(x1:d ) [ˆxd x1, . . . , xd H \ {x1, . . . , xd }] < 1.

Consider the hypothesis h H and the stream x1, x2, . . . chosen above for this case. Note that {x1, x2, . . . } supp(h) by definition. Moreover, since xd +1 = xd , t = d is the only such time point where |{x1, . . . , xt}| = d . Finally, on round s = d t, we have that Prˆxs G(x1:s) [ˆxs supp(h) \ {x1:s}] < 1, satisfying condition (1).

1An identical proof follows if A is instead a finite partition of X.

Representative Language Generation

Case 2: Suppose on the input x1, . . . , xd , we have that

Pr ˆxd G(x1:d ) [ˆxd x1, . . . , xd H \ {x1, . . . , xd }] = 1

and the input x1, . . . , xd satisfies condition (a). Consider the hypothesis h H and the stream x1, x2, . . . chosen above for this case. Note that {x1, x2, . . . } supp(h) by definition. Let i N be the group satisfying the property in condition (a). Observe that on time point t = d , we have that ||G(x1:t)|A x1:t|A|| x1:t|A(i) > α because x1, . . . , xd H \ {x1, . . . , xd } Ai = and G(x1:d )|A puts all its mass on x1, . . . , xd H \ {x1, . . . , xd }. Thus, condition (2) is met and G violates α-representation.

Case 3: Suppose on the input x1, . . . , xd , we have that

Pr ˆxd G(x1:d ) [ˆxd x1, . . . , xd H \ {x1, . . . , xd }] = 1

and the input x1, . . . , xd satisfies condition (b). Consider the hypothesis h H and the stream x1, x2, . . . chosen above for this case. Note that {x1, x2, . . . } supp(h) by construction. Let t = d . We claim that if condition (b) holds, we have that ||G(x1:t)|A x1:t|A|| > α. For the sake of contradiction, suppose that ||G(x1:t)|A x1:t|A|| α. Then, we have that G(x1:t)|A(i) x1:t|A(i) + α for all i N/S and G(x1:t)|A(i) = 0 for all i S, where the latter is true by definition of S. If condition (b) holds, it must be the case that |N \ S| < . Thus, we can write:

i N G(x1:t)|A(i) = X

i N\S G(x1:t)|A(i) X

x1:t|A(i) + α|N \ S|.

If condition (b) is true, then P

i N\S x1:t|A(i) < 1 α|N \ S| giving that P

i N G(x1:t)|A(i) < 1, a contradiction to the fact that G(x1:t)|A is a probability measure (recall that when A is a partition of X, the induced group probabilities of any µ X form a probability measure). Thus, it must be the case that ||G(x1:t)|A x1:t|A|| > α and as in Case 2, G violates α-representation and condition (2) is met.

This completes all cases. The overall proof is complete after noting that the generator G and number d N were picked arbitrarily.

Proof of Sufficiency in Theorem 3.3. Let X be countable, H {0, 1}X be any class satisfying the UUS property, and A = {Ai}i N be any countable infinite partition of X. Suppose that GCα(H, A) < for some α > 0. Let d := GCα(H, A) Then, by definition, we have that for every c d + 1 and sequence of distinct examples x1, x2, . . . , xc such that x1, x2, . . . , xc H = , both of the following conditions hold true:

(1) maxi S x1:c|A(i) α and

(2) α |N/S| P

i S x1:c|A(i),

where S = {i N : x1, . . . , xc H Ai \ {x1, . . . , xc} = }.

We will use this fact to construct an α-representative uniform generator satisfying the properties in Definition 2.9.

Let x1, x2, . . . be any stream of examples. Consider the following generator G. For each round t until d+1 unique examples have been observed, G computes and plays from any µt X such that µt|A is an α-approximation of the group empirical distribution x1:t|A, i.e. µt|A x1:t|A α. Note that such a µt is always guaranteed to exist, as we can choose precisely µt = x1:t. Suppose on round t , we have that |{x1, . . . , xt }| = d + 1. For all rounds t t , G checks whether x1, . . . , xt H = . If this is true, then G computes and plays from any µt X such that µt|A is an α-approximation of the group empirical distribution x1:t|A. Otherwise, G first computes the group empirical distribution x1:t|A(i) and then the set St = {i N : x1, . . . , xt H Ai \ {x1, . . . , xt} = }. If St := , G picks zi x1, . . . , xt H Ai \ {x1, . . . , xt} for all i N and constructs the distribution µt X such that µt(zi) = x1:t|A(i) and µt(x) = 0 for all x X \ {z1, z2, . . . }. Otherwise, if St = , G picks zi x1, . . . , xt H Ai\{x1, . . . , xt} for all i N\St and constructs a distribution µt X in the following way. First, G picks a measure µ : X [0, 1] such that µ t(zi) = x1:t|A(i) for all i N \ St and µ t(x) = 0 for all x X \ {z1, z2, . . . }. At this point, P

x X µ t(x) = P

i N\St x1:t|A(i) and thus, there is still P

i St x1:t|A(i)

Representative Language Generation

amount of mass to be placed. To complete the distribution, G distributes the remaining P

i St x1:t|A(i) mass among {z1, z2, . . . } and obtains a probability measure µt X such that 0 µt(zi) x1:t|A(i) α and P

x X µt(x) = 1. We will prove why this is possible below.

We now show that such a generator is α-representative and consistent after observing d := GCα(H, A) + 1 distinct examples. We first prove consistency.

Proof of consistency: Let G be the generator described above. Let h H be the target hypothesis and x1, x2, supp(h) be the stream chosen by the adversary. Without loss of generality, suppose there are at least d distinct examples in the stream and let t be the first time point such that |{x1, . . . , xt }| = d . We need to show for all s t :

Pr ˆxs G(x1:s) [ˆxs supp(h) \ {x1, . . . , xs}] = 1.

Fix some s t . Then, observe that by construction, G always picks and plays from a distribution µs X such that supp(µs) x1, . . . , xs H \ {x1, . . . , xs}. Since x1, . . . , xs H supp(h), the proof of consistency is complete.

We now prove that G is α-representative.

Proof of α-representativeness: Fix any (not necessarily valid) sequence of examples x1, x2, . . . . It suffices to show that for every t N, we have that

||G(x1:t)|A x1:t|A|| α.

Let t N be the smallest time point such that |{x1, . . . , xt }| = d . By definition, observe that G satisfies αrepresentativeness for all t < t . Fix some t t . There are three cases to consider. Suppose that x1, . . . , xt H = . Then by definition, G plays an α-approximation of x1:t|A and hence is α-representative. Suppose that x1, . . . , xt H = and St = . Then, by construction, G picks and plays from a distribution µt X such that µt|A = x1:t|A and thus α-representativeness is trivially satisfied. Finally, consider the case where x1, . . . , xt H = and St = . We claimed above that in this scenario, G first computes an incomplete measure µ t and then distributes the remaining P

i St x1:t|A(i) mass to obtain a probability measure µt such that 0 µt(zi) x1:t|A(i) α and P x X µt(x) = 1. To see why this is possible, first note that since |{x1, . . . , xt}| GCα(H, A) + 1, we know that conditions (1) and (2) hold. Then, consider two cases: (i) P

i St x1:t|A(i) > α and (ii) P

i St x1:t|A(i) α. In case (i), we know that µ t(zi) < 1 α for all i N \ St. Hence, we can obtain the probability measure µt by adding at most α mass to z1, and then z2, and so on until all of P

i St x1:t|A(i) has been accounted for since α |N/S| P

i St x1:t|A(i). In case (ii), there must exist an i N \ St such that µ t(i) 1 P

i St x1:t|A(i). Hence we can obtain the probability measure µt, by adding all of the mass of P i St x1:t|A(i) on zi since P i St x1:t|A(i) α. The analysis above gives that G plays a measure µt X such that |µt|A(i) x1:t|A(i)| α for all i N\St and µt|A(i) = 0 for all i St. However, since |{x1, . . . , xt}| GCα(H, A)+1, condition (1) holds and thus maxi St x1:t|A(i) α. This gives that |µt|A(i) x1:t|A(i)| α for all i St implying that ||G(x1:t)|A x1:t|A|| α. Since t t , this concludes the proof of α-representativeness and the overall proof.

C.2. Proof of Corollary 3.5

Proof of Corollary 3.5. Let X be countable, H {0, 1}X be any finite class satisfying the UUS property, and A = {Ai}i K be any finite partition of X. Suppose for the sake of contradiction that GCα(H, A) = for some α > 0. Let F := {V : V H, V = } be the set of all non-empty subsets of H. Since H is finite, F is also finite. We will assign a number to every V F. In particular, for every V F, first define and compute:

h V supp(h).

Then, let d V N be the largest finite number for which there exists x1, . . . , xd V V such that

(1) maxi S x1:d V |A(i) > α or

Representative Language Generation

(2) α(K |S|) < P

i S x1:d V |A(i)

where S := {i [K] : V \ {x1, . . . , xd V } Ai = }.

We will use the fact that |K| < to prove that such a finite d V exists for all V F. To that end, fix some V F. Without loss of generality, suppose that | V | = , as otherwise, the claim is trivially true (i.e if | V | < , it must be the case that d V | V | < ). It suffices to show that there exists a d N such that for every d d and sequence x1, . . . , xd V , we have that

(1) maxi S x1:d |A(i) α and

i S x1:d |A(i) α (K |S|)

where S = {i [K] : Ai V \ {x1, . . . , xd } = }. To do so, we need some more notation. First, separate the K groups into two sets. The set R1 = {i [K] : |Ai V | = } contains those groups whose intersection with V is unbounded in size. The set R2 = {i [K] : |Ai V | < } are those groups whose intersection with V is finite. Note that for any x1, . . . , xd, the groups in R1 will never appear in S, and so we have that S R2. Now, define

p := max j R2 |Aj V |.

and observe that p < because |R2| K < .

We are now ready to show the existence of such a d N. In particular, pick d = Kp

α and consider any sequence x1, . . . , xd V for d d. In the worst-case, x1, . . . , xd contains Ai V for all i R2. However, for any given i R2, at most p of the elements from x1, . . . , xd can be from Ai. Accordingly, we have that

max i S x1:d |A(i) α

Moreover, observe that

x1:d |A(i) α|S|

K α α(K |S|),

where the last inequality is true because the groups form a partition of X and so |S| < K. Accordingly, d V Kp

α < . Since V F was chosen arbitrarily, this is true for all such V F. Now, we complete the proof by showing a contradiction. Define

d = max{d V : V F}.

Again, d < because |F| < and d V < for all V F. Because GCα(H, A) = , we know that there exists a t d + 1, and a distinct sequence of examples x1, . . . , xt such that

(1) maxi S x1:t|A(i) > α or

(2) α(K |S|) < P

i S x1:t|A(i)

where S := {i N : x1, . . . , xt H Ai \ {x1, . . . , xt} = }. Take V := H(x1, . . . , xt) and note that x1, . . . , xt H = V . Thus, we have shown that d V t d + 1, which contradicts the fact that d = max{d V : V F} since V F. The proof is complete, showing that GCα(H, A) < .

Representative Language Generation

C.3. Proof of Corollary 3.6

Proof of Corollary 3.6. Let X = Z be the set of integers. Let H {0, 1}N be any class that is not uniformly generatable (for example see Lemma 3.9 in Li et al. (2024)). Define a new class H {0, 1}X such that H := {x 7 1{x supp(h ) Z 0} : h H } and let A := {N, Z 0}. Note that A is finite partition of X of size 2.

First, observe that H is trivially uniformly generatable since T

h H supp(h) Z 0 = . We now show that (H, A) is not representatively uniformly generatable. For the sake of contradiction, suppose that (H, A) was representatively uniformly generatable. Then, for every α (0, 1], there exists an α-representative generator G and a number d N such that after observing d distinct examples, the output of G is consistent (see Definition 2.9). Let G be such a generator for any α < 1. We will use G in a blackbox manner to construct a uniform generator for H .

Consider the following generator G for H . On input x1, . . . , xt N, G passes x1, . . . , xt to G and receives an αrepresentative distribution µt. G plays any x supp(µt) N \ {x1, . . . , xt} if one exists. Otherwise, G plays any x X. We claim that G perfectly generates from H after observing d number of distinct examples. To see why, let h H and x1, . . . , xt supp(h ) be any sequence such that |{x1, . . . , xt}| d . Let h : X {0, 1} be defined as h(x) := 1{x supp(h ) Z 0}. Then, h H, {x1, . . . , xt} supp(h), and therefore by definition of d and G, we have that

(1) Prˆxt µt [ˆxt supp(h) \ {x1, . . . , xt}] = 1 and

(2) ||µt|A x1:t|A|| α.

Because x1:t N and α < 1, in order to satisfy conditions (1) and (2), there must exist an ˆxt supp(µt) N\{x1, . . . , xt}, and hence G , by playing this ˆxt, perfectly generates on round t. Since t N and x1, . . . , xt were chosen arbitrarily, we have that G is a uniform generator for H which contradicts the fact that H was chosen to not be uniformly generatable.

C.4. Proof of Theorem 3.7

We separate the two directions (necessity and sufficiency) of Theorem 3.7 into two proofs below.

Proof of Sufficiency in Theorem 3.7. Let X be countable, H {0, 1}X be any class satisfying the UUS property, and A = {Ai}i N be any countable partition of X.

Fix some α > 0. Suppose there exists a non-decreasing sequence of classes H1 H2 such that H = S i=1 Hi and (Hi, A) is α-representative uniformly generatable for all i N. Then, there exists a α-representative uniform generator Gi for each (Hi, A). For every i N, let ni be the number of distinct examples that Gi needs to see before it is consistent (i.e. ni is the d in Definition 2.9). Observe we can assume without loss of generality that n1, n2, . . . is a non-decreasing sequence by using the α-representative uniform generator constructed in the proof of Theorem 3.3.

We will now use G1, G2, . . . to construct a α-representative non-uniform generator Q. To that end, let x1, x2, . . . be any stream of examples. Consider the following generator Q. On time point t N, Q first computes the number of distinct examples dt := |{x1, . . . , xt}| in the stream so far. Then, Q computes

it = max {i [t] : ni dt} {1}

and plays Git(x1, . . . , xt), the output of Git on input x1, . . . , xt.

We now show that Q is an α-representative non-uniform generator that satisfies Definition 2.11. Let us start by proving that Q is α-representative. Let x1, x2, . . . be any (not necessarily valid) stream of examples. It suffices to show that for every t N, we have that ||Q(x1:t)|A x1:t|A|| α. Recall that Q(x1, . . . , xt) = Git(x1, . . . , xt). Since Git is an α-representative generator for (Hit, A), it must be the case that

||Git(x1:t)|A x1:t|A|| α.

We now complete the proof by establishing the consistency guarantees of Q. To that end, let h H and x1, x2, supp(h) be the hypothesis and stream picked by the adversary. Let j N be the smallest index such that h Hj and let

Representative Language Generation

d := max{j , nj }. We claim that Q is consistent after observing d unique examples. To see why, suppose without loss of generality that x1, x2, . . . contains at least d distinct elements and let t be the smallest time point such that |{x1, . . . , xt }| = d . We need to show that for all s t , we have that Prˆxs Q(x1:s) [ˆxs supp(h) \ {x1, . . . , xs}] = 1. Fix some s t . Then, observe that ds nj and t j . Accordingly, we have that is j and hence h His. Since nis ds, we also know that

Pr ˆxs Gis(x1:s) [ˆxs supp(h) \ {x1, . . . , xs}] = 1.

Finally, since Q(x1:s) = Gis(x1:s) by construction, we have that Q is consistent on round s. Since s t is chosen arbitrarily, this is true for all such s, completing the proof of consistency and the overall. proof that Q is an α-representative non-uniform generator for (H, A).

Proof of Necessity in Theorem 3.7. Let X be countable, H {0, 1}X be any class satisfying the UUS property, and A = {Ai}i N be any countable partition of X. Suppose that (H, A) is representative non-uniformly generatable. We need to show that for every α > 0, there exists a non-decreasing sequence of classes H1 H2 such that H = S i=1 Hi and (Hi, A) is α-representative uniformly generatable i N. To that end, fix some α > 0. If H is representative non-uniformly generatable, then for this error level α, there exists a α-representative non-uniform generator G. For every h H, let dh N be such that for any sequence x1, x2, . . . with {x1, x2, . . . } supp(h), if there exists t N where |{x1, . . . , xt }| = dh, then G is consistent from t . Now, define Hi := {h H : dh i} for all i N. Note that H1 H2 and that S i=1 Hi = H. Finally, observe that G is an α-representative uniform generator for Hi, and hence, (Hi, A) is α-representative uniformly generatable.

C.5. Proof of Corollary 3.8

Proof. Let X be countable, H {0, 1}X be any countably infinite class satisfying the UUS property, and A = {Ai}i K be any finite partition of X. Fix some α > 0. By Theorem 3.7, it suffices to show that there exists a non-decreasing sequence of classes H1 H2 such that H = S i=1 Hi and (Hi, A) is α-representative uniformly generatable i N. Let h1, h2, . . . be any enumeration of H. Define Hi = {h1, . . . , hi} for all i N. Observe that H1 H2 and that H = S i=1 Hi. Finally, since |Hi| = i < and |A| < , by Corollary 3.5, we have that (Hi, A) is α-representative uniformly generatable. Since representative non-uniform generatability implies representative generatability in the limit, our proof is complete.

D. Proofs from Section 4

D.1. Proof of Lemma 4.3

Proof of Lemma 4.3. Select any 0 < α < 1, and let H = {h}, where supp(h) = X (note that this H is trivially generatable in the limit if we do not require group representation). Consider any arbitrary enumeration of supp(h), u1, u2, ..., with each ui distinct. Define a countable partition of X, A = {A1, A2, ...} such that for each i N,

Clearly this is a valid partition as all groups are disjoint, and because 1 1 α > 1 by definition, for every uj, there exists an

i N such that Pi 1 k=0 1 1 α k j < Pi k=0 1 1 α k . Note that for each i N,

|Ai| = |Ai supp(h)| =

Choose u1, u2, ... be the enumeration of supp(h) selected by the adversary. Consider any i N, and note that at step

t = Pi j=0 1 1 α j 1, all elements of Ai have been exhausted, and supp(h) Ai \ {x1, ..., xt} = . Moreover, Ai takes

Representative Language Generation

up more than an α-fraction of the distinct elements seen thus far, as x1:t|A(i) can be lower-bounded as

Pi j=0 1 1 α j 1

1 1 α (1 ( 1 1 α )i) 1 1 1 α

formula for sum of geometric series

1 1 α i+1 1 1 α i

1 1 α i+1 1 1 α

1 1 α 1 1 1 α

This implies that in order to be representative, the generator must output a distribution µt that puts positive mass on some x Ai, otherwise |µt|A(i) x1:t|A(i)| x1:t|A(i) > α.

However, because all elements of Ai have already been exhausted in the sequence, the generator must either violate consistency by putting mass on an x Ai that has already been seen in the sequence, or violate representation by putting no mass on Ai despite appearing in greater than an α-fraction of the sequence. Thus, the generator cannot satisfy both consistency and representation simultaneously at this timestep (in fact, it cannot generate correctly from any of the groups that have been seen so far).

This happens for each Ai, and thus any generator must make infinitely many mistakes at timesteps Pi j=0 1 1 α j 1 for every i N on this sequence, and thus it cannot satisfy the requirement of representative generation in the limit.

D.2. Proof of Lemma 4.8

Proof of Lemma 4.8. We introduce the notion of a group vector v(x) {0, 1}N \ {0N}, which given x X, indicates an x s group membership in all groups in A, with v(x)i = 1[x Ai].

Recall that fh,A denotes the finite support size of h with respect to A (Definition 4.1), which by assumption satisfies fh,A < . We can equivalently write the definition of fh,A in terms of group membership vectors, i.e.

v {0,1}N\{0N}, | T

i N,vi=1 Ai supp(h)|<

i N,vi=1 Ai supp(h) .

Let V {0, 1}N \ {0N} be the subset of all group membership vectors with finite support, i.e. all v {0, 1}N \ {0N} such that

i N,vi=1 Ai supp(h)

The remainder of the proof follows in three key steps:

1. We first show that after enough timesteps, the proportion of the data sequence taken up by unique elements with group membership lying in V must shrink to a very small quantity and stay below that amount. In particular, less than α/2.

Representative Language Generation

2. Next, we show that due to how little mass is placed on these vectors, it s possible to construct a distribution that α-approximates the empirical group probabilities, but places no mass on any x with v(x) V .

3. It follows that h must be feasible at this point, because all other group vectors have infinite support when intersected with supp(h).

Step 1: V takes up a small proportion of of the empirical distribution. Consider the timestep t at which point the sequence has included d = 2fh,A

α unique elements. For any s t, the proportion of elements in the empirical distribution with v(x) V is upper bounded by

Thus, for all s t, the total proportion of unique elements from V in the sequence must be at most α/2, as desired.

Step 2: Constructing an α-approximate distribution that ignores V . We now show that for all s t, because we are guaranteed that the total proportion of unique datapoints from V is at most α/2, we can construct a distribution whose group probabilities α-approximate x1:s|A and places no mass on any x with v(x) V . For now we will ignore exactly what elements the distribution uses, and just care about their group membership vectors. We construct a distribution πs over these elements as follows:

( 0 v(x) V or x x1:s x1:s(x) otherwise .

Clearly by definition πs places no mass on any x with v(x) V . It remains to show that this is actually a valid approximation, i.e πs|A x1:s|A α.

To this end, consider any Ai A. We rewrite the proportions of Ai appearing in πs and the empirical distribution:

|πs|A(i) x1:s|A(i)|

Pr x x1:s[x Ai, v(x) V ] + Prx x1:s[v(x) V ] 1 Prx x1:s[v(x) V ] Pr x x1:s[x Ai, v(x) V ]

Pr x x1:s[x Ai, v(x) V ] + Prx x1:s[v(x) V ] 1 Prx x1:s[v(x) V ] Pr x x1:s[v(x) V ]

2 Pr x x1:s[v(x) V ]

α guarantee from Step 1

Thus, because this holds for any Ai, we conclude that πs|A α-approximates x1:s|A.

Step 3: Constructing a feasible distribution. πs passes all group representation requirements, but is not quite what we want, because supp(πs) is supported on previously seen points in x1:s, whereas we want a distribution supported on supp(h) \ x1:s. Note that it suffices if for every x supp(πs) we can find an x supp(h) \ x1:s such that v(x) = v(x ). Then, the distribution µs defined as µs(x ) = πs(x) would exactly match the group vector proportions of πs, and thus also satisfy µs|A x1:s|A α as well as supp(µs) supp(h) \ x1:s.

This is in fact easy to do, as note that by construction, for every x supp(πs), we must have v(x) V , and thus

i N,v(x)i=1 Ai supp(h)

This means that even after removing x1:s, there are an infinite number of x that we can choose from to replace x for each x supp(πs).

Thus, putting all these steps together, we conclude that at the finite timestep t N after we have seen 2fh,A/α unique elements, for every s t we can find a distribution µs with supp(µs) supp(h) \ x1:s and µs|A x1:s|A α. Thus, h is α-feasible for all s t, completing the proof.

Representative Language Generation

D.3. Proof of Theorem 4.4

Proof of Theorem 4.4. Choose some α > 0, and consider the following mapping from a sequence of examples x1, ..., xt to a distribution over X:

1. Given examples x1, ..., xt, let Ct {h1, ..., ht} be the set of critical hypotheses at step t.

2. Let Ft Ct be the subset of critical hypotheses that are also α-feasible.

3. if Ft is empty, output the distribution µt = x1:t.

4. Otherwise, let hn Ft be the hypothesis in Ft with the largest index n t. Output the distribution µt over supp(hn) \ x1:t that witnesses hn s α-feasibility.

We now show that the generator G that follows this mapping and outputs µt at step t for all t N satisfies representative generation in the limit. To do this, we need to verify that G is α-representative, and for any enumeration x1, x2, ... of an h H and there exists some t N such that G is consistent after timestep t .

Property 1: α-Representative. We first show that the generator s output is representative at every t N. This is trivially true if G outputs µt = x1:t at Step 3, and by the definition of α-feasibility, the µt output at step 4 also α-approximates the empirical distribution of groups. Thus, in either case the µt output satisfies the representation requirement. Because this holds true for any datastream x1, x2, ..., we conclude that G is α-representative.

Property 2: Consistent. It remains to show that there exists some t N such that for all s t , supp(µs) supp(h) \ x1:s. Let t N be the timestep guaranteed to exist by Lemma 4.6 such that for all s t, h is a critical hypothesis, i.e. h Cs. Let d N be the finite timestep guaranteed to exist by Lemma 4.8 such that for all s d, h is α-feasible. Let t = max{d, t}. It follows that for all s t , we have h Fs, as it is both critical and α-feasible.

Thus, for any timestep s t , we are guaranteed that Fs is non-empty, and µs is guaranteed to satisfy supp(µs) supp(hn) \ x1:s, where hn Ft is the hypothesis with the largest index n t. It follows from the definition of a critical hypothesis that we must have supp(hn) supp(h), and thus µs satisfies consistency.

Thus, we have shown that G as described above is an α-representative generator and is consistent for all s t , and thus because α > 0 was chosen arbitrarily, we conclude that (H, A) is generatable in the limit with representation.

D.4. Proof and Discussion of Lemma 4.9

Before providing the proof of Lemma 4.9, we provide some additional discussion of the result and comparison to the positive result of Kleinberg & Mullainathan (2024).

The algorithm of Kleinberg & Mullainathan (2024) that generates in the limit using only a finite number of membership queries at each step crucially relies on the UUS assumption, in particular the fact that finding an element from supp(h) \ x1:t is always possible because |supp(h)| = , and thus one can simply enumerate the elements of X until an unseen point is encountered. In the case of representative generation, however, a generator may need to generate from (supp(h) Ai) \ x1:t, which is not always guaranteed to be infinite, making this problem a lot more difficult. We note that in the case where A partitions X, if we made the strong assumption that supp(h) Ai is either empty or has infinite size for all Ai A and h H, then this would be similar to the assumption made by Kleinberg & Mullainathan (2024), and we could use membership queries to generate with representation using an algorithm almost identical to their membership-query algorithm.

However, the Lemma 4.3 shows that weakening this assumption to allow for groups with finite intersection with hypotheses in H introduces some inherent difficulty. In particular, no algorithm that works even just for very simple, finite pairs of H and A can generate in the limit with representation using only membership queries.

With this in mind, we present the proof below.

Proof of Lemma 4.9. Consider a deterministically computed randomized generator G that at step t, sees examples x1, ..., xt and can make queries of the form x supp(h) or x A1? . Note that because we only have two groups making up the partition, querying about A2 as well can provide no extra information.

Representative Language Generation

Assume for contradiction that at each timestep t, the generator makes a finite number of such membership queries before outputting its generated distribution µt, and satisfies representative generation in the limit.

We examine the execution of this generator and use it to construct a hypothesis h, partition A, and enumeration x1, x2, ... of supp(h) that forces the generator to make infinitely many mistakes.

Let u1, u2, .... be an enumeration of X.

We now imagine running G (this can be run offline prior to execution as the generator is deterministic) and use the run to build up a hypothesis h, an enumeration k1, k2, ... of supp(h), and disjoint groups A1, A2 with A1 A2 = X.

We introduce some variables for keeping track of our constructed example:

A dictionary H : X { 1, 0, 1} that keeps track of the values assigned for the language h thus far, with H(x) = 1 if h(x) = 1, H(x) = 0 if h(x) = 0, and H(x) = 1 if the value has not been assigned. We initialize H(x) = 1 for all x X.

A dictionary A : X { 1, 1, 2} that keeps track of the group membership of each x (with 1 and 2 responding to A1 and A2, respectively, and 1 if x has not yet been assigned). We initialize A(x) = 1 for all x X.

A list K that will keep track of the enumeration of elements. We begin with K = {}.

A queue Q, to which we can insert elements and pull them off the queue in a first-in, first-out manner.

Now, at each timestep t = 1, 2, ..., we perform three stages of actions. The first is to add to the enumeration, the second is to answer the queries of the generator, and the last is to handle the generator s outputted distribution µt for that timestep.

Stage 1: Adding to the enumeration. If t is odd or Q is empty, we find the smallest i 1 such that A(ui) = 1. Such a ui is guaranteed to exist as u1, ... is an enumeration of X, which starts with infinitely many ui with A(ui) = 1, and at each step we will only fix the membership of a finite number of such ui. Having found this ui, we set A(ui) = 1, H(ui) = 1, and append it to the enumeration K, so set kt = ui.

Otherwise, if t is even and Q is non-empty, we pull an element x off of the queue, and append it to K, thus setting kt = x.

Stage 2: Answering G s queries. We now run the generator upon seeing k1, ..., kt, and answer each membership query as follows.

Every time the generator asks x supp(h)? for some x, if H(x) = 1, and the membership has already been fixed, we output the correct membership value. Otherwise, if H(x) = 1, we answer yes, set H(x) = 1, A(x) = 2, and add it to Q.

Every time the generator asks x A1?, if A(x) = 1, we provide the correct answer that has already been fixed. Otherwise, we set A(x) = 2, H(x) = 1, and add it to Q. Note that if A(x) = 1, we necessarily have H(x) = 1.

Stage 3: Handling G s output. After a finite number of queries, G is guaranteed by assumption to output a distribution µt. We make no assumptions about the representation of µt, and thus it could have infinite support.

We perform the following actions depending on the contents of supp(µt):

If there exists an x supp(µt) with H(x) = 0 or x {x1, ..., xt}, we do nothing and end execution for the timestep.

If there exists an x supp(µt) with H(x) = 1, i.e. x has not yet been queried by the generator, we set H(x) = 0, A(x) = 2, and finish execution for timestep t.

Otherwise, note that every x supp(µt) has H(x) = 1, A(x) = 2 by definition of our construction.

We repeat these three steps for each timestep. This completes the definition of the construction.

We now verify the necessary facts for this H, enumeration, and group partition to be well-defined. In particular, H = {h} must be UUS and K must be a valid enumeration of supp(h). Clearly, each x can be assigned only one of A(x) = 1 or A(x) = 2, so we have a valid partition.

Representative Language Generation

We first consider the UUS property of H. We consider two cases. First, if the generator makes a finite number of queries at each time step as promised, then {x X : H(x) = 1} will have infinite size at all timesteps, and so at every timestep we are able to find an unseen x X with H(x) = 1 to add to the enumeration, or pull one off the queue. Because this can be repeated indefinitely, supp(h) will be infinite. On the other hand, if at some finite timestep the generator makes an infinite number of queries, either an infinite number of the queries are on elements with H(x) = 1 or A(x) = 1, in which case each of these elements get set to H(x) = 1 resulting in an infinite support, or there are only a finite number of such queries, and thus the remaining un-queried portion of X must be infinite, and we can set all of it to H(x) = 1 to again get an infinite support. Thus, in all cases supp(h) satisfies the UUS property.

Lastly, we show that K is a valid enumeration. First, consider the case where at some timestep t, the generator makes an infinite number of queries. We can follow the assignment of H above to construct a UUS H, and then append any enumeration of the remaining unseen elements with H(x) = 1 to K to obtain a valid enumeration.

Otherwise, we can assume that at every step, the generator makes a finite number of queries. Clearly by definition of the construction, only elements with H(x) = 1 are added to the enumeration, meaning that S i N{ki} supp(h). Thus, it remains to show that the K covers all of supp(h). Consider any x supp(h), i.e. an x X such that there exists some finite timestep at which H(x) is set to 1. We will show that there exists some j N such that kj = x. Note that because supp(h) X and u1, ... is an enumeration of X, there exists some i N such that ui = x. At timestep 2i 1, we have three possibilities.

k2i 1 = ui.

k2i 1 = ui, but there exists a j < 2i 1 such that kj = ui.

k2i 1 = ui, and ui has not appeared earlier in the enumeration.

If we are in either of the first conditions, we are done because we have guaranteed that there exists a j N such that kj = x. Note that in the third case, this can only happen because ui is currently in the queue. Because at each previous step the generator made a finite number of membership queries, the position of ui in the queue must be some finite s N. Thus, because even timestep outputs an element from the queue if it is nonempty, we are guaranteed that k2i 1+2s 1 = ui, and thus in all cases there exists a j N such that kj = x, and thus K is a valid enumeration of supp(h).

Construction of Contradiction. We finally consider what happens when the generator is run on enumeration K with group and hypothesis membership defined by A and H as above. Consider any timestep t, where the generator outputs a distribution µt.

We consider the three cases considered in Stage 3. Note that in both of the first two cases, the generator must violate consistency, as there exists an x supp(µt) with x supp(h) \ {x1, ..., xt}.

Otherwise, the only other possibility is given by the third case, which guarantees that every x supp(µt) has A(x) = 2, and thus x A2. This means that the generator is not representative of the data sequence thus far, because by definition of the enumeration, x1:t|A(1) 1/2 while µt|A(1) = 0, and thus x1:t|A µt|A 1/2 > α.

Thus, the generator fails to generate consistently with group constraints at every timestep. We thus conclude that the assumption was false, and at some iteration the generator must make an infinite number of membership queries.