# contextures_representations_from_contexts__67881f2f.pdf

Contextures: Representations from Contexts

Runtian Zhai 1 Kai Yang 2 Burak Varıcı 1 Che-Ping Tsai 1 Zico Kolter 1 Pradeep Ravikumar 1

Despite the empirical success of foundation models, we do not have a systematic characterization of the representations that these models learn. In this paper, we establish the contexture theory. It shows that a large class of representation learning methods can be characterized as learning from the association between the input and a context variable. Specifically, we show that many popular methods aim to approximate the top-d singular functions of the expectation operator induced by the context, in which case we say that the representation learns the contexture. We demonstrate the generality of the contexture theory by proving that representation learning within various learning paradigms supervised, self-supervised, and manifold learning can all be studied from such a perspective. We prove that representations that learn the contexture are optimal on those tasks that are compatible with the context. One important implication of our theory is that once the model is large enough to approximate the top singular functions, scaling up the model size yields diminishing returns, so further improvement requires better contexts. To this end, we study how to evaluate a context without knowing the downstream tasks. We propose a metric and show by experiments that it correlates well with the actual performance of the encoder on many real datasets.

1. Introduction

Representation learning underpins the modern deep learning revolution, leading up to the remarkable recent successes of foundation models (Bommasani et al., 2021). But a critical question that has remained unanswered to a satisfactory extent is: why are these models learning anything

1Carnegie Mellon University, Pittsburgh, PA, USA 2Peking University, Beijing, China. Correspondence to: Runtian Zhai <rzhai@alumni.cmu.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

useful, or perhaps even what representations are these models learning? Unlike classical statistical learning theory, where there is no mystery regarding the statistical estimand, the very target itself is unclear in representation learning. For example, what are the representations that BERT (Devlin et al., 2019) trained to do cloze tests is learning, and why are they useful in understanding the sentiment of user reviews on Netflix? What representations do deep neural networks learn, and why are they useful if they cause neural collapse (Papyan et al., 2020), where deep representations could collapse to a few clusters? Do the many different self-supervised learning methods (Balestriero et al., 2023) all learn similar or disparate representations?

The responses to these questions are often muddled. Many analyses are conflated with the mystery of deep learning generalization the ability of large neural networks to learn function approximations that generalize to unseen points. However, this is a different problem from the mechanism of representation learning. Our focus is on what representation learning (or pretraining in the context of foundation models) aims to capture, and why it can be applied to tasks completely different from the objectives used to train the representations. Another way this question is muddled is by recourse to scaling. A popular viewpoint argues that even if the encoder performs poorly on one task, increasing the model size while keeping everything else the same could allow better performance to emerge (Wei et al., 2022). However, substantial evidence suggests that certain abilities cannot emerge solely from scaling. Additional training signals, such as alignment (Ouyang et al., 2022), are necessary.

The above questions are naturally interesting to learning theorists, but why should the broader machine learning community care about understanding the mechanism of representation learning, if empirical success seems to be always achievable with existing approaches by scaling up the model size, an empirical observation known as scaling laws (Kaplan et al., 2020)? This is because sustainable success or progress is not always guaranteed. Ilya Sutskever recently remarked that pretraining as we know it will end (Sutskever, 2024), largely because the current pretraining paradigm is producing diminishing returns. Understanding what representations are learned by foundation models is crucial for designing future generations of pretraining methods, and this is how this field can make scientific progress.

Contextures: Representations from Contexts

In this work, we establish the contexture theory, which provides a unified lens for inspecting a large class of representation learning methods. The central argument of this theory is that representations are learned from the association between the input X and a context variable A. This framework is general enough to encompass a wide variety of learning paradigms, as we demonstrate in Section 3.

Now, suppose we are given a context variable A along with X, how should we learn the representation? In Section 4 we prove that the optimal method is to approximate the top singular functions of the expectation operator induced by the context, in which case we say that the encoder learns the contexture . Such an encoder is optimal as long as the task is compatible with the context, and in Section 4 we define a quantitative measurement of such compatibility.

Our theory implies that the main consequence of enlarging the model is that the learned representation space will be brought closer to the span of the top-d singular functions of the expectation operator, which we empirically verify in Section 4.2. Once the two spaces are close enough, further scaling will yield diminishing returns. We envision that future breakthroughs in pretraining require context scaling, where better contexts are learned from data, not heuristics.

In Section 5 we study how to evaluate contexts. This is a prerequisite for context scaling because if we cannot even determine which contexts are good, then we cannot create better contexts. The key takeaway is that for a context to be useful (meaning that it can lead to good representations), the association between X and A should be moderate neither too strong nor too weak. For example, if A = X, then their association is the strongest; if A is independent of X, then their association is the weakest. However, neither context is useful because A does not provide additional information about X. In Section 5.2, we propose a quantitative measurement of context usefulness that can be efficiently estimated and does not require knowledge of the downstream task. We also empirically verify that the metric correlates with the performance of the encoder on many real datasets.

In one sentence, our key contribution is clarity on the target of a large class of representation learning methods the singular functions of the expectation operator. We do not discuss the numerical aspect of approximating these functions, which requires an expressive model architecture and a good optimizer, and such analyses are left to future work.

1.1. Examples of Representation Learning Methods

Supervised learning is the simplest way to learn representations. For example, neural networks pretrained on Image Net (Russakovsky et al., 2015) were very popular in the early days of the deep learning boom (Huh et al., 2016). Specifically, one uses the output of an intermediate layer,

typically the one before the last linear layer, as the representation. However, it has never been fully explained why the penultimate layer works so well across disparate tasks.

Self-supervised learning (SSL) is currently the most common way of learning representations. There are two main types of SSL: multi-view learning and masked autoencoders. Multi-view learning includes contrastive learning (Oord et al., 2018; Chen et al., 2020) and non-contrastive learning (Grill et al., 2020; Zbontar et al., 2021; Bardes et al., 2022). Masked autoencoders have wide applications, including language (Devlin et al., 2019; Radford et al., 2019), vision (He et al., 2022), videos (Gupta et al., 2023), and more.

Manifold learning is a classical method that aims to capture the geometry of the data. Examples such as locally linear embedding (LLE) (Roweis & Saul, 2000) and Laplacian eigenmaps (Belkin & Niyogi, 2003) formulate manifold learning as node representation learning on a graph, where connected nodes should have similar embeddings.

2. Definitions and Examples

Let X be the input space. Pretraining aims to learn a feature encoder Φ : X Rd. We call Φ(x) the embedding of x, and d the embedding dimension. Let PX be the data distribution. In this work, we assume PX to be fixed.

The central argument of the contexture theory is that representations are learned from the association between two random variables: the input X X and a context variable A A. A is called the context space. Let P +(x, a) be the joint distribution of X and A, with marginal distributions PX and PA. Let L2(PX ) be the L2 function space w.r.t. PX , with inner product f1, f2 PX = EX PX [f1(X)f2(X)] and

norm f PX = f, f 1/2 PX . Define L2(PA), , PA, PA for PA similarly.

2.1. Examples of Contexts

1. Labels are a common type of context. They can take different forms, such as discrete categories in classification, continuous values in regression, or text captions of images. Labels may be obtained from human annotators or in pseudo-forms, such as clusters or teacher models. Typically, labels are provided as compatible pairs sampled from the joint distribution P +(x, a).

2. Random transformations generate different views of the same data point. Common transformations include adding random noise to inputs, as seen in diffusion models and denoising autoencoders, or randomly corrupting/masking inputs, as in Sim CLR and masked autoencoder. These transformations are typically defined by domain experts and are specified through a predefined conditional distribution P +(a | x).

Contextures: Representations from Contexts

3. Graphs provide locality information about the inputs. The edge weights represent the similarity between two inputs. In this case, we have A = X with the conditional distribution P +(a | x) proportional to the edge values between x and a. See Section 3.3.

4. Features are predefined or pretrained mappings from X to a vector space, which we denote by a = Ω(x). This encompasses teacher models that provide stochastic pretrained feature encoders. In contrast to previous instances, here P +(a | x) is directly described via a reparameterization or structural equation for a in terms of x.

2.2. Induced Kernels and the Expectation Operator

The joint distribution P + fully determines the association between X and A. It induces the positive-pair kernel (Johnson et al., 2023) and the dual kernel (Zhai et al., 2024).

Definition 2.1. The positive-pair kernel k+ A and its dual kernel k+ X are defined as

k+ A(a, a ) = P +(a, a ) PA(a)PA(a ) =

R P +(a|x)P +(a |x)d PX (x)

PA(a)PA(a ) ;

k+ X(x, x ) = P +(x, x ) PX (x)PX (x ) =

R P +(x|a)P +(x |a)d PA(a)

PX (x)PX (x ) .

The kernels are density ratios between joints and marginals for A and X, respectively. They capture how more likely (a, a ) or (x, x ) appear together than independently given P +. P + also induces the following expectation operator. Intuitively, given a function g L2(PA), the operator computes the expectation of g(A) conditioned on any x.

Definition 2.2. The expectation operator TP + : L2(PA) L2(PX ) is defined as for all g L2(PA),

(TP +g)(x) = Z g(a)P +(a | x)da = E[g(A) | x].

Its adjoint operator T P + : L2(PX ) L2(PA), which satisfies f, TP +g PX = T P +f, g PA ( f L2(PX ), g L2(PA)), is given by T P +f (a) = R f(x) P +(a | x)PX (x)

PA(a) dx = E[f(X) | a].

Now we discuss the spectral properties of these operators. Define the kernel integral operator Tk+ A : L2(PA)

L2(PA) as (Tk+ Ag)(a) = R g(a )k+ A(a, a )d PA(a ). Define the other operator Tk+ X : L2(PX ) L2(PX ) similarly. It is easy to see that Tk+ A = T P +TP +, and Tk+ X = TP +T P +.

We call λ R an eigenvalue of Tk+ A with eigenfunction ν L2(PA), if Tk+ Aν = λν. Suppose Tk+ A is a Hilbert-Schmidt integral operator. Then, we can order its eigenvalues by 1 = λ0 λ1 0, and the corresponding eigenfunctions ν0, ν1, form an orthonormal basis (ONB) of L2(PA).

Here λi 1 because of Jensen s inequality, and ν0 1 is always an eigenfunction of Tk+ A with λ0 = 1. Similarly, we can order the eigenvalues of Tk+ X by 1 = κ0 κ1 0, and the eigenfunctions µ0, µ1, form an ONB of L2(PX ), where µ0 1. We also have the following result.

Lemma 2.3 (Duality property, Zhai et al. (2024), Proposition 1). For all i, we have λi = κi [0, 1]. And if λi > 0,

then we have µi = λ 1

2 i TP +νi, and νi = λ 1

2 i T P +µi.

We call si = λ

1 2 i a singular value of TP +, associated with left singular function µi L2(PX ) and right singular function νi L2(PA). Since µ0 1 and ν0 1, all other µi (νi) must have zero mean as they are orthogonal to µ0 (ν0). Now, we can spectrally decompose P + as follows.

Lemma 2.4. The spectral decomposition of P+ is P +(x, a) = P

i siµi(x)νi(a)PX (x)PA(a).

Proof. i, D P +(x,a) PX (x)PA(a), νi E

PA = R P +(a | x)νi(a)da =

(TP +νi)(x) = λ

1 2 i µi (x) = siµi(x). Since (νi)i 0 is an

ONB, we have P +(x,a) PX (x)PA(a) = P i=0 siµi(x)νi(a).

The first key result of the contexture theory is that the optimal d-dimensional representation should recover the linear space spanned by the top-d singular functions µ1, , µd. We say that such a representation learns the contexture. Note that the constant function µ0 1 is excluded, as it does not need to be learned there is no benefit in allocating a dimension to encode something already universally present.

Definition 2.5. A d-dimensional encoder Φ = [ϕ1, , ϕd] learns the contexture of P +, if there exists a set of top-d singular functions {µ1, , µd} of TP +, such that span{ϕ1, , ϕd} = span{µ1, , µd}. We also say that Φ extracts the top-d eigenspace of Tk+ X.

If the multiplicity of sd is more than 1, then Φ recovering the span of any top-d singular functions suffices. The intuition why learning the contexture is ideal is that such a representation keeps the most variance of the context, analogous to principal component analysis (PCA) in the finitedimensional case. Suppose X and A are both finite sets. Let N = |X| and M = |A|. Then, a function f L2(PX ) is a vector in RN, g L2(PA) is a vector in RM, and TP + is essentially a matrix T RN M. The goal is learning a d-dimensional embedding E RN d for the N samples in X. PCA states that we should use the top-d left singular vectors of T as E, or the top-d eigenvectors of T T , because they maximize the explained variance. Similarly, functional spaces are essentially infinite-dimensional vector spaces, so the d-dimensional encoder that preserves the most variance of TP + consists of the top-d left singular functions of TP +, or the top-d eigenfunctions of Tk+ X = TP +T P +.

Contextures: Representations from Contexts

3. Learning the Contexture

In this section, we show that every example method in Section 1.1 does one of the following:

(i) Extracts the top-d eigenspace of Tk+ X = TP +T P + (learns the contexture of P +), that is, recovering the span of µ1, , µd (excluding µ0 1);

(ii) Extracts the top-d eigenspace of TP +ΛT P +, where Λ is the integral operator of a kernel kΛ(a, a ), that is (Λg)(a) = R g(a )kΛ(a, a )d PA(a ). kΛ is called the loss kernel, which is defined by the loss function used in the objective. Since the constant function is not necessarily the top eigenfunction of TP +ΛT P +, in this case, we do not exclude any eigenfunction.

Notation: For f L2(PX ), denote its mean by f = EPX [f(X)], and its centered version by f = f f. The same notation is used for multi-dim functions and random variables, when the distribution is clear from context. The covariance matrix of Φ : X Rd, denoted by Cov PX [Φ],

is a d d matrix C where C[i, j] = D ϕi, ϕj E

3.1. Supervised Learning: Label Context

Let A be the label of X. The mean squared error (MSE) is

R(Φ) = min W ,b E (X,A) P +

h W Φ(X) + b A 2 2 i , (1)

where Φ(X) is the output of the layer before the last linear layer in a neural network, and b denotes the bias. If b can be an arbitrary vector, then the linear layer is biased; if b = 0 is fixed, then the linear layer is unbiased. For classification where A is one-hot, we have the following result. Theorem 3.1 (Proof in Appendix A.1). Let A be a onehot random vector. Suppose the linear layer is unbiased, that is b = 0. Then, Φ minimizes R(Φ) if and only if it extracts the top-d eigenspace of TP +ΛT P +, where kΛ(a, a ) = I[a = a ], or (Λg)(a) = g(a)PA(a). If all classes have the same size, then the top-d eigenfunctions of TP +ΛT P + and TP +T P + are the same.

This theorem works for randomized labels, where each x can belong to multiple classes with certain probabilities. Kernel kΛ stems from class imbalance; it puts more weights on larger classes. Indeed, in practice, smaller classes are harder to learn. To get rid of Λ, we can use the class-balanced risk (also known as importance weighting (Shimodaira, 2000)):

Rbal(Φ) = min W ,b E (X,A) P +

" W Φ(X) + b A 2 2 p

Theorem 3.2 (Proof in Appendix A.2). Under the setting of Thm. 3.1, let the linear layer be biased. Then, Φ minimizes Rbal(Φ) if and only if it learns the contexture of P +.

Interestingly, this result can partially explain neural collapse. Papyan et al. (2020) empirically showed that when there are d classes of equal sizes and the label A is deterministic, a sufficiently trained deep representation will collapse to an equiangular tight frame (ETF) ϕ1, , ϕd, which are defined as ϕi(x) = c(I[x belongs to class i] d 1) for all i [d] and some constant c. The span of ϕ1, , ϕd is the same as the top-d eigenspace of TP +ΛT P +. However, it cannot explain why ϕ1, , ϕd converge to the exact functions as above it only proves that they will span the same space. To prove this, one needs to analyze the training dynamics of the specific optimizer, such as gradient-based methods, while our results are independent of the optimizer.

When the classes have different sizes, the dual kernel of TP +ΛT P + is k+ X(x, x ) = I[x and x have the same label]. This is equivalent to the simplex-encoded labels interpolation (SELI) defined by Thrampoulidis et al. (2022, Definition 2), which generalizes neural collapse.

For a regression task where A is an arbitrary Euclidean vector, using the same objective Eqn. (1), a similar result can be proved. See Appendix A.3.

3.2. Self-supervised Learning: Transformation Context

Two major types of SSL are multi-view learning and denoising autoencoders. Multi-view learning independently samples two views A, A+ from P +( |x) for every x. A+ is called a positive sample of A. Then, one trains Ψ : A Rd

such that Ψ(A) Ψ(A+). This Ψ is an encoder on A instead of X, so at downstream we need to convert Ψ(a) to Φ(x), which is typically done via the average encoder:

Φ(x) = (TP +Ψ)(x) = Z Ψ(a)d P +(a | x).

By Lemma 2.3, we have the following corollary. Corollary 3.3. Let sd > 0. The average encoder Φ spanning the span of the left top-d singular functions of TP + is equivalent to Ψ spanning the span of the right top-d singular functions of TP +.

Enforcing Ψ(A) Ψ(A+) alone leads to the degenerate solution where Ψ gives the same embedding to all a. This is called the feature collapse problem. There are two solutions: contrastive learning and non-contrastive learning. Prior work by Hao Chen et al. (2021); Johnson et al. (2023); Zhai et al. (2024) showed that the spectral contrastive loss LC and non-contrastive loss LN can learn the contexture of P +. Let A+ be a positive sample of A, and A be a negative sample drawn from PA independently. Define

LC = E D Ψ(A), Ψ(A+) E + 1

D Ψ(A), Ψ(A ) E2 ;

LN = E h Ψ(A) Ψ(A+) 2 2

i s.t. Cov PA[Ψ] = I,

Contextures: Representations from Contexts

where the (i, j)-th entry of Cov PA[Ψ] is ψi, ψj PA, i, j [d]. Minimizing LN is a constrained optimization problem. The constraint Cov PA[Ψ] = I is called the orthonormality constraint. It makes sure that Ψ must be rank-d, so that it cannot be a constant function on A.

Theorem 3.4 (Proof in Appendix A.4). Ψ minimizes LC or LN if and only if Φ = TP + Ψ learns the contexture.

For denoising autoencoders, a similar result can be proved. See Appendix A.5.

3.3. Node Representation Learning: Graph Context

Let G = (V, E) be an undirected graph, where edge (u, v) has a weight w(u, v) such that w(u, v) = w(v, u) 0. Let the degree of node u be d(u) = P

v V w(u, v), and dsum = P

v d(v). Let PX (u) = d(u)

dsum be a distribution on

V, and Pw(u, v) = w(u,v)

dsum be a distribution on E. Define

P +(v|u) = w(u,v)

d(u) . The following problem with a similar orthonormality constraint learns the contexture of P +.

minimize Φ:X Rd 1 2E(u,v) Pw h Φ(u) Φ(v) 2 2 i

s.t. Cov PX [Φ] = I. (2)

Theorem 3.5 (Proof in Appendix A.6). Let Φ be any solution to Eqn. (2) (so that for any constant c, Φ + c is also a solution). Then, Φ learns the contexture of P +.

4. Optimality of the Contexture

So far, we have shown that commonly used learning objectives can learn the contexture. Now, we address the question of why and when learning the contexture is optimal. Arguably no representations can be good for all downstream tasks, but we show that a feature encoder that learns the contexture is optimal for the class of tasks that are compatible with the context. This provides a quantitative characterization of when a task respects the human prior knowledge the context incorporates. Interestingly, as we detail in the sequel, this has intriguing implications for scaling laws.

4.1. Compatibility

The ultimate evaluation of an encoder is its performance on relevant downstream tasks. Most downstream tasks, such as prediction, clustering, and segmentation, can be associated with a target function f L2(PX ). For example, multiclass classification can be associated with multiple one-vsall labeling functions. Moreover, if we are fitting a linear predictor on top of Φ, then the mean and variance of f do not matter because we can change the weight and bias of the predictor accordingly. Thus, we can assume that f is normalized, that is, it has zero mean and unit variance.

We say that a context P + and a task f are compatible, if P + can help us learn a good predictor for f . Formally, consider the scenario where we have a corrupted training set {(ai, yi)}, where ai P +( | xi) and yi = f (xi). That is, we cannot see the original samples xi, but can only see the corrupted samples ai. To learn a predictor on this training set, we can train a predictor ˆg : A Y, and then use ˆf = E[ˆg(A) | x]. At test time, given input x, we can draw a P +( | x) and then output the average of g (a). For this procedure to work, two conditions are necessary:

There exists g L2(PA) s.t. f (x) = E[g (A) | x].

This g has a low Var[g (A) | x] on average over x.

If Var[g (A) | x] is high, then g (a) will be far away from y = f (x), so fitting ˆg on (a, y) will not work. Based on these insights, we define compatibility as follows.

Definition 4.1. The compatibility with P + of any non-zero f L2(PX ) is defined as

ρ(f, P +) = max g L2(PA),g =0

D f, TP +g E

PX f PX g PA [0, 1].

For further insight, let f = P

i uiµi and g = P

Then, ρ(f, P +) = max vi

i 1 siuivi P

i v2 i = r P

i 1 s2 i u2 i P

i 1 u2 i by Cauchy-Schwarz inequality (the optimal vi satisfy v0 = 0 and vi siui for i 1). For any ϵ > 0, we define the class of (1 ϵ)-compatible tasks as

Fϵ(P +) = f L2(PX ) : E[f] = 0, ρ(f, P +) 1 ϵ .

This class of tasks satisfies the two conditions, i.e. we can find a g with low variance Var[g (A) | x]:

Theorem 4.2 (Proof in Appendix B.1). For any f Fϵ(P +), there exists a g L2(PA) such that f (x) = E[g (A) | x], and g satisfies

E X PX E A,A P +( | X)

h (g (A) g (A ))2i 4ϵ g 2 PA.

Now that we have a class of tasks compatible with P +, we evaluate Φ by its worst-case approximation error on Fϵ(P +). The most common way to evaluate Φ is to fit a linear predictor on top, also called a linear probe, which is the focus of our attention (other methods for using Φ include fitting a small neural network on top, using a kernel method, or using KNN). Specifically, the worst-case approximation error of Φ on F L2(PX ) is the maximum error of the optimal linear probe in estimating any function in F. In this work, we focus on the L2 error.

Contextures: Representations from Contexts

Definition 4.3. Let F L2(PX ) be a function class where f F αf F for all α R. The worst-case approximation error of Φ : X Rd on F is defined as

err(Φ; F) = max f F(P +), f PX =1 err(Φ, f),

where err(Φ, f) = min w Rd, b R

w Φ + b f 2

The following key result shows that the Φ that minimizes err(Φ; Fϵ(P +)) over all d-dimensional encoders must recover the linear space spanned by the µ1, , µd. Here µ0 is excluded since the bias b in the linear predictor implicitly contains µ0.

Theorem 4.4 (Proof in Appendix B.2). Suppose 1 s1

s2 1+s2 2 2 . For any d, among all Φ = [ϕ1, , ϕd] where ϕi L2(PX ) , Φ minimizes err(Φ; Fϵ(P +)) if and only if it learns the contexture of TP +. The error is given by

min Φ:X Rd, ϕi L2(PX ) err Φ; Fϵ(P +) = s2 1 (1 ϵ)2

s2 1 s2 d+1 .

Conversely, for any d-dimensional encoder Φ and any ϵ > 0, there exists f L2(PX ) such that ρ(f, P +) = 1 ϵ, and err(Φ, f) s2 1 (1 ϵ)2

s2 1 s2 d+1 .

This result has two parts. First, we show that if f is compatible (f Fϵ(P +)), the optimal encoder achieves low error on f . Second, we ask what if f is incompatible. We cannot claim that no Φ works for f if one knows f a priori, then one can set ϕ1 = f to achieve zero error. Instead, we show that for any Φ, there exists an f with the same compatibility as f for which Φ performs poorly. Therefore, compatibility reflects whether a context is suitable for a task.

Evaluating an arbitrary encoder. The above result bounds the approximation error of the encoder that learns the contexture. We can also bound the approximation error of an arbitrary encoder. See Appendix C.

4.2. Implications for Neural Scaling Laws

Scaling laws (Kaplan et al., 2020) state that the performance of large neural networks grows with their size. Moreover, models of different architectures learn highly aligned representations when scaled up. Huh et al. (2024) thus proposed the platonic representation hypothesis that scaling makes representations more aligned with an underlying reality, though they did not formally define this reality.

The contexture theory provides a new perspective on the role of scaling. The function class from which the feature encoders Φ are trained is a subset of L2(PX ), and as the model gets larger, the class approaches L2(PX ). This suggests that scaling brings the learned representation closer

512 1024 2048

Figure 1. Alignment between the learned representation and the top-d eigenfunctions of Tk+ X on the abalone dataset. Solid curves: CCA. Dashed curves: mutual KNN. Depth here means the number of hidden layers.

to the span of the top-d singular functions of TP +, explaining why big models learn aligned representations. The key difference is that contexts are designed by humans and thus are more subjective than the so-called underlying reality .

We substantiate this extrapolation with an experiment on the abalone dataset from Open ML. We use KNN (K = 30) as context, where A = X, and P + maps x to one of its K nearest neighbors equiprobably. We compare two ddimensional representations with d = 128 learned in the following two ways: (i) Kernel PCA to obtain the exact topd eigenfunctions of Tk+ X; (ii) non-contrastive learning (LN in Theorem 3.4) implemented with VICReg (Bardes et al., 2022). For (ii), we use a fully-connected neural network with Tanh activation, skip connections, and Adam W optimizer (Kingma & Ba, 2015; Loshchilov & Hutter, 2017). We use the same number of training epochs for every model. For each width and depth, we run the experiments 15 times with different random initializations, and report the average alignment. See Appendix E for more details.

We measure the alignment between the two representations using the canonical-correlation analysis (CCA) metric R2 CCA, and the mutual KNN metric with 10 neighbors like Huh et al. (2024). We center and whiten the representations (making the covariance identity) when using mutual KNN. CCA is invariant to all invertible linear transformations on Φ, which is ideal because such transformations do not affect the performance of the downstream linear probe, since one can adjust W and b of the linear probe accordingly. We do not use linear CKA proposed by Kornblith et al. (2019) because it is only invariant to orthogonal transformations.

Figure 1 plots the alignment between the exact top-d eigenfunctions and the learned deep representation while varying the depth and width of the neural network. When they are chosen optimally, the CCA can be as high as 0.9, and the

Contextures: Representations from Contexts

mutual KNN can be higher than 0.8. Note that these alignment metric values are very high. For example, in Huh et al. (2024), the mutual KNN metric value is usually below 0.2. Hence, the representation learned by the neural network is highly aligned with the top-d eigenfunctions.

The top plot studies neural networks with increasing widths. We observe that when the neural network is relatively narrow, increasing the width improves alignment. However, once the neural network is sufficiently wide, further increasing the width may have a negative effect. For example, when the depth is 3, the alignment is the highest when the width is 512, and the alignment becomes lower when the network is wider than 512. Since increasing the width can only make the function class of Φ larger, this phenomenon is not due to the expressivity of the neural network. We hypothesize that it arises from optimization difficulty, that is larger models are harder to train effectively. Consequently, with the same number of pretraining steps, a larger model will be farther away from the minima, leading to a reduced alignment.

The bottom plot studies neural networks with increasing depths, and a similar trend is observed. When the network is shallow, increasing the depth improves the alignment. However, once the network is sufficiently deep, further increasing the depth may have a negative effect.

In summary, we draw two conclusions from this experiment: (i) the representation learned by a large neural network is highly aligned with the top-d eigenfunctions; (ii) once the neural network is large enough, further increasing its size does not increase the alignment. Hence, we argue that once the model is large enough such that Φ is already highly aligned with the top-d eigenfunctions, further increasing the model size inevitably yields diminishing returns.

5. Context Evaluation

How to create better contexts is a challenging problem. In this section, we take a first step by studying when a context is useful and how to efficiently evaluate its usefulness. The key result is that the usefulness of a context is largely determined by the association level between X and A, and a useful context should have a moderate association. The association level affects the decay rate of the singular values. We propose a usefulness metric that only depends on the singular values. Then, we empirically verify that this metric has a strong correlation with the actual performance of the encoder on many real datasets. As such, the proposed metric can help practitioners to select among various pretraining methods or hyperparameter settings efficiently.

5.1. The Effect of Context Association

A useful context should provide sufficient training signals that are easy for the model to capture. If the association

between the X and A of a context is too weak, then the signals will be insufficient. If the association is too strong, then capturing the signals will be too hard. The association level affects the spectrum of the context the stronger the association, the slower the decay of the singular values.

Case 1: Weak association. Consider the extreme case where A is independent of X. This context is clearly useless because it provides no information. In this case, only the trivial singular function µ0 1 has a positive singular value; all the other singular values are 0. When X and A are nearly independent, k+ X(x, x ) is very close to 1, which causes the singular values to decay too fast. Formally, we have:

Lemma 5.1 (Proof in Appendix G.1). When |k+ X(x, x ) 1| < ϵ for all x, x X, we have P i>0 s2 i < ϵ.

In Appendix F, we empirically verify that low association leads to a small |k+ X(x, x ) 1| for all x, x . In such a scenario, Fϵ(P +) is a very small set, so very few tasks are compatible with and can benefit from the context.

Case 2: Strong association. The context A = X is useless. Contexts whose singular values decay too slow are bad because (i) for pretraining, there are non-smooth singular functions with large singular values, which are hard to learn; (ii) for downstream, a larger d is needed, as more singular functions have non-trivial contributions, and it leads to a higher sample complexity. In Appendix F, we empirically verify that kernel k+ X has a high Lipschitz constant when the association is strong, meaning that the kernel is non-smooth and thus the singular functions are non-smooth.

5.2. Task-agnostic Evaluation of Contexts

A good measurement of context usefulness should be taskagnostic, because we would like the pretrained encoder to be transferable to a variety of tasks, which we might not know at pretrain time. Note that for any task-agnostic metric, one can adversarially create a task for which the metric fails, so there is no universal task-agnostic metric. However, a metric can still be very useful if it provides guidance for most real tasks. To this end, we propose the following metric:

τd = 1 1 s2 d+1 + β Pd i=1 s2 i Pd0 i=1 s2 i , τ = min d τd, (3)

where β > 0 is a parameter, and d0 is the maximum embedding dimension we consider. Typically d0 ranges from 512 to 8192. We choose β = 1 and d0 = 512 in our experiments. τd is a proxy of the prediction error when the embedding dimension is d. Thus, the d that minimizes τd can be viewed as the predicted optimal embedding dimension, and τ evaluates the context when d is chosen optimally.

Metric derivation. Let the target function be f = f0 + f1, where f0, f1 PX = 0, f1 is compatible with the

Contextures: Representations from Contexts

0 100 200 0

0 100 200 0

0 100 200 0

0 100 200 0.4 0.5 0.6 0.7 0.8

(a) Weak association

0 100 200 0.4 0.5 0.6 0.7 0.8

(b) Moderate

0 100 200 0.4 0.5 0.6 0.7 0.8

Figure 2. Metric illustration on abalone. Top row: context spectra. Bottom row: black solid curves are τd divided by 6; red dashed curves are the actual downstream prediction error. We divide τd by 6 to fit it in the same plot.

0 100 200 0

0 100 200 0

0 100 200 0

0 100 200 0.2 0.4 0.6 0.8

(a) Weak association

0 100 200 0.2 0.4 0.6 0.8

(b) Moderate

0 100 200 0.2 0.4 0.6 0.8

Figure 3. Metric illustration on MNIST, similar to Figure 2.

context, and f0 is not compatible with the context. The prediction error can then be decomposed into (i) the approximation error of f1; (ii) the approximation error of f0; (iii) the estimation error. Theorem 4.4 bounds component (i) by s2 1 (1 ϵ)2

s2 1 s2 d+1 . In practice, s1 is usually very close to 1, and we simplify this bound to the first term of Eqn. (3), up to a constant factor. For component (ii), stronger associations imply that more tasks are compatible with the context, reducing this approximation error. Thus, this component should be negatively correlated with Pd0 i=1 s2 i . Component (iii), the estimation error, increases with stronger associations, since higher association typically requires a larger d, and thus greater sample complexity. Based on the results in Zhai et al. (2024), this component can be essentially understood as positively correlated with Pd i=1 s2 i . The second term in Eqn. (3) combines the contributions from components (ii) and (iii), and is designed to be bounded by 1. This metric can be efficiently estimated. It only requires the top-d0 eigenfunctions of Tk+ X, which can be estimated in O(m3) time using a random subset of m = Θ(d0 log d0) samples. See Appendix D for details.

We now conduct an experiment that examines τd on two datasets. First, we use the abalone dataset and KNN as

0 50 100 150 0 0.1 0.2 0.3 0.4 0.5

Cosine similarity

(a) abalone

0 50 100 150 0 0.1 0.2 0.3 0.4 0.5

(b) MNIST Figure 4. Comparison of the downstream task between abalone and MNIST. The y-axis is the cosine similarity between the downstream task and the i-th eigenfunction.

the context, similar to Section 4.2. We adjust the association level by changing K: K = 150 (weak), K = 30 (moderate) and K = 5 (strong). We obtain the exact eigenvalues and eigenfunctions of Tk+ X using kernel PCA. In Figure 2, we plot the spectra of the three contexts in the top row. Then, in the bottom row, we compare τd against the prediction error of the linear probe under different d. We can see that when the association is weak or moderate, τd first decreases and then increases, which tracks the actual error. However, when the association is too strong, τd monotonically decreases with d, and it cannot track the actual error.

Second, we use the MNIST dataset. The context is random cropping with crop ratio α. We adjust the association level by changing α: α = 0.5 (weak), α = 0.2 (moderate) and α = 0.05 (strong). Since kernel PCA is not scalable to datasets as large as MNIST, we instead train a Le Net (Le Cun et al., 1998) using the non-contrastive learning objective (LN in Theorem 3.4) and the Adam W optimizer. Then, we estimate the top eigenvalues using the method in Appendix D. The downstream task is a binary classification task whether the digit is greater than 4. After pretraining, a linear probe is fit on top of Φ using ridge regression. The result is plotted in Figure 3.

Unlike abalone, on MNIST the downstream error monotonically decreases with d. This disparity stems from the difference between the two downstream tasks. To demonstrate this, in Figure 4 we plot the cosine similarity between the target function f and the estimated i-th eigenfunction on the two datasets. We can see that the variance of f on abalone is mostly concentrated on the top-5 eigenfunctions, with the first cosine similarity being almost 0.5. In contrast, the variance of f on MNIST is more scattered, and the cosine similarity is still close to 0.1 for the 150-th eigenfunction. Consequently, having a large d on abalone will have a little impact on the approximation error but will increase the estimation error significantly. On the other hand, having a larger d on MNIST will decrease the approximation error more than it increases the estimation error, which is why the total error monotonically decreases with d. The takeaway from this experiment is that, while a context with moderate association is generally good, its effectiveness ultimately depends on the specific downstream task.

Contextures: Representations from Contexts

The implication is that no evaluation metric would universally work for all contexts and downstream tasks. However, a metric would still be useful if it correlates well with the actual error in most scenarios, and thus can provide insights into choosing the right context and the right hyperparameters, such as the mask or crop ratio.

5.3. Empirical Verification

Now we examine if our metric correlates with the encoder s performance on real datasets. In practice, the performance is influenced by many factors. To create a setting where all factors but the context are controlled, we let the encoder be the exact top-d singular functions obtained by kernel PCA.

Each dataset is randomly split into a pretrain set, a downstream labeled set, and a test set. The downstream linear predictor is fit via ridge regression. Hyperparameter grid search is conducted at both encoder learning and downstream stages. The evaluation metric is the mean squared error. Let errd be the actual prediction error when Φ is ddimensional. We test d up to d0 = 512. Let d be the one that minimizes errd. We use four types of contexts:

RBF kernels: k(x, a) = exp( γ x a 2). We define P + as P +(a | x) k(x, a) for each x.

KNN: P +(a | x) = K 1 if a is a KNN of x, else 0.

RBFmask: First, randomly mask 20% of the features, and then apply RBF kernels to the other features. Specifically, we randomly draw 50 masks, and use the average of P + over all masks as the context.

KNNmask: 20% random masking and then apply KNN.

For each of these contexts, A = X. For each type, we use 35 contexts by adjusting γ for RBF kernels and K for KNN. By doing so, we adjust the association level between X and A. We make sure that contexts in every type range from very weak to very strong association. We do not use masking alone because its dual kernel is hard to estimate.

In Table 1 we report the correlation between τ and errd over all 140 contexts from the 4 types on 28 classification (Cls) and regression (Reg) datasets from Open ML (Vanschoren et al., 2013) widely used in machine learning research. The most common metric is the Pearson correlation, but it can only detect linear correlations, while the correlation between τ and errd is not necessarily linear. Thus, we also report the distance correlation (Sz ekely et al., 2007) that can detect non-linear correlations, but it cannot tell if the correlation is positive or negative because it is always non-negative.

The median reported in the table shows that on more than half of the datasets, there is a Pearson correlation of over 0.5, which is in general considered a strong correlation. The distance correlation is even higher. As expected, the metric

Table 1. Correlation between τ and errd on all 4 types of contexts (clipped). Full results reported in Table 3 in Appendix G.

Dataset Size ( ) #Feat. Type Pearson Dist.

diabetes 768 8 Cls 0.737 0.740 Moneyball 1232 14 Reg 0.680 0.650 yeast 1269 8 Cls 0.221 0.256 splice 3190 60 Cls 0.831 0.801 abalone 4177 8 Reg 0.028 0.470 mushroom 8124 22 Cls 0.185 0.340 pumadyn32nh 8192 32 Reg 0.938 0.961 Speed Dating 8378 120 Cls 0.590 0.656 grid stability 10000 12 Reg 0.925 0.911 brazilian houses 10692 9 Reg -0.290 0.563 fifa 19178 28 Reg -0.349 0.663 kings county 21613 21 Reg 0.842 0.882 cps88wages 28155 6 Reg 0.250 0.479

Mean on 28 datasets 0.431 0.611 Median on 28 datasets 0.587 0.659

does not work on all datasets. For example, the Pearson correlation is very negative on brazilian houses and fifa. To understand the failure modes, in Appendix G we do more analysis on the datasets where our metric fails.

6. Conclusion

We advance the science of representation learning by articulating the target of representation learning the top singular functions of a particular operator induced by a context, which we term contextures. We further prove that such a representation is optimal because it minimizes the worstcase approximation on the class of tasks compatible with the context. We show that most representation learning approaches could be cast as estimating contextures, and empirically verified that large neural networks can learn the top singular functions well. We further analyze when a context can be useful, relating that to its spectrum, and proposed a task-agnostic usefulness metric that correlates well with the encoder s performance on real datasets.

Our analysis has three limitations, which lead to three open problems. First, our analysis focused on the minimizers of the objectives. However, Cohen et al. (2021) showed that deep models trained by popular gradient methods do not find the minimizers, but instead oscillate around the edge of stability. The open problem is how this phenomenon affects our results. Second, we did not discuss the impact of the inductive bias of the model architecture, such as the translation invariance of convolutional neural networks. Such inductive biases can affect the context and, therefore, the encoder. We pose how to integrate the effect of these biases into our theory as an open problem. Third, our theory assumes that PX is fixed. In practice, however, there is always a data distribution shift from upstream to downstream. Refining our theory to handle such distribution shifts is an exciting direction for future work.

Contextures: Representations from Contexts

Acknowledgements

We thank Rattana Pukdee, Hugo Contant, Chenhao Zhang and Zihao Ye for their feedback on this paper. Kai Yang did this work at Carnegie Mellon University, under the PKUCMU undergraduate research program. We acknowledge the support of NSF via IIS-2211907, ONR via N0001423-1-2368, AFRL via FA8750-23-2-1015, and DARPA via HR00112020006.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Code and Data Availability

The code for this paper can be found at https: //colab.research.google.com/drive/ 1Gd J0Yn-PKi Kfk ZIw Uuon3Wp Tpb NWEt AO?usp= sharing. The data can be downloaded from Open ML.

Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A. S., Shekhar, S., Goldstein, T., Bordes, F., Bardes, A., Mialon, G., Tian, Y., Schwarzschild, A., Wilson, A. G., Geiping, J., Garrido, Q., Fernandez, P., Bar, A., Pirsiavash, H., Le Cun, Y., and Goldblum, M. A cookbook of selfsupervised learning. arxiv:2304.12210, 2023.

Bardes, A., Ponce, J., and Le Cun, Y. VICReg: Varianceinvariance-covariance regularization for self-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=xm6YD62D1Ub.

Baudat, G. and Anouar, F. Generalized discriminant analysis using a kernel approach. Neural computation, 12(10): 2385 2404, 2000.

Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. ar Xiv:2108.07258, 2021.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In Proc. International Conference on Machine Learning, July 2020.

Cohen, J., Kaur, S., Li, Y., Kolter, J. Z., and Talwalkar, A. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=jh-r Ttvk Ge M.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Volume 1 (Long and Short Papers), Minneapolis, MN, June 2019. Association for Computational Linguistics.

Grill, J.-B., Strub, F., Altch e, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B., Kavukcuoglu, K., Munos, R., and Valko, M. Bootstrap your own latent - a new approach to self-supervised learning. In Proc. Advances in Neural Information Processing Systems, December 2020.

Gupta, A., Wu, J., Deng, J., and Li, F.-F. Siamese masked autoencoders. In Advances in Neural Information Processing Systems, New Orleans, LA, December 2023.

Hao Chen, J. Z., Wei, C., Gaidon, A., and Ma, T. Provable guarantees for self-supervised deep learning with spectral contrastive loss. In Proc. Advances in Neural Information Processing Systems, December 2021.

He, K., Chen, X., Xie, S., Li, Y., Doll ar, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, June 2022.

Huh, M., Agrawal, P., and Efros, A. A. What makes imagenet good for transfer learning? ar Xiv:1608.08614, 2016.

Huh, M., Cheung, B., Wang, T., and Isola, P. Position: The platonic representation hypothesis. In Proc. International Conference on Machine Learning, Vienna, Austria, July 2024.

Jing, L., Vincent, P., Le Cun, Y., and Tian, Y. Understanding dimensional collapse in contrastive self-supervised learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=Yevs Q05DEN7.

Johnson, D. D., Hanchi, A. E., and Maddison, C. J. Contrastive learning can find an optimal basis for approximately view-invariant functions. In International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=Aj C0KBji Mu.

Contextures: Representations from Contexts

Kaplan, J., Mc Candlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. ar Xiv:2001.08361, 2020.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http: //arxiv.org/abs/1412.6980.

Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similarity of neural network representations revisited. In Proc. International Conference on Machine Learning, Long Beach, CA, June 2019.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Liu, Q., Lu, H., and Ma, S. Improving kernel fisher discriminant analysis for face recognition. IEEE transactions on circuits and systems for video technology, 14(1):42 49, 2004.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Mika, S., Ratsch, G., Weston, J., Scholkopf, B., and Mullers, K.-R. Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), pp. 41 48. Ieee, 1999.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv:1807.03748, 2018.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. In Proc. Advances in Neural Information Processing Systems, New Orleans, LA, December 2022.

Paninski, L. Estimation of entropy and mutual information. Neural computation, 15(6):1191 1253, 2003.

Papyan, V., Han, X., and Donoho, D. L. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652 24663, 2020.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290 (5500):2323 2326, 2000.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115: 211 252, 2015.

Shawe-Taylor, J., Williams, C. K., Cristianini, N., and Kandola, J. On the eigenspectrum of the gram matrix and the generalization error of kernel-PCA. IEEE Transactions on Information Theory, 51(7):2510 2522, 2005.

Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227 244, 2000.

Sutskever, I. Test of time award talk: Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 2024.

Sz ekely, G. J., Rizzo, M. L., and Bakirov, N. K. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769 2794, 2007. ISSN 00905364. URL http://www.jstor.org/ stable/25464608.

Thrampoulidis, C., Kini, G. R., Vakilian, V., and Behnia, T. Imbalance trouble: Revisiting neural-collapse geometry. In Proc. Advances in Neural Information Processing Systems, New Orleans, LA, December 2022.

Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49 60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm. org/10.1145/2641190.2641198.

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https:// openreview.net/forum?id=yzk SU5zdw D. Survey Certification.

Yarotsky, D. Optimal approximation of continuous functions by very deep relu networks. In Conference on Learning Theory, pp. 639 649, Stockholm, Sweden, July 2018.

Zbontar, J., Jing, L., Misra, I., Le Cun, Y., and Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In Proc. International Conference on Machine Learning, pp. 12310 12320, July 2021.

Contextures: Representations from Contexts

Zhai, R., Liu, B., Risteski, A., Kolter, Z., and Ravikumar, P. Understanding augmentation-based self-supervised representation learning via rkhs approximation and regression. In International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=Ax2y Rh CQr1.

Contextures: Representations from Contexts

A. Proofs for Section 3

A.1. Proof of Theorem 3.1

Theorem 3.1. Let A be a one-hot random vector. Suppose the linear layer is unbiased, that is b = 0. Then, Φ minimizes R(Φ) if and only if it extracts the top-d eigenspace of TP +ΛT P +, where kΛ(a, a ) = I[a = a ], or (Λg)(a) = g(a)PA(a). If all classes have the same size, then the top-d eigenfunctions of TP +ΛT P + and TP +T P + are the same.

The following lemma will be very useful in the proof.

Lemma A.1. TP +ΛT P + is the integral kernel operator of the following kernel

k(x, x ) = ZZ kΛ(a, a )P +(a|x)P +(a |x )dada .

Proof. By definition, we have

(T P +h)(a ) = Z h(x )P +(x |a )dx .

Thus we can get

(ΛT P +h)(a) = Z (T P +h)(a )kΛ(a, a )PA(a )da

= ZZ h(x )P +(x |a )kΛ(a, a )PA(a )dx da

= ZZ h(x )P +(a |x )kΛ(a, a )PX (x )dx da .

This implies that

(TP +ΛT P +h)(x) = Z (ΛT P +h)(a)P +(a|x)da

= ZZZ h(x )kΛ(a, a )P +(a|x)P +(a |x )PX (x )dada dx

= Z h(x )k(x, x )PX (x )dx ,

as desired.

Then, we finish the proof of Theorem 3.1.

Proof. For any fixed Φ, define

R(Φ, W ) = EP + h A W Φ(X) 2 2 i = E X PX E A P +( |X)

h A W Φ(X) 2 2 i .

Assuming, without loss of generality, that EX PX [ΦiΦj] = δij; otherwise one can perform Gram-Schmidt process on Φi and change the value of W respectively. Thus, it amounts to minimize

R(Φ, W ) = E X PX E A P +( |X)

h A W Φ(X) 2 2 i

= E X PX W Φ(X) 2 2 2 E X PX E A P +( |X) A, W Φ(X) + E A PA A 2 2

= W 2 F 2 E X PX E A P +( |X) A, W Φ(X) + E A PA A 2 2.

Denote W = (wij)1 i d A,1 j d. We have

R wij = 2wij 2 E X PX E A P +( |X) [AiΦj(X)],

Contextures: Representations from Contexts

which implies that for a fixed Φ, the optimal W that minimizes this loss should satisfy

wij = E X PX E A P +( |X) [AiΦj(X)].

Combining the minimizer of W with R and notice that EA PA A 2 2 is a constant, it suffices to maximize

E X PX E A P +( |X) AiΦj(X) 2

j Φj(x1)Φj(x2) a1, a2 PX (x1)P +(a1|x1)PX (x2)P +(a2|x2)dx1da1dx2da2

j Φj(x1)Φj(x2)ˆk(x1, x2)PX (x1)PX (x2)dx1dx2,

ˆk(x1, x2) = ZZ a1, a2 P +(a1|x1)P +(a2|x2)da1da2 (4)

= ZZ I[a1 = a2]P +(a1|x1)P +(a2|x2)da1da2.

Thus Φ is a minimizer of R(Φ) if Φ extracts the top-d eigenfunctions of ˆk(x1, x2). Combining with Lemma A.1 yields that kΛ(a, a ) = I[a = a ]. Furthermore, we have (Λg)(a) = R g(a )kΛ(a, a )d PA(a ) = g(a)PA(a), as desired.

If all classes have the same size, we have PA(a) c (0, 1) where c is a constant. Thus (Λg)(a) = g(a)PA(a) = cg(a), which implies that TP +ΛT P + = c TP +T P +. This concludes that TP +ΛT P + and TP +T P + share the same top-d eigenfunctions.

A.2. Proof of Theorem 3.2

Theorem 3.2. Under the setting of Theorem 3.1, let the linear layer be biased. Then, Φ minimizes Rbal(Φ) if and only if it learns the contexture of P +.

Proof. For any fixed Φ, define

R(Φ, W ) = EP +

PA(A) A W Φ(X) 2 2

= E X PX E A P +( |X)

PA(A) A W Φ(X) 2 2

Assuming, without loss of generality,

E X PX E A P +( |X)

otherwise we can perform Gram-Schmidt process on Φi and change the value of W respectively. Thus it suffices to minimize

R(Φ, W ) = E X PX E A P +( |X)

PA(A) A W Φ(X) 2 2

= E X PX E A P +( |X)

PA(A) W Φ(X) 2 2

2 E X PX E A P +( |X)

PA(A) , W Φ(X)

= W 2 F 2 E X PX E A P +( |X)

PA(A) , W Φ(X)

Contextures: Representations from Contexts

Denote W = (wij)1 i d A,1 j d. We have

R wij = 2wij 2 E X PX E A P +( |X)

PA(A) Φj(X)

which implies that for a fixed Φ, the minimizer of W satisfies

wij = E X PX E A P +( |X)

PA(A) Φj(X)

Combining the minimizer of W with R, it suffices to maximize

E X PX E A P +( |X) Ai p

PA(A) Φj(X)

j Φj(x1)Φj(x2)ˆk(x1, x2)PX (x1)PX (x2)dx1dx2,

where ˆk(x1, x2) = ZZ a1, a2 p

PA(a1)PA(a2) P +(a1|x1)P +(a2|x2)da1da2

= ZZ I[a1 = a2] p

PA(a1)PA(a2) P +(a1|x1)P +(a2|x2)da1da2

= Z P +(a|x1)P +(a|x2)

Thus, Φ is a minimizer of R(Φ) if Φ extracts the top-d eigenfunctions of ˆk(x1, x2). Combining with Definitions 2.1 and 2.5 yields the desired results.

A.3. Result for Regression

For regression where A is an arbitrary Euclidean vector, using the same objective as Eqn. (1), we can prove the following result.

Theorem A.2. Φ minimizes Eqn. (1) if and only if Φ extracts the top-d eigenspace of TP +ΛT P +. If the linear layer is unbiased (b = 0), then kΛ(a, a ) = a, a ; if it is biased (b can be arbitrary), then kΛ(a, a ) = a, a .

Remark A.3. Kernel kΛ(a, a ) = a, a is called the linear kernel on A, and kΛ(a, a ) = a, a is called the centered linear kernel w.r.t. PA. Theorem 3.1 is a special case of Theorem A.2.

Proof. For the unbiased linear model, the proof is similar to that of Theorem 3.1. Combining Eqn. (4) and Lemma A.1 yields the desired result.

Next, we consider a biased linear model. For a variable z, we denote z = E[Z], and z = z E[Z] as its centered version.

For any fixed Φ, define R(Φ, W , b) = E X PX E A P +( |X)

h A W Φ(X) b 2 2 i

= E X PX E A P +( |X)

h A W Φ(X) b 2 2 i

= E X PX E A P +( |X)

A W Φ(X) ˆb 2

= E X PX E A P +( |X)

where ˆb = W EX PX [Φ(X)] EA PA[A]+b. Thus, for any fixed Φ, W , the optimal b = EA PA[A] W EX PX [Φ(X)].

Assuming, without loss of generality, EX PX [ Φi Φj] = δij; otherwise we can perform Gram-Schmidt process on Φi and change the value of W respectively. Thus, it suffices to minimize

Contextures: Representations from Contexts

ˆR(Φ, W ) = E X PX E A P +( |X)

2 2 E X PX E A P +( |X)

D Y , W Φ(X) E + E A PA

= W 2 F 2 E X PX E A P +( |X)

D A, W Φ(X) E + E A PA

Denote W = (wij)1 i dy,1 j d. We have

ˆR wij = 2wij 2 E X PX E A P +( |X)

h Ai Φj(X) i ,

which implies that for a fixed Φ, the minimizer of W satisfies

wij = E X PX E A P +( |X)

h Ai Φj(X) i .

Combining the minimizer of W with ˆR and notice that EA PA A 2

2 is a constant, it suffices to maximize

E X PX E A P +( |X) Ai Φj(X) 2

j Φj(x1) Φj(x2) a1, a2 PX (x1)P +(a1|x1)PX (x2)P +(a2|x2)dx1da1dx2da2

j Φj(x1) Φj(x2)ˆk(x1, x2)PX (x1)PX (x2)dx1dx2,

where ˆk(x1, x2) = ZZ a1, a2 P +(a1|x1)P +(a2|x2)da1da2.

Notice that ZZ ˆk(x1, x2)PX (x1)PX (x2)dx1dx2 = Z a1, a2 P +(x1, a1)P +(x2, a2)dx1da1dx2da2

= Z a1, a2 PA(a1)PA(a2)da1da2 = 0,

thus Φ is a minimizer of R(Φ) if Φ extracts the top-d eigenfunctions of ˆk(x1, x2). Combining with Lemma A.1 yields the desired results.

A.4. Proof of Theorem 3.4

Theorem 3.4. Ψ minimizes LC or LN if and only if Φ = TP + Ψ learns the contexture.

Proof. (i) The spectral contrastive loss is

LC(Ψ) = E X PX E A,A+ P +( |X)

D Ψ(A), Ψ(A+) E + 1

D Ψ(A), Ψ(A ) E2 .

Suppose ψi = P

j 0 cijνj where νj is the ONB of L2(PA) in Lemma 2.3. Since νj is the ONB of L2(PA) and ν0 1, we can get for j 1, EPA[νj(a)] = δ0,j = 0. Thus we can get ψi = ψi E[ψi] = P

Contextures: Representations from Contexts

Denote matrix C = (cij)1 i d,j 1, matrix B = (bij) := C C, and matrix D = diag(s2 1, s2 2, ) where si is the singular value of TP +. We have

E X PX E A,A+ P +( |X)

h D Ψ(A), Ψ(A+) Ei

= ZZZ D Ψ(a), Ψ(a+) E P +(a|x)P +(a+|x)PX (x)dxdada+

= Z Z Ψ(a)P +(a|x)dy, Z Ψ(a+)P +(a+|x)da+ p(x)dx

= Z D TP + Ψ(x), TP + Ψ(x) E p(x)dx = TP + Ψ 2 PX

i s2 i bii;

D Ψ(A), Ψ(A ) E2 = ZZ " d X

i=1 ψi(a) ψi(a )

d PA(a)d PA(a )

Z ψi(a) ψj(a)d PA(a) 2

Thus, we have

i s2 i bii + 1

i,j b2 ij = B D 2 F D 2 F .

So if suffices to minimize B D 2 F where rank(B) d. By Eckart-Young-Mirsky Theorem, we know the minimizer of B is B = diag(s2 1, , s2 d). Thus the minimizer of C should be C = Udiag(s1, , sd) where U Rd d is an orthonormal matrix. This indicates the minimizer Ψ extracts the top-d singular functions of TP +, and Φ learns the contexture of P +.

(ii) The non-contrastive loss is

LN(Ψ) = E X PX E A,A+ P +( |X)

h D Ψ(A), Ψ(A+) Ei ;

L N(Ψ) = E X PX E A,A+ P +( |X)

h Ψ(A) Ψ(A+) 2 2

where Cov PA[Ψ] = I. Since for any Ψ,

L N(Ψ) LN(Ψ) = 2 E A PA

is a constant, thus Ψ minimizes LN(Ψ) Ψ minimizes L N(Ψ).

Suppose ψi = P

j 0 cijνj where νj is the ONB of L2(PA) in Lemma 2.4. Since EPA[νj(a)] = δ0,j, we can get ψi = ψi E[ψi] = P

We now consider the minimizer of LN(Ψ). By the calculation in (i), we obtain

LN(Ψ) = E X PX E A,A+ P +( |X)

h D Ψ(A), Ψ(A+) Ei = TP + Ψ 2 PX = X

i s2 i bii.

By EPA h ψi ψj i = δij, we have X

i,j c2 ij = d.

Contextures: Representations from Contexts

Since νi is an ONB of L2(PA), ψ1, , ψd are orthogonal, we have

j=1 c2 ji =

D ψj, νi E2

PA νi 2 PA = 1. (5)

Thus, we conclude that

i=1 s2 i (1 bii) X

i>d s2 i bii

i=1 s2 d(1 bii) X

i>d s2 dbii = 0,

which implies that LN(Ψ) Pd i=1 s2 i . To attain equality, we will have bii = 1 for i = 1, , d, and bii = 0 for i d + 1. By Eqn. (5), we can know Ψ extracts the span of ν1, , νd, indicating that Ψ extracts the top-d singular functions of TP + and Φ learns the contexture of P +.

A.5. Result for Denoising Autoencoders

For denoising autoencoders, suppose X Rd X. Then, consider minimizing the following objective:

R(Ψ) = min W Rd X d, b Rd X E (X,A) P +

h W Ψ(A) + b X 2 2 i . (6)

Theorem A.4. Let Ψ be any minimizer of Eqn. (6). Then, Ψ extracts the top-d eigenspace of T P +ΛTP +, where Λ is the integral operator of kΛ(x, x ) = x, x if b can be an arbitrary vector, or kΛ(x, x ) = x, x if b = 0. Consequently, Φ = TP + Ψ extracts the top-d eigenspace of TP +T P +Λ.

Proof. The proof is the same as Theorem A.2.

A.6. Proof of Theorem 3.5

Theorem 3.5. Let Φ be any solution to Eqn. (2) (so that for any constant c, Φ + c is also a solution). Then, Φ learns the contexture of P +.

Proof. Without loss of generality, suppose Φ = 0. We have

(TP +f)(u) = X

v f(v)w(u, v)

d(u) ; TP +f, g PX = X

u,v f(u)g(v)w(u, v)

dsum = f, TP +g PX ,

which implies that TP + is self-adjoint. Therefore, the eigenfunctions of TP + are the same as those of T P +TP +, with square root eigenvalues.

For the objective of Eqn. (2), we have

1 2E(u,v) Pw h Φ(u) Φ(v) 2 2 i = E (u,v) Pw

h Φ(u) 2 2 Φ(u), Φ(v) i

ϕi 2 PX ϕi, TP +ϕi PX

i=1 ϕi, TP +ϕi PX .

Note that (u, v) and (v, u) can be drawn from Pw with equal probability. We conclude that Φ extracts the top-d eigenfunctions of TP +, which are the same as the top-d eigenfunctions of T P +TP +, or the top-d singular functions of TP +. This implies that Φ learns the contexture of TP +.

Contextures: Representations from Contexts

B. Proofs for Section 4

B.1. Proof of Theorem 4.2

Theorem 4.2. For any f Fϵ(P +), there exists a g L2(PA) such that f (x) = E[g (A) | x], and g satisfies

E X PX E A,A P +( | X)

h (g (A) g (A ))2i 4ϵ g 2 PA. (7)

Proof. Let g = P siuiνi. We have already explained that if f Fϵ(P +), i.e., E[f ] = 0 and ρ(f , P +) 1 ϵ, then it must satisfy the condition w.r.t. g : f , TP +g PX f PX g PA 1 ϵ.

For Eqn. (7), we have P(A |A = a) = R P +(A |X = x)P +(x|A = a)dx, where P +(x|a) = P +(a|x)PX (x)

PA(a) by Bayes rule. Then, using Definition 2.1 we have P(A |A = a) = k+ A(a, a )PA(a ), which implies that

EX PX EA,A P +( |X)[g (A)g (A )] = EA PAEA P ( |A)[g (A)g (A )]

g (A) Z g (a )P(a |A)da = EA

g (A) Z g (a )k+ A(a, a )PA(a )da = D g , Tk+ Ag E

Since Tk+ Ag = T P +TP +g = P s3 i uiνi, Eqn. (7) is equivalent to P(s2 i s4 i )u2 i 2ϵ P s2 i u2 i . Meanwhile, we have P s2 i u2 i (1 ϵ)2 P u2 i (1 2ϵ) P u2 i . By Cauchy-Schwarz inequality, we have (P s4 i u2 i )(P u2 i ) (P s2 i u2 i )2 (1 2ϵ)(P u2 i )(P s2 i u2 i ), which proves Eqn. (7).

B.2. Proof of Theorem 4.4

Theorem 4.4. Suppose 1 s1 ϵ 1 q

s2 1+s2 2 2 . For any d, among all Φ = [ϕ1, , ϕd] where ϕi L2(PX ) , Φ minimizes err(Φ; Fϵ(P +)) if and only if it learns the contexture of TP +. The error is given by

min Φ:X Rd, ϕi L2(PX ) err Φ; Fϵ(P +) = s2 1 (1 ϵ)2

s2 1 s2 d+1 .

Conversely, for any d-dimensional encoder Φ and any ϵ > 0, there exists f L2(PX ) such that ρ(f, P +) = 1 ϵ, and err(Φ, f) s2 1 (1 ϵ)2

s2 1 s2 d+1 .

Proof. Necessity: Since span(Φ) is at most rank-d, thus there exists f1 span{µ1, , µd+1} with f1 PX = 1 that is orthogonal to span(Φ). Thus there exists f1, f2 span{µ1, , µd+1} with f1 PX = f2 PX = 1, f1 is orthogonal to span(Φ) and f2 span(Φ) (thus f1 f2), and µ1 span{f1, f2}. Suppose µ1 = α1f1 + α2f2 (without loss of generosity, assuming α1, α2 [0, 1]) and denote f0 = α2f1 α1f2. Then f0 PX = 1 and µ1, f0 PX = 0. Since f1, f2 span{µ1, , µd+1}, we have f0 span{µ2, , µd+1} and thus E[f0] = 0.

Consider f = β1µ1 + β2f0 where β2 1 + β2 2 = 1, β1, β2 [0, 1]. Denote f = P

i 1 uiµi, we can get P

i u2 i = 1 and

β2 2 s2 1 (1 ϵ)2

s2 1 s2 d+1 = X

i 1 s2 i u2 i s2 1β2 1 + s2 d+1β2 2 = s2 1 (s2 1 s2 d+1)β2 2 (1 ϵ)2 X

Obviously, f F(P +). We have

f = β1µ1 + β2f0 = β1(α1f1 + α2f2) + β2(α2f1 α1f2) = (α1β1 + α2β2)f1 + (α2β1 α1β2)f2.

By the definition of f1, f2 we can know the approximation error for f is (α1β1 + α2β2)2. We can show F(α1) = α1β1 + α2β2 = α1β1 + p

1 α2 1β2 (α1 [0, 1]) first increases then decreases when β1, β2 [0, 1]. Thus F(α1)2

min{F(0)2, F(1)2} = min{β2 1, β2 2}. Take β2 2 = s2 1 (1 ϵ)2

s2 1 s2 d+1 1

2, we can get for f, the approximation error is always at

least s2 1 (1 ϵ)2

s2 1 s2 d+1 .

Contextures: Representations from Contexts

To attain equality, we must have P

i 1 s2 i u2 i = s2 1β2 1 + s2 d+1β2 2. This implies that f1 = µd+1, indicating that span(ϕ1, , ϕd) = span(µ1, , µd). Thus Φ learns the contexture of TP +.

Furthermore, denote f0 = k2µ2 + + kd+1µd+1 and consider f = β1µ1 + β2f0 where β2 1 + β2 2 = 1, β1, β2 [0, 1]. By the definition of f0 and f, we have f PX = 1 and f = β1µ1 + β2µ2 = β1µ1 + β2k2µ2 + + β2kd+1µd+1. Thus

ρ2(f, P +) = s2 1β2 1 + β2 2

i=2 s2 i k2 i = s2 1

i=2 s2 i k2 i

β2 2 = s2 1 (1 ϵ)2

s2 1 Pd+1 i=2 s2 i k2 i s2 1 (1 ϵ)2

s2 1 s2 2 s2 1 s2 1+s2 2 2 s2 1 s2 2 = 1

we have ρ(f, P +) = 1 ϵ. Similarly, the approximation error for f is

(α1β1 + α2β2)2 min{β2 1, β2 2} = β2 2 = s2 1 (1 ϵ)2

s2 1 Pd+1 i=2 s2 i k2 i s2 1 (1 ϵ)2

s2 1 s2 d+1 .

Sufficiency: For any f F(P +) with f PX = 1 and E[f] = 0, denote f = P

i 1 uiµi where P

i 1 u2 i = 1. Obviously we have (1 ϵ)2 P

i 1 s2 i u2 i 1. Notice that when span(ϕ1, , ϕd) = span(µ1, , µd) since Φ learns the contexture of TP +, the approximation of f will be P

i d+1 u2 i := A. By the given conditions, we have

i 1 s2 i u2 i s2 1

i=1 u2 i + s2 d+1 X

i d+1 u2 i = s2 1 (s2 1 s2 d+1)A,

and this implies that

A = min w Rd, b R

w Φ + b f 2

PX s2 1 (1 ϵ)2

s2 1 s2 d+1 .

When u2 1 = 1 s2 1 (1 ϵ)2

s2 1 s2 d+1 , u2 d+1 = s2 1 (1 ϵ)2

s2 1 s2 d+1 , the equality holds. Thus the approximation error reaches its lower bound when Φ learns the contexture of TP +.

C. Evaluating an Arbitrary Encoder

Given a context that is compatible with the task, the encoder that learns the contexture is optimal. Now, what about an arbitrary encoder Φ? Is it possible to bound its worst-case approximation error on the class of compatible tasks? To derive such a bound, two key objects are necessary: the induced RKHS and the ratio trace. They were originally defined in (Zhai et al., 2024) for self-supervised learning, and here we extend them to a broader scope.

Denote the range of T P + by R(T P +) = T P +f f L2(PX ) .

Definition C.1. The induced RKHS of P +, denoted by HP +, is the Hilbert space R(T P +) with the inner product given by T P +f1, T P +f2

HP + = f1, f2 PX .

An alternative formula is that for any h1, h2 HP + where h1 = P uiνi and h2 = P viνi, there is h1, h2 HP + = P uivi

Proposition C.2. The induced RKHS HP + has the following properties:

(i) k+ A is the reproducing kernel, such that h(a) = h, k+ A(a, )

HP + for all h HP +.

(ii) HP + is isometric to span{µi : si > 0}, which is a subspace of L2(PX ).

(iii) f Fϵ(P +) is equivalent to h = T P +f satisfying the following isometry property:

(1 ϵ) h HP + h PA h HP +. (8)

Contextures: Representations from Contexts

Proof. For any h HP + where h = T P +f and f = P uiµi, we have

h, k+ A(a, )

HP + = DX siuiνi, X s2 i νi(a)νi E

HP + = X siuiνi(a) = h(a),

which proves (i). (ii) is obvious. Regarding (iii), recall that f = P uiµi Fϵ(P +) is equivalent to P

i 1 s2 i u2 i

i 1 u2 i , and this is h PA (1 ϵ) h HP +. Meanwhile, it is obvious that h PA h HP + always

Definition C.3. Define covariance matrices CΦ = Cov PX [Φ], and BΦ = Cov PA[T P +Φ]. If CΦ is invertible, then the ratio trace of Φ w.r.t. P + is defined as RT(Φ; P +) = RT(ϕ1, , ϕd; P +) := Tr(C 1 Φ BΦ); otherwise, let Φ = [ϕi1, , ϕit] be the maximal linearly independent subset of [ϕ1, , ϕd], and define the ratio trace of Φ the same as the ratio trace of Φ .

The ratio trace of any Φ essentially measures how well Φ is aligned with the contexture of P +. Multiplying Φ by any invertible matrix does not change its ratio trace. If Φ learns the contexture, then its ratio trace is s2 1 + + s2 d, which can be easily shown by setting ϕi = µi. In fact, this is the maximum ratio trace of any d-dimensional encoder.

Lemma C.4. Suppose ϕ1, , ϕd are orthonormal and all have zero mean. Then, we have

T P +ϕ1 2 PA + + T P +ϕd 2 PA s2 1 + + s2 d.

Proof. Let ϕi = P

j 1 qijµj for i [d]. Then, Q = (qij) is a matrix with d orthonormal rows and infinitely many columns. It is easy to see that the left-hand side is equal to Tr(QDQ ), where D = diag s2 1, s2 2, . Let qj be the j-th column of Q. For all j [d], there is Pj i=1 q i qi j; and for any j > d, there is Pj i=1 q i qi d. Thus, using Abel transformation, we have

Tr(QDQ ) = Tr(DQ Q) =

j=1 s2 jq j qj =

! s2 j s2 j+1

as desired.

The ratio trace induces a key quantity in the approximation error bound called the trace gap, which reflects the gap between Φ and the top-d singular functions. The larger the trace gap is, the larger the approximation error will be. A simple definition is s2 1 + + s2 d+1 RT(Φ; P +), whose lower bound s2 d+1 can be achieved by the top-d singular functions, the optimal encoder. However, there is an issue with this definition. For example, consider an encoder with d = 1000. It learns the top-10 singular functions, but the other 990 dimensions are complete noise that has zero contribution to RT(Φ; P +). The approximation error of this encoder should be no higher than that of the top-10 singular functions, because adding more dimensions will never make the approximation error higher. However, if d becomes larger and RT(Φ; P +) stays the same, then s2 1 + + s2 d+1 RT(Φ; P +) will become larger, so this quantity does not correlate with the approximation error in this scenario. The following definition fixes this issue. Definition C.5. For any linearly independent f1, , fd L2(PX ), denote F = [f1, , fd ], CF = Cov PX [F], and BF = Cov PA[F]. The trace gap of Φ w.r.t. P + is defined as

TG(Φ; P +) := inf d d inf f1, ,fd s2 1 + + s2 d +1 Tr(C 1 F BF ) .

Obviously, this definition of trace gap is upper bounded by s2 1 + + s2 d+1 RT(Φ; P +). It solves the issue in the previous example because having completely noisy dimensions does not affect the trace gap. The following result bounds the approximation error.

Theorem C.6. Suppose TG(Φ; P +) < s2 1, and ϵ > 1 s1. Then,

err(Φ; Fϵ(P +)) s2 1 (1 ϵ)2 + s1TG(Φ; P +)

s2 1 TG(Φ; P +)2 .

Remark C.7. This bound is fairly tight. If Φ learns the contexture, then by Theorem 4.4 we have err(Φ; Fϵ(P +)) = s2 1 (1 ϵ)2

s2 1 s2 d+1 , and TG(Φ; P +) = sd+1. Compared to this exact formula, the above upper bound only has an extra s1TG(Φ; P +) term in the numerator.

Contextures: Representations from Contexts

Proof. Let f1, , fd be the functions that minimize s2 1 + + s2 d +1 Tr(C 1 F BF ). Without loss of generality, assume that f1, , fd have zero mean and are orthonormal. Let F = span{f1, , fd }, and H = span T P +f1, , T P +fd . For any f Fϵ(P +) with f PX = 1, let h = T P +f HP +, and let f F be the projection of f onto F. Since err(Φ; Fϵ(P +)) is upper bounded by f f F 2 PX , it suffices to show that f f F 2 PX is upper bounded by the right-hand side.

Let α2 = f F 2 PX , and β2 = f f F 2 PX , where α and β are non-negative. Then, α2 + β2 = f 2 PX = 1 = h 2 HP +.

The isometry property says that (1 ϵ)2(α2 + β2) h 2 PA. Let f f F = βf0 where f0 PX = 1. Let h F = T P +h F and h0 = T P +f0. Then, we have h F 2 PA s2 1 f F 2 PX = s2 1α2. Meanwhile, since f0 is orthogonal to f1, , fd ,

by Lemma C.4 we have T P +f0 2 PA + T P +f1 2 PA + + T P +fd 2 PA s2 1 + + s2 d +1, which implies that T P +f0 2 PA s2 1 + + s2 d +1 Tr(C 1 F B 1 F ). Let τ = TG(Φ; P +). Then, we have

h 2 PA = h F + βh0 2 PA h F 2 PA + β2 h0 2 PA + 2β h F PA h0 PA s2 1α2 + τ 2β2 + 2s1ταβ.

Thus, we have (1 ϵ)2(α2 + β2) s2 1α2 + τ 2β2 + 2s1ταβ, which implies that (s2 1 τ 2)β2 [s2 1 (1 ϵ)2](α2 + β2) + 2s1ταβ [s2 1 (1 ϵ)2 + s1τ](α2 + β2), as desired.

Connection to Fisher discriminant analysis. Fisher discriminant analysis (Mika et al., 1999; Baudat & Anouar, 2000; Liu et al., 2004), or more generally linear discriminant analysis (LDA), is a classical method of learning linear classifiers in statistics. Here we show that Fisher discriminant analysis has a strong connection to the contexture theory. Suppose X Rd X . Fisher discriminant analysis defines the following between-class covariance matrix SB Rd X d X and within-class covariance matrix SW Rd X d X :

SB = ZZ n (E[X | A = a1] E[X | A = a2])(E[X | A = a1] E[X | A = a2]) o ;

SW = Z EP + h (X E[X | A = a])(X E[X | A = a]) A = a i d PA(a).

In the original formulation of Fisher discriminant analysis, A is the label of X. Here we extend it to a general context variable. Consider a linear encoder Φ(x) = W x, where W Rd d X . Then, one solves the following optimization problem to find W : maximize W Rd d X J(W ) = Tr h W SBW W SW W 1i s.t. W SW W is invertible.

Here, J(W ) is called the Fisher discriminant. Define Ψ(a) = EP +[W X|A = a]. Then, we can see that

W SBW = ZZ (Ψ(a1) Ψ(a2))(Ψ(a1) Ψ(a2)) d PA(a1)d PA(a2);

W SW W = Z EP + h (Φ(X) Ψ(a))(Φ(X) Ψ(a)) A = a i d PA(a).

Let CΦ = E[ Φ(X) Φ(X) ] and BΦ = E[ Ψ(A) Ψ(A) ]. Then, we have

W SBW = 2 E Ψ(A)Ψ(A) Ψ Ψ = 2E h Ψ(A) Ψ(A) i = 2BΦ;

W SW W = Z EP + Φ(X)Φ(X) Ψ(a)Ψ(a) A = a d PA(a)

= E Φ(X)Φ(X) E Ψ(A)Ψ(A)

= E h Φ(X) Φ(X) i E h Ψ(A) Ψ(A) i = CΦ BΦ.

Therefore, J(W ) = 2 Tr[(CΦ BΦ) 1BΦ], which is very similar to the ratio trace defined in Definition C.3. Recall that an encoder that learns the contexture maximizes the ratio trace. A well-known result is that J(W ) is maximized when W consists of the top-d eigenvectors of S 1 W SB. Hence, Fisher discriminant analysis is almost equivalent to contexture learning under the constraint that the encoder must be linear.

Contextures: Representations from Contexts

0 100 200 0

(a) abalone (n = 4177)

0 100 200 0

(b) fifa (n = 19178)

0 100 200 0

(c) kings county (n = 21613)

Ground truth

m = 300 m = 1000 m = 2000

Figure 5. Estimating the eigenvalues using a random subset of m samples, from n total training samples.

D. Efficient Estimation of the Context Usefulness Metric

To estimate the metric τd defined in Eqn. (3), it suffices to estimate the top-d0 eigenvalues of the context. This can be efficiently done with the following procedure:

(i) Train an encoder Φ whose output dimension is at least d0 to learn the contexture with a random subset of m samples.

(ii) Estimate the covariance matrix CΦ Rd d = Cov PX [Φ] with Monte Carlo.

(iii) Estimate BΦ Rd d, where BΦ[i, j] = D ϕi, Tk+ X ϕj E

PX , with Monte Carlo.

(iv) Solve the generalized eigenvalue problem BΦv = λCΦv. The eigenvalues λ1 λd0 are estimates of s2 1, , s2 d0. Moreover, let Q = [v1, , vd] where (vi) are the orthonormal eigenvectors corresponding to (λi). Let Φ be the normalized version of ΦQ such that each dimension is scaled to unit variance. Then, ϕ 1, , ϕ d0 are estimates of µ1, , µd0.

If we only need to estimate the eigenvalues, then (Shawe-Taylor et al., 2005) showed that for any fixed d, the sum s2 1+ +s2 d can be estimated with low error using Θ(d) i.i.d. samples. By union bound, all s2 1, , s2 d0 can be estimated with low error using m = Θ(d0 log d0) i.i.d. samples. However, if we want to estimate the eigenfunctions as well, then usually we need to use the entire training set.

Let us demonstrate this method on 3 real datasets from Open ML (Vanschoren et al., 2013): abalone, fifa, and kings county. We use KNN with K = 60 as context, where A = X, and P +(x |x) = K 1 if x is a K-nearest neighbor of x and 0 otherwise. For this context, we can exactly compute k+ X, and thus we can obtain the exact eigenvalues (ground truth) using kernel PCA. Meanwhile, we pretrain Φ with one of the variational objectives using a random subset of m samples, and estimate the eigenvalues using the post-hoc approach. Then, we compare the estimation with the ground truth.

We use a 2-layer wide Tanh-activated neural network with embedding dimension d = 512 and hidden dimension 20,000 as Φ. We train the model through non-contrastive learning with the orthonormality constraint implemented by VICReg (Bardes et al., 2022), and Adam W (Kingma & Ba, 2015; Loshchilov & Hutter, 2017) as the optimizer. We vary m and compare the estimated top-d0 eigenvalues with the ground truth, where d0 = 256. The estimated eigenvalues and the ground truth are plotted in Figure 5. From the plots, we observe that the eigenvalues estimated by our estimation method decay faster than the ground truth, even if the full dataset is used. We hypothesize that the main reason is that even though we use a very wide neural network, its function class is still a subset of L2(PX ). Consequently, the inductive bias of the model architecture has an impact on the encoder, and therefore the learned contexture can be viewed as a mixture of the inductive bias and the original KNN context. This mixture causes the eigenvalues to decay faster, which explains the observation in Figure 5. Another reason is related to optimization. Since the model is non-convex, gradient methods cannot find the minima of the objective.

The average estimation error of the top-256 eigenvalues is reported in Table 2. The error is defined as 1 d0 Pd0 i=1 |ˆλi s2 i |, where ˆλi is the estimated eigenvalue. The table shows that when m [600, 1000] [0.5d0 log d0, 0.7d0 log d0], the

Contextures: Representations from Contexts

Dataset m = 100 m = 300 m = 600 m = 1000 m = 2000 Full dataset

abalone 0.157 0.124 0.088 0.104 0.110 0.088 fifa 0.218 0.151 0.137 0.134 0.133 0.131 kings county 0.278 0.264 0.190 0.183 0.177 0.177

Table 2. Average estimation error of the top-256 eigenvalues.

performance is comparable to using the full dataset, which verifies the theoretical result of (Shawe-Taylor et al., 2005). The estimation error is not zero even if the full dataset is used due to the aforementioned reasons. In summary, the post-hoc method can estimate the eigenvalues using a small subset of samples, but the estimated eigenvalues decay faster than the ground truth.

E. Scaling Law Experiment Details

Here we provide a more detailed description of the experiment setting in Section 4.2.

Experiment overview. The purpose of this experiment is to examine whether a large neural network can learn the contexture well, and whether scaling up the model size makes the learned representation more aligned to the top-d eigenfunctions. We compare two encoders. The first encoder is obtained via kernel PCA on the dual kernel, so it consists of the exact top-d eigenfunctions. The second encoder is obtained via training a large neural network to optimize an objective that can learn the contexture. Then, we compute the representational alignment of these two encoders. The most classical metric is the canonical-correlation analysis (CCA) metric R2 CCA, which is invariant under invertible linear transformations to the encoders. (Kornblith et al., 2019) proposed a variant called linear CKA, which is only invariant under orthogonal transformations. In our setting, since we only care about the span of ϕ1, , ϕd, we would like the metric to be invariant under all invertible transformations, which is why we use CCA. In addition, we also use the mutual KNN metric with 10 neighbors proposed by (Huh et al., 2024), which measures the intersection over union (Io U) of nearest neighbors between the two representations. This metric is not invariant under invertible linear transformations, so we whiten the two representations such that their covariance matrices are both identities.

Setup. We use the abalone dataset from Open ML, and split the dataset into a pretrain set, a downstream train set and a downstream test set by 70%-15%-15%. We use K-nearest neighbors (KNN) with K = 30 as the context. The embedding dimension is set to d = 128. For the second encoder, we train a fully-connected neural network with Tanh activation and skip connections for a sufficient number of steps with full-batch Adam W, and vary the depth and width of the network so that we can study their effect on the alignment. Here, depth refers to the number of hidden layers for example, a 2-layer neural network has depth 1. For each width and depth, we run the experiments 15 times with different random initializations and report the average alignment.

In our experiments, we observe the dimension collapse problem (Jing et al., 2022) if we set the output dimension of the neural network to be d, then the rank of the learned representation will usually be less than d, meaning that it can only extract the top-d eigenspace for some d < d. (Jing et al., 2022) proved that the training dynamics of self-supervised learning can cause this problem, that is, a large neural network trained with a gradient method cannot find the exact minima, but will find a low-rank solution instead.

To fix this issue, we set the output dimension of the neural network to be d1 = 512 > d. After we obtain the d1-dimensional encoder, similar to Appendix D we estimate the matrices CΦ and BΦ, and solve the generalized eigenvalue problem BΦv = λCΦv. Let V = [v1, , vd] Rd1 d be the top-d eigenvectors; then, we use ΦV as the d-dimensional representation. In other words, we use the 128 principal components of the 512-dimensional embedding.

F. More on Context Evaluation

In this section, we offer guidance for practitioners on identifying contexts with weak or strong associations with inputs. We then show that both excessively weak and overly strong associations degrade downstream performance and demonstrate that the proposed quantitative measurements accurately capture association strength in our controlled experiments. Moreover, we provide full experimental results that complements Table 1. Finally, we provide proofs for the lemmas in Section 5.1.

Contextures: Representations from Contexts

F.1. Quantitative measurements for level of association

While mutual information captures mutual dependence between random variables, estimating it from samples remains a long-standing challenge as it requires the joint density function to be known (Paninski, 2003). To address this, we propose alternative metrics that are computationally more tractable.

Decay rate (all association): As we mentioned, the decay rate of the singular values (si)i 0 reflects the strength of association. To estimate the decay rate λ, we assume the singular values decay exponentially and fit the regression model s2 i = exp ( λi). When λ is large, it indicates a fast decay rate and context has low association. Conversely, when λ is low, it implies a slow decay and highly associated context.

Expected kernel deviation (weak association): Since the kernel values k+ A(a, a ) can be close to 1 for the contexts with low association, we propose using the expected absolute deviation from 1, i.e. Ex,x PX |k+ X(x, x ) 1| , as a measure to indicate the weak association setting. Given samples x1, , xn X, we use Monte Carlo sampling to approximate it, i.e. 1 n2

j=1 |k+ X(xi, xj) 1|.

Lipschitz constant (strong association): We empirically measure the Lipschitz constant of k+ X for detecting contexts with strong association. Specifically, given samples x1, , xn X, we use the following estimation:

Lk+ X = sup x,y,z X

|k+ X(z, x) k+ X(z, y)| ||x y||2 max 1 i<j n,1 k n |k+ X(xk, xi) k+ X(xk, xj)| ||xi xj||2 .

We note that the first metric requires estimating singular values (s2 i )i 0 and the last two metrics rely on estimating kernel values k+ X(x, x ). For estimating singular values, we employ the same technique as the task agnostic metric τ in Eqn. (3). Estimating kernel values, on the other hand, necessitates training an encode Φ and approximating k+ X(x, x ) as Φ(x) Φ(x ), which may require a large training set. Thus, decay rate measurements are preferred due to their simpler estimation process.

Additionally, for the non-smooth kernel k+ X, we have the following lemma showing that we may have the singular functions could be non-smooth and are difficult to estimate.

Lemma F.1. Let the Lipschitz constant for the positive pair kernel k+ X be Lk+ X = supx1,x2,x3 X |k+ X(x3,x1) k+ X(x3,x2)| ||x1 x2||2 and

the maximum Lipschitz constant of its eigenfunctions be Lµ = maxi 1 supx1,x2 X |µi(x1) µi(x2)|

||x1 x2||2 . Assume that all the eigenfunctions µi are bounded by c, i.e. µi(x) c for all i > 0 and x X. Then we have Lk+ X c Lµ P

Proof. We have

Lk+ X = sup x1,x2,x3 X

|k+ X(x3, x1) k+ X(x3, x2)| x1 x2 2

= sup x1,x2,x3 X

i 0 s2 i µi(x3)µi(x1) P

i 0 s2 i µi(x3)µi(x2)|

= sup x1,x2,x3 X

i 0 s2 i µi(x3)(µi(x1) µi(x2)|

= sup x1,x2,x3 X

i>0 s2 i µi(x3)(µi(x1) µi(x2)|

x1 x2 2 (µ0 1)

sup x1,x2,x3 X

i>0 s2 i µi(x3)|µi(x1) µi(x2)|

i>0 s2 i µi(x3) sup x1,x2 X

|µi(x1) µi(x2)|

i>0 s2 i µi(x3)Lµ

Contextures: Representations from Contexts

sup x3 X c Lµ X

Assume that there exists a universal c that bounds all eigenfunctions. For a highly non-smooth kernel k+ X with high Lipschiz constant Lk+ X, the lemma implies that we have either (i) smooth singular functions with large Lµ and slow decay in singular values with small P

i=1 s2 i , or (ii) non-smooth singular functions with large Lµ and fast decay in singular values with small P i=1 s2 i . For (i), we need a larger d to approximate the kernel well, which leads to a higher downstream sample complexity. For (ii), the function approximation by neural networks becomes more difficult for non-smooth functions (Yarotsky, 2018).

F.1.1. EMPIRICAL VERIFICATION

Setup. We provide empirical evidence showing (1) downstream performance is worse for contexts with weak and strong associations, (2) the proposed quantitative measurements are positively correlated with the association level. To control the level of association, we use RBF kernels and KNN.

For the estimation of kernel value k+ X, we use k+ X(x, x ) = R P +(a|x)P +(a|x )

PA(a) da since P +(a|x) can be efficiently computed for these contexts. The decay rate is estimated using a non-linear least squares approach to fit the regression model. For computing the expected kernel deviation, we utilize the entire training set. To estimate the Lipschitz constant, we restrict the sample size to n = 1000 for computational efficiency. Other experimental setups are the same as in Section 5.3.

Results. Figure 6 and Figure 7 illustrate the relationship between association level and both the linear probe error errd and the decay rate λ for KNN and RBF contexts, respectively. The results show that errd increases at both extremes, with most blue curves exhibiting a U-shape. This suggests that both weak and strong association levels lead to higher errors. Additionally, the red curve indicates that the decay rate λ increases as the association level strengthens, highlighting a strong correlation between association level and spectral decay, which is effectively captured by the estimated decay rate.

We report the relationship between association level and both the expected kernel deviation and the Lipschitz constant Lk+ X in Figure 8 for the KNN context and Figure 9 for the RBF context. The results show that contexts with low association exhibit small expected kernel deviations, while those with high association have large Lipschitz constants. These findings align with our theoretical developments in Section 5.1.

G. More Experiment Details and Results

See Table 3 for the full results.

G.1. Proof of Lemma 5.1

Proof. Define k+ X (x, x ) = k+ X(x, x ) 1 = P

i>0 s2 i µi(x)µi(x ) that is the positive pair kernel without the trivial mode (s0, µ0), where the equality follows the definition of Tk+ X. We also denote the corresponding kernel integral operator as

(Tk+ X g)(x) = R g(x )k+ X (x, x )d PX (x ). Then we have

i>0 s2 i = Tr(Tk+ X ) = Z k+ X (x, x )d PX (x ) < Z ϵd PX (x ) = ϵ,

as desired.

Contextures: Representations from Contexts

Figure 6. Association level vs errd and the decay rate λ for the KNN context on the 28 datasets. A larger K/N indicates a lower association level, while a smaller K/N corresponds to a higher association level.

Figure 7. Association level vs errd and the decay rate λ for the RBF context on the 28 datasets. A larger γ indicates a higher association level, while a smaller γ corresponds to a higher association level.

Contextures: Representations from Contexts

Figure 8. Association level vs the kernel deviation Ex,x PX |k+ X(x, x ) 1| and the Lipschitz constant Lk+ X for the KNN context on the 28 datasets. A larger K/N indicates a lower association level, while a smaller K/N corresponds to a higher association level. We multiply Lk+ X by the input feature dimension dfeat to normalize the L2 distance in the denominator.

Figure 9. Association level vs the kernel deviation Ex,x PX |k+ X(x, x ) 1| and the Lipschitz constant Lk+ X for the RBF context on the 28 datasets. A larger γ indicates a higher association level, while a smaller γ corresponds to a higher association level. We multiply Lk+ X by the input feature dimension dfeat to normalize the L2 distance in the denominator.

Contextures: Representations from Contexts

(a) diabetes

1.2 1.5 1.8

(c) kings-county

1.85 1.9 1.95 2

(d) brazilian houses

RBF KNN RBF Masking KNN Masking

Figure 10. Scatter plots of τ versus errd . Dashed line: Linear fit.

Contextures: Representations from Contexts

Table 3. Correlation between τ and the actual error errd on all 4 types of contexts.

Dataset Size ( ) #Feature Type Pearson Distribution

credit-approval 690 15 Cls 0.583 0.683 breast-w 699 9 Cls 0.072 0.255 diabetes 768 8 Cls 0.737 0.740 solar flare 1066 10 Reg 0.019 0.262 Moneyball 1232 14 Reg 0.680 0.650 yeast 1269 8 Cls 0.221 0.256 cmc 1473 9 Cls 0.867 0.860 Wine 1599 11 Reg -0.084 0.212 scene 2407 299 Cls 0.608 0.685 dna 3186 180 Cls 0.881 0.843 splice 3190 60 Cls 0.831 0.801 kr-vs-kp 3196 36 Cls 0.543 0.512 abalone 4177 8 Reg 0.028 0.470 spambase 4601 57 Cls 0.775 0.858 colleges 7603 44 Reg 0.155 0.387 mushroom 8124 22 Cls 0.185 0.340 kin8nm 8192 8 Reg 0.805 0.760 pumadyn32nh 8192 32 Reg 0.938 0.961 cpu activity 8192 21 Reg 0.709 0.825 Speed Dating 8378 120 Cls 0.590 0.656 grid stability 10000 12 Reg 0.925 0.911 sulfur 10081 6 Reg -0.180 0.487 brazilian houses 10692 9 Reg -0.290 0.563 fifa 19178 28 Reg -0.349 0.663 superconductivity 21263 81 Reg 0.141 0.367 kings county 21613 21 Reg 0.842 0.882 health insurance 22272 11 Reg 0.601 0.749 cps88wages 28155 6 Reg 0.250 0.479

Mean 0.431 0.611 Median 0.587 0.659