# why_finegrained_labels_in_pretraining_benefit_generalization__2233ec12.pdf

Published in Transactions on Machine Learning Research (10/2024)

Why Fine-grained Labels in Pretraining Benefit Generalization?

Guan Zhe Hong hong288@purdue.edu Purdue University

Yin Cui yinc@nvidia.com NVIDIA

Ariel Fuxman afuxman@google.com Google Research

Stanley H. Chan stanchan@purdue.edu Purdue University

Enming Luo enming@google.com Google Research

Reviewed on Open Review: https: // openreview. net/ forum? id= Foj AV72ow K

Recent studies show that pretraining a deep neural network with fine-grained labeled data, followed by fine-tuning on coarse-labeled data for downstream tasks, often yields better generalization than pretraining with coarse-labeled data. While there is ample empirical evidence supporting this, the theoretical justification remains an open problem. This paper addresses this gap by introducing a hierarchical multi-view structure to confine the input data distribution. Under this framework, we prove that: 1) coarse-grained pretraining only allows a neural network to learn the common features well, while 2) fine-grained pretraining helps the network learn the rare features in addition to the common ones, leading to improved accuracy on hard downstream test samples.

1 Introduction

We consider the theory of label granularity in deep learning. By label granularity, we mean a hierarchy of training labels specifying how detailed each label subclass needs to be (See Figure 1).

Having access to different granularity of labels offers us the freedom of training a classifier using a different level of precision. For example, instead of differentiating between dogs and cats, we can train a classifier to differentiate a Poodle dog and a Persian cat. The latter classification task is undoubtedly harder. However, recent studies found that if one uses fine-grained labels to pre-train a backbone, the pre-trained backbone will help the downstream neural networks generalize better (Chen et al., 2018). Vision transformers, for example, are well-known to require pretraining on large datasets with thousands of classes for effective downstream generalization (Dosovitskiy et al., 2021; He et al., 2016; Krizhevsky et al., 2012).

To convince readers who are less familiar with this particular training strategy, we conduct an experiment on Image Net with details described in Appendix A.2 (we also include experiments on i Naturalist 2021 in Appendix A). Our experiment is limited in scale due to its high demand on the computing resources. Figure 2 shows an experiment of pre-training on Image Net21k and fine-tuning the pre-trained network using Image Net1k. The labels used in the Image Net21k is based on Word Net Hierarchy. The downstream task

Work done at Google Research.

Published in Transactions on Machine Learning Research (10/2024)

Appenzeller

Pekinese Golden Retriever

Dog Coarse Label Fine-Grained Label

Pre-training

Fine-tuning

Pre-training

Fine-tuning

Figure 1: The goal of this paper is to provide a theoretical justification of why fine-grained labels in pretraining benefit generalization.

is Image Net1k classification. The x-axis of this plot indicates the number of pre-training classes whereas the y-axis shows the validation accuracy for the Image Net1k classification task. It is evident from the plot that as we increase the number of classes (hence a finer label granularity in pre-training), the downstream classification task s performance is improved.

102 103 104

Label Granularity, i.e., Number of Classes in Pretraining

Image Net1k Validation Accuracy

Baseline Hierarchy (Image Net1k) Fine-grained Hierarchy (Word Net on Image Net21k)

Figure 2: Image Net21k Image Net1k transfer using a Vi T-B/16 model. [Blue]: pretrained on the Word Net hierarchy of Image Net21k, finetuned on Image Net1k. [Red]: baseline, trained and evaluated on Image Net1k.

1.1 Goal of this paper

The above experimental finding may sound familiar to practitioners who frequently train large models. In fact, experimental evidence on this subject is abundant (Mahajan et al., 2018; Singh et al., 2022; Yan et al., 2020; Shnarch et al., 2022; Juan et al., 2020; Yang et al., 2021; Chen et al., 2018; Ridnik et al., 2021; Son et al., 2023; Ngiam et al., 2018; Cui et al., 2018; 2019a). However, the theoretical explanation remains an open problem. Our goal in this paper is to provide a theoretical justification. The core question we ask is:

Theoretical Question Why does pretraining at a high label granularity benefit generalization?

Certainly, this grand challenge can be impossible to answer in full because of the uncontrollable complexity of the practical situations. To say something concrete, we focus on a tractable (sub-)problem under a controlled setting:

Simple scheme: We pretrain a backbone on a classification task and then finetune it for a target problem;

Published in Transactions on Machine Learning Research (10/2024)

Assume negligible distribution shift between the input distributions of the source and target datasets;

The label functions for both datasets align well in terms of the features which they consider discriminative;

The labels are error-free.

1.2 Main results and theoretical contributions

Our main result is based on analyzing a two-layer convolutional neural network with Re LU activation. We assume that the data distribution satisfies a certain hierarchical multi-view condition (to be discussed in Section 4.1). The optimization algorithm is stochastic gradient descent. Such problem settings are consistent with published works on this subject (Allen-Zhu & Li, 2023b; 2022; Shen et al., 2022b; Jelassi & Li, 2022). Our conclusions are as follows.

Theoretical results 1. Coarse-grained pretraining only allows the neural network to learn the common features well. Therefore, when testing, the test error on easy samples is o(1) (i.e., small) whereas the error on hard samples is Ω(1) (i.e., large). 2. Fine-grained pretraining helps the network learn the rare features in addition to the common ones, thus improving its test error on hard samples. In particular, the test error rate on both easy and hard test samples are o(1) (i.e., small).

To our knowledge, a precise characterization of the test error presented in this paper has never been reported in the literature. The key enablers of our theoretical finding are the concepts of hierarchical multi-view and representation-label correspondence. We summarize these two concepts below:

1. Hierarchical multi-view. To understand the label granularity problem, we argue that it is necessary for coarse and fine-grained classes to be distinguished by their corresponding input features. This is consistent with the multi-view data property pioneered by Allen-Zhu & Li (2023b). We call this a hierarchical multi-view structure. The hierarchical multi-view structure on the data makes us different from many other deep learning theory works that assume simple or no structure in the input data (Kawaguchi, 2016; Allen-Zhu & Li, 2023a; Ba et al., 2022; 2023; Damian et al., 2022; Kumar et al., 2023; Ju et al., 2021). 2. Representation-label correspondence. Representation learning aims to recognize features in the input data. As will be shown later in the paper, under the hierarchical multi-view data assumption, label complexity (i.e., how complex the labels are) during training influences the representation complexity (i.e., how many and what types of features are learnt), which further influences the model s generalization performance. Studying label granularity through understanding the neural network s feature-learning process is a departure from the literature which focuses on feature selection (Jacot et al., 2018; Ju et al., 2021; 2022; Pezeshki et al., 2021; Arora et al., 2019), i.e., selecting a subset of pre-determined features.

2 Related Work

2.1 Our theoretical setting compared to the literature

The subject of label granularity is immensely related to how to make a deep neural network (DNN) generalize better. In the existing literature, this is mostly explained through the lens of implicit regularization and bias towards simpler solutions to prevent overfitting even when DNNs are highly overparameterized (Lyu et al., 2021; Kalimeris et al., 2019; Ji & Telgarsky, 2019; De Palma et al., 2019; Huh et al., 2017). An alternative approach is the concept of shortcut learning which argues that deep networks can learn overly simple solutions. As such, deep networks achieve high training and testing accuracy on in-distribution data but generalize poorly to challenging downstream tasks (Geirhos et al., 2020; Shah et al., 2020; Pezeshki et al., 2021).

By examining these papers, we believe that Shah et al. (2020); Pezeshki et al. (2021) are the closest to ours because they demonstrate that DNNs perform shortcut learning and respond weakly to features that have a weak presence in the training data. However, our work departs from Shah et al. (2020); Pezeshki et al. (2021) in several key ways.

Published in Transactions on Machine Learning Research (10/2024)

1. We focus on how the pretraining label space affects classification generalization, while Shah et al. (2020); Pezeshki et al. (2021) primarily focus on demonstrating that simplicity bias can be harmful to generalization. 2. The core theoretical tool used by Pezeshki et al. (2021) is the neural tangent kernel (NTK) model, which is unsuitable for analyzing the label granularity problem because the feature extractor of an NTK model barely changes after pretraining. 3. The theoretical setting in Shah et al. (2020) is limited because they use the hinge loss while we use a more standard exponential-tailed cross-entropy loss. 4. Our data distribution assumptions are more realistic, as they capture feature hierarchies in natural images, which has direct impact on the downstream generalization power of the pretrained model.

2.2 Our analytic tool compared to literature

Our theoretical analysis is inspired by a recent line of work by Allen-Zhu & Li (2022; 2023b); Shen et al. (2022b). These papers analyze the feature learning dynamics of neural networks by tracking how the hidden neurons of shallow nonlinear neural networks evolve to solve dictionary-learning-like problems. We adopt a multi-view approach to the data distribution which was first proposed in Allen-Zhu & Li (2023b). However, the learning problems we analyze and the results we aim to show are fundamentally different. As such, we derive the gradient descent dynamics of the neural network from scratch.

2.3 Consistency with existing empirical results

We stress that our theoretical findings are consistent with the reported empirical results in the literature, especially those that aim to improve classification accuracy by manipulating the pre-training label space (Mahajan et al., 2018; Singh et al., 2022; Yan et al., 2020; Shnarch et al., 2022; Juan et al., 2020; Yang et al., 2021; Chen et al., 2018; Ridnik et al., 2021; Son et al., 2023; Ngiam et al., 2018; Cui et al., 2018; 2019a). For example, Mahajan et al. (2018); Singh et al. (2022) use hashtags from Instagram as pretraining labels, Yan et al. (2020); Shnarch et al. (2022) apply clustering on the data first and then treat the cluster IDs as pretraining labels, Juan et al. (2020) use the queries from image search results, Yang et al. (2021) apply image transformations such as rotation to augment the label space, and Chen et al. (2018); Ridnik et al. (2021) include fine-grained manual hierarchies in their pretraining processes. Our results corroborate the utility of pretraining on fine-grained label space.

On the empirical end, there is also work focusing on exploiting the hierarchical structures present in (humangenerated) label space to improve classification accuracy (Yan et al., 2015; Zhu & Bain, 2017; Goyal & Ghosh, 2020; Sun et al., 2017; Zelikman et al., 2022; Silla & Freitas, 2011; Shkodrani et al., 2021; Bilal et al., 2017; Goo et al., 2016). For example, Yan et al. (2015) adapt the network architecture to learn super-classes at each hierarchical level, Zhu & Bain (2017) add hierarchical losses in the hierarchical classification task, Goyal & Ghosh (2020) propose a hierarchical curriculum loss for curriculum learning. Our results do not directly validate these practices because we are more interested in understanding the influence of label granularity on model generalization.

3 Notations and Intuitions

3.1 Notations and training schemes

For a DNN-based classifier, given input image X, we can write its (pre-logit) output for class c as

Fc(X) | {z } pre-logit output for class c

= D ac |{z} linear classifier

, h |{z} backbone network

( Θ |{z} network parameter

; X) E , (1)

where ac is the linear classifier for class c, h(Θ; ) is the network backbone with parameter Θ.

Referring to Figure 1, label granularity concerns about two datasets: X src for the source (typically fine-grained) and X tgt for the target (typically coarse-grained). The corresponding labels are Ysrc and Ytgt, respectively. A dataset can be represented as D = (X, Y). For instance, the source training dataset is Dsrc train = (X src train, Ysrc train).

Published in Transactions on Machine Learning Research (10/2024)

Poodle feature

Golden retriever feature

Dog feature

Dog feature

Persian feature

Siamese feature

Cat feature

Cat feature

Figure 3: A simplified symbolic representation of the cat versus dog problem.

The relevant training and testing datasets are denoted as Dsrc train, Dtgt train, Dtgt test. Finally, the granularity of a label set is denoted as G(Y), which represents the total number of classes.

The two learning methodologies of interest are as follows.

1. Baseline: Train Fc( ) using Dtgt train. Test Fc( ) using Dtgt test. 2. Fine-to-coarse: Train Fc( ) using Dsrc train. This gives us the pretrained feature extractor h(Θsrc train; ). Then finetune Fc( ) using Dtgt train. Test the resulting Fc( ) using Dtgt test.

3.2 Intuition: why higher granularity improves generalization

Consider the following toy example. There are two classes: cat and dog. Our goal is to build a binary classifier. Let s discuss how the two training schemes would work, with an illustration shown in Figure 3.

1. Baseline. The baseline method tries to identify the common features that can distinguish most of the cats from dogs, for instance, the shape of the animal s ear as shown in Figure 3. These features are often the most noticeable ones because they appear the most frequently. Of course, there are hard samples, e.g., a close-up shot of a cat s fur. They pose limited influence during training because they are relatively rare in natural images. 2. Fine-to-coarse. With fine-grained labels, each subclass has its own unique visual features that are only dominant within that subclass. However, fine-grained features are not as common in the dataset, hence making them more difficult to be noticed. Therefore, if we only present the coarse labels in the pre-training stage, the learner is allowed to take shortcuts by learning only the common features to achieve low training loss. One strategy to force the learner to learn the rarer features is to explicitly label the fine-grained classes. This means that within each fine-grained class, the fine-grained features become as easy to notice as the common features. As a result, even if common features are weakly present or missing in a hard test sample, the network can still be reasonably robust to distracting irrelevant patterns due to its ability to recognize (some of) the finer-grained features.

4 Problem Formulation

Our first theoretical contribution is a new data model, the hierarchical multi-view model. This model consists of four definitions. Compared to existing theories studying feature learning of neural networks in the literature (Allen-Zhu & Li, 2023b; 2022; Shen et al., 2022b; Jelassi & Li, 2022), these four definitions are better formulated to the label granularity problem. For the sake of brevity, we present the core concepts of our data model here, and delay its full specification to Appendix B. Following data model specifications, we also discuss characteristics of the learner, a two-layer nonlinear convolutional neural network.

Published in Transactions on Machine Learning Research (10/2024)

4.1 New data model: hierarchical multi-view

We consider the setting where an input sample X Rd P consists of P patches x1, x2, ..., x P with xp Rd, where d is sufficiently large, and all our asymptotic statements are made with respect to d.

For analytic tractability, we consider two levels of label hierarchy. The root of this hierarchy has two superclasses +1 and 1. The superclass +1 has k+ subclasses. We denote these k+ subclasses as (+1, c1), . . . , (+1, ck+). We can do the same for the superclass 1 which has k subclasses. Each subclass has two types of features: the common features and the fine-grained features. The two types of features are sufficiently different in the sense they have zero correlation and equal magnitude. This leads to the following definition. Definition 4.1 (Features). We define features as elements of a fixed orthonormal dictionary V = {vi}d i=1 Rd. The common and fine-grained features are

Common feature: v+ V and v V Fine-grained feature of subclass c: v+,c and v ,c V

The usage of an orthonormal dictionary is again a choice of our model. We choose so because it is more tractable. With features defined, we can now specify patches in an input sample. Definition 4.2 (Input patches). We define three types of patches for y {+, }:

(Common-feature patches) are defined as xp = αpvy + ζp, where αp 1, and ζp N(0, σ2 ζId). (Subclass-feature patches) are defined as xp = αpvy,c + ζp, where αp 1, and ζp N(0, σ2 ζId). (Noise patches) are defined as xp = ζp.

Within an input sample X = (x1, x2, ..., xp), there are approximately s common-feature patches and s

subclass-feature patches, the rest are all noise patches. Moreover, within a sample, the choice of y has to be consistent across the feature patches. Lastly, the positions of the features patches are random.

These definitions of the input patches are illustrated in Figure 4.

Common cat feature patch

Fine-grained Persian feature patch

Figure 4: Illustration of features and patches.

Some comments: An easy sample is generated according to Definition 4.2. A hard sample is generated in the same way as easy samples, except the common-feature patches are replaced by noise patches, and we replace a small number of noise patches by feature-noise patches, which are of the form xp = α pv + ζp, where α p o(1), and set one of the noise patches to ζ N(0, σ2 ζ Id) with σζ σζ; these patches serve the role of distracting patterns . Definition 4.3 (Source dataset s label mapping). We say a sample X belongs to the +1 superclass if any one of its commonor subclass-feature patches contains v+ or v+,c for any c [k+]. It belongs to the (+, c) subclass if any one of its subclass-feature patches contains v+,c. Definition 4.4 (Source training set). We assume the input samples of the source training set as X src train are generated as in Definition 4.2; the corresponding labels are generated following Definition 4.3. Overall, we denote the source training dataset Dsrc train.

Relation to multi-view. Our data model is inspired by the multi-view concept first proposed in Allen-Zhu & Li (2023b), as we (1) use an orthonormal dictionary to define the features, (2) define an input consisting

Published in Transactions on Machine Learning Research (10/2024)

of many disjoint high-dimensional patches, and (3) assume the existence of multiple discriminative features per class. The reason why the original multi-view property is insufficient for our problem is that it does not consider any label hierarchy nor its link to the input structure. We resolve this issue by following our intuition that classes at different hierarchy levels should be distinguished by their corresponding features: this naturally defines a feature hierarchy, with an exact correspondence with the label hierarchy.

Target dataset. To ensure that baseline and fine-grained training have no unfair advantage over each other, we post a set of new characterizations on the target dataset:

1. The input samples in the target dataset is generated according to Definition 4.2. 2. The true label function is identical across the source and target datasets. 3. Since we are studying the fine-to-coarse transfer direction, the target problem s label space is the root of the hierarchy, meaning that any element of Ytgt train or Ytgt test must belong to the label space {+1, 1}.

Therefore, in our setting, only Ysrc and Ytgt can differ (in distribution) due to different choices in the label granularity level. In this idealized setting, we have essentially made baseline training and coarse-grained pretraining the same procedure. Therefore, an equally valid way to view our theory s setting is to consider Dtgt train the same as Dsrc train except with coarse-grained labels. In other words, we pretrain the network on two versions of the source dataset Dsrc,coarse train and Dsrc,fine train , and then compare the two models on Dtgt test (which has coarse-grained labels).

4.2 Characteristics about the learner

Our model about the learner is consistent with Allen-Zhu & Li (2023b; 2022); Shen et al. (2022b). The learner is a two-layer average-pooling convolutional Re LU network:

p=1 σ( wc,r, xp + bc,r), (2)

where m is a low-degree polynomial in d and denotes the width of the network, σ( ) = max(0, ) is the Re LU nonlinearity, and c denotes the class. We perform a random initialization of w(0) c,r N(0, σ2 0Id) with

σ2 0 = 1/poly(d); we set b(0) c,r = Θ σ0 p

ln(d) and manually tune it, similar to Allen-Zhu & Li (2022). Crossentropy is the training loss for both baseline and transfer training. To simplify analysis and to focus solely on the learning of the feature extractor, we freeze ac,r = 1 during all baseline and transfer training phases, and we use the fine-grained model for binary classification as follows: b F+(X) = maxc [k+] F+,c(X), b F (X) = maxc [k ] F ,c(X). See Appendix B.2 and the beginning of Appendix G for details of learner characteristics and training algorithm.

5 Theoretical results and proof strategy

Our second theoretical contribution lies in establishing a correspondence between the complexity of the labels and complexity of the network s representations. Under the assumption of the hierarchical multi-view data structure, the following are true:

1. If trained with coarse-grained labels (i.e. overly simple labels), the network only learns the common features well, so its representations of the data is overly simple; 2. In contrast, training with fine-grained labels helps the network learn the fine-grained features well in addition to the common ones, so its representation of the data is more complex.

The difference in representation complexity leads to the difference in the network s downstream test accuracy.

5.1 Main results

Theorem 5.1 (Coarse-label training: baseline). (Summary). Let the number of subclasses be lower-bounded: ky polylog(d). With high probability, with proper choice of step size, there exists a time T poly(d) such

Published in Transactions on Machine Learning Research (10/2024)

that for any T [T , poly(d)], the training loss is upper bounded according to

L(F (T )) o(1) (3)

Moreover, for an easy test sample (Xeasy, y), the probability of making a classification mistake is small:

P h F (T ) y (Xeasy) F (T ) y (Xeasy) i o(1), for y = y. (4)

However, for all t [0, poly(d)], given a hard test sample (Xhard, y), the probability of making a classification mistake is large:

P h F (t) y (Xhard) F (t) y (Xhard) i Ω(1), for y = y. (5)

This theorem essentially says that, with a mild lower bound on the number of fine-grained classes, if we only train on the easy samples with coarse labels, it is virtually impossible for the network to learn the fine-grained features even if we give it as much practically reachable amount of time and training samples as possible. Consequently, the network would perform poorly on the hard downstream test samples: if the sample is missing the common features, then the network can be easily misled by the noise present in the sample. To see the full setup and statement of this theorem, please see Appendix B and E. Its proof spans Appendix C to E.

Theorem 5.2 (Fine-grained-label training). (Summary). Assume the same setting as in Theorem 5.1, except let the labels be fine-grained and ky d0.4 (number of subclasses not pathologically large; see Section 7 for its discussion). Within poly(d) time, the probability of making a classification mistake is small:

P h b F (T ) y (X) b F (T ) y (X) i o(1) for y = y, (6)

on the target binary problem on both easy and hard test samples.

The full version of this result is presented in Appendix G.4, and its proof in Appendix G. After fine-grained pretraining, the network s feature extractor gains a strong response to the fine-grained features, therefore its accuracy on the downstream hard test samples increases significantly.

Remark. One concern about the above theorems is that the neural networks are trained only on easy samples. As noted in Sections 1 and 3.2, easy samples should make up the majority of the training and testing samples. Pretraining at higher label granularities only improves network performance on rare samples. Our theoretical result presents the feature-learning bias of a neural network in an exaggerated fashion. Therefore, it is natural to start with the case of no hard training samples. In reality, even if a small portion of hard training samples is present, finite-sized training datasets can have many flaws that can cause the network to overfit severely before learning the fine-grained features, especially since rarer features are learnt more slowly and corrupted by greater amount of noise. We leave these deeper considerations for future theoretical work.

5.2 Proof strategy: representation-label correspondence

The key idea of the proof is to establish a correspondence between the complexity of the labels and complexity of the network s representations. We show that when trained on coarse-grained labels (i.e. overly simple labels), the network only learns the common features well, so its representations of the data is overly simple. In contrast, training with fine-grained labels helps the network learn the fine-grained features well in addition to the common ones, so its representations are more complex.

We first sketch the proof of baseline training which uses coarse-grained labels.

Feature detector neurons. We show that, at initialization, with high probability, for every feature v V, there exists a small group of lucky neurons, denoted S (0) y (v) (with y indicating the superclass), that only activate on v-dominated feature patches. We prove that if v is a feature of class y, then with high probability,

Published in Transactions on Machine Learning Research (10/2024)

the lucky neurons will remain activated on v-dominated patches throughout training, and dominate the feature extractor s response to the feature v. In particular, given any v-dominated patch xp = αpv + ζp,

r=1 σ D w(t) y,r, xp E + b(t) y,r

| {z } network representation of v-dominated patch xp

r S (0) y (v)

σ D w(t) y,r, xp E + b(t) y,r

| {z } detector neurons response to v-dominated patch xp

, t [0, poly(d)]. (7)

Therefore, we call neurons in S (0) y (v) the detector neurons of feature v.

The significance of equation 7 is that, we may now argue about the network s representation of the input data solely based on the behavior of the feature detector neurons.

Impartial representation at initalization. At initialization, the feature extractor s response to common and fine-grained features are very close. The reason is that, S (0) y (v) S (0) y (v ) for all superclasses y, y

and features v, v , and they all have a similar magnitude of activation strength. Written explicitly, given any common-feature patch xcom = αvy + ζ and subclass-feature patch xsub = α vy,c + ζ (from the training or testing distribution), with high probability,

r S (0) y (vy)

σ D w(0) y,r, αvy + ζ E

| {z } network representation of common-feature patch xcom at t = 0

r S (0) y (vy,c)

σ D w(0) y,r, α vy,c + ζ E

| {z } network representation of subclass-feature patch xsub at t = 0

So what happened during training which caused a strong imbalance of representation of the common and fine-grained features in the end? The answer below is the core of the proof.

Overly simple labels = overly simple representations. The imbalance of growth is a result of the subclass-feature patches occurring with less frequency in the training set than the common-feature patches. Recall that the number of subclasses is ky: for any subclass (y, c), subclass-feature patches dominated by vy,c are about ky times rarer than the common feature patches. This has a direct impact on the growth speed of the common and fine-grained detector neurons: for any neuron rcom S (0) y (vy) and any rfine S (0) y (vy,c), w(t) y,rcom, vy Θ (ky) w(t) y,rfine, vy,c .

With careful arguments on the influence of noise and bias on the activation values, we can show that, for t sufficiently large, the fine-grained detector neurons are about Θ(ky) times weaker in strength:

r S (0) y (vy)

σ D w(t) y,r, αvy + ζ E

| {z } network representation of common-feature patch xcom at large t

r S (0) y (vy,c)

σ D w(t) y,r, α vy,c + ζ E

| {z } network representation of subclass-feature patch xsub at large t

Furthermore, we prove that, due to the exponential tail of cross-entropy, by the end of training,

r S (0) y (vy)

σ D w(t) y,r, αvy + ζ E

+ b(t) y,r = Θ (log(d)) (10)

which causes the representation of subclass-feature patches to be vanishing in strength:

r S (0) y (vy,c)

σ D w(t) y,r, α vy,c + ζ E

+ b(t) y,r O log(d)

< o(1), t poly(d). (11)

In other words, the neural network almost cannot detect subclass features by the end of baseline training. Therefore, even though it can classify the easy test samples correctly since it learned the common features

Published in Transactions on Machine Learning Research (10/2024)

well, it simply cannot classify the hard ones, which requires the model to solely rely on subclass-feature patches for inference.

Fine-grained training alleviates this issue.

Complex labels = complex representations. The proof of fine-grained training proceeds in a very similar fashion as the case of coarse-grained training. The main difference lies in the gradient updates. During training, for any neuron rcom S (0) (y,c)(vy) and any rfine S (0) (y,c)(vy,c),

D w(t) (y,c),rcom, vy

D w(t) (y,c),rfine, vy,c

In other words, the common and fine-grained detector neurons for each subclass grow at similar speeds now, because the commonand subclass-feature patches occur with similar frequency in each subclass. Again with careful analysis of how the noise and bias influence the activation values, we arrive at

r S (0) (y,c)(vy)

σ D w(t) (y,c),r, αvy + ζ E

+ b(t) (y,c),r

| {z } network representation of common-feature patch xcom, end of training

r S (0) (y,c)(vy,c)

σ D w(t) (y,c),r, α vy,c + ζ E

+ b(t) (y,c),r

| {z } network representation of subclass-feature patch xsub, end of training Ω(1) Ω(1)

Therefore, both the common and fine-grained features are learnt well. It follows that the model can correctly utilize the commonand subclass-feature patches in the input, so it can classify easy and hard test samples with high accuracy.

6 Empirical Results

Building on our theoretical analysis in an idealized setting, this section discusses conditions on the source and target label functions that we observed to be important for fine-grained pretraining to work in practice, while remaining in the controlled setting described in Section 1 for the sake of tractability. We present the core experimental results obtained on Image Net21k and i Naturalist 2021 in the main text, and leave the experimental details and ablation studies to Appendix A.

6.1 Image Net21k Image Net1k transfer experiment

This subsection provides more details about the experiment shown in Figure 2. Specifically, we show that the common practice of pretraining on Image Net21k using leaf labels is indeed better than pretraining at lower granularities in the manual hierarchy.

Hierarchy definition. The label hierarchy in Image Net21k is based on Word Net Miller (1995); Deng et al. (2009). To define fine-grained labels, we first define the leaf labels of the dataset as Hierarchy level 0. For each image, we trace the path from the leaf label to the root using the Word Net hierarchy. We then set the k-th synset (or the root synset, if it is higher in the hierarchy) as the level-k label of this image. This procedure also applies to the multi-label samples. This is how we generate the hierarchies shown in Figure 2.

Network choice and training. For this dataset, we use the more recent Vision Transformer Vi T-B/16 Dosovitskiy et al. (2021). Our pretraining pipeline is almost identical to the one in Dosovitskiy et al. (2021). For fine-tuning, we experimented with several strategies and report only the best results in the main text; the finer details are discussed in Appendix A.1.2 and A.2. To ensure a fair comparison, we also used these strategies to find the best baseline result by using Dtgt train for pretraining.

6.2 Transfer experiment on i Naturalist 2021

We conduct a systematic study of the transfer method within the label hierarchies of i Naturalist 2021 (Horn & macaodha, 2021). This dataset is well-suited for our analysis because it has a manually defined label

Published in Transactions on Machine Learning Research (10/2024)

hierarchy that is based on the biological traits of the creatures in the images. Additionally, the large sample size of this dataset reduces the likelihood of sample-starved pretraining on reasonably fine-grained hierarchy levels.

Our experiments on this dataset again demonstrate that, as long as the finer-grained labels contain little noise, are well-aligned with the target label space, and sample count per subclass is not too limited, then we observe improvement in the model s generalization performance. However, we also show negative results outside of the aforementioned nice regime : when sample count per sub-class is limited, or the fine-grained labels are noisy, or potentially misaligned with the target label space s, finer-grained labels do not necessarily improve generalization significantly.

Relevant datasets. We perform transfer experiments within i Naturalist2021. More specifically, we set X src train and X tgt train both equal to the training split of the input samples in i Naturalist2021, and set X tgt train to the testing split of the input samples in i Naturalist2021. To focus on the fine-to-coarse transfer setting, the target problem is to classify the root level of the manual hierarchy, which contains 11 superclasses. To generate a greater gap between the performance of different hierarchies and to shorten training time, we use the mini version of the training set in all our experiments.

Alternative hierarchies generation. To better understand the transfer method s operating regime, we experiment with different ways of generating the fine-grained labels for pretraining: we perform k Means clustering on the Vi T-L/14-based CLIP embedding Radford et al. (2021); Dehghani et al. (2022) of every sample in the training set and use the cluster IDs as pretraining class labels. We carry out this experiment in two ways. The green curve in Figure 5 comes from performing k Means clustering on the embedding of each superclass separately, while the purple one s cluster IDs are from performing k Means on the whole dataset. The former way preserves the implicit hierarchy of the superclasses in the cluster IDs: samples from superclass k cannot possibly share a cluster ID with samples belonging to superclass k = k. Therefore, its label function is forced to align better with that of the 11 superclasses than the purple curve s. We also assign random class IDs to samples.

Network choice and training. We experiment with Res Net 34 and 50 on this dataset. For pretraining on Dsrc train with fine-grained labels, we adopt a standard 90-epoch large-batch-size training procedure commonly used on Image Net He et al. (2016); Goyal et al. (2017). Then we finetune the network for 90 epochs and test it on the 11-superclass Dtgt train and Dtgt test, respectively, using the pretrained backbone h(Θsrc; ). To ensure a fair comparison, we trained the baseline model using exactly the same training pipeline, except that the pretraining stage uses Dtgt train. We observed that this retraining baseline consistently outperformed the naive one-pass 90-epoch training baseline on this dataset. Due to space limitations, we leave the results of Res Net50 to the appendix.

Interpretation of results. Figure 5 shows the validation errors of the resulting models on the 11-superclass problem. We make the following observations.

Reasonably fine-grained labels benefit generalization, but there is a catch. We can observe that in the blue curve of Figure 5 that, as long as the number of subclasses is less than 103, we see obvious decline in the validation error on the target labels. In other words, reasonably fine-grained pretraining is indeed beneficial in this setting. We should note, however, the overall curve exhibits a U shape: overly fine-grained labels are not beneficial to downstream generalization. This is intuitive. If the pretraining granularity is too close to the target one, we should not expect improvement. On the other extreme, if we assign a unique label to every sample in the training data, it is highly likely that the only differences a model can find between each class would be frivolous details of the images, which would not be considered discriminative by the label function of the target coarse-label problem. In this case, the pretraining stage is almost meaningless and can be misleading, as evidenced by the very high label-per-sample error (red star in Figure 5).

High granularity can be helpful, but label-assignment consistency is critical. Random class ID pretraining (orange curve) performs the worst of all the alternatives. The label function of this type does not generate a meaningful hierarchy because it has no consistency in the features it considers discriminative when decomposing the superclasses. This is in stark contrast to the manual hierarchies, which decompose the superclasses based on the finer biological traits (mostly visual in nature) of the creatures in the image.

Published in Transactions on Machine Learning Research (10/2024)

101 102 103 104 105 106

Label Granularity, i.e., Number of Classes in Pretraining

Validation Error

Baseline: Retraining Baseline: One-pass Subclass ID = Random Subclass ID = k Mean/Class Subclass ID = k Mean/All Subclass ID = Manual

Figure 5: In-dataset transfer. Res Net34 validation error (with standard deviation) of finetuning on 11 superclasses of i Naturalist 2021, pretrained on various label hierarchies. The manual hierarchy outperforms the baseline and every other hierarchy, and exhibits a U-shaped curve.

Alignment between fine-grained and target label spaces are important. For fine-grained pretraining to be effective, the features that the pretraining label function considers discriminative must align well with those valued by the label function of the 11-superclass hierarchy. To see this point, observe that for models trained on cluster IDs obtained by performing k Means on the CLIP embedding samples in each superclass separately (green curve in Figure 5), their validation errors are much lower than those trained on cluster IDs obtained by performing k Means on the whole dataset (purple curve in Figure 5). As expected, the manually defined fine-grained label functions align best with that of the 11 superclasses, and the results corroborate this view.

7 Discussion

Q: Are there other reasons why fine-grained labels benefit neural network generalization?

A: Yes, it is possible, e.g., the optimization landscape induced by finer-grained labels could contain less saddle points, making it friendlier to SGD. We did not analyze this because our focus is primarily on the generalization instead of optimization aspect of the problem.

Q: Does a higher label granularity always imply better generalization?

A: No. There is an operating regime. Training a model with pathologically high label granularity is harmful. For example, if we assign a unique class to every sample in the dataset, the model will be forced to rely on the frivolous differences between each sample. We verify this intuition in Figure 5 in Appendix A.1.1 on the i Naturalist 2021 dataset. These extreme scenarios do not arise in common practice, so we do not focus on them in this paper.

Q: The theoretical setting appears restrictive.

A: Our theoretical setting is consistent with Cao et al. (2022); Allen-Zhu & Li (2023b; 2022); Shen et al. (2022b); Jelassi & Li (2022). With a limited number of available analytic tools in the literature, we believe these settings are necessary to keep things tractable.

Published in Transactions on Machine Learning Research (10/2024)

8 Conclusion

In this paper, we formally studied the influence of pretraining label granularity on the generalization of DNNs, and performed large-scale experiments to complement our theoretical results. Under the new data model, hierarchical multi-view, we theoretically showed that higher label complexity leads to higher representation complexity, through which we explained why pretraining with fine-grained labels is beneficial to generalization. We complement our theory with experiments on Image Net and i Naturalist, demonstrating that in the controlled setting of this paper, pretraining on reasonably fine-grained labels indeed benefits generalization.

Broader Impact Statement

This paper presents work whose goal is to advance the theory of deep learning. There are potential societal consequences of our work, none which we feel must be specifically highlighted here.

Zeyuan Allen-Zhu and Yuanzhi Li. Feature purification: How adversarial training performs robust deep learning. In FOCS, 2022.

Zeyuan Allen-Zhu and Yuanzhi Li. Backward feature correction: How deep learning performs deep (hierarchical) learning. In COLT, 2023a.

Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In ICLR, 2023b.

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Neur IPS, 2019.

Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Neur IPS, 2022.

Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the presence of low-dimensional structure: A spiked random matrix perspective. In Neur IPS, 2023.

Alsallakh Bilal, Amin Jourabloo, Mao Ye, Xiaoming Liu, and Liu Ren. Do convolutional neural networks learn class hierarchy? IEEE transactions on visualization and computer graphics, 2017.

Yuan Cao, Zixiang Chen, Misha Belkin, and Quanquan Gu. Benign overfitting in two-layer convolutional neural networks. In Neur IPS, 2022.

Zhuo Chen, Ruizhou Ding, Ting-Wu Chin, and Diana Marculescu. Understanding the impact of label granularity on cnn-based image classification. In ICDMW, 2018.

Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large scale fine-grained categorization and domain-specific transfer learning. In CVPR, 2018.

Yin Cui, Zeqi Gu, Dhruv Mahajan, Laurens Van Der Maaten, Serge Belongie, and Ser-Nam Lim. Measuring dataset granularity. ar Xiv preprint ar Xiv:1912.10154, 2019a.

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In CVPR, 2019b.

Alexandru Damian, Jason Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In COLT, 2022.

Giacomo De Palma, Bobak Kiani, and Seth Lloyd. Random deep neural networks are biased towards simple functions. In Neur IPS, 2019.

Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, and Yi Tay. Scenic: A jax library for computer vision research and beyond. In CVPR, 2022.

Published in Transactions on Machine Learning Research (10/2024)

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks. In Nature Machine Intelligence, 2020.

Wonjoon Goo, Juyong Kim, Gunhee Kim, and Sung Ju Hwang. Taxonomy-regularized semantic deep convolutional neural networks. In ECCV, 2016.

Palash Goyal and Shalini Ghosh. Hierarchical class-based curriculum loss. ar Xiv preprint ar Xiv:2006.03629, 2020.

Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv:1706.02677, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Grant Van Horn and macaodha. inat challenge 2021 - fgvc8, 2021. URL https://kaggle.com/competitions/ inaturalist-2021.

Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit Agrawal, and Phillip Isola. The low-rank simplicity bias in deep networks. ar Xiv:2103.10427, 2017.

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Neur IPS, 2018.

Samy Jelassi and Yuanzhi Li. Towards understanding how momentum improves generalization in deep learning. In ICML, 2022.

Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. In ICLR, 2019.

Ralph P. Boas Jr. and Jr. John W. Wrench. Partial sums of the harmonic series. The American Mathematical Monthly, 1971.

Peizhong Ju, Xiaojun Lin, and Ness Shroff. On the generalization power of overfitted two-layer neural tangent kernel models. In ICML, 2021.

Peizhong Ju, Xiaojun Lin, and Ness Shroff. On the generalization power of the overfitted three-layer neural tangent kernel model. In Neur IPS, 2022.

Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Aleksei Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, Andrew Tomkins, and Sujith Ravi. Ultra fine-grained image semantic embedding. In WSDM, 2020.

Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. In Neur IPS, 2019.

Stefani Karp, Ezra Winston, Yuanzhi Li, and Aarti Singh. Local signal adaptivity: Provable feature learning in neural networks beyond kernels. In Neur IPS, 2021.

Kenji Kawaguchi. Deep learning without poor local minima. In Neur IPS, 2016.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Neur IPS, 2012.

Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, and Cengiz Pehlevan. Grokking as the transition from lazy to rich training dynamics. ar Xiv:2310.06110, 2023.

Published in Transactions on Machine Learning Research (10/2024)

Béatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5), 2000.

Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama, John Canny, and Ian Fischer. Compressive visual representations. In Neur IPS, 2021.

Kaifeng Lyu, Zhiyuan Li, Runzhe Wang, and Sanjeev Arora. Gradient descent on two-layer nets: Margin maximization and simplicity bias. In Neur IPS, 2021.

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.

George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995.

Jiquan Ngiam, Daiyi Peng, Vijay Vasudevan, Simon Kornblith, Quoc V Le, and Ruoming Pang. Domain adaptive transfer learning with specialist models. ar Xiv preprint ar Xiv:1811.07056, 2018.

Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. In Neur IPS, 2021.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021.

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik. Imagenet-21k pretraining for the masses. In Neur IPS Track on Datasets and Benchmarks, 2021.

Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. The pitfalls of simplicity bias in neural networks. In Neur IPS, 2020.

Ruoqi Shen, Sebastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipulation. In ICML, 2022a.

Ruoqi Shen, Sebastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipulation. In ICML, 2022b.

Sindi Shkodrani, Yu Wang, Marco Manfredi, and Nóra Baka. United we learn better: Harvesting learning improvements from class hierarchies across tasks. ar Xiv preprint ar Xiv:2107.13627, 2021.

Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, Leshem Choshen, Ranit Aharonov, and Noam Slonim. Cluster & tune: Boost cold start performance in text classification. ar Xiv preprint ar Xiv:2203.10581, 2022.

Carlos N Silla and Alex A Freitas. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 2011.

Mannat Singh, Laura Gustafson, Aaron Adcock, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick, Piotr Dollár, and Laurens Van Der Maaten. Revisiting weakly supervised pre-training of visual perception models. In CVPR, 2022.

Donghyun Son, Byounggyu Lew, Kwanghee Choi, Yongsu Baek, Seungwoo Choi, Beomjun Shin, Sungjoo Ha, and Buru Chang. Reliable decision from multiple subtasks through threshold optimization: Content moderation in the wild. In WSDM, 2023.

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.

Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, and Dhruv Mahajan. Clusterfit: Improving generalization of visual representations. In CVPR, 2020.

Published in Transactions on Machine Learning Research (10/2024)

Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis De Coste, Wei Di, and Yizhou Yu. Hd-cnn: hierarchical deep convolutional neural networks for large scale visual recognition. In ICCV, 2015.

Chuanguang Yang, Zhulin An, Linhang Cai, and Yongjun Xu. Hierarchical self-supervised augmented knowledge distillation. ar Xiv preprint ar Xiv:2107.13715, 2021.

Eric Zelikman, Jesse Mu, Noah D Goodman, and Yuhuai Tony Wu. Star: Self-taught reasoner bootstrapping reasoning with reasoning. In Neur IPS, 2022.

Xinqi Zhu and Michael Bain. B-cnn: branch convolutional neural network for hierarchical classification. ar Xiv preprint ar Xiv:1709.09890, 2017.

Published in Transactions on Machine Learning Research (10/2024)

A Additional Experimental Results

In this section, we present the full details of our experiments and relevant ablation studies. All of our experiments were performed using tools in the Scenic library Dehghani et al. (2022).

A.1 In-dataset transfer results

To clarify, in this transfer setting, we are essentially transferring within a dataset. More specifically, we set X src = X tgt and only the label spaces Ysrc and Ytgt may differ (in distribution). The baseline in this setting is clear: train on Dtgt train and test on Dtgt test. In contrast, after pretraining the backbone network h(Θ; ) on Ysrc, we finetune or linear probe it on Dtgt train using the backbone and then test on Dtgt test.

A.1.1 i Naturalist 2021

i Naturalist 2021 is well-suited for our analysis because it has a high-quality, manually defined label hierarchy that is based on the biological traits of the creatures in the images. Additionally, the large sample size of this dataset reduces the likelihood of sample-starved pretraining on reasonably fine-grained hierarchy levels. We use the mini training dataset with size 500,000 instead of the full training dataset to show a greater gap between the results of different hierarchies and speed up training.

We use the architectures Res Net 34 and 50 He et al. (2016).

Training details. Our pretraining pipeline on i Naturalist is essentially the same as the standard large-batchsize Image Net-type training for Res Nets He et al. (2016); Goyal et al. (2017). The following pipeline applies to model pretraining on any hierarchy.

Optimization: SGD with 0.9 momentum coefficient, 0.00005 weight decay, 4096 batch size, 90 epochs total training length. We perform 7 epochs of linear warmup in the beginning of training until the learning rate reaches 0.1 4096/256 = 1.6, and then apply the cosine annealing schedule. Each training instance is run on 16 TPU v4 chips, taking around 2 hours per run.

Data augmentation: subtracting mean and dividing by standard deviation, image (original or its horizontal flip) resized such that its shorter side is 256 pixels, then a 224 224 random crop is taken.

For finetuning, we keep everything in the pipeline the same except setting the batch size to 4096/4 = 1024 and base learning rate 1.6/4 = 0.4. We found that finetuning at higher batch size and learning rate resulted in training instabilities and severely affected the final finetuned model s validation accuracy, while finetuning at lower batch size and learning rate than the chosen one resulted in lower validation accuracy at the end even though their training dynamics was stabler.

For the baseline accuracy, as mentioned in the main text, to ensure fairness of comparison, in addition to only training the network on the target 11-superclass problem for 90 epochs (using the same pretraining pipeline), we also perform retraining : follow the exact training process of the models trained on the various hierarchies, but use Dtgt train as the training dataset in both the pretrianing and finetuning stage. We observed consistent increase in the final validation accuracy of the model, so we report this as the baseline accuracy. Without retraining (so naive one-pass 90-epoch training on 11 superclasses), the average accuracy with standard deviation is 94.13, 0.025.

Clustering. To obtain the cluster-ID-based labels, we perform the following procedure.

1. For every sample Xn in the mini training dataset of i Naturalist 2021, obtain its Vi T-L/14 CLIP embedding En.

2. Per-superclass k Means clustering. Let C be the predefined number of clusters per class.

Published in Transactions on Machine Learning Research (10/2024)

Table 1: In-dataset transfer, i Naturalist 2021. Res Net34 average finetuning validation error and standard deviation on 11 superclasses in i Naturalist 2021, pretrained on various label hierarchies with different label granularity. Baseline (11-superclass) and best performance are highlighted.

Manual Hierarchy G(Ysrc) 11 13 51 273 1103 4884 6485 Validation error 5.25 0.051 5.40 0.075 5.10 0.038 4.83 0.041 4.79 0.045 4.82 0.056 4.84 0.033 Random class ID G(Ysrc) 22 88 352 1,408 5,632 11,264 500,000 Validation error 6.61 0.215 6.30 0.070 6.12 0.77 6.10 0.053 6.12 0.042 6.10 0.057 6.54 0.758 CLIP+k Means G(Ysrc) 22 88 352 1408 2816 5632 22528 per superclass Validation error 5.14 0.049 5.16 0.033 5.17 0.027 5.24 0.029 5.30 0.029 5.31 0.077 5.37 0.032 C+k per supclass G(Ysrc) 88 218 320 608 1040 1984 Class rebalanced Validation error 5.18 0.054 5.17 0.038 5.23 0.052 5.28 0.045 5.26 0.035 5.21 0.040 CLIP+k Means G(Ysrc) 22 44 88 352 1408 2816 5632 whole dataset Validation error 5.52 0.015 5.42 0.047 5.45 0.049 5.46 0.019 5.60 0.029 5.50 0.029 5.47 0.029

(a) For every superclass k, for the set of embedding {(En, yn = k)} belonging to that superclass, perform k Means clustering with cluster size set to C. (b) Given a sample with superclass ID k {1, 2, ..., 11} and cluster ID c {1, 2, ..., C}, define its fine-grained ID as C k + c.

3. Whole-dataset k Means clustering. Let C be the predefined number of clusters on the whole dataset.

(a) Perform k Means on the embedding of all the samples in the dataset, with the number of clusters set to C. Set the fine-grained class ID of a sample to its cluster ID.

Some might have the concern that having the same number of k Means clusters per superclass could cause certain classes to have too few samples, which could be a reason for why the cluster ID hierarchies perform worse than the manual hierarchies. Indeed, the number of samples per superclass on i Naturalist is different, so in addition to the above uniform-number-of-cluster-per-superclass hierarchy, we add an extra label hierarchy by performing the following procedure to balance the sample size of each cluster:

1. Perform k Means for each superclass with number of clusters set to 2, 8, 32, 64, 128, 256, 512, 1024 and save the corresponding image-ID-to-cluster-ID dictionaries (so we are basically reusing the clustering results of the CLIP+k Means per superclass experiment)

2. For each superclass, find the image-ID-to-cluster-ID dictionary with the highest granularity while still keeping the minimum number of samples for each cluster > predefined threshold (e.g. 1000 samples per subclass)

3. Now we have nonuniform granularity for each superclass while ensuring that the sample count per cluster is above some predefined threshold.

This simple procedure somewhat improves the balance of sample count per cluster, for example, Figure 6 shows the sample count per cluster for the cases of total number of clusters = 608 and 1984. Unfortunately, we do not observe any meaningful improvement on the model s validation accuracy trained on this more refined hierarchy.

Experimental procedures. All the validation accuracies we report on Res Net34 are the averaged results of experiments performed on at least 6 random seeds: 2 random seeds for backbone pretraining and 3 random seeds for finetuning. We report the average accuracies with their standard deviation on various hierarchies in Table 1.

An additional experiment we performed with Res Net34 is a small grid search over what checkpoint of a pretrained backbone we should use for finetuning on the 11-superclass method; we tried the 50-, 70and 90-epoch checkpoints of the backbone on the manual hierarchies. We report these results in Table 2. As we can see, 90-epoch checkpoints performs almost equally well as the 70-epoch checkpoints and better than the 50-epoch ones by a nontrivial margin. With this observation, we chose to use the end-of-pretraining 90-epoch checkpoints in all our other experiments without further ablation studies on those hierarchies.

Published in Transactions on Machine Learning Research (10/2024)

Table 2: In-dataset transfer, i Naturalist 2021. Res Net34 average finetuned validation error and standard deviation on 11 superclasses in i Naturalist 2021, pretrained on the manual hierarchies, with different backbone checkpoints.

90-Epoch ckpt G(Ysrc) 13 51 273 1103 4884 6485 Validation error 5.40 0.075 5.10 0.038 4.83 0.041 4.79 0.045 4.82 0.056 4.84 0.033 70-Epoch ckpt G(Ysrc) 13 51 273 1103 4884 6485 Validation error 5.43 0.055 5.08 0.029 4.86 0.037 4.82 0.034 4.83 0.064 4.85 0.018 50-Epoch ckpt G(Ysrc) 13 51 273 1103 4884 6485 Validation error 5.53 0.036 5.2 0.031 4.90 0.038 4.9 0.042 4.91 0.020 4.95 0.026

Table 3: In-dataset transfer, i Naturalist 2021. Res Net50 finetuned average validation error and standard deviation on 11 superclasses in i Naturalist 2021, pretrained on label hierarchies with different label granularity.

Manual Hierarchy G(Ysrc) 11 13 51 273 1103 4884 6485 Validation error 4.43 0.029 4.44 0.063 4.36 0.062 4.22 0.021 4.20 0.035 4.23 0.054 4.33 0.037 Random class ID G(Ysrc) 22 88 352 1,408 5,632 11,264 500,000 Validation error 5.36 0.111 5.31 0.079 5.24 0.093 5.38 0.052 5.37 0.033 5.40 0.033 5.13 0.072

0 250 500 750 1000 1250 1500 1750 2000 Subclass ID (sorted)

Sample Count

Sample count per cluster. Number of clusters = 1984

0 100 200 300 400 500 600 Subclass ID (sorted)

Sample Count

Sample count per cluster. Number of clusters = 608

Figure 6: In-dataset transfer, i Naturalist 2021. Number of samples per cluster in the case of 608 and 1984 total clusters, after applying the sample size rebalancing procedure described in subsection A.1.1. Observe that the sample sizes are reasonably balanced across almost all the subclasses.

Table 4: In-dataset transfer. Vi T-B/16 validation error on the binary problem is this object a living thing? of Image Net21k. Pretrained on various hierarchy levels of Image Net21k, finetuned on the binary problem. Observe that the maximal improvement appears at the leaf labels, and as G(Ysrc) approaches 2, the percentage improvement approaches 0.

Hierarchy level G(Ysrc) Validation error Baseline 2 7.90 0 (leaf) 21843 6.56 1 5995 6.76 2 2281 6.70 4 519 6.97 6 160 7.31 9 38 7.55

Our Res Net50 results are not as extensive as those on Res Net34. We present the average accuracies and standard deviations in Table 3.

Published in Transactions on Machine Learning Research (10/2024)

A.1.2 Image Net21k

The Image Net21k dataset we experiment on contains a total of 12,743,321 training samples and 102,400 validation samples, with 21843 leaf labels. A small portion of samples have multiple labels.

Caution: due to the high demand on computational resources of training Vi T models on Image Net21k, all of our experiments that require (pre-)training or finetuning/linear probing on this dataset were performed with one random seed.

Hierarchy generation. To define fine-grained labels, we start by defining the leaf labels of the dataset to be Hierarchy level 0. For every image, we trace from the leaf synset to the root synset relying on the Word Net hierarchy, and set the k-th synset (or the root synset, whichever is higher in level) as the level-k label of this image; this procedure also applies to the multi-label samples. This is the way we generate the manual hierarchies shown in the main text.

Due to the lack of a predefined coarse-label problem, we manually define our target problem to be a binary one: given an image, if the synset Living Thing is present on the path tracing from the leaf label of the image to the root, assign label 1 to this image; otherwise, assign 0. This problem almost evenly splits the training and validation sets of Image Net21k: 5,448,549:7,294,772 for training, 43,745:58,655 for validation.

Network choice and pretraining pipeline. We experiment with the Vi T-B/16 model Dosovitskiy et al. (2021). The pretraining pipeline of this model follows the one in Dosovitskiy et al. (2021) exactly: we train the model for 90 epochs using the Adam optimizer, with β1 = 0.9, β2 = 0.999, weight decay coefficient equal to 0.03 and a batch size of 4096; we let the dropout rate be 0.1; the output dense layer s bias is initialized to 10.0 to prevent huge loss value coming from the off-diagonal classes near the beginning of training Cui et al. (2019b); for learning rate, we perform linear warmup for 10,000 steps until the learning rate reaches 10 3, then it is linearly decayed to 10 5. The data augmentations are the common ones in Image Net-type training Dosovitskiy et al. (2021); He et al. (2016): random cropping and horizontal flipping. Note that we use the sigmoid cross-entropy for training since the dataset has multi-label samples.

Each training instance (90 epochs) is run on 64 TPU v4 chips, taking approximately 1.5 to 2 days.

Evaluation on the binary problem. After the 90-epoch pretraining on the manual hierarchies, we evaluate the model on the binary problem. We report the best accuracies on each hierarchy level in Table 4. To get a sense of how the relevant hyperparameters influence final accuracy of the model, we try out the following finetuning/linear probing strategies on the backbone trained on the leaf labels and the target binary problem of the dataset, and report the results in Table 5 (similar to our experiments on i Naturalist, we include the backbone trained on the binary problem in these ablation studies to ensure that our comparisons against the baseline are fair) :

1. 90-epochs finetuning in the same fashion as the pretraining stage, but with a small grid search over

(batch size, base learning rate) ={(4096, 0.001), (4096/4 = 1024, 0.001/4 = 0.00025),

(4096/8 = 512, 0.001/8 = 0.000125)}.

2. Linear probing with 20 epochs training length, using exactly the same training pipeline as in pretraining. We ran a small grid search over (batch size, base learning rate) = {(4096, 0.001), (4096/8 = 512, 0.001/8 = 0.000125)}.

3. 10-epochs finetuning, no linear warmup, 3 epochs of constant learning rate in the beginning followed by 7 epochs of linear decay, with a small grid search over (batch size, base learning rate) = {(4096, 0.001), (4096/8 = 512, 0.001/8 = 0.000125)}.

Table 5 helps us decide the best accuracies to report. First, as expected the linear probing results are much worse than the finetuning ones. Second, the retraining accuracy of 92.102 is the best baseline we can report (the same thing happened in the i Naturalist case) if we only train the model for 90 epochs (the naive one-pass training) on the binary problem, then the model s final validation accuracy is 91.746%, which is lower than 92.102% by a nontrivial margin. In contrast, the short 10-epoch finetuning strategy

Published in Transactions on Machine Learning Research (10/2024)

Table 5: In-dataset transfer, Image Net21k. Vi T-B/16 validation accuracy on the binary problem Is the object a Living Thing on Image Net21k. Ablation study on the exact finetuning/linear probing strategy.

Eval strategy 90-epoch finetune Linear probe 10-epoch finetune

Leaf-pretrained (Batch size, base lr) (4096,1e-3) (1024,2.5e-4) (512,1.25e-4) (4096, 1e-3) (512, 1.25e-4) (4096,1e-3) (512,1.25e-4) Validation error 92.782 93.177 93.295 87.497 87.493 92.294 93.439

Baseline (Batch size, base lr) (4096,1e-3) (1024,2.5e-4) (512,1.25e-4) (4096, 1e-3) (512, 1.25e-4) (4096,1e-3) (512,1.25e-4) Validation error 92.102 91.971 91.939 91.703 91.719 92.002 91.856

Table 6: In-dataset transfer, Image Net1k. Res Net50 finetuned average validation error and standard deviation on the vanilla 1000 classes, pretrained on label hierarchies with different label granularity.

Res Net50 CLIP+k Means G(Ysrc) 2000 4000 8000 per-class Validation error 23.4 0.13 23.48 0.098 23.49 0.204 Vi T-L/14 CLIP+k Means G(Ysrc) 2000 4000 8000 per-class Validation error 23.4 0.127 23.47 0.074 23.78 0.048 Random ID G(Ysrc) 2000 4000 8000 per-class Validation error 23.4 0.068 23.4 0.070 23.65 0.071

works best for the backbone trained on the leaf labels, therefore, we also use this strategy to evaluate the backbones trained on all the other manual hierarchies. A peculiar observation we made was that, finetuning the leaf-labels-pretrained backbone for extended period of time on the binary problem caused it to overfit severely: for batch size and base learning rate in the set {(4096, 0.001), (1024, 0.00025), (512, 0.000125)}, throughout the 90 epochs of finetuning, although its training loss exhibits the normal behavior of staying mostly monotonically decreasing, its validation accuracy actually reached its peak during the linear warmup period!

A.1.3 Image Net1k

Our Image Net1k in-dataset transfer experiments are done in a very similar fashion to the i Naturalist ones. In particular, the pretraining and finetuning pipeline for Res Net50 is exactly the same as the one in the i Naturalist case, so we do not repeat it here.

Due to a lack of more fine-grained manual label on this dataset, we generate fine-grained labels by performing k Means on the Vi T-L/14 CLIP embedding of the dataset separately for each class; the exact procedure is also identical to the i Naturalist case. The CLIP backbones we use here are the Res Net50 version and the Vi T-L/14 version. We report the average accuracies and their standard deviation in Table 6. All results are obtained from at least one random seed during pretraining and 3 random seeds during finetuning.

The best baseline we report is the one using retraining: if we adopt the pretrain-then-finetune procedure but with Dtgt train (i.e. the vanilla 1000-class labels) set as the pretraining dataset, then we obtain an average validation error of 23.28% with standard deviation of 0.103, averaged over results of 3 random seeds. In comparison, if we only perform the naive one-pass 90-epoch training, we obtain average valiation error 24.04%, with standard deviation 0.057.

From Table 6, we see that there is virtually no difference between the baseline and the best errors obtained by the models trained on the custom hierarchies: they are almost equally bad. Noting that the sample size of each class in Image Net1k is only around 103, and the fact that Image Net1k classification is a hard problem it is a problem of high sample complexity further decomposing the classes causes each fine-grained class to have too few samples, leading to the above negative results. This reflects the intuition that higher label granularity does not necessarily mean better model generalization, since the sample size per class might become too small.

Published in Transactions on Machine Learning Research (10/2024)

Table 7: Cross-dataset transfer. Vi T-B/16 average finetuning validation accuracy on Image Net1k along with standard deviation, pretrained on various hierarchy levels of Image Net21k, and a small grid search over the base learning rate.

Pretrained on / Base lr 3 10 3 3 10 2 6 10 2 3 10 1

Image Net21k, Hier. lv. 0 80.87 0.012 82.48 0.005 82.51 0.042 81.40 0.041 Image Net21k, Hier. lv. 1 77.38 0.037 81.03 0.054 81.28 0.045 80.40 0.087 Image Net21k, Hier. lv. 2 74.91 0.012 79.76 0.021 80.26 0.05 79.7 0.019 Image Net21k, Hier. lv. 4 63.65 0.052 76.43 0.033 77.32 0.088 77.53 0.078 Image Net21k, Hier. lv. 6 62.17 0.012 73.65 0.033 73.92 0.073 75.53 0.024 Image Net21k, Hier. lv. 9 53.68 0.034 69.33 0.045 71.08 0.068 72.75 0.071

Table 8: Cross-dataset transfer. Vi T-B/16 average linear-probing validation accuracy on Image Net1k along with standard deviation, pretrained on various hierarchy levels of Image Net21k.

Pretrained on Hier. lv G(Ysrc) Validation acc. IM21k 0 (leaf) 21843 81.45 0.021

1 5995 78.33 0.018

2 2281 75.66 0.005

4 519 68.95 0.051

6 160 63.65 0.035

9 38 57.35 0.016

A.2 Cross-dataset transfer, Image Net21k Image Net1k

In this subsection, we report the average validation accuracy and standard deviation of the cross-dataset transfer experiment from Image Net21k to Image Net1k, as discussed in Figure 2 and Section 1 in the main text.

Network choice. We use the same architecture Vi T-B/16 as the one in the in-dataset Image Net21k transfer experiment and follow the same training procedure, which we repeat here for the reader s convenience. The pretraining pipeline of this model follows the one in Dosovitskiy et al. (2021): we train the model for 90 epochs using the Adam optimizer, with β1 = 0.9, β2 = 0.999, weight decay coefficient equal to 0.03 and a batch size of 4096; we let the dropout rate be 0.1; the output dense layer s bias is initialized to 10.0 to prevent huge loss value coming from the off-diagonal classes near the beginning of training Cui et al. (2019b); for learning rate, we perform linear warmup for 10,000 steps until the learning rate reaches 10 3, then it is linearly decayed to 10 5. The data augmentations are the common ones in Image Net-type training Dosovitskiy et al. (2021); He et al. (2016): random cropping and horizontal flipping. Note that we use the sigmoid cross-entropy for training since the dataset has multi-label samples.

Additionally, each training instance (90 epochs) is run on 64 TPU v4 chips, taking approximately 1.5 to 2 days.

Finetuning. For finetuning on Image Net1k, our procedure is very similar to the one in the original Vi T paper Dosovitskiy et al. (2021), described in its Appendix B.1.1. We optimize the network for 8 epochs using SGD with momentum factor set to 0.9, zero weight decay, and batch size of 512. The dropout rate, unlike in pretraining, is set to 0. Gradient clipping at 1.0 is applied. Unlike Dosovitskiy et al. (2021), we still finetune at the resolution of 224 224. For learning rate, we apply linear warmup for 500 epochs until it reaches the base learning rate, then cosine annealing is applied; we perform a small grid search of base learning rate = {3 10 3, 3 10 2, 6 10 2, 3 10 1}. Every one of these grid search is repeated over

Published in Transactions on Machine Learning Research (10/2024)

3 random seeds. We report the Image Net1k validation accuracies and their standard deviations in Table 7. In the main text, we report the best accuracy for each hierarchy level.

Linear probing. For linear probing, we use the following procedure. We optimize the linear classifier for 40 epochs (similar to Lee et al. (2021)) using SGD with Nesterov momentum factor set to 0.9, a small weight decay coefficient 10 6, and batch size 512. We start with a base learning rate of 0.9, and multiply it by 0.97 per 0.5 epoch. In terms of data augmentation, we adopt the standard ones like before: horizontal flipping and random cropping of size 224 224. We repeat this linear probing procedure over 3 random seeds given the pretrained backbone, and report the average validation accuracy and standard deviation in Table 8.

Baseline. The baseline accuracy on Image Net1k is directly taken from the Vi T paper Dosovitskiy et al. (2021) (see Table 5 in it), in which the Vi T-B/16 model is trained for 300 epochs on Image Net1k.

Published in Transactions on Machine Learning Research (10/2024)

B Theory, Problem Setup

B.1 Data Properties

1. Coarse classification: a binary task, +1 vs. 1.

2. An input sample X Rd P consists of P patches, each with dimension d. In this work, always assume d is sufficiently large1;

3. Assume there exists k+ subclasses of the superclass + , and k subclasses of the superclass . Let k+ = k .

4. Assume orthonormal dictionary V = {v1, ..., vd} Rd, which forms an orthonormal basis of Rd. Define v+ V to be the common feature of class + . For each subclass (+, c) (where c [k+]), denote the subclass feature of it as v+,c V. Similar for the class.

5. For an easy sample X belonging to the (+, c) class (for c [k+]), we sample its patches as follows:

Definition: we define the function P : Rd P V [P] (so (X; v) 7 I [P]) to extract, from sample X, the indices of the patches on which the dictionary word v D dominates.

(a) (Common-feature patches) With probability s

P , a patch xp in X is a common-feature patch, on which xp = αpv+ + ζp for some (random) αp 1 ι, 1 + ι ;

(b) (Subclass-feature patches) With probability s P |P(X;v+)|, a patch with index p ([P] P(X; v+)) is a subclass-feature patch, on which xp = αpv+,c + ζp, for random αp 1 ι, 1 + ι ; (c) (Noise patches) For the remaining P |P(X; v+)| |P(X; v+,c)| patches, xp = ζp.

6. A hard sample Xhard for class (+, c) is exactly the same as an easy one except:

(a) Its common-feature patches are replaced by noise patches;

(b) (Feature noise patches) With probability s P |P(X;v+,c)|, a patch with index p ([P] P(X; v+,c))

is a feature-noise patch, on which xp = α pv + ζp for some (random) αp h ι lower, ι upper i ;

(c) Set one of the noise patches to ζ N(0, σ2 ζ Id).

7. A sample X belongs to the + superclass if |P(X; v+)| > 0 or |P(X; v+,c)| > 0 for any c (excluding feature-noise patches).

8. The above sample definitions also apply to the classes by switching the class signs.

9. A training batch of samples contains exactly N/2k+ samples for each (+, c) and ( , c) subclass. This also means that each training batch contains exactly N/2 samples belonging to the +1 superclass, and N/2 samples for the 1 superclass.

10. As discussed in the main text, for both coarse-grained (baseline) and fine-grained training, we only train on easy samples.

B.2 Learner Assumptions and Training Algorithm

Assume the learner is a two-layer convolutional Re LU network:

p=1 σ( wc,r, xp + bc,r) (14)

To simplify analysis and only focus on the learning of the feature extractor, we freeze ac,r = 1 throughout training. The nonlinear activation σ( ) = max(0, ) is Re LU. Note that the convolution kernels have dimension d and stride d.

1Consider each d-dimensional patch of the input as an embedding of the input image generated by, for instance, an intermediate layer of a DNN.

Published in Transactions on Machine Learning Research (10/2024)

Remark. One difference between this architecture and a CNN used in practice is that we do not allow feature sharing across classes: for each class c, we are assigning a disjoint group of neurons wc,r to it. Separating neurons for each class is a somewhat common trick to lower the complexity of analysis in DNN theory literature Allen-Zhu & Li (2023b); Karp et al. (2021); Cao et al. (2022), as it reduces complex coupling between neurons across classes which is not the central focus of our study in this paper.

Now we discuss the training algorithm.

Initialization.

Sample w(0) c,r N(0, σ2 0Id), and set b(0) c,r = σ0cb p

We adopt the standard cross-entropy training:

n=1 L(F; Xn, yn) =

exp(Fyn(Xn)) PC c=1 exp(Fc(Xn))

This induces the stochastic gradient descent update for each hidden neuron (c [k], r [m]) per minibatch of N iid samples:

w(t+1) c,r = w(t) c,r + η 1

1{yn = c}[1 logit(t) c (X(t) n )] X

p [P ] σ ( w(t) c,r, x(t) n,p + b(t) c,r)x(t) n,p+

1{yn = c}[ logit(t) c (X(t) n )] X

p [P ] σ ( w(t) c,r, x(t) n,p + b(t) c,r)x(t) n,p

logit(t) c (X) = exp(Fc(X)) PC y=1 exp(Fy(X)) (17)

As for the bias,

b(t+1) c,r = b(t) c,r w(t+1) c,r w(t) c,r 2 log5(d) (18)

Remark. 1. The initialization strategy is similar to the one in Allen-Zhu & Li (2022).

2. Since the only difference between the training samples of coarse and fine-grained pretraining is the label space, the form of SGD update is identical. The only difference is the number of output nodes of the network: for coarse training, the output nodes are just F+ and F (binary classification), while for fine-grained training, the output nodes are F+,1, F+,2, ..., F+,k+, F ,1, F ,2, ..., F ,k , a total of k+ + k nodes.

3. The bias is for thresholding out the neuron s noisy activations that grow slower than 1/ log5(d) times the activations on the features which the neuron detects. This way, the bias does not really influence updates to the neuron s response to the (common and/or fine-grained) features which it activates strongly on, since 1 1 log5(d) 1, while it removes useless low-magnitude noisy activations. This in fact creates a (generalization) gap between the nonlinear model that we are studying and linear models. Due to our parameter choices (as discussed below), if the model has no nonlinearity (remove the Re LU activations), then even if the model can be written as F+(X) = P

p [P ] c+ v+, xp + c+,1 v+,1, xp + ... + c+,k+ v+,k+, xp and F (X) = P

p [P ] c v , xp + c ,1 v ,1, xp + ... + c ,k v ,k , xp for

any sequence of nonnegative real numbers c+, c , {c+,j}k+ j=1, {c ,j}k j=1 (which is the ideal situation since the true features are not corrupted by anything), it is impossible for the model to reach o(1) error on the input samples, because the number of noise patches will accumulate to a variance of (P O(s )) σζ O(s ), which significantly overwhelms the signal from the true features. On the

Published in Transactions on Machine Learning Research (10/2024)

other hand, each noise patch is sufficiently small in magnitude with high probability (their strength is o(1/ log5(d))), so a slightly negative bias, as described above, can threshold out these noise-based signals and prevent them from accumulating across the patches.

An important difference between our bias update rule and the one in Allen-Zhu & Li (2022) is that, our rule depends on the ℓ2 norm of the neuron s update, while the one in Allen-Zhu & Li (2022) is hard-coded and not dependent on the neuron weights. The reason that we should not hard code the bias update rate is that, the neurons that are responsible for detecting the common features will grow more quickly in norm than those responsible for detecting the fine-grained features, therefore, to ensure fairness between the different groups of neurons (i.e. only using the bias to remove useless activations on the noise patches while creating minimal disturbance to the neurons activation on feature-dominated patches), we rely on our neuron-dependent bias update rule.

B.3 Parameter Choices

The following are fixed choices of parameters for the sake of simplicity in our proofs.

1. Always assume d is sufficiently large. All of our asymptotic results are presented with respect to d;

2. poly(d) denotes the asymptotic order polynomial in d ;

3. polylog(d) aymptotic order polylogarithmic in d ;

4. polylog(d) k+ = k d0.4 and s log5(d) k+ (i.e. k+ lower bounded by polynomial of log(d) of sufficiently high degree);

5. Small positive constant c0 (0, 0.1);

6. For coarse-grained (baseline) training, set cb = 4 + 2c0, and for fine-grained training, set cb = 2 + 2c0;

7. 0 ι 1 polylog(d);

8. ι lower 1 log4(d), and s ι upper O 1 log(d) ;

10. s polylog(d) with a degree > 15;

11. σζ = 1 log10(d)

12. σζ h ω polylog(d)

, O 1 polylog(d) i ;

13. Pσζ ω(polylog(d)), and P poly(d);

14. σ0 O 1 d3s log(d) , and set η = Θ(σ0) for simplicity;

15. Batch of samples B(t) at every iteration has a deterministic size of N (Ω(polylog(d)k+d), poly(d)).

16. Note: we sometimes abuse the notation x = a b as an abbreviation for x [a b, a + b].

Remark. We believe the range of parameter choice can be (asymptotically) wider than what is considered here, but for the purpose of illustrating the main messages of the paper, we do not consider a more general set of parameter choice necessary because having a wider range of it can significantly complicate and obscure the already lengthy proofs without adding to the core messages.

Published in Transactions on Machine Learning Research (10/2024)

B.4 Plan of presentation and central ideas

We shall devote the majority of our effort to proving results for the coarse-label learning dynamics, starting with appendix section C and ending on E, and only devote section G to the fine-grained-label learning dynamics, since the analysis of fine-grained training overlaps significantly with the coarse-grained one.

One technical difficulty in making the above ideas rigorous lies in the Re LU activation (with time-dependent bias): due to randomness in the gradient updates and the initialization, it is possible for individual hidden neurons that activate on v-dominated patches at one time iterate to no longer do so at the next iterate, and the opposite can happen. This can be problematic: for instance, it is possible that certain lucky neurons for v+ at one iterate become dead on v+-dominated patches at the next iterate, while some unlucky neurons that were dead on v+,c-dominated patches before start activating on these patches at the current iterate. In our proof, we show that this kind of situation does not happen too frequently nor do they contribute too much to the overall behavior of the neural network, by carefully keeping track of each hidden neuron s response to feature vectors and noise vectors throughout training.

Published in Transactions on Machine Learning Research (10/2024)

C Coarse-grained training, Initialization Geometry

For coarse-grained training, assume m = Θ(d2+2c0).

Definition C.1. Define the following sets of interest of the hidden neurons:

1. U(0) +,r = {v V : w(0) +,r, v σ0 4 + 2c0 q

log(d) 1 log5(d)}

2. Given v V, S (0) + (v) + [m] satisfies:

(a) w(0) +,r, v σ0 4 + 2c0 q

log(d) + 1 log5(d)

(b) v V s.t. v v, w(0) +,r, v < σ0 4 + 2c0 q

log(d) 1 log5(d)

3. Given v D, S(0) + (v) + [m] satisfies:

(a) w(0) +,r, v σ0 4 + 2c0 q

log(d) 1 log5(d)

4. For any (+, r) S (0) +,reg + [m]:

(a) w(0) +,r, v σ0

U(0) +,r O(1)

Proposition 1. Assume m = Θ(d2+2c0), i.e. the number of neurons assigned to the + and class are equal and set to Θ(d2+2c0).

At t = 0, for all v V, the following properties are true with probability at least 1 d 2 over the randomness of the initialized kernels:

1. |S (0) + (v)|, |S(0) + (v)| = Θ 1

2. In particular, for any v, v V, |S (0) + (v)|

|S (0) + (v )| 1 , |S (0) + (v)|

|S(0) + (v )| 1 O 1 log5(d)

3. S(0) +,reg = [m]

Proof. Recall the tail bound of g N(0, 1) for every ϵ > 0:

2π ϵ ϵ2 + 1e ϵ2/2 P [g ϵ] 1

2π 1 ϵ e ϵ2/2 (19)

First note that for any r [m], { w(0) +,r, v }v V is a sequence of iid random variables with distribution N(0, σ2 0).

The proof of the first point proceeds in two steps.

Published in Transactions on Machine Learning Research (10/2024)

1. The following properties hold at t = 0:

w(0) +,r, v σ0 4 + 2c0

log(d) + 1 log5(d)

8π d 2 c0e( 2 c0)/ log5(d)

(4 + 2c0) log(d) + 1 log5(d)

(4 + 2c0) log(d) + 1 log5(d) + 1 , 1 r

(4 + 2c0) log(d) + 1 log5(d)

w(0) +,r, v σ0 4 + 2c0

log(d) 1 log5(d)

8π d 2 c0e ( 2 c0)/ log5(d)

(4 + 2c0) log(d) 1 log5(d)

(4 + 2c0) log(d) 1 log5(d) + 1 , 1 r

(4 + 2c0) log(d) 1 log5(d)

Therefore, for any r [m], the random event described in S (0) + holds with probability

p1 (1 p2)d 1 =Θ

d 2 c0 !d 1

The last equality holds because defining f(d) = d 2 c0 and d being sufficiently large,

g(d) := |(d 1) log(1 f(d))| (d 1) (f(d) + O(f(d)2)) O(d 1) (23)

which means (1 f(d))d 1 = e g(d) (1 O(d 1), 1) (24)

2. Given v V, |S (0) + (v)| is a binomial random variable, with each Bernoulli trial (ranging over

r [m]) having success probability p1(1 p2)d 1. Therefore, E h |S (0) + (v)| i = mp1(1 p2)d 1 =

Now recall the Chernoff bound of binomial random variables. Let {Xn}m n=1 be an iid sequence of Bernoulli random variable with success rate p, and Sn = Pm n=1 Xn. Then for any δ (0, 1),

P[Sn (1 + δ)mp] exp δ2mp

P[Sn (1 δ)mp] exp δ2mp

Published in Transactions on Machine Learning Research (10/2024)

It follows that, for each v V, |S (0) + (v)| = Θ 1

dc0 with probability at least 1

exp( Ω(log 1/2(d))dc0). Taking union bound over all possible v D, the random event still holds with probability at least 1 exp( Ω(log 1/2(d))dc0 + O(log(d))) 1 exp( Ω(d0.5c0)) (in sufficiently high dimension).

The proof for S(0) + (v) proceeds in virtually the same way, so we omit the calculations here.

To show the second point, in particular |S (0) + (v)|

|S(0) + (v )| 1 O 1 log5(d) , we need to be a bit more careful in our

bounds of the relevant sets. In particular, we need to directly use the CDF of gaussian random variables: P

w(0) +,r, v σ0 4 + 2c0

log(d) + 1 log5(d)

w(0) +,r, v σ0 4 + c0

log(d) 1 log5(d)

log(d)+ 1 log5(d)

log(d) 1 log5(d) e ϵ2/2dϵ + O

2π d 2 c0e(2+c0)/ log5(d) 4 + 2c0

log(d) + 1 log5(d)

log(d) 1 log5(d)

2π d 2 c0e(2+c0)/ log5(d) 4 + 2c0

2 log5(d) q

log(d) + 1 log5(d) + q

log(d) 1 log5(d) + O

The expected difference in number between the two sets is just the above expression multiplied by m = Θ(d2+2c0), and with probability at least 1 exp( Ω(d c0/4)), the difference term satisfies

2π (1 d c0/2)Θ(dc0)e(2+c0)/ log5(d) 4 + 2c0

2 log5(d) q

log(d) + 1 log5(d) + q

log(d) 1 log5(d)

dc0 1 log5(d)

By further noting from before that |S(0) + (v)| = Θ 1

dc0, |S (0) + (v)|

|S(0) + (v )| 1 O 1 log5(d) follows. The

proof of |S (0) + (v)|

|S (0) + (v )| 1 O 1 log5(d) follows a very similar argument, so we omit the calculations here.

Now, as for the set S(0) reg, we know for any r [m] and vi D,

P h w(0) +,r, vi σ0

Taking the union bound over r and i yields

P h r and i s.t. w(0) +,r, vi σ0

log(d) i md O

d 5 < d 2. (29)

Published in Transactions on Machine Learning Research (10/2024)

Finally, to show U(0) +,r O(1) holds for every (+, r), we just need to note that for any arbitrary (+, r)

neuron, the probability of U(0) +,r > 4 is no greater than

d 8 4c0 d4 O 1 log2 d

d 4 4c0 (30)

Taking union bound over all m O d2+2c0 neurons yields the desired result.

Published in Transactions on Machine Learning Research (10/2024)

D Coarse-grained SGD Phase I: (Almost) Constant Loss, Neurons Diversify

Definition D.1. We define T0 to be the first time which there exists some sample n such that

F (T0) c (X(T0) n ) d 1 (31)

Without loss of generality assume c = +. Define phase I to be the time t [0, T0).

D.1 Main results

Theorem D.1 (Phase 1 SGD update properties). The following properties hold with probability at least

1 O m NP k+t

poly(d) O(e Ω(log2(d))) for every t [0, T0).

1. (On-diagonal common-feature neuron growth) For every (+, r), (+, r ) S (0) + (v+),

w(t) +,r w(0) +,r = w(t) +,r w(0) +,r (32)

w(t) +,r =η

1 ι 1 s 1/3 O 1 log10(d)

2P v+ + ζ(t) +,r (33)

where ζ(t) +,r N(0, σ(t)2 ζ+,r I), σ(t) ζ+,r = ησζ 1

2N , and |ψ1| d 1.

Furthermore, every (+, r) S (0) + (v+) activates on v+-dominated patches at time t.

2. (On-diagonal finegrained-feature neuron growth) For every possible choice of c and every (+, r), (+, r ) S (0) + (v+,c),

w(t) +,r w(0) +,r = w(t) +,r w(0) +,r (34)

w(t) +,r =η

1 ι 1 s 1/3 O 1 log10(d)

2k+P v+,c + ζ(t) +,r (35)

where ζ(t) +,r N(0, σ(t)2 ζ+,r I), and σ(t) ζ+,r = ησζ 1

Furthermore, every (+, r) S (0) + (v+,c) activates on v+-dominated patches at time t.

3. The above results also hold with the + and signs flipped.

Proof. The SGD update rule produces the following update:

w(t+1) +,r =w(t) +,r + η 1

1{yn = +}[1 logit(t) + (X(t) n )] X

p [P ] σ ( w(t) +,r, x(t) n,p + b(t) +,r)x(t) n,p (37)

+ 1{yn = }[ logit(t) + (X(t) n )] X

p [P ] σ ( w(t) +,r, x(t) n,p + b(t) +,r)x(t) n,p

Published in Transactions on Machine Learning Research (10/2024)

In particular,

equation 37 =

n=1 1{yn = +} 1

1{|P(X(t) n ; v+)| > 0}

p P(X(t) n ;v+) σ ( w(t) +,r, α(t) n,pv+ + ζ(t) n,p + b(t) +,r) α(t) n,pv+ + ζ(t) n,p

p/ P(X(t) n ;v+) σ ( w(t) +,r, x(t) n,p + b(t) +,r)x(t) n,p

+ 1{|P(X(t) n ; v+)| = 0} X

p [P ] σ ( w(t) +,r, x(t) n,p + b(t) +,r)x(t) n,p

n=1 1{yn = +} 1

1{|P(X(t) n ; v+)| > 0}

p P(X(t) n ;v+) 1 n w(t) +,r, α(t) n,pv+ + ζ(t) n,p b(t) +,r o α(t) n,pv+ + ζ(t) n,p

p/ P(X(t) n ;v+) 1 n w(t) +,r, x(t) n,p b(t) +,r o x(t) n,p

+ 1{|P(X(t) n ; v+)| = 0} X

p [P ] 1 n w(t) +,r, x(t) n,p b(t) +,r o x(t) n,p

The rest of the proof proceeds by induction (in Phase 1).

First, recall that we set b(0) c,r = 4 + 2c0 p

log(d), and b(t) c,r = w(t) c,r 2 log5(d) for all t in phase 1, and for any

+-class sample Xn with p P(X(t) n ; v+), α(t) n,p 1 ι by our data assumption.

Base case t = 0.

1. (On-diagonal common-feature neuron growth)

The base case for the neuron expression of point 1. is trivially true.

We show that the neurons (+, r) S (0) + (v+) only activate on v+-dominated patches at time t = 0.

With probability at least 1 O m NP poly(d) , by Lemma H.3, we have for all possible choices of r, n, p:

w(0) +,r, ζ(0) n,p O(σ0σζ p

d log(d)) O σ0 log9(d)

It follows that

w(0) +,r, α(0) n,pv+ + ζ(0) n,p

1 ι 4 + 2c0 q

log(d) + 1/ log5(d),

log(d) 1 log9(d)

1 ι 4 + 2c0 q

log(d) + 1/ log5(d),

log(d) 1 log9(d)

Published in Transactions on Machine Learning Research (10/2024)

Employing the basic identity a b = a2 b2

a+b , we have the lower bound

σ 1 0 w(0) +,r, α(0) n,pv+ + ζ(0) n,p + b(0) +,r

(1 ι)(4 + 2c0)(log(d) + 1/ log5(d)) p

(4 + 2c0) log(d) O 1 log9(d)

= (1 ι)(4 + 2c0)(log(d) + 1/ log5(d)) (4 + 2c0) log(d) q

(1 ι)(4 + 2c0)(log(d) + 1/ log5(d)) + p

(4 + 2c0) log(d) O 1 log9(d)

= (4 + 2c0)( ι log(d) + (1 ι)/ log5(d)) q

(1 ι)(4 + 2c0)(log(d) + 1/ log5(d)) + p

(4 + 2c0) log(d) O 1 log9(d)

The last inequality holds since ι 1 polylog(d) and d is sufficiently large such that 1 log9(d) does not drive the positive term down past 0.

Therefore, the neurons in S (0) + (v+) indeed activate on the v+-dominated patches at t = 0.

The rest of the patches x(0) n,p is either a feature patch (not dominated by v+) or a noise patch. By definition, (+, r) S (0) + (v+) = (+, r) S(0) + (v+). Therefore, by Theorem F.1, with probability at least 1

poly(d) , at time t = 0, the (+, r) S (0) + (v+) neurons we are considering cannot activate on any feature patch dominated by v v+, nor on any noise patches.

It follows that the expression equation 37 at time t = 0 is as follows:

equation 37 =

n=1 1{yn = +} 1

1{|P(X(0) n ; v+)| > 0}

p P(X(0) n ;v+)

1 ιv+ + ζ(0) n,p + X

p/ P(X(0) n ;v+) 0

+ 1{|P(X(0) n ; v+)| = 0} X

n=1 1{yn = +, |P(X(0) n ; v+)| > 0} X

p P(X(0) n ;v+)

1 ιv+ + ζ(0) n,p

n (n, p) [N] [P] : yn = +, |P(X(0) n ; v+)| > 0, p P(X(0) n ; v+) o

p P(X(0) n ;v+) {yn = +} 1

On average, E h n (n, p) [N] [P] : yn = +, |P(X(0) n ; v+)| > 0, p P(X(0) n ; v+) o i

Furthermore, with our parameter choices, and by concentration of binomial random variables, with probability at least 1 e Ω(polylog(d)), n (n, p) [N] [P] : yn = +, |P(X(0) n ; v+)| > 0, p P(X(0) n ; v+) o = s N

1 s 1/3 (45)

Published in Transactions on Machine Learning Research (10/2024)

must be true.

It follows that

equation 37 = 1

p P(X(0) n ;v+) {yn = +} 1

ζ(0) n,p (46)

The other component expression equation 38 is zero with probability at least 1 O mk+NP

poly(d) by Theorem F.1.

By noting that

Var ζ(0) +,r =Var

p P(X(0) n ;v+) {yn = +} 1

1 s 1/3 σ2 ζ,

E h ζ(0) +,r i = E

p P(X(0) n ;v+) {yn = +} 1

we finish the proof of the base case for point 1.

2. (On-diagonal finegrained-feature neuron growth)

The proof of the base case of point 2. is virtually identical to point 1, so we omit the computations here.

Inductive step: We condition on the high probability events of the induction hypothesis for t [0, T] (with T < T0 of course), and prove the statements for t = T + 1.

1. (On-diagonal common-feature neuron growth)

By the induction hypothesis, up to time t = T, with probability at least 1 O mk+NP T

poly(d) , for all (+, r)

S (T ) + (v+),

w(t) +,r =η

1 ι 1 s 1/3 ! s

2P v+ + ζ(t) +,r (49)

where ζ(t) +,r N(0, σ(t)2 ζ I), σ(t) ζ = ησζ 1

Expression of w(T +1) +,r .

Conditioning on the high-probability event of the induction hypothesis, at time t = T + 1,

w(T +1) +,r =w(0) +,r +

τ=0 w(τ) +,r

1 ι 1 s 1/3 ! s

2P v+ + ζ(t) +,r

where ζ(t) +,r N(0, σ(t)2 ζ I), σ(t) ζ = ησζ

Let us compute w(T +1) +,r .

Published in Transactions on Machine Learning Research (10/2024)

We first want to show that w(T +1) +,r activates on v+-dominated patches x(T +1) n,p = 1 ιv+ + ζ(T +1) n,p . We need to show that the following expression is above 0:

w(T +1) +,r , x(T +1) n,p + b(T +1) +,r

= w(0) +,r,

1 ιv+ + ζ(T +1) n,p + b(0) +,r

1 ι 1 s 1/3 O 1 log10(d)

2P v+ + ζ(T +1) +,r ,

1 ιv+ + ζ(T +1) n,p

τ=0 b(τ) +,r

Let us treat the three terms (on three lines) separately.

First, following virtually the same argument as in the base case, the following lower bound holds with probability at least 1 O m NP poly(d) for all n, p and (+, r) S (T ) + (v+):

1 ιv+ + ζ(T +1) n,p + b(0) +,r

(1 ι)(4 + 2c0)(log(d) + 1/ log5(d)) p

(4 + 2c0) log(d) O 1 log9(d)

Now consider the second term.

We know, with probability at least 1 e Ω(d), for all n and p,

ζ(T +1) n,p , v+ O 1 log10(d)

therefore, ηT

1 ι 1 s 1/3 O 1 log10(d)

2P v+, ζ(T +1) n,p

2P O 1 log10(d)

Moreover, with probability at least 1 e Ω(d),

ζ(T +1) +,r , v+ η

2N O 1 log10(d)

and with probability at least 1 e Ω(d),

ζ(T ) +,r, ζ(T +1) n,p O σζσ(T ) ζ d O η

2N 1 log20(d)dd η

2N 1 log19(d) (56)

ηTζ(T +1) +,r ,

1 ιv+ + ζ(T +1) n,p η

2N O 1 log10(d)

Published in Transactions on Machine Learning Research (10/2024)

It follows that with probability at least 1 O(e Ω(d)),

1 ι 1 s 1/3 ! s

2P v+ + ζ(T +1) +,r ,

1 ιv+ + ζ(T +1) n,p

1 ι 1 s 1/3 ! s

1 ι 1 s 1/3 ! s

2P v+, ζ(T +1) n,p

+ ηζ(T +1) +,r ,

1 ιv+ + ζ(T +1) n,p

2 ψ(T +1) 1

(1 ι) 1 s 1/3 s

2N O 1 log10(d)

Now we compute the third term. By the induction hypothesis,

t=0 b(t) +,r

w(t) +,r 2 log5(d)

1 ι 1 s 1/3 s

2P v+ + ζ(t) +,r

1 log5(d)η 1

1 + ι 1 + s 1/3 s

= 1 log5(d)ηT 1

1 + ι 1 + s 1/3 s

With probability at least 1 O m T poly(d) , for all t [0, T] and r in consideration,

ζ(t) +,r 2 η

2N O 1 log10(d)

t=0 b(t) +,r

1 + ι 1 + s 1/3 s

2N O 1 log10(d)

Published in Transactions on Machine Learning Research (10/2024)

Combining our calculations of the three terms from above, we find the following estimate:

w(T +1) +,r , x(T +1) n,p + b(T +1) +,r > 0

(1 ι) 1 s 1/3 s

2N O 1 log10(d)

1 + ι 1 + s 1/3 s

2N O 1 log10(d)

(1 ι) 1 s 1/3 O 1 log4(d)

On the other hand, by Theorem F.1, with probability at least 1 O mk+NP T

poly(d) , none of the (+, r) S (T ) + (v+)

can activate on x(T +1) n,p that are feature-patches dominated by v v+ or noise patches.

Combining the above observations, with probability at least 1 O mk+NP (T +1)

poly(d) , the update expressions up to time t = T + 1 can be written as follows:

w(t) +,r = 1

( n (n, p) [N] [P] : yn = +, |P(X(t) n ; v+)| > 0, p P(X(t) n ; v+) o

p P(X(0) n ;v+) {yn = +} 1

The rest of the derivations proceeds virtually the same as in the base case; we just need to rely on the concentration of binomial random variables to calculate n (n, p) [N] [P] : yn = +, |P(X(0) n ; v+)| > 0, p P(X(0) n ; v+) o = s N

1 s 1/3 (64)

which completes the proof of the expression of w(t) +,r.

Additionally, to show w(T +1) +,r w(0) +,r = w(T +1) +,r w(0) +,r (65)

we just need to note that, by the above sequence of derivations, for every (+, r) S (0) + (v+), these neurons receive exactly the same update at time t = T + 1

n=1 1{yn = +}1{|P(X(T +1) n ; v+)| > 0}[1 logit(T +1) + (X(T +1) n )] X

p P(X(T +1) n ;v+)

α(T +1) n,p v+ + ζ(T +1) n,p . (66)

2. (On-diagonal finegrained-feature neuron growth)

For point 2, the proof strategy is almost identical, the only difference is that at every iteration, the expected number of patches in which subclass features appear in is n (n, p) [N] ([P] P(X(T ) n ); v+,c) : yn = +, |P(X(T ) n ; v+,c)| > 0, p P(X(T ) n ; v+,c) o

1 s 1/3 (67)

which holds with probability at least 1 e Ω(log2(d)) for the relevant neurons.

Published in Transactions on Machine Learning Research (10/2024)

Corollary D.1.1. T0 < O η s

P 1 poly(d).

Proof. Follows from Theorem D.1.

Lemma D.2. During the time t [0, T0), for any X(t) n ,

1 logit(t) + (X(t) n ) = 1

2 O(d 1) (68)

The same holds for 1 logit(t) (X(t) n ).

Therefore, |ψ1| O(d 1) for t [0, T0).

Proof. By definition of T0, for any t [0, T0], we have F (t) c (X(t) n ) < d 1 + O (η) for all n, therefore, using Taylor approximation,

1 logit(t) + (X(t) n ) = exp(F (t) (X(t) n ))

exp(F (t) + (X(t) n )) + exp(F (t) (X(t) n )) < exp(d 1)

2 + O(d 1) (69)

The lower bound can be proven due to convexity of the exponential:

exp(F (t) (X(t) n ))

exp(F (t) + (X(t) n )) + exp(F (t) (X(t) n )) > 1

2 exp( d 1) 1

Published in Transactions on Machine Learning Research (10/2024)

E Coarse-grained SGD Phase II: Loss Convergence, Large Neuron Movement

Recall that the desired probability events in Phase I happens with probability at least 1 o(1).

In phase II, common-feature neurons start gaining large movement and drive the training loss down to o(1). We show that the desired probability events occur with probability at least 1 o(1).

We study the case of T1 poly(d), where T1 denotes the time step at the end of training.

E.1 Main results

Theorem E.1. With probability at least 1 O mk+NP T1

poly(d) , the following events take place:

1. There exists time T poly(d) such that for any t [T , poly(d)], for any n [N], the training loss L(F; X(t) n , yn) o(1).

2. (Easy sample test accuracy is nearly perfect) Given an easy test sample (Xeasy, y), for y {+1, 1} {y}, for t [T , poly(d)],

P h F (t) y (Xeasy) F (t) y (Xeasy) i o(1). (71)

3. (Hard sample test accuracy is bad) However, for all t [0, poly(d)], given a hard test sample (Xhard, y),

P h F (t) y (Xhard) F (t) y (Xhard) i Ω(1). (72)

Proof. The training loss property follows from Lemma E.3 and Lemma E.4. We can set T = T1,1 or any time beyond it (and upper bounded by poly(d)).

The test accuracy properties follow from Lemma E.8 and Lemma E.9.

Lemma E.2 (Phase II, Update Expressions). For any T1 poly(d), with probability at least 1 O m NP k+t

during t [T0, T1], for any (+, r) S (0) + (v+),

n=1 1{yn = +} exp n F (t) + (X(t) n ) o

exp(F (t) (X(t) n ))

exp F (t) (X(t) n ) F (t) + (X(t) n ) + 1 (1 s 1/3) s

1 ιv+ + ζ(t) n,p ,

(where ct n denotes the subclass index of sample X(t) n ) and for any (+, r) S (0) + (v+,c),

1 ι 1 O 1 log5(d)

s A (t) +,r S (0) + (v+) + A (t) +,c,r S (0) + (v+,c) )

n=1 1{yn = (+, c)} exp(F (t) (X(t) n ))

exp F (t) (X(t) n ) F (t) + (X(t) n ) + 1 (1 s 1/3) s

1 ιv+,c + ζ(t) n,p ,

Published in Transactions on Machine Learning Research (10/2024)

In fact, for any v {v+} {v+,c}k+ c=1, every neuron in S (0) + (v) remain activated (on v-dominated patches) and receive exactly the same updates at every iteration as shown above.

For simpler exposition, for any (+, r ) S (0) + (v+), we write A (t) +,r := w(t) +,r , v+ ; similarly for A (t) +,c,r :=

w+,r , v+,c for neurons (+, r ) S (0) + (v+,c).

Moreover, on + -class samples, the neural network response satisfies the estimate for every (+, r ) S (0) + (v+):

F (t) + (X(t) n )

1 ι 1 O 1 log5(d)

s A (t) +,r S (0) + (v+) + A (t) +,ctn,r S (0) + (v+,ctn) , (75)

The same claims hold for the class neurons (with the class signs flipped).

Proof. In this proof we focus on the neurons in S (0) + (v+); the proof for the update expressions for those in S (0) + (v+,c) are proven in virtually the same way.

Base case, t = T0.

First define A (t) +,r := w+,r , v+ , (+, r ) S (0) + (v+); similarly for A (t) +,c,r := w+,r , v+,c . Note that

the choice of r does not really matter, since we know from phase I that every neuron in S (0) + (v+) evolve at exactly the same rate, so by the end of phase I, w(T0) +,r w(T0) +,r 2 O(σ0 log(d)) w(T0) +,r 2 for any

(+, r), (+, r ) S (0) + (v+).

Let (+, r) S (0) + (v+). Similar to phase I, consider the update equation

w(t+1) +,r =w(t) +,r + η 1

1{yn = +}[1 logit(t) + (X(t) n )] X

p [P ] σ ( w(t) +,r, x(t) n,p + b(t) +,r)x(t) n,p (77)

+ 1{yn = }[ logit(t) + (X(t) n )] X

p [P ] σ ( w(t) +,r, x(t) n,p + b(t) +,r)x(t) n,p

For the on-diagonal update expression, we have

n=1 1{yn = +}[1 logit(t) + (X(t) n )] X

p [P ] σ ( w(t) +,r, x(t) n,p + b(t) +,r)x(t) n,p

n=1 1{yn = +}[1 logit(t) + (X(t) n )]

1{|P(X(t) n ; v+)| > 0}

p P(X(t) n ;v+) 1 n w(t) +,r, α(t) n,pv+ + ζ(t) n,p b(t) +,r o α(t) n,pv+ + ζ(t) n,p

p/ P(X(t) n ;v+) 1 n w(t) +,r, x(t) n,p b(t) +,r o x(t) n,p

+ 1{|P(X(t) n ; v+)| = 0} X

p [P ] 1 n w(t) +,r, x(t) n,p b(t) +,r o x(t) n,p

Published in Transactions on Machine Learning Research (10/2024)

Following from Theorem D.1 and F.1, the neurons non-activation on the patches that do not contain v+, and activation on the v+-dominated patches hold with probability at least 1 O m NP k+

poly(d) at time T0. Therefore, the above update expression reduces to

n=1 1{yn = +, |P(X(t) n ; v+)| > 0}[1 logit(t) + (X(t) n )] X

p P(X(t) n ;v+)

α(t) n,pv+ + ζ(t) n,p (80)

Note that for samples X(t) n with yn = +,

[1 logit(t) + (X(t) n )] = exp(F (t) (X(t) n ))

exp(F (t) (X(t) n )) + exp(F (t) + (X(t) n )) (81)

Now we need to estimate the network response F (t) + (X(t) n ). With probability at least 1 exp( Ω(s 1/3)), we have the upper bound (let (+, ct n) denote the subclass which sample X(t) n belongs to):

F (t) + (X(t) n )

p P(X(t);v+)

(+,r) S(0) + (v+) w(t) +,r, v+ + ζ(t) n,p + b(t) +,r

p P(X(t);v+,ctn)

(+,r) S(0) + (v+,ctn) w(t) +,r, v+,ctn + ζ(t) n,p + b(t) +,r

(1 + s 1/3)

1 + ιs 1 + O 1 log9(d)

A (t) +,r S(0) + (v+) + A (t) +,ctn,r S(0) + (v+,ctn)

The second inequality is true since maxr w(t) +,r, v+ A (t) +,r + O(σ0 log(d)), and for any (+, r) S(0) + (v+),

| w(t) +,r, ζ(t) n,p | O(1/ log9(d))A (t) +,r . The bias value is negative (and so less than 0).

To further refine the bound, we recall S (0) + (v) / S (0) + (v ) , S (0) + (v) / S(0) + (v ) = 1 O(1/ log5(d)).

Therefore, we obtain the bound

F (t) + (X(t) n ) (1 + s 1/3)

1 + ιs 1 + O 1 log5(d)

1 + O 1 log5(d)

A (t) +,r S (0) + (v+) + A (t) +,ctn,r S (0) + (v+,ctn) (83)

Following a similar argument, we also have the lower bound

F (t) + (X(t) n )

p P(X(t);v+)

(+,r) S (0) + (v+) σ w(t) +,r, v+ + ζ(t) n,p + b(t) +,r

p P(X(t);v+,ctn)

(+,r) S (0) + (v+,ctn) σ w(t) +,r, v+,ctn + ζ(t) n,p + b(t) +,r

1 ιs 1 O 1 log5(d)

1 O 1 log5(d)

A (t) +,r S (0) + (v+) + A (t) +,ctn,r S (0) + (v+,ct n)

Published in Transactions on Machine Learning Research (10/2024)

The neurons in S (0) + (v+) have to activate, therefore they serve a key role in the lower bound, the bias bound for them is simply A (t) +,r Θ(1/ log5(d)); the neurons in S(0) + (v+,c) contribute at least 0 due to the Re LU

activation; the rest of the neurons do not activate. The same reasoning holds for the S (0) + (v+,c).

Knowing that neurons in S (0) + (v+) cannot activate on the patches in samples belonging to the class, now we may write the update expression for every (+, r) S (t) + (v+) as (their updates are identical, same as in phase I):

n=1 1{yn = +}[1 logit(t) + (X(t) n )] X

p [P ] σ ( w(t) +,r, x(t) n,p + b(t) +,r)x(t) n,p

n=1 1{yn = +, |P(X(t) n ; v+)| > 0} exp( F (t) + (X(t) n ))

exp(F (t) (X(t) n ))

exp F (t) (X(t) n ) exp(F (t) + (X(t) n )) + 1

p P(X(t) n ;v+)

α(t) n,pv+ + ζ(t) n,p

n=1 1{yn = +} exp

(1 + s 1/3)

1 + ιs 1 + O 1 log5(d)

A (t) +,r S (0) + (v+) + A (t) +,ctn,r S (0) + (v+,ctn) )

exp(F (t) (X(t) n ))

exp F (t) (X(t) n ) F (t) + (X(t) n ) + 1 (1 s 1/3) s

1 ιv+ + ζ(t) n,p

This concludes the proof of the base case.

Induction step. Assume the statements hold for time period [T0, t], prove for time t + 1.

At step t + 1, based on the induction hypothesis, we know that with probability at least 1 O m NP k+t

during time τ [T0, t], for any (+, r) S (0) + (v+),

n=1 1{yn = +} exp

(1 + s 1/3)

1 + ιs 1 + O 1 log5(d)

A (τ) +,r S (0) + (v+) + A (τ) +,ctn,r S (0) + (v+,ctn) )

exp(F (τ) (X(τ) n ))

exp F (τ) (X(τ) n ) exp(F (τ) + (X(τ) n )) + 1 (1 s 1/3) s

1 ιv+ + ζ(τ) n,p

Published in Transactions on Machine Learning Research (10/2024)

and for the bias,

η 1 log5(d)

n=1 1{yn = +} exp

(1 + s 1/3)

1 + ιs 1 + O 1 log5(d)

A (τ) +,r S (0) + (v+) + A (τ) +,ctn,r S (0) + (v+,ctn) )

(1 s 1/3) s

1 ι 1 log10(d)

exp(F (τ) (X(τ) n ))

exp F (τ) (X(τ) n ) exp(F (τ) + (X(τ) n )) + 1

Conditioning on the high-probability events of the induction hypothesis,

n=1 1{yn = +} exp

(1 + s 1/3)

1 + ιs 1 + O 1 log5(d)

A (τ) +,r S (0) + (v+) + A (τ) +,ct n,r S (0) + (v+,ct n) )

exp(F (τ) (X(τ) n ))

exp F (τ) (X(τ) n ) exp(F (τ) + (X(τ) n )) + 1 (1 s 1/3) s

1 ιv+ + ζ(τ) n,p

It follows that, with probability at least 1 O m NP poly(d) , for all v+-dominated patch x(t+1) n,p ,

Published in Transactions on Machine Learning Research (10/2024)

w(t+1) +,r , x(t+1) n,p + b(t+1) +,r

= w(T0) +,r ,

1 ιv+ + ζ(t+1) n,p + b(T0) +,r

n=1 1{yn = +} exp

(1 + s 1/3)

1 + ιs 1 + O 1 log5(d)

A (τ) +,r S (0) + (v+) + A (τ) +,ctn,r S (0) + (v+,ctn) )

exp(F (τ) (X(τ) n ))

exp F (τ) (X(τ) n ) F (τ) + (X(τ) n ) + 1 (1 s 1/3) s

1 ιv+ + ζ(τ) n,p,

1 ιv+ + ζ(t+1) n,p + b(τ) +,r 0

n=1 1{yn = +} exp

(1 + s 1/3)

1 + ιs 1 + O 1 log5(d)

A (τ) +,r S (0) + (v+) + A (τ) +,ctn,r S (0) + (v+,ct n) )

exp(F (τ) (X(τ) n ))

exp F (τ) (X(τ) n ) F (τ) + (X(τ) n ) + 1 (1 s 1/3) s

1 ι O 1 log5(d)

Therefore the neurons (+, r) S (0) + (v+) activate on the v+-dominated patches x(t+1) n,p . We also know that they cannot activate on patches that are not dominated by v+ by Theorem F.1. Following a similar derivation to the base case, we arrive at the result that, conditioning on the events of the induction hypothesis, with probability at least 1 O m NP k+

poly(d) , for all (+, r) S (0) + (v+),

n=1 1{yn = +} exp

(1 + s 1/3)

1 + ιs 1 + O 1 log5(d)

A (t+1) +,r S (0) + (v+) + A (t+1) +,ctn,r S (0) + (v+,ctn) )

(1 s 1/3) s

NP exp(F (t+1) (X(t+1) n ))

exp F (t+1) (X(t+1) n ) F (t+1) + (X(t+1) n ) + 1

1 ιv+ + ζ(t+1) n,p

Published in Transactions on Machine Learning Research (10/2024)

Consequently, with probability at least 1 O m NP poly(d) ,

n=1 1{yn = +}η exp

(1 + s 1/3)

1 + ιs 1 + O 1 log5(d)

A (t+1) +,r S (0) + (v+) + A (t+1) +,ctn,r S (0) + (v+,ctn) )

exp(F (t+1) (X(t+1) n ))

exp F (t+1) (X(t+1) n ) F (t+1) + (X(t+1) n ) + 1 (1 s 1/3) s

1 ι O 1 log9(d)

Utilizing the definition of conditional probability, we conclude that the expressions for w(τ) +,r and b(t+1) +,r are

indeed as described in the theorem during time τ [T0, t + 1] with probability at least 1 O m NP k+t

poly(d) 1 O m NP k+

poly(d) 1 O m NP k+(t+1)

Moreover, based on the expression of w(τ) +,r and b(t+1) +,r , following virtually the same argument as in the

base case, we can estimate the network output for any (X(t+1) n , yn = +):

F (t+1) + (X(t+1) n ) =(1 s 1/3)

1 ι 1 O 1 log5(d)

A (t) +,r S (0) + (v+) + A (t) +,ctn,r S (0) + (v+,ctn) (92)

Lemma E.3. Define time T1,1 to be the first point in time which the following identity holds on all X(t) n belonging to the + class:

exp(F (t) (X(t) n ))

exp(F (t) (X(t) n ) F (t) + (X(t) n )) + 1 1 O 1 log5(d)

Then T1,1 poly(d), and for all t [T1,1, T1], the above holds. The following also holds for this time period:

[1 logit(t) + (X(t) n )] O 1 log5(d)

The same results also hold with the class signs flipped.

Proof. We first note that, the training loss [1 logit(t) + (X(t) n )] on samples belonging to the + class at any time during t [T0, T1] is, asymptotically speaking, monotonically decreasing from 1

2 O(d 1). This can

be easily proven by observing the way s A (t) +,r S (0) + (v+) + A (t) +,c,r S (0) + (v+,c) monotonically increases

from the proof of Lemma E.2: before F (t) + (X(t) n ) log log5(d) on all X(t) n belonging to the + class, there

Published in Transactions on Machine Learning Research (10/2024)

must be some samples X(t) n on which

[1 logit(t) + (X(t) n )] = exp(F (t) (X(t) n ))

exp(F (t) (X(t) n )) + exp(F (t) + (X(t) n ))

1 O(σ0 log(d)s dc0) 1 + O(σ0 log(d)s dc0) + log5(d)

Ω 1 log5(d)

Therefore, by the update expressions in the proof of Lemma E.2, F (t) + (X(t) n ) can reach log log5(d) in time at

most O NP log5(d)

ηs poly(d) (in the worst case scenario). At time T1,1 and beyond,

1 exp(F (t) (X(t) n ))

exp(F (t) (X(t) n ) F (t) + (X(t) n )) + 1 1 exp(1 O(σ0dc0s )) exp(1 + O(σ0dc0s )) 1 log5(d) + 1

O 1 log5(d)

Lemma E.4. Denote C = η s 2k+P , and write (for any c [k+])

Ac(t) = s A (t) +,r S (0) + (v+) + A (t) +,c,r S (0) + (v+,c) (97)

(see Lemma E.2 for definition of A (t) ). Define tc,0 = exp(Ac(T1,1)). We write A(t) and t0 below for cleaner notations.

Then with probability at least 1 o(1), during t [T1,1, T1],

A(t) = log(C(t T1,1) + t0) + E(t) (98)

where |E(t)| O 1 log4(d) Pt T1,1+C 1t0 τ=C 1t0 1 τ O log(t) log(C 1t0)

The same results also hold with the class signs flipped.

Proof. Sidenote: To make the writing a bit cleaner, we assume in the proof below that C 1t0 is an integer. The general case is easy to extend to by observing that 1 t T1,1+ C 1t0 1 t T1,1+C 1t0

1 (t T1,1+ C 1t0 )(t T1,1+C 1t0), which can be absorbed into the error term at every iteration since

1 t T1,1+ C 1t0 1 log4(d) due to C 1t0 Ω(σ 1 0 /(polylog(d)dc0)) d log4(d).

Based on result from Lemmas E.2 and E.3, as long as A(t) O (log(d)), we know during time t [T1,1, T1] the update rule for A(t) is as follows:

A(t + 1) A(t) =C exp (1 s 1/3)

1 ι 1 O 1 log5(d)

1 O 1 log5(d)

1 ι 1 log10(d)

=C exp { A(t)} exp O 1 log4(d)

1 O 1 log5(d)

=C exp { A(t)} 1 C1 log4(d)

where we write C1 in place of O( ) for a more concrete update expression.

Published in Transactions on Machine Learning Research (10/2024)

The base case t = T1,1 is trivially true.

We proceed with the induction step. Assume the hypothesis true for t [T1,1, T], prove for t + 1 = T + 1.

Note that by Lemma E.10,

A(t + 1) = log(C(t T1,1) + t0) + E(t)

+ C exp { log(C(t T1,1) + t0) E(t))} 1 C1 log4(d)

= log(C) + log(t T1,1 + C 1t0) + E(t)

+ C 1 C(t T1,1) + t0

1 E(t) O(E(t)2) 1 C1 log4(d)

t T1,1+C 1t0 1 X

2 1 t T1,1 + C 1t0 + 0, 1

8 1 (t T1,1 + C 1t0)2

+ 1 t T1,1 + C 1t0 C1 log4(d) 1 t T1,1 + C 1t0

+ E(t) + 1 t T1,1 + C 1t0

E(t) O(E(t)2) 1 C1 log4(d)

t T1,1+C 1t0 X

2 1 t T1,1 + C 1t0 + 0, 1

8 1 (t T1,1 + C 1t0)2

C1 log4(d) 1 t T1,1 + C 1t0

+ E(t) + 1 t T1,1 + C 1t0

E(t) O(E(t)2) 1 C1 log4(d)

Invoking Lemma E.10 again,

A(t + 1) = log(C) + log(t + 1 T1,1 + C 1t0)

2 1 t + 1 T1,1 + C 1t0 + 1

2 1 t T1,1 + C 1t0

8 1 (t + 1 T1,1 + C 1t0)2 , 0 + 0, 1

8 1 (t T1,1 + C 1t0)2

C1 log4(d) 1 t T1,1 + C 1t0

+ E(t) + 1 t T1,1 + C 1t0

E(t) O(E(t)2) 1 C1 log4(d)

= log(C(t + 1 T1,1) + t0)

2 1 (t + 1 T1,1 + C 1t0)(t T1,1 + C 1t0) O 1 (t + 1 T1,1 + C 1t0)2

C1 log4(d) 1 t T1,1 + C 1t0

+ E(t) + 1 t T1,1 + C 1t0

E(t) O(E(t)2) 1 C1 log4(d)

Published in Transactions on Machine Learning Research (10/2024)

To further refine the expression, first note that the error passed down from the previous step t does not grow in this step (in fact it slightly decreases): E(t) + 1 t T1,1 + C 1t0

E(t) O(E(t)2) 1 C1 log4(d)

O 1 log4(d)

t T1,1+C 1t0 X

Moreover, notice that at step t + 1, since 1 t+1 T1,1+C 1t0 1 log4(d), the error term |E(t + 1)| = |A(t + 1)

log(C(t + 1 T1,1) + t0)| O 1 log4(d) Pt+1 T1,1+C 1t0 τ=C 1t0 1 τ , which finishes the inductive step.

Lemma E.5. With probability at least 1 O m NP k+T1

poly(d) , for all t [0, T1], all c [k+],

A (t) +,c,r

A (t) +,r = Θ 1

A (t) +,c,r

A (t) +,r = Θ 1

The same identity holds for the -classes.

Proof. The statements in the lemma follow trivially from Theorem D.1 for time period [0, T0]. Let us focus on the phase [T0, T1].

In this proof, we condition on the high-probability events of Lemma E.4 and Lemma E.2.

First of all, based on Lemma E.4, we know that s A (t) +,r S (0) + (v+) O(log(d)). We will make use of this fact later.

Base case, t = T0.

The base case directly follows from our Theorem D.1.

Induction step, assume statement holds for τ [T0, t], prove statement for t + 1.

By Lemma E.2, we know that

1 ιs 1 O 1 log5(d)

A (t) +,r S (0) + (v+) + A (t) +,c,r S (0) + (v+,c) )

[1/3, 1](1 s 1/3) s

1 ι O 1 log9(d)

(104) and for any c [k+],

A (t) +,c,r

1 ιs 1 O 1 log5(d)

A (t) +,r S (0) + (v+) + A (t) +,c,r S (0) + (v+,c) )

[1/3, 1](1 s 1/3) s

1 ι O 1 log9(d)

Published in Transactions on Machine Learning Research (10/2024)

Relying on the induction hypothesis, we can reduce the above expressions to

1 ι 1 O 1 log5(d)

s A (t) +,r S (0) + (v+)

[1/3, 1](1 s 1/3) s

1 ι O 1 log9(d)

1 ι 1 O 1 log5(d)

s A (t) +,r S (0) + (v+)

and for any c [k+],

A (t) +,c,r

1 ι 1 O 1 log5(d)

s A (t) +,r S (0) + (v+)

By invoking the property that s A (t) +,r S (0) + (v+) O(log(d)), we find that for all c [k+],

A (t) +,c,r

A (t) +,r = exp

O 1 log5(d)

s A (t) +,r S (0) + (v+)

= 1 O 1 log4(d)

Therefore, we can finish our induction step:

A (t+1) +,c,r

A (t+1) +,r = A (t) +,c,r + A (t) +,c,r

A (t) +,r + A (t) +,r = A (t) +,c,r + A (t) +,c,r

Θ (k+) A (t) +,c,r + A (t) +,c,r = Θ 1

Lemma E.6. Let TΩ(1) be the first point in time such that either s A (t) +,r S (0) + (v+) Ω(1) or

s A (t) ,r S (0) (v ) Ω(1). Then for any t < TΩ(1),

A (t) +,r = Θ(1) (110)

and for any t [TΩ(1), T1],

A (t) +,r , A (t) +,r

A (t) ,r Ω 1 log(d)

Published in Transactions on Machine Learning Research (10/2024)

Proof. This lemma is a consequence of Theorem D.1, Lemma E.2 and Lemma E.4.

Due to Theorem D.1, we already know that A (t) ,r

A (t) +,r = Θ(1) up to time T0. In addition, with Lemma E.2 we

know that before s A (t) +,r S (0) + (v+) Ω(1), the loss term (on a +-class sample) 1 logit(t) + (X(t) n ) = Θ(1)

(the same holds with the class signs flipped), in which case it is also easy to derive A (t) ,r

A (t) +,r = Θ(1) by noting

that the update expressions A (t) ,r / A (t) +,r = Θ(1).

Beyond time TΩ(1), by Lemma E.4, we know that s A (t) +,r S (0) + (v+) , s A (t) ,r S (0) (v ) O(log(d)). With

the understanding that s A (t) +,r S (0) + (v+) , s A (t) ,r S (0) (v ) Ω(1) beyond TΩ(1) due to the monotonicity

of these functions, and the property |S (0) (v )|

|S (0) + (v+)| 1 O 1 log5(d) from Proposition 1, the rest of the lemma

Lemma E.7. With probability at least 1 O m NP k+t

poly(d) , for all t [0, T1] and all (+, r) S (0) + (v+),

b(t) +,r A(t) +,r = Θ 1 log5(d)

The same holds with the +-class signs replaced by the -class signs.

Proof. Choose any (+, r) S (0) + (v+).

The statement in this lemma for time period t [0, T0] follows easily from Theorem D.1 and its proof. Let us examine the period t [T0, T1].

Based on Lemma E.2 and its proof and Lemma E.5, we know that for t [T0, T1], with probability at least

1 O m NP k+t

1 ι 1 O 1 log5(d)

s A (t) +,r S (0) + (v+)

(1 s 1/3) s

1 ι O 1 log9(d)

n=1 1{yn = +} exp(F (t) (X(t) n ))

exp F (t) (X(t) n ) F (t) + (X(t) n ) + 1

Furthermore,

= w(t) +,r 2 log5(d)

= η 1 log5(d) exp

1 ι 1 O 1 log5(d)

s A (t) +,r S (0) + (v+)

(1 s 1/3) s

1 ι 1 log9(d)

n=1 1{yn = +} exp(F (t) (X(t) n ))

exp F (t) (X(t) n ) F (t) + (X(t) n ) + 1

With the understanding that s A (t) +,r S (0) + (v+) O(log(d)) from Lemma E.4 and the fact that

exp(F (t) (X(t) n ))

exp F (t) (X(t) n ) F (t) + (X(t) n ) +1 = Θ(1), we have

Published in Transactions on Machine Learning Research (10/2024)

b(t) +,r A(t) +,r = Θ 1 log5(d)

1 O 1 log5(d)

s A (t) +,r S (0) + (v+)

= Θ 1 log5(d)

1 O 1 log4(d)

= Θ 1 log5(d)

Lemma E.8 (Probability of mistake on hard samples is high). For all t [0, T1], given a hard test sample (Xhard, y), y = y,

P h F (T ) y (Xhard) F (T ) y (Xhard) i Ω(1). (116)

Proof. We first show that at time t = 0, the probability of the network making a mistake on hard test samples is Ω(1), then prove that for the rest of the time, i.e. t (0, T1], the model still makes mistake on hard test samples with probability Ω(1).

At time t = 0, by Lemma H.3, we know that for any r [m], with probability Ω(1),

w(0) +,r, ζ Ω(σ0σζ

d) Ω(σ0polylog(d)) Ω σ0 p

log(d) . (117)

Relying on concentration of the binomial random variable, with probability at least 1 e Ω(polylog(d)),

r=1 σ w(0) +,r, ζ + b(0) +,r Ω(mσ0σζ

which is asymptotically larger than the activation from the features, which, following from Proposition 1, is upper bounded by O σ0 p

log(d)s dc0 . The same can be said for the class. In other words,

F (0) (Xhard) F (0) + (Xhard) > 0

r=1 1{ w(0) ,r, ζ + b(0) ,r > 0} w(0) ,r, ζ

r=1 1{ w(0) +,r, ζ + b(0) +,r > 0} w(0) +,r, ζ

(1 o(1)) > 0

which clearly holds with probability Ω(1).

Now consider t (0, T1].

During this period of time, by Theorem D.1 and Lemma E.2, we note that for any c [k+] and (+, r)

S (0) + (v+,c), ζ(t) +,r N(0, σ(t)2 ζ+,r Id), with σ(t) ζ+,r = Θ A(t) +,c,r q

. The same can be said for (+, r)

S (0) + (v+), although with the A(t) +,c,r q

2k+ s N factor replaced by A(t) +,r q

2 s N . Also from the proofs of Theorem

D.1 and Lemma E.2, and using the property |U(0) +,r| O(1) from Proposition 1, we know that for all neurons,

the updates to the neurons also take the feature-plus-Gaussian-noise form of P

v U(0) +,r c(t)(v )v + ζ(t) +,r, with

c(t)(v ) 1 + O 1 log5(d) A(t) +,c,r if v = v+,c for some c [k+], or c(t)(v ) 1 + O 1 log5(d) A(t) +,r if

v = v+ (because the v component of a v -singleton neuron s update is already the maximum possible).

Published in Transactions on Machine Learning Research (10/2024)

Moreover, if v+ U(0) +,r, then σ(t) ζ+,r O A(t) +,r q

2 s N σζ + O A(t) +,c,r q

O A(t) +,r q

otherwise, if U(0) +,r only contains the fine-grained features, then σ(t) ζ+,r O A(t) +,c,r q

With the understanding that only neurons in S(0) y (vy) and S(0) y (vy,c) can possibly activate on the feature patches of a sample when t T1 (coming from Theorem F.1), we have

F (t) + (Xhard) X

(+,r) S(0) + (v+,c)

p P(Xhard;v+,c) σ

τ=0 w(τ) +,r,

1 ιv+,c + ζp + b(t) +,r

τ=0 w(τ) +,r, ζ + b(t) +,r

(+,r) S(0) + (v )

p P(Xhard;v ) σ

τ=0 w(τ) +,r, α pv + ζp + b(t) +,r

To further refine this upper bound, we first note that with probability at least 1 O m NP k+t

poly(d) , the following

holds with arbitrary choice of (+, r ) S(0) + (v+,c):

(+,r) S(0) + (v+,c)

p P(Xhard;v+,c)

τ=0 w(τ) +,r,

1 ιv+,c + ζp O

s S(0) + (v+,c)

τ=0 A(τ) +,c,r

Invoking Lemma E.5, we obtain (for arbitrary (+, r ) S(0) + (v+)):

(+,r) S(0) + (v+,c)

p P(Xhard;v+,c)

τ=0 w(τ) +,r,

1 ιv+,c + ζp O

1 k+ s S(0) + (v+,c)

τ=0 A(τ) +,r

Let us examine the term P

r [m] σ w(0) +,r + Pt 1 τ=0 w(τ) +,r, ζ + b(t) +,r more carefully. First of all, denoting

S(0) + = k+ c=1S(0) + (v+,c) k c=1S(0) + (v ,c) S(0) + (v+) S(0) + (v ), neurons (+, r) / S(0) + cannot receive any update at all during training due to Theorem F.1. Therefore we can rewrite the term

τ=0 w(τ) +,r, ζ + b(t) +,r

(+,r) S(0) +

τ=0 w(τ) +,r, ζ + b(t) +,r

(+,r)/ S(0) +

σ w(0) +,r, ζ + b(0) +,r (123)

Relying on Corollary F.1.1, we know

τ=0 b(τ) +,r <

τ=0 Ω polylog(d)

w(τ) +,r, ζ . (124)

Therefore, we know that for r [m],

τ=0 w(τ) +,r, ζ + b(τ) +,r 0 (125)

Published in Transactions on Machine Learning Research (10/2024)

As a consequence, we can write the naive upper bound

τ=0 w(τ) +,r, ζ + b(t) +,r

(+,r) S(0) +

σ w(0) +,r, ζ + b(0) +,r + X

(+,r)/ S(0) +

σ w(0) +,r, ζ + b(0) +,r

r [m] σ w(0) +,r, ζ + b(0) +,r

Additionally, due to Theorem F.1 (and its proof), we know that

(+,r) S(0) + (v )

p P(Xhard;v ) σ

τ=0 w(τ) +,r, α pv + ζp + b(t) +,r

(+,r) S(0) + (v )

p P(Xhard;v ) σ w(0) +,r, α pv + ζp + b(0) +,r (127)

It follows that

F (t) + (Xhard)

1 k+ s S(0) + (v+,c)

τ=0 A(τ) +,r

(+,r) S(0) + (v+,c)

p P(Xhard;v+,c)

1 ιv+,c + ζp

r [m] σ w(0) +,r, ζ + b(0) +,r + X

(+,r) S(0) + (v )

p P(Xhard;v ) σ w(0) +,r, α pv + ζp + b(0) +,r (128)

On the other hand, for the neurons, denoting S(0) = k+ c=1S(0) (v+,c) k c=1S(0) (v ,c) S(0) (v+) S(0) (v ),

F (t) (Xhard) X

(+,r) S (0) (v )

p P(Xhard;v ) σ

τ=0 w(τ) ,r, α pv + ζp + b(t) +,r

(+,r)/ S(0)

σ w(0) ,r, ζ + b(0) +,r , (129)

note that the last line is true because neurons outside the set S(0) cannot receive any update during training

with probability at least 1 O m NP k+t

poly(d) due to Theorem F.1. Estimating the activation value of the neurons

from S (0) (v ) on the feature noise patches requires some care. We define time t to be the first point in time such that any ( , r ) S (0) (v ) satisfies Pt τ=0 A(τ) ,r σ0 log5(d), and beyond this point in time,

i.e. for t [t , T1], the neurons in S (0) (v ) have to activate with high probability, since

τ=0 w(τ) ,r, α pv + ζp + b(t) +,r 1 O 1 log5(d)

σ0 log5(d)/ log4(d) O(σ0 p

Now we can proceed to prove the lemma for t (0, T1] by combining the above estimates for F (t) + (Xhard) and F (t) (Xhard).

Published in Transactions on Machine Learning Research (10/2024)

For t (0, t ], relying argument similar to the situation of t = 0 and the fact that m |S(0) | = (1 o(1))m, ( X

(+,r)/ S(0)

1{ w(0) ,r, ζ + b(0) ,r > 0} w(0) ,r, ζ

r=1 1{ w(0) +,r, ζ + b(0) +,r > 0} w(0) +,r, ζ

(1 o(1)) > 0

= F (t) (Xhard) F (t) + (Xhard) > 0

which has to be true with probability Ω(1).

On the other hand, with t (t , T1], we have

F (t) (Xhard) F (t) + (Xhard)

1 O 1 log5(d)

s |S (0) (v )| A(τ) ,r O(σ0 p

1 k+ s S(0) + (v+,c)

τ=0 A(τ) +,r

(+,r)/ S(0)

σ w(0) ,r, ζ + b(0) +,r X

(+,r) S(0) + (v+,c)

p P(Xhard;v+,c)

1 ιv+,c + ζp

r [m] σ w(0) +,r, ζ + b(0) +,r X

(+,r) S(0) + (v )

p P(Xhard;v ) σ w(0) +,r, α pv + ζp + b(0) +,r )

Let us begin analyzing the first { } bracket.

By Proposition 1 we know that S (0) (v ) = (1 O(1/ log5(d))) S(0) + (v+,c) , and by Lemma E.5, we know

that A(τ) +,r O(log(d) A(τ) ,r ), therefore,

1 k+ s S(0) + (v+,c)

τ=0 A(τ) +,r

k+ s S (0) (v )

τ=0 A(τ) ,r

1 O 1 log5(d)

s |S (0) (v )| A(τ) ,r O(σ0 p

Therefore, we obtained the simpler lower bound

F (t) (Xhard) F (t) + (Xhard)

(+,r)/ S(0)

σ w(0) ,r, ζ + b(0) +,r X

(+,r) S(0) + (v+,c)

p P(Xhard;v+,c)

1 ιv+,c + ζp

r [m] σ w(0) +,r, ζ + b(0) +,r X

(+,r) S(0) + (v )

p P(Xhard;v ) σ w(0) +,r, α pv + ζp + b(0) +,r ) (134)

which is greater than 0 with probability Ω(1) (by relying on an argument almost identical to the t = 0 case again, and noting that m |S(0) | = (1 o(1))m). This concludes the proof.

Published in Transactions on Machine Learning Research (10/2024)

Lemma E.9 (Probability of mistake on easy samples is low after training). For t [T1,1, T1], given an easy test sample (Xeasy, y),

P h F (T ) y (Xeasy) F (T ) y (Xeasy) i o(1). (135)

Proof. Without loss of generality, assume the true label of Xeasy is +1. Assume t T1,1.

Firstly, conditioning on the events of Theorem F.1, the following upper bound on F (t) (Xeasy) holds with

probability at least 1 O m poly(d) :

F (t) (Xeasy) = X

( ,r) S(0) (v+)

p P(Xeasy;v+) σ w(t) ,r,

1 ιv+ + ζp + b(t) ,r

( ,r) S(0) (v+,c)

p P(Xeasy;v+,c) σ w(t) ,r,

1 ιv+,c + ζp + b(t) ,r

( ,r) S(0) (v+)

p P(Xeasy;v+) σ w(0) ,r,

1 ιv+ + ζp + b(0) ,r

( ,r) S(0) (v+,c)

p P(Xeasy;v+,c) σ w(0) ,r,

1 ιv+,c + ζp + b(0) ,r

<O (s dc0σ0)

and on the other hand,

F (t) + (Xeasy) X

(+,r) S (0) + (v+)

p P(Xeasy;v+) σ w(t) +,r,

1 ιv+ + ζp + b(t) +,r

(+,r) S (0) + (v+,c)

p P(Xeasy;v+,c) σ w(t) +,r,

1 ιv+,c + ζp + b(t) +,r

Therefore, F (t) + (Xeasy) F (t) (Xeasy), which completes the proof.

Lemma E.10 (Jr. & John W. Wrench (1971)). The partial sum of harmonic series satisfies the following identity: n 1 X

1 k = log(n) + E 1

2n ϵn (138)

where E is the Euler Mascheroni constant (approximately 0.58), and ϵn [0, 1/8n2].

Published in Transactions on Machine Learning Research (10/2024)

F Coarse-grained SGD, Poly-time properties

In this section, set Te poly(d).

Please note that we are performing stochastic gradient descent on easy samples only.

Theorem F.1. Fix any t [0, Te].

1. (Non-activation invariance) For any τ t, with probability at least 1 O mk+NP t

poly(d) , any feature

v {v+,c}k+ c=1 {v ,c}k c=1 {v+, v }, any t t, (+, r) / S(0) + (v) and v-dominated patch sample x(τ) n,p = α(τ) n,pv + ζ(τ) n,p, the following holds:

σ w(t ) +,r, x(τ) n,p + b(t ) +,r = 0 (139)

2. (Non-activation on noise patches) For any τ t, with probability at least 1 O m NP t poly(d) , for every

t t, r [m] and noise patch x(τ) n,p = ζ(τ) n,p, the following holds:

σ w(t ) +,r, x(τ) n,p + b(t ) +,r = 0 (140)

3. (Off-diagonal nonpositive growth) For any τ t, with probability at least 1 O mk+NP t

poly(d) , for

any t t, any feature v {v ,c}k c=1 {v }, any (+, r) S(0) + (v) and v-dominated patch x(τ) n,p =

α(τ) n,pv + ζ(τ) n,p, σ w(t ) +,r, x(τ) n,p + b(t ) +,r σ w(0) +,r, x(τ) n,p + b(0) +,r .

Proof. Base case t = 0.

1. (Nonactivation invariance)

Choose any τ 0, v from the set {v+,c}k+ c=1 {v ,c}k c=1 {v+, v }. We will work with neuron sets in the + class in this proof; the -class case can be handled in the same way.

First, we need to show that, for every n such that |P(X(τ) n ; v )| > 0 and p P(X(τ) n ; v ), for every (+, r) neuron index,

w(0) +,r, v < σ0 4 + 2c0

log(d) 1 log5(d) = σ w(0) +,r, x(τ) n,p + b(0) +,r = 0 (141)

This is indeed true. The following holds with probability at least 1 O m NP poly(d) for all (+, r) / S(0) + (v) and

all such x(τ) n,p:

w(0) +,r, x(τ) n,p + b(0) +,r σ0

(4 + 2c0)(log(d) 1/ log5(d)) + O σ0 log9(d)

(4 + 2c0)(1 + ι)(log(d) 1/ log5(d)) (4 + 2c0) log(d) q

(4 + 2c0)(log(d) 1/ log5(d)) + 4 + 2c0 p

log(d) + O 1 log9(d)

(4 + 2c0)ι log(d) (1 + ι)/ log5(d) q

(4 + 2c0)(log(d) 1/ log5(d)) + 4 + 2c0 p

log(d) + O 1 log9(d)

The first equality holds by utilizing the identity a b = a2 b2

a+b . As a consequence, σ( w(0) +,r, x(τ) n,p + b(0) +,r) = 0.

Published in Transactions on Machine Learning Research (10/2024)

2. (Non-activation on noise patches) Invoking Lemma H.3, for any τ 0, with probability at least

1 O m NP poly(d) , we have for all possible choices of r [m] and the noise patches x(τ) n,p = ζ(τ) n,p:

w(0) +,r, ζ(τ) n,p O(σ0σζ p

d log(d)) O σ0 log9(d)

b(0) +,r. (143)

Therefore, no neuron can activate on the noise patches at time t = 0.

3. (Off-diagonal nonpositive growth) This point is trivially true at t = 0.

Inductive step: we assume the induction hypothesis for t [0, T] (with T < Te of course), and prove the statements for t = T + 1.

1. (Nonactivation invariance)

Choose any v from the set {v+,c}k+ c=1 {v ,c}k c=1 {v+, v }. We will work with neuron sets in the + class in this proof; the -class case can be handled in the same way.

We need to prove that given τ T + 1, with probability at least 1 O m NP (T +1)

poly(d) , for every t T + 1,

(+, r) neuron index and v -dominated patch x(τ) n,p,

(+, r) / S(0) + (v ) = σ w(t ) +,r, x(τ) n,p + b(t ) +,r = 0. (144)

Conditioning on the (high-probability) event of the induction hypothesis of point 1., the following is already true on all the v -dominated patches at time t T:

(+, r) / S(0) + (v ) = σ w(t ) +,r, x(T ) n,p + b(t ) +,r = 0. (145)

In particular, σ w(T ) +,r, x(T ) n,p + b(T ) +,r = 0.

In other words, no (+, r) / S(0) + (v ) can be updated on the v -dominated patches at time t = T. Furthermore, the induction hypothesis of point 2. also states that the network cannot activate on any noise patch x(T ) n,p = ζ(T ) n,p with probability at least 1 O m NP T poly(d) . Therefore, the neuron update for those (+, r) / S(0) + (v ) takes the form

w(T ) +,r = η

n=1 1{|P(X(T ) n ; v)| > 0}[1{yn = +} logit(T ) + (X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,r, α(T ) n,pv + ζ(T ) n,p + b(T ) c,r > 0} α(T ) n,pv + ζ(T ) n,p (146)

Now we can invoke Lemma F.2 and obtain that, with probability at least 1 O m NP poly(d) , the following holds for all relevant neurons and v -dominated patches:

w(T ) +,r, x(τ) n,p + b(T ) +,r < 0. (147)

In conclusion, with τ T + 1, with probability at least 1 O m NP poly(d) , for every (+, r) / S(0) + (v ) and

relevant (n, p) s,

w(T ) +,r + w(T ) +,r, x(τ) n,p + b(T ) +,r + b(T ) +,r = w(T +1) +,r , x(τ) n,p + b(T +1) +,r < 0, (148)

Published in Transactions on Machine Learning Research (10/2024)

which leads to w(t ) +,r, x(τ) n,p + b(t ) +,r < 0 for all t T + 1 with probability at least 1 O mk+NP (T +1)

poly(d) (also

taking union bound over all the possible choices of v ). This finishes the inductive step for point 1.

2. (Non-activation on noise patches)

Relying on the event of the induction hypothesis, for any τ T, the following holds for every r [m] and noise patch x(τ) n,p = ζ(τ) n,p, w(T ) +,r, x(τ) n,p + b(T ) +,r < 0. (149)

Conditioning on this high-probability event, this means no neuron w(T ) +,r can be updated on the noise patches.

Denoting the set of features M = {v+,c}k+ c=1 {v ,c}k c=1 {v+, v }, for every r [m], its update is reduced to

w(T ) +,r = η

n=1 1{|P(X(T ) n ; v)| > 0}[1{yn = +} logit(T ) + (X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,r, α(T ) n,pv + ζ(T ) n,p + b(T ) c,r > 0} α(T ) n,pv + ζ(T ) n,p , (150)

Invoking Lemma F.3, we have that, for any τ T + 1, the following inequality holds with probability at least 1 O m NP poly(d) for every r [m] and noise patches,

w(T ) +,r, x(τ) n,p + b(T ) +,r < 0. (151)

Consequently, for any τ T + 1, the following inequality holds with probability at least 1 O m NP poly(d) for

every r [m] and noise patches x(τ) n,p = ζ(τ) n,p:

w(T ) +,r + w(T ) +,r, x(τ) n,p + b(T ) +,r + b(T ) +,r = w(T +1) +,r , x(τ) n,p + b(T +1) +,r < 0. (152)

This finishes the inductive step for point 2.

3. (Off-diagonal nonpositive growth) Choose any v {v } {v ,c}k c=1.

Choose any neuron with index (+, r). Similar to our proof for point 2., we know that its update, when taken inner product with a v -dominated patch x(τ) n,p = 1 ιv + ζ(τ) n,p, has to take the form

1 ιv + ζ(τ) n,p

n=1 1{|P(X(T ) n ; v)| > 0}[1{yn = +} logit(T ) + (X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,r, α(T ) n,pv + ζ(T ) n,p + b(T ) +,r > 0} α(T ) n,pv + ζ(T ) n,p ,

1 ιv + ζ(τ) n,p

n=1 1{|P(X(T ) n ; v)| > 0}[1{yn = +} logit(T ) + (X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,r, α(T ) n,pv + ζ(T ) n,p + b(T ) +,r > 0} ζ(T ) n,p ,

1 ιv + α(T ) n,pv + ζ(T ) n,p , ζ(τ) n,p

n=1 1{|P(X(T ) n ; v )| > 0}[logit(T ) + (X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,r, α(T ) n,pv + ζ(T ) n,p + b(T ) +,r > 0} α(T ) n,pv + ζ(T ) n,p ,

1 ιv + ζ(τ) n,p

Published in Transactions on Machine Learning Research (10/2024)

With probability at least 1 O NP poly(d) , α(T ) n,pv +ζ(T ) n,p , 1 ιv +ζ(τ) n,p > 0, and ζ(T ) n,p , 1 ιv + α(T ) n,pv+

ζ(T ) n,p , ζ(τ) n,p < O(1/ log9(d)). Therefore,

w(T ) +,r, v < η

n=1 1{|P(X(T ) n ; v)| > 0}[1{yn = +} logit(T ) + (X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,r, α(T ) n,pv + ζ(T ) n,p + b(T ) +,r > 0}O 1 log9(d)

Invoking Lemma F.3, we know that

1 log5(d) η NP

1 ι 1 log9(d)

n=1 1{|P(X(T ) n ; v)| > 0} 1{yn = +} logit(T ) + (X(T ) n )

p P(X(T ) n ;v) 1{ w(T ) +,r, α(T ) n,pv + ζ(T ) n,p + b(T ) +,r > 0}

It follows that

1 ιv + ζ(τ) n,p + b(T ) +,r

<O 1 log9(d)

n=1 1{|P(X(T ) n ; v)| > 0}[1{yn = +} logit(T ) + (X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,r, α(T ) n,pv + ζ(T ) n,p + b(T ) c,r > 0}

Ω 1 log5(d)

n=1 1{|P(X(T ) n ; v)| > 0} 1{yn = +} logit(T ) + (X(T ) n )

p P(X(T ) n ;v) 1{ w(T ) +,r, α(T ) n,pv + ζ(T ) n,p + b(T ) c,r > 0}

Consequently,

σ w(T +1) +,r ,

1 ιv + ζ(τ) n,p + b(T +1) +,r

=σ w(T ) +,r,

1 ιv + ζ(τ) n,p + b(T ) +,r + w(T ) +,r,

1 ιv + ζ(τ) n,p + b(T ) +,r

σ w(T ) +,r,

1 ιv + ζ(τ) n,p + b(T ) +,r

σ w(0) +,r,

1 ιv + ζ(τ) n,p + b(0) +,r .

Corollary F.1.1 (Bias update upper bound). Choose any Te poly(d). With probability at least 1

poly(d) , for all t [0, Te], any neuron w+,r, and any v U(0) +,r,

b(t) +,r < Ω polylog(d)

w(t) +,r, ζ . (158)

Published in Transactions on Machine Learning Research (10/2024)

Proof. Conditioning on the high-probability events of Theorem F.1 above, we know that for any neuron indexed (+, r), at any time t Te, its update takes the form

w(t) +,r = η

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} α(t) n,pv + ζ(t) n,p ,

It follows that, with probability at least 1 O 1 poly(d) ,

w(t) +,r, ζ =

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} α(t) n,pv + ζ(t) n,p, ζ

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}O 1 polylog(d)

On the other hand,

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}α(t) n,pv

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}ζ(t) n,p

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}

1 ι O 1 log9(d)

Clearly, w(t) +,r 2 Ω polylog(d) w(t) +,r, ζ . (162)

The conclusion follows.

Published in Transactions on Machine Learning Research (10/2024)

Lemma F.2 (Nonactivation invariance). Let the assumptions in Theorem D.1 hold.

Denote the set of features C(v ) = {v+,c}k+ c=1 {v ,c}k c=1 {v+, v } {v }. If the update term for neuron w(t) +,r can be written as follows

w(t) +,r = η

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} α(t) n,pv + ζ(t) n,p , (163)

then given any τ > t, the following inequality holds with probability at least 1 O NP poly(d) for all v -dominated

patch x(τ) n,p:

w(t) +,r, x(τ) n,p + b(t) +,r < 0 (164)

Proof. Let us fix a neuron w+,r satisfying the update expression in the Lemma statement, and fix some τ > t.

Firstly, the bias update for this neuron can be upper bounded via the reverse triangle inequality:

w(t) +,r 2 log5(d)

1 log5(d) η NP

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}α(t) n,pv

+ 1 log5(d) η NP

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}ζ(t) n,p

Published in Transactions on Machine Learning Research (10/2024)

Let us further upper bound the two 2 terms separately. Firstly,

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}α(t) n,pv

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}α(t) n,pv

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}α(t) n,p v 2

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}

Secondly, with probability at least 1 O NP poly(d) ,

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}ζ(t) n,p

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} ζ(t) n,p 2

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(0) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} 1 log9(d)

Published in Transactions on Machine Learning Research (10/2024)

Therefore, with probability at least 1 O NP poly(d) , we can bound the update to the bias as follows:

1 log5(d) η NP

1 ι 1 log9(d)

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}

Furthermore, with probability at least 1 e Ω(d)+O(log(d)) > 1 O NP poly(d) , the following holds for all n, p:

α(t) n,pv, ζ(τ) n,p , ζ(t) n,p, α(τ) n,pv , ζ(t) n,p, ζ(τ) n,p < O 1 log9(d)

Combining the above derivations, they imply that with probability at least 1 O NP poly(d) , for any x(τ) n,p dominated by v ,

w(t) +,r, x(τ) n,p + b(t) +,r

= w(t) +,r, α(τ) n,pv + ζ(τ) n,p + b(t) +,r

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} α(t) n,pv + ζ(t) n,p, α(τ) n,pv + ζ(τ) n,p + b(t) +,r

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} α(t) n,pv, ζ(τ) n,p + ζ(t) n,p, α(τ) n,pv + ζ(t) n,p, ζ(τ) n,p

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} O 1 log9(d)

O 1 log9(d)

1 ι 1 log9(d)

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}

Published in Transactions on Machine Learning Research (10/2024)

This completes the proof.

Lemma F.3 (Nonactivation on noise patches). Let the assumptions in Theorem D.1 hold.

Denote the set of features M = {v+,c}k+ c=1 {v ,c}k c=1 {v+, v }. If the update term for neuron w(t) +,r can be written as follows

w(t) +,r = η

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} α(t) n,pv + ζ(t) n,p , (171)

1 log5(d) η NP

1 ι 1 log9(d)

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}

Moreover, for any τ > t, the following inequality holds with probability at least 1 O NP poly(d) for all noise

patches x(τ) n,p = ζ(τ) n,p:

w(t) +,r, x(τ) n,p + b(t) +,r < 0 (173)

Proof. Similar to the proof of Lemma F.2, we can estimate the update to the bias term

1 log5(d) η NP

1 ι 1 log9(d)

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}

Published in Transactions on Machine Learning Research (10/2024)

Then for any x(τ) n,p = ζ(τ) n,p with τ > t, with probability at least 1 O m NP poly(d) ,

w(t) +,r, x(τ) n,p + b(t) +,r

= w(t) +,r, ζ(τ) n,p + b(t) +,r

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} α(t) n,pv + ζ(t) n,p, ζ(τ) n,p + b(t) +,r

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = +} logit(t) + (X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} α(t) n,pv, ζ(τ) n,p + ζ(t) n,p, ζ(τ) n,p + b(t) +,r

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0} O 1 log9(d)

O 1 log9(d)

1 ι 1 log9(d)

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = +} logit(t) + (X(t) n )

p P(X(t) n ;v) 1{ w(t) +,r, α(t) n,pv + ζ(t) n,p + b(t) c,r > 0}

Published in Transactions on Machine Learning Research (10/2024)

G Fine-grained Learning

This section treats the learning dynamics of using fine-grained labels to train the NN; the analysis will be much simpler since the technical analysis overlaps significantly with that in the previous sections.

The training procedure is exactly the same as in the coarse-grained training setting. We explicitly write them out here to avoid any possible confusion.

The learner for fine-grained classification is written as follows for c [k+]:

p=1 σ( w+,c,r, xp + b+,c,r), c [k+] (176)

with frozen linear classifier weights a+,c,r = 1. Same definition applies to the classes.

The SGD dynamics induced by the training loss is now

w(t+1) +,c,r = w(t) +,c,r + η 1

1{yn = (+, c)}[1 logit(t) +,c(X(t) n )] X

p [P ] σ ( w(t) +,c,r, x(t) n,p + b(t) c,r)x(t) n,p+

1{yn = (+, c)}[ logit(t) +,c(X(t) n )] X

p [P ] σ ( w(t) +,c,r, x(t) n,p + b(t) c,r)x(t) n,p

The bias is manually tuned according to the update rule

b(t+1) +,c,r = b(t) +,c,r w(t) +,c,r 2 log5(d) (178)

We assign m+,c = Θ(d1+2c0) neurons to each subclass (+, c). For convenience, we write m = dm+,c.

The initialization scheme is identical to the coarse-training case, except we choose a slightly less negative b(0) c,r = σ0 2 + 2c0 p

The parameter choices remain the same as before.

G.1 Initialization geometry

Definition G.1. Define the following sets of interest of the hidden neurons:

1. U(0) +,c,r = {v V : w(0) +,c,r, v σ0 2 + 2c0 q

log(d) 1 log5(d)}

2. Given v V, S (0) +,c (v) (+, c) [m+,c] satisfies:

(a) w(0) +,c,r, v σ0 2 + 2c0 q

log(d) + 1 log5(d)

(b) v V s.t. v v, w(0) +,c,r, v < σ0 2 + 2c0 q

log(d) 1 log5(d)

3. Given v V, S(0) +,c(v) (+, c) [m+,c] satisfies:

(a) w(0) +,c,r, v σ0 2 + 2c0 q

log(d) 1 log5(d)

4. For any (+, c, r) S (0) +,c,reg (+, c) [m+,c]:

(a) w(0) +,c,r, v σ0

U(0) +,c,r O(1)

Published in Transactions on Machine Learning Research (10/2024)

The same definitions apply to the -class neurons.

Proposition 2. At t = 0, for all v D, the following properties are true with probability at least 1 d 2

over the randomness of the initialized kernels:

1. |S (0) +,c (v)|, |S(0) +,c(v)| = Θ 1

2. In particular, |S (0) y (v)|

|S(0) y (v )| 1 = O 1 log5(d) and |S (0) y (v)|

|S (0) y (v )| 1 = O 1 log5(d) for any y, y

{(+, c)}k+ c=1 {( , c)}k c=1 and common or fine-grained features v, v .

3. S(0) +,c,reg = [m+,c]

The same properties apply to the -class neurons.

Proof. This proof proceeds in virtually the same way as in the proof of Proposition 1, so we omit it here.

G.2 Poly-time properties

Theorem G.1. Fix any t [0, Te], assuming Te poly(d).

1. (Non-activation invariance) For any τ t, with probability at least 1 O mk+NP t

poly(d) , for any feature

v {v+,c}k+ c=1 {v ,c}k c=1 {v+, v }, for every t t, (+, c, r) / S(0) +,c(v) and v-dominated patch

sample x(τ) n,p = α(τ) n,pv + ζ(τ) n,p, the following holds:

σ w(t ) +,c,r, x(τ) n,p + b(t ) +,c,r = 0 (179)

2. (Non-activation on noise patches) For any τ t, with probability at least 1 O m NP t poly(d) , for every

c [k+], r [m] and noise patch x(τ) n,p = ζ(τ) n,p, the following holds:

σ w(t) +,c,r, x(τ) n,p + b(t) +,c,r = 0 (180)

3. (Off-diagonal nonpositive growth) Given fine-grained class (+, c) and any τ t, with probability at least 1 O mk+NP t

poly(d) , for any t t, any feature v {v ,c}k c=1 {v } {v+,c }c =c, any neuron

w+,c,r S(0) +,c(v) and any v-dominated patch x(τ) n,p = α(τ) n,pv + ζ(τ) n,p, σ w(t ) +,c,r, x(τ) n,p + b(t ) +,c,r

σ w(0) +,c,r, x(τ) n,p + b(0) +,c,r .

Proof. The proof of this theorem is similar to that of Theorem F.1, but with some subtle differences.

Base case t = 0.

1. (Nonactivation invariance)

Choose any v from the set {v+,c}k+ c=1 {v ,c}k c=1 {v+, v }. We will work with neuron sets in the + class in this proof; the -class case can be handled in the same way.

First, given τ 0, we need to show that, for every n such that |P(X(τ) n ; v )| > 0 and p P(X(τ) n ; v ), for every (+, c, r) neuron index,

w(0) +,c,r, v < σ0 2 + 2c0

log(d) 1 log5(d) = σ w(0) +,c,r, x(τ) n,p + b(0) +,c,r = 0 (181)

Published in Transactions on Machine Learning Research (10/2024)

This is indeed true. The following holds with probability at least 1 O m NP poly(d) for all (+, r) / S(0) + (v) and

all such x(τ) n,p:

w(0) +,c,r, x(τ) n,p + b(0) +,c,r

(2 + 2c0)(log(d) 1/ log5(d)) + O σ0 log9(d)

(2 + 2c0)(1 + ι)(log(d) 1/ log5(d)) (2 + 2c0) log(d) q

(2 + 2c0)(log(d) 1/ log5(d)) + 4 + 2c0 p

log(d) + O 1 log9(d)

(2 + 2c0)(ι log(d) (1 + ι)/ log5(d)) q

(2 + 2c0)(log(d) 1/ log5(d)) + 2 + 2c0 p

log(d) + O 1 log9(d)

The first equality holds by utilizing the identity a b = a2 b2

a+b . As a consequence, σ( w(0) +,c,r, x(τ) n,p + b(0) +,r) = 0.

2. (Non-activation on noise patches) Invoking Lemma H.3, for any τ 0, with probability at least

1 O m NP poly(d) , we have for all possible choices of r [m] and the noise patches x(τ) n,p = ζ(τ) n,p:

w(0) +,c,r, ζ(τ) n,p O(σ0σζ p

d log(d)) O σ0 log9(d)

b(0) +,r. (183)

Therefore, no neuron can activate on the noise patches at time t = 0.

3. (Off-diagonal nonpositive growth) This point is trivially true at t = 0.

Inductive step: we assume the induction hypothesis for t [0, T] (with T < Te of course), and prove the statements for t = T + 1.

1. (Nonactivation invariance)

Again, choose any v from the set {v+,c}k+ c=1 {v ,c}k c=1 {v+, v }.

We need to prove that given τ T + 1, with probability at least 1 O mk+NP (T +1)

poly(d) , for every t T + 1,

(+, c, r) neuron index and v -dominated patch x(τ) n,p,

(+, c, r) / S(0) +,c(v ) = σ w(t ) +,c,r, x(τ) n,p + b(t ) +,c,r = 0. (184)

By the induction hypothesis of point 1., with probability at least 1 O mk+NP T

poly(d) , the following is already

true on all the v -dominated patches at time t T:

(+, c, r) / S(0) +,c(v ) = σ w(t ) +,c,r, x(T ) n,p + b(t ) +,c,r = 0. (185)

In particular, σ w(T ) +,c,r, x(T ) n,p + b(T ) +,c,r = 0.

In other words, no (+, c, r) / S(0) +,c(v ) can be updated on the v -dominated patches at time t = T. Furthermore, the induction hypothesis of point 2. also states that the network cannot activate on any noise patch x(T ) n,p = ζ(T ) n,p with probability at least 1 O m NP T poly(d) . Therefore, the neuron update for those

Published in Transactions on Machine Learning Research (10/2024)

(+, c, r) / S(0) +,c(v ) takes the form

w(T ) +,c,r = η

n=1 1{|P(X(T ) n ; v)| > 0}[1{yn = (+, c)} logit(T ) +,c(X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,c,r, α(T ) n,pv + ζ(T ) n,p + b(T ) +,c,r > 0} α(T ) n,pv + ζ(T ) n,p (186)

Conditioning on this high-probability event, we have

b(t) +,c,r =

w(t) +,c,r 2 log5(d)

1 log5(d) η NP

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = (+, c)} logit(t) +,c(X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}α(t) n,pv

+ 1 log5(d) η NP

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = (+, c)} logit(t) +,c(X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}ζ(t) n,p

Let us further upper bound the two 2 terms separately. Firstly,

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = (+, c)} logit(t) +,c(X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}α(t) n,pv

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = (+, c)} logit(t) +,c(X(t) n )

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}α(t) n,p v 2

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = (+, c)} logit(t) +,c(X(t) n )

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}

For the second 2 term consisting purely of noise, note that since all the ζ(t) n,p s are independent Gaussian random vectors, the standard deviation of the sum is in fact ( X

p P(X(t) n ;v) 1{|P(X(t) n ; v)| > 0}1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}

[1{yn = (+, c)} logit(t) +,c(X(t) n )]2 )1/2

Published in Transactions on Machine Learning Research (10/2024)

With the basic property that q P

j |cj| for any sequence of real numbers c1, c2, ..., we know this standard deviation can be upper bounded by

p P(X(t) n ;v) 1{|P(X(t) n ; v)| > 0}1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}

1{yn = (+, c)} logit(t) +,c(X(t) n ) σζ

It follows that with probability at least 1 O 1 poly(d) ,

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = (+, c)} logit(t) +,c(X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}ζ(t) n,p

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = (+, c)} logit(t) +,c(X(t) n )

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0} 1 log9(d)

Therefore, we can upper bound the bias update as follows:

b(t) +,c,r 1 log5(d) η NP

1 ι 1 log9(d)

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = (+, c)} logit(t) +,c(X(t) n )

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}

Furthermore, with probability at least 1 O NP poly(d) , the following holds for all n, p:

α(t) n,pv, ζ(τ) n,p , ζ(t) n,p, α(τ) n,pv , ζ(t) n,p, ζ(τ) n,p < O 1 log9(d)

Published in Transactions on Machine Learning Research (10/2024)

Combining the above derivations, they imply that with probability at least 1 O NP poly(d) , for any x(τ) n,p dominated by v ,

w(t) +,c,r, x(τ) n,p + b(t) +,c,r

= w(t) +,c,r, α(τ) n,pv + ζ(τ) n,p + b(t) +,c,r

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = (+, c)} logit(t) +,c(X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0} α(t) n,pv + ζ(t) n,p, α(τ) n,pv + ζ(τ) n,p + b(t) +,c,r

n=1 1{|P(X(t) n ; v)| > 0}[1{yn = (+, c)} logit(t) +,c(X(t) n )]

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0} α(t) n,pv, ζ(τ) n,p + ζ(t) n,p, α(τ) n,pv + ζ(t) n,p, ζ(τ) n,p

+ b(t) +,c,r

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = (+, c)} logit(t) +,c(X(t) n )

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0} O 1 log9(d)

+ b(t) +,c,r

O 1 log9(d)

1 ι 1 log9(d)

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = (+, c)} logit(t) +,c(X(t) n )

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}

Therefore, with probability at least 1 O m NP poly(d) , the following holds for the relevant neurons and v - dominated patches: w(T ) +,c,r, x(τ) n,p + b(T ) +,c,r < 0. (195)

In conclusion, with τ T + 1, with probability at least 1 O m NP poly(d) , for every (+, c, r) / S(0) +,c(v ) and

relevant (n, p) s,

w(T ) +,c,r + w(T ) +,c,r, x(τ) n,p + b(T ) +,c,r + b(T ) +,c,r = w(T +1) +,c,r , x(τ) n,p + b(T +1) +,c,r < 0, (196)

which leads to w(t ) +,c,r, x(τ) n,p + b(t ) +,c,r < 0 for all t T + 1 with probability at least 1 O mk+NP (T +1)

(also by taking union bound over all the possible choices of v at time T + 1). This finishes the inductive step for point 1.

2. (Non-activation on noise patches)

The inductive step for this part is very similar to (and even simpler than) the inductive step of point 1, so we omit the calculations here.

Published in Transactions on Machine Learning Research (10/2024)

3. (Off-diagonal nonpositive growth) By the induction hypothesis s high-probability event, we already have that, given any fine-grained class (+, c), τ T + 1, for any feature v {v ,c}k c=1 {v } {v+,c }c =c and any neuron w+,c,r, σ w(T ) +,c,r, x(τ) n,p + b(T ) +,c,r σ w(0) +,c,r, x(τ) n,p + b(0) +,c,r . We just need to show that

w(t) +,c,r, x(τ) n,p + b(T ) +,r 0 to finish the proof; the rest proceeds in a similar fashion to the induction step of point 3 in the proof of Theorem F.1.

Similar to the induction step of point 1, denoting M to be the set of all common and fine-grained features, the update expression of any neuron (+, c, r) has to be

w(T ) +,c,r = η

n=1 1{|P(X(T ) n ; v)| > 0}[1{yn = (+, c)} logit(T ) +,c(X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,c,r, α(T ) n,pv + ζ(T ) n,p + b(T ) +,c,r > 0} α(T ) n,pv + ζ(T ) n,p (197)

Written more explicitly,

w(T ) +,c,r = η

n=1 1{|P(X(T ) n ; v)| > 0}1{yn = (+, c)}[1 logit(T ) +,c(X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,c,r, α(T ) n,pv + ζ(T ) n,p + b(T ) +,c,r > 0} α(T ) n,pv + ζ(T ) n,p

n=1 1{yn = (+, c)}1{|P(X(T ) n ; v )| > 0}[logit(T ) +,c(X(T ) n )]

p P(X(T ) n ;v ) 1{ w(T ) +,c,r, α(T ) n,pv + ζ(T ) n,p + b(T ) +,c,r > 0} α(T ) n,pv + ζ(T ) n,p

It follows that with probability at least 1 O m NP poly(d) , for relevant n, p, r, we have

w(T ) +,c,r, α(τ) n,pv + ζ(τ) n,p

n=1 1{|P(X(T ) n ; v)| > 0}1{yn = (+, c)}[1 logit(T ) +,c(X(T ) n )]

p P(X(T ) n ;v) 1{ w(T ) +,c,r, α(T ) n,pv + ζ(T ) n,p + b(T ) +,c,r > 0}O 1 log9(d)

Furthermore, similar to the induction step of point 1, we can estimate the bias update as follows:

Ω 1 log5(d)

n=1 1{|P(X(t) n ; v)| > 0} 1{yn = (+, c)} logit(t) +,c(X(t) n )

p P(X(t) n ;v) 1{ w(t) +,c,r, α(t) n,pv + ζ(t) n,p + b(t) +,c,r > 0}

It follows that, indeed, w(T ) +,c,r, x(τ) n,p + b(T ) +,c,r 0, which completes the induction step of point 3.

Published in Transactions on Machine Learning Research (10/2024)

G.3 Training

Choose an arbitrary constant B [Ω(1), log(3/2)].

Definition G.2. Let T0(B) > 0 be the first time that there exists some X(t) n and c such that F (T0(B)) y (X(T0(B)) n ) B for any n [N] and y {(+, c)}k+ c=1 {( , c)}k c=1.

We write T0(B) as T0 for simplicity of notation when the context is clear.

Lemma G.2. With probability at least 1 O mk+NP T0

poly(d) , the following holds for all t [0, T0):

1. (On-diagonal common-feature neuron growth) For every c [k+], every (+, c, r), (+, c, r ) S (0) +,c (v+),

w(t) +,c,r w(0) +,c,r = w(t) +,c,r w(0) +,c,r (201)

w(t) +,r =[1/4, 2/3]

1 ι 1 s 1/3 η s

2k+P v+ + ζ(t) +,r (202)

where ζ(t) +,c,r N(0, σ(t)2 ζ+,c,r I), σ(t) ζ+,c,r = Θ(1) ησζ

The bias updates satisfy

b(t) +,c,r = Θ ηs

k+P log5(d)

Furthermore, every (+, r) S (0) + (v+) activates on all the v+-dominated patches at time t.

2. (On-diagonal finegrained-feature neuron growth) For every c [k+] and every (+, c, r), (+, c, r ) S (0) +,c (v+,c),

w(t) +,c,r w(0) +,c,r = w(t) +,c,r w(0) +,c,r (204)

w(t) +,c,r = 1 O 1

1 ι 1 s 1/3 η s

2k+P v+,c + ζ(t) +,r (205)

where ζ(t) +,c,r N(0, σ(t)2 ζ+,cr I), and σ(t) ζ+,r = 1 O 1 k+

1 s 1/3 ησζ

The bias updates satisfy

b(t) +,c,r = Θ ηs

k+P log5(d)

Furthermore, every (+, c, r) S (0) +,c (v+,c) activates on all the v+-dominated patches at time t.

3. The above results also hold with the + and class signs flipped.

Proof. The proof of this theorem proceeds in a similar fashion to Theorem D.1, with some variations for the common-feature neurons.

We shall prove the statements in this theorem via induction. We focus on the +-class neurons; -class neurons proofs are done in the same fashion.

First of all, relying on the (high-probability) event of Theorem G.1, we know that we can simplify the update expressions for the neurons in S (0) +,c (v+,c) to the form

w(t) +,c,r = η

n=1 1{yn = (+, c)}[1 logit(t) +,c(X(t) n )]

p P(X(t) n ;v+,c) 1{ w(t) +,c,r, α(t) n,pv+,c + ζ(t) n,p + b(t) +,c,r > 0} α(t) n,pv+,c + ζ(t) n,p , (207)

Published in Transactions on Machine Learning Research (10/2024)

and for the neurons in S (0) +,c (v+), the updates take the form

1{yn = (+, c)}[1 logit(t) +,c(X(t) n )] + X

c [k+] {c} 1{yn = (+, c )}[ logit(t) +,c(X(t) n )]

p P(X(t) n ;v+) 1{ w(t) +,c,r, α(t) n,pv+ + ζ(t) n,p + b(t) +,c,r > 0} α(t) n,pv+ + ζ(t) n,p .

By definition of T0 and the fact that B log(3/2), for any n [N] and t < T0, we can write down a simple upper bound of logit(t) +,c(X(t) n ):

logit(t) +,c(X(t) n ) = exp(F+,c(X(t) n )) Pk+ c =1 exp(F+,c (X(t) n )) + Pk c =1 exp(F ,c (X(t) n ))

3 2 2k+ = 3 4k+ ,

and we can lower bound it as follows

logit(t) +,c(X(t) n ) 1 2k+ 3

2 = 1 3k+ , (210)

The inductive proof for the fine-grained neurons S (0) +,c (v+,c) is almost identical to that in the proof of Theorem

D.1. The only notable difference here is that [1 logit(t) +,c(X(t) n )] has the estimate 1 O 1 k+

The inductive proof of the common-feature neurons S (0) +,c (v+) requires more care as its update expression equation 208 is qualitatively different from the coarse-grained training case in Theorem D.1, so we present the full proof here.

Base case, t = 0.

With probability at least 1 O m NP poly(d) , for every c [k+] and every (+, c, r) S (0) +,c (v+),

w(0) +,c,r, α(0) n,pv+ + ζ(0) n,p + b(0) +,c,r

(1 ι)(2 + 2c0)(log(d) + 1/ log5(d)) p

(2 + 2c0) log(d) O 1 log9(d)

(1 ι)(2 + 2c0)(log(d) + 1/ log5(d)) (2 + 2c0) log(d) q

(1 ι)(2 + 2c0)(log(d) + 1/ log5(d)) + p

(2 + 2c0) log(d) O 1 log9(d)

(2 + 2c0)( ι log(d) + (1 ι)/ log5(d)) q

(1 ι)(2 + 2c0)(log(d) + 1/ log5(d)) + p

(2 + 2c0) log(d) O 1 log9(d)

Published in Transactions on Machine Learning Research (10/2024)

This means all the v+-singleton neurons will be updated on all the v+-dominated patches at time t = 0. Therefore, we can write update expression equation 208 as follows

1{yn = (+, c)}[1 logit(0) +,c(X(0) n )] + X

c [k+] {c} 1{yn = (+, c )}[ logit(0) +,c(X(0) n )]

p P(X(0) n ;v+)

α(0) n,pv+ + ζ(0) n,p .

By concentration of the binomial random variable, we know that with probability at least 1 e Ω(log2(d)), for all n, P(X(0) n ; v+) = 1 s 1/3 s . (213)

Now, with the estimates we derived for logit(t) +,c(X(t) n ) at the beginning of the proof and the independence of

all the noise vectors ζ(0) n,p s, we arrive at

w(0) +,r =[1/4, 2/3]

1 ι 1 s 1/3 η s

2k+P v+ + ζ(0) +,r (214)

where σ(0) ζ+,c,r = Θ(1) ησζ

Additionally, a byproduct of the above proof steps is that all the S (0) +,c (v+) neurons indeed activate on all the v+-dominated patches at t = 0 with high probability.

Now we examine the bias update. We first estimate w(0) +,c,r 2. With probability at least 1 O m poly(d)

the following upper bound holds for all neurons in S (0) +,c (v+):

w(0) +,c,r 2 O η s

v+ 2 + ζ(0) +,r 2

and the following lower bound holds (via the reverse triangle inequality):

w(0) +,c,r 2 Ω η s

v+ 2 ζ(0) +,r 2

It follows that w(0) +,c,r 2 = Θ η s k+P , which means

b(0) +,c,r =

w(0) +,c,r 2 log5(d) = Θ ηs

k+P log5(d)

This completes the proof of the base case.

Published in Transactions on Machine Learning Research (10/2024)

Induction step. Assume statements for time [0, t], prove for t + 1.

First, by the induction hypothesis, we know that neurons in S (0) +,c (v+) must activate on all the v+-dominated patches at time t. Therefore, we can write the update expression equation 208 as follows:

1{yn = (+, c)}[1 logit(t) +,c(X(t) n )] + X

c [k+] {c} 1{yn = (+, c )}[ logit(t) +,c(X(t) n )]

p P(X(t) n ;v+)

α(t) n,pv+ + ζ(t) n,p .

Following the same argument as in the base case, we have that with probability at least 1 O m NP poly(d) ,

w(t) +,c,r =[1/4, 2/3]

1 ι 1 s 1/3 η s

2k+P v+ + ζ(t) +,c,r, (219)

and σ(t) ζ+,c,r = Θ(1) ησζ

Now we need to show that w(t+1) +,c,r indeed activate on all the v+-dominated patches at time t + 1 with high probability.

So far, we know that for τ [0, t + 1],

w(τ) +,c,r =[1/4, 2/3]

1 ι 1 s 1/3 η s

2k+P v+ + ζ(τ) +,c,r, (220)

and σ(τ) ζ+,c,r = Θ(1) ησζ

2N . It follows that

w(t+1) +,r =w(0) +,c,r + (t + 1)[1/4, 2/3]

1 ι 1 s 1/3 η s

2k+P v+ + ζ(t+1) +,c,r , (221)

where σ(t+1) ζ+,c,r = Θ(1) t + 1ησζ

The following holds with probability at least 1 O m NP poly(d) over all the v+-dominated patches x(t+1) n,p =

α(t+1) n,p v+ + ζ(t+1) n,p (which are independent of w(t+1) +,r ) and the v+-singleton neurons:

w(t+1) +,c,r , α(t+1) n,p v+ + ζ(t+1) n,p

= w(0) +,c,rα(t+1) n,p v+ + ζ(t+1) n,p + (t + 1)[1/4, 2/3](1 ι) 1 s 1/3 1 O 1 log9(d)

+ ζ(t+1) +,c,r , α(t+1) n,p v+ + ζ(t+1) n,p

Note that with probability at least 1 O 1 poly(d) ,

ζ(t+1) +,c,r , α(t+1) n,p v+ O(1)

d log(d), (223)

and since t + 1 t + 1,

s < s , σζ p

d log(d) < 1 log9(d), and N > dk+, we know that

ζ(t+1) +,c,r , α(t+1) n,p v+ O 1

2k+P . (224)

Published in Transactions on Machine Learning Research (10/2024)

Similarly, with probability at least 1 O 1 poly(d) ,

ζ(t+1) +,c,r , α(t+1) n,p v+ + ζ(t+1) n,p O(1)

d log(d) O 1

2k+P . (225)

It follows that with probability at least 1 O m NP poly(d) ,

w(t+1) +,c,r , α(t+1) n,p v+ + ζ(t+1) n,p

w(0) +,c,r, α(t+1) n,p v+ + ζ(t+1) n,p + 1

4(t + 1)(1 ι) 1 s 1/3 1 O 1 log9(d)

2k+P . (226)

Next, let us estimate the bias updates for τ [0, t + 1].

Estimating b(t) +,c,r follows an almost identical argument as in the base case (with the only main difference being relying on Theorem G.1 for non-activation on non-v+-dominated patches), so we skip its calculations.

Therefore, b(t+1) +,c,r = b(0) +,c,r + Θ ηs (t+1) k+P log5(d) . This means

w(t+1) +,c,r , α(t+1) n,p v+ + ζ(t+1) n,p + b(t+1) +,c,r

w(0) +,c,r, α(t+1) n,p v+ + ζ(t+1) n,p + b(0) +,c,r

4(t + 1)(1 ι) 1 s 1/3 1 O 1 log9(d)

2k+P O ηs (t + 1)

k+P log5(d)

This completes the inductive step.

Corollary G.2.1. At time t = T0, ηs

k+P s S (0) +,c (v+) , ηs

k+P s S (0) +,c (v+,c) = Θ(1).

Proof. Directly follows from Lemma G.2 and Theorem G.1.

G.4 Model error after training

In this subsection, we show the model s error after fine-grained training. We also discuss that finetuning the model further increases its feature extractor s response to the true features, so it is even more robust/generalizing in downstream classification tasks.

Theorem G.3. Define b F+(X) = maxc [k+] F+,c(X), b F (X) = maxc [k ] F ,c(X).

With probability at least 1 O mk2 +NP T0 poly(d) , the following events take place:

1. (Fine-grained easy & hard sample test accuracies are nearly perfect) Given an easy or hard fine-grained

test sample (X, y) where y {(+, c)}k+ c=1 {( , c)}k c=1, P h F (T0) y (X) maxy =y F (T0) y (X) i o(1).

2. (Coarse-grained easy & hard sample test accuracy are nearly perfect) Given an easy or hard coarse-

grained test sample (X, y) where y {+1, 1}, P h b F (T0) y (X) b F (T0) y (X) i o(1).

Proof. Probability of mistake on easy samples.

Without loss of generality, assume X is a (+, c)-class easy sample.

Conditioning on the events of Theorem G.1 and Lemma G.2, we know that for all c [k ],

F (T0) ,c O(m+,c σ0 p

log(d)) o(1), (228)

Published in Transactions on Machine Learning Research (10/2024)

and for all c [k+] {c},

F (T0) +,c X

(+,r) S(0) +,c (v+) σ w(T0) +,r , αn,pv+ + ζn,p + b(T0) +,c ,r + O(m+,c σ0 p

s S(0) +,c (v+) 2

3(1 + ι) 1 + s 1/3 1 + 1 log9(d)

F (T0) +,c X

(+,r) S (0) +,c (v+) σ w(T0) +,c,r, αn,pv+ + ζn,p + b(T0) +,c,r

p P(X;v+,c)

(+,r) S (0) +,c (v+,c) σ w(T0) +,c,r, αn,pv+,c + ζn,p + b(T0) +,c,r

s S (0) +,c (v+) 1

4(1 ι) 1 s 1/3 1 1 log5(d)

+ s S (0) +,c (v+,c) 1 O 1

(1 ι) 1 s 1/3 1 1 log5(d)

Relying on Proposition 2, we know S(0) +,c (v+) = 1 1 log5(d) S (0) +,c (v+) and S (0) +,c (v+,c) = 1 1 log5(d) S (0) +,c (v+) , therefore F (T0) +,c (X) > maxc =c F (T0) +,c (X) has to be true. With Corollary G.2.1,

we also have F (T0) +,c (X) Ω(1) > o(1) maxc [k ] F (T0) ,c (X). It follows that the probability of mistake on an easy test sample is indeed at most o(1).

Probability of mistake on hard samples. Without loss of generality, assume X is a (+, c)-class hard sample.

By Theorem G.1 (and its proof) and Lemma G.2, we know that for any c [k+], the neurons w+,c ,r can only possibly receive update on v-dominated patches for v U(0) +,c ,r, and the updates to the neurons take the

feature-plus-Gaussian-noise form of P

v U(0) +,c ,r c(v )v + ζ(t) +,c ,r, with c(v ) 1 + ι 1 + s 1/3 η s 2k+P if

v is a fine-grained feature, or c(v ) 2

3 1 + ι 1 + s 1/3 η s 2k+P if v = v+ (because the v component of a

v -singleton neuron s update is already the maximum possible). Moreover, σ(t) ζ+,c ,r O ησζ

Relying on Theorem G.1, Lemma G.2, Corollary G.2.1 and previous observations, we have

F (T0) +,c (X) X

p P(X;v+,c)

(+,c,r) S (0) +,c (v+,c) σ w(T0) +,c,r, αn,pv+,c + ζn,p + b(T0) +,c,r

s S (0) +,c (v+,c) 1 O 1

(1 ι) 1 s 1/3 1 O 1 log5(d)

Published in Transactions on Machine Learning Research (10/2024)

and for c = c,

F (T0) +,c (X)

r=1 σ w(T0) +,c ,r, ζ + b(T0) +,c ,r

p P(X;v+,c)

(+,c ,r) S(0) +,c (v+,c) σ w(T0) +,c ,r, αn,pv+,c + ζn,p + b(T0) +,c ,r

(+,c ,r) S(0) +,c (v ) σ w(T0) +,c ,r, α n,pv + ζn,p + b(T0) +,c ,r

(+,c ,r) U(0) +,c ,r

τ=0 w(τ) +,c, r, ζ + X

r [m+,c ] w(0) +,c, r, ζ

p P(X;v+,c)

(+,c ,r) S(0) +,c (v+,c) σ w(0) +,c ,r, αn,pv+,c + ζn,p + b(0) +,c ,r

(+,c ,r) S(0) +,c (v ) σ w(0) +,c ,r, α n,pv + ζn,p + b(0) +,c ,r

O 1 polylog(d)

Moreover, for any c [k ], similar to before,

F (T0) ,c (X)

r=1 σ w(T0) ,c ,r, ζ + b(T0) ,c ,r

p P(X;v+,c)

( ,c ,r) S(0) ,c (v+,c) σ w(T0) ,c ,r, αn,pv+,c + ζn,p + b(T0) ,c ,r

( ,c ,r) S(0) ,c (v ) σ w(T0) ,c ,r, α n,pv + ζn,p + b(T0) ,c ,r

( ,c ,r) U(0) ,c ,r

w(T0) ,c, r, ζ + X

r [m ,c ] w(0) ,c, r, ζ

p P(X;v+,c)

( ,c ,r) S(0) ,c (v+,c) σ w(0) ,c ,r, αn,pv+,c + ζn,p + b(0) ,c ,r

+ O(1) s S(0) ,c (v ) ι upper + O(σ0 log(d))

O 1 polylog(d)

log(d) + O 1 log(d)

Therefore, F (T0) +,c (X) > maxy =(+,c) F (T0) y (X), which means b F (T0) + (X) > b F (T0) (X) indeed.

Remark. First of all, note that the feature extractor, after fine-grained training, is already well-performing, as it responds strongly (Ω(1) strength) to the true features, and very weakly (o(1) strength) to any off-diagonal features and noise. In other words, we stop training when the margin is at least Ω(1), i.e. when we have F (T )

y(T ) n (X(T ) n ) maxy =y(T ) n F (T ) y (X(T ) n ) Ω(1) for all n at some T poly(d), and with high probability, we just

Published in Transactions on Machine Learning Research (10/2024)

need T0 time to reach it. This can already help us explain the linear-probing result we saw on Image Net21k in Appendix A.2, since linear probing does not alter the the feature extractor after fine-grained pretraining (on Image Net21k), it only retrains a new linear classifier on top of the feature extractor for classifying on the target Image Net1k dataset.

At a high level, finetuning b F can only further enhance the feature extractor s response to the features, therefore making the model even more robust for challenging downstream classification problems; it will not degrade the feature extractor s response to any true feature. A rigorous proof of this statement is almost a repetition of the proofs for fine-grained training, so we do not repeat them here. Intuitively speaking, we just need to note that the properties stated in Theorem G.1 will continue to hold during finetuning (as long as we stay in polynomial time), and with similar argument to those in the proof of Lemma G.2, we note that the neurons responsible for detecting fine-grained features, i.e. the S (0) +,c (v+,c), will continue to only receive (positive) updates on the v+,c-dominated patches of the following form:

w(t) +,c,r = η

n=1 1{yn = (+, c)}[1 logit(t) + (X(t) n )]

p P(X(t) n ;v+,c) 1{ w(t) +,c,r, α(t) n,pv+,c + ζ(t) n,p + b(t) +,c,r > 0} α(t) n,pv+,c + ζ(t) n,p , (234)

and similar update expression can be stated for the S (0) +,c (v+) neurons:

n=1 1{yn = (+, c)}[1 logit(t) + (X(t) n )]

p P(X(t) n ;v+) 1{ w(t) +,c,r, α(t) n,pv+ + ζ(t) n,p + b(t) +,c,r > 0} α(t) n,pv+ + ζ(t) n,p .

Indeed, these feature-detector neurons will continue growing in the direction of the features they are responsible for detecting instead of degrade in strength.

Published in Transactions on Machine Learning Research (10/2024)

H Probability Lemmas

Lemma H.1 (Laurent-Massart χ2 Concentration (Laurent & Massart (2000) Lemma 1)). Let g N(0, Id). For any vector a Rd 0, any t > 0, the following concentration inequality holds:

i=1 aig2 i a 1 + 2 a 2

Lemma H.2. Let g N(0, σ2Id). Then,

P g 2 2 5σ2d e d (237)

Proof. By Lemma H.1, setting ai = 1 for all i and t = d yields

P g 2 2 σ2d + 2σ2d + 2σ2d e d (238)

Lemma H.3 (Shen et al. (2022a)). Let g1 N(0, σ2 1Id) and g2 N(0, σ2 2Id) be independent. Then, for any δ (0, 1) and sufficiently large d, there exist constants c1, c2 such that

P h | g1, g2 | c1σ1σ2 p

d log(1/δ) i 1 δ (239)

P h g1, g2 c2σ1σ2