# deep_active_learning_by_leveraging_training_dynamics__8afd0c6e.pdf Deep Active Learning by Leveraging Training Dynamics Haonan Wang1, Wei Huang2, Ziwei Wu1, Andrew Margenot1, Hanghang Tong1, Jingrui He1 1University of Illinois Urbana-Champaign 2University of New South Wales 1{haonan3,ziweiwu2,margenot,htong,jingrui}@illinois.edu 2{weihuang.uts}@gmail.com Active learning theories and methods have been extensively studied in classical statistical learning settings. However, deep active learning, i.e., active learning with deep learning models, is usually based on empirical criteria without solid theoretical justification, thus suffering from heavy doubts when some of those fail to provide benefits in real applications. In this paper, by exploring the connection between the generalization performance and the training dynamics, we propose a theory-driven deep active learning method (dynamic AL) which selects samples to maximize training dynamics. In particular, we prove that the convergence speed of training and the generalization performance are positively correlated under the ultra-wide condition and show that maximizing the training dynamics leads to better generalization performance. Furthermore, to scale up to large deep neural networks and data sets, we introduce two relaxations for the subset selection problem and reduce the time complexity from polynomial to constant. Empirical results show that dynamic AL not only outperforms the other baselines consistently but also scales well on large deep learning models. We hope our work would inspire more attempts on bridging the theoretical findings of deep networks and practical impacts of deep active learning in real applications. 1 Introduction Training deep learning (DL) models usually requires large amount of high-quality labeled data [1] to optimize a model with a massive number of parameters. The acquisition of such annotated data is usually time-consuming and expensive, making it unaffordable in the fields that require high domain expertise. A promising approach for minimizing the labeling effort is active learning (AL), which aims to identify and label the maximally informative samples, so that a high-performing classifier can be trained with minimal labeling effort [2]. Under classical statistical learning settings, theories of active learning have been extensively studied from the perspective of VC dimension [3]. As a result, a variety of methods have been proposed, such as (i) the version-space-based approaches, which require maintaining a set of models [4, 5], and (ii) the clustering-based approaches, which assume that the data within the same cluster have pure labels [6]. However, the theoretical analyses for these classical settings may not hold for over-parameterized deep neural networks where the traditional wisdom is ineffective [1]. For example, margin-based methods select the labeling examples in the vicinity of the learned decision boundary [7, 8]. However, in the over-parameterized regime, every labeled example could potentially be near the learned decision boundary [9]. As a result, theoretically, such analysis can hardly guide us to design practical active 36th Conference on Neural Information Processing Systems (Neur IPS 2022). learning methods. Besides, empirically, multiple deep active learning works, borrowing observations and insights from the classical theories and methods, have been observed unable to outperform their passive learning counterparts in a few application scenarios [10, 11]. On the other hand, the analysis of neural network s optimization and generalization performance has witnessed several exciting developments in recent years in terms of the deep learning theory [12, 13, 14]. It is shown that the training dynamics of deep neural networks using gradient descent can be characterized by the Neural Tangent Kernel (NTK) of infinite [12] or finite [15] width networks. This is further leveraged to characterize the generalization of over-parameterized networks through Rademacher complexity analysis [13, 16]. We are therefore inspired to ask: How can we design a practical and generic active learning method for deep neural networks with theoretical justifications? To answer this question, we firstly explore the connection between the model performance on testing data and the convergence speed on training data for the over-parameterized deep neural networks. Based on the NTK framework [12, 13], we theoretically show that if a deep neural network converges faster ( Train Faster ), then it tends to have better generalization performance ( Generalize Better ), which matches the existing observations [17, 18, 19, 20, 21]. Motivated by the aforementioned connection, we first introduce Training Dynamics, the derivative of training loss with respect to iteration, as a proxy to quantitatively describe the training process. On top of it, we formally propose our generic and theoretically-motivated deep active learning method, dynamic AL, which will query labels for a subset of unlabeled samples that maximally increase the training dynamics. In order to compute the training dynamics by merely using the unlabeled samples, we leverage two relaxations Pseudo-labeling and Subset Approximation to solve this non-trivial subset selection problem. Our relaxed approaches are capable of effectively estimating the training dynamics as well as efficiently solving the subset selection problem by reducing the complexity from O(N b) to O(b). In theory, we coin a new term Alignment to measure the length of the label vector s projection on the neural tangent kernel space. Then, we demonstrate that higher alignment usually comes with a faster convergence speed and a lower generalization bound. Furthermore, with the help of the maximum mean discrepancy [22], we extend the previous analysis to an active learning setting where the i.i.d. assumption may not hold. Finally, we show that alignment is positively correlated with our active learning goal, training dynamics, which implies that maximizing training dynamics will lead to better generalization performance. Regarding experiments, we have empirically verified our theory by conducting extensive experiments on three datasets, CIFAR10 [23], SVHN [24], and Caltech101 [25] using three types of network structures: vanilla CNN, Res Net [26], and VGG [27]. We first show that the result of the subset selection problem delivered by the subset approximation is close to the global optimal solution. Furthermore, under the active learning setting, our method not only outperforms other baselines but also scales well on large deep learning models. The main contributions of our paper can be summarized as follows: We propose a theory-driven deep active learning method, dynamic AL, inspired by the observation of train faster, generalize better . To this end, we introduce the Training Dynamics, as a proxy to describe the training process. We demonstrate that the convergence speed of training and the generalization performance is strongly (positively) correlated under the ultra-wide condition; we also show that maximizing the training dynamics will lead to a lower generalization error in the scenario of active learning. Our method is easy to implement. We conduct extensive experiments to evaluate the effectiveness of dynamic AL and empirically show that our method consistently outperforms other methods in a wide range of active learning settings. 2 Background Notation. We use the random variable x X to represent the input data feature and y Y as the label where K is the number of classes and [K] := {1, 2, ..., K}. We are given non-degenerated a data source D with unknown distribution p(x, y). We further denote the concatenation of x as X = [x1, x2, ..., x M] and that of y as Y = [y1, y2, ..., y M] . We consider a deep learning classifier hθ(x) = argmax σ(f(x; θ)) : x y parameterized by θ Rp, where σ( ) is the softmax function and f is a neural network. Let be the Kronecker Product and IK RK K be an identity matrix. Active learning. The goal of active learning is to improve the learning efficiency of a model with a limited labeling budget. In this work, we consider the pool-based AL setup, where a finite data set S = {(xl, yl)}M l=1 with M points are i.i.d. sampled from p(x, y) as the (initial) labeled set. The AL model receives an unlabeled data set U sampled from p(x) and request labels according to p(y|x) for any x U in each query round. There are R rounds in total, and for each round, a query set Q consisting of b unlabeled samples can be queried. The total budget size B = b R. Neural Tangent Kernel. The Neural Tangent Kernel [12] has been widely applied to analyze the dynamics of neural networks. If a neural network is sufficiently wide, properly initialized, and trained by gradient descent with infinitesimal step size (i.e., gradient flow), then the neural network is equivalent to kernel regression predictor with a deterministic kernel Θ( , ), called Neural Tangent Kernel (NTK). When minimizing the mean squared error loss, at the iteration t, the dynamics of the neural network f has a closed-form expression: df(X; θ(t)) dt = Kt(X, X) (f(X; θ(t)) Y) , (1) where θ(t) denotes the parameter of the neural network at iteration t, Kt(X, X) R|X| K |X| K is called the empirical NTK and Ki,j t (x, x ) = θf i(x; θ(t)) θf j(x ; θ(t)) is the inner product of the gradient of the i-th class probability and the gradient of the j-th class probability for two samples x, x X and i, j [K]. The time-variant kernel Kt( , ) is equivalent to the (time-invariant) NTK with a high probability, i.e., if the neural network is sufficiently wide and properly initialized, then: Kt(X, X) = Θ(X, X) IK. (2) The final learned neural network at iteration t, is equivalent to the kernel regression solution with respect to the NTK [14]. For any input x and training data {X, Y } we have, f(x; θ(t)) Θ(x, X) Θ(X, X) 1(I e ηΘ(X,X)t)Y, (3) where η is the learning rate, Θ(x, X) is the NTK matrix between input x and all samples in training data X. In section 3.1, we introduce the notion of training dynamics which can be used to describe the training process. Then, in section 3.2, based on the training dynamics, we propose dynamic AL. In section 3.3, we discuss the connection between dynamic AL and existing deep active learning methods. 3.1 Training dynamics In this section, we introduce the notion of training dynamics. The cross-entropy loss over the labeled set S is defined as: L(S) = X (xl,yl) S ℓ(f(xl; θ), yl) i [K] yi l log σi(f(xl; θ)), (4) where σi(f(x; θ)) = exp(f i(x;θ)) P j exp(f j(x;θ)). We first analyze the dynamics of the training loss, with respect to iteration t, on one labeled sample (derivation is in Appendix A.1): ℓ(f(x; θ), y) yi σi(f(x; θ)) θf i(x; θ) t θ. (5) For neural networks trained by gradient descent, if the learning rate η is small, then tθ = θt+1 θt = (xl,yl) S ℓ(f(xl;θ),yl) θ . Taking the partial derivative of the training loss with respect to the parameters, we have (the derivation of the following equation can be found in Appendix A.2): ℓ(f(x; θ), y) σj(f(x; θ)) yj f j(x; θ) Therefore, we can further get the following result for the dynamics of training loss: ℓ(f(x; θ), y) σi(f(x; θ)) yi X (xl ,yl ) S θf i(x; θ) θf j(xl ; θ) σj(f(xl ; θ)) yj l . (7) Furthermore, we define di(X, Y ) = σi(f(X; θ)) Y i and Y i is the label vector of all samples for i-th class. Then, the training dynamics (dynamics of training loss) over training set S, computed with the empirical NTK Kij(X, X), is denoted by G(S) R: ℓ(f(xl; θ), yl) j di(X, Y ) Kij(X, X)dj(X, Y ). (8) 3.2 Active learning by activating training dynamics Before we present dynamic AL, we state Proposition 1, which serves as the theoretical guidance for dynamic AL and will be proved in Section 4. Proposition 1. For deep neural networks, converging faster leads to a lower worst-case generalization error. Motivated by the connection between convergence speed and generalization performance, we propose the general-purpose active learning method, dynamic AL, which aims to accelerate the convergence by querying labels for unlabeled samples. As we described in the previous section, the training dynamics can be used to describe the training process. Therefore, we employ the training dynamics as a proxy to design an active learning method. Specifically, at each query round, dynamic AL will query labels for samples which maximize the training dynamics G(S), i.e., Q = argmax Q UG(S Q), s.t. |Q| = b, (9) where Q is the corresponding data set for Q with ground-truth labels. Notice that when applying the above objective in practice, we are facing two major challenges. First, G(S Q) cannot be directly computed, because the label information of unlabeled examples is not available before the query. Second, the subset selection problem can be computationally prohibitive if enumerating all possible sets with size b. Therefore, we employ the following two relaxations to make this maximization problem to be solved with constant time complexity. Pseudo labeling. To estimate the training dynamics, we use the predicted label ˆyu for sample xu in the unlabeled data set U to compute G. Note, the effectiveness of this adaptation has been demonstrated in the recent gradient-based methods [11, 28], which compute the gradient as if the model s current prediction on the example is the true label. Therefore, the maximization problem in Equation (9) is changed to, Q = argmax Q UG(S b Q). (10) where b Q is the corresponding data set for Q with pseudo labels b YQ. Subset approximation. The subset selection problem of Equation (10) still requires enumerating all possible subsets of U with size b, which is O(nb). We simplify the selection problem to the following problem without causing any change on the result, argmax Q UG(S b Q) = argmax Q U ( b Q|S), (11) where ( b Q|S) = G(S b Q) G(S) is defined as the change of training dynamics. We approximate the change of training dynamics caused by query set Q using the summation of the change of training dynamics caused by each sample in the query set. Then the maximization problem can be converted to Equation (12) which can be solved by a greedy algorithm with O(b). Q = argmax Q U X (x,by) b Q ({(x, by)}|S), s.t. |Q| = b. (12) To further show the approximated result is reasonably good, we decompose the change of training dynamics as (derivation in Appendix A.4): ( b Q|S) = X (x,by) b Q ({(x, by)}|S) + X (x,ˆy),(x ,ˆy ) b Q di(x, ˆy) Kij(x, x )dj(x , ˆy ), (13) where Kij(x, x ) is the empirical NTK. The first term in the right hand side is the approximated change of training dynamics. Then, we further define the Approximation Ratio (14) which measures the approximation quality, R( b Q|S) = (x,by) b Q ({(x, by)}|S) ( b Q|S) . (14) We empirically measure the expectation of the Approximation Ratio on two data sets with two different neural networks under three different batch sizes. As shown in Figure 4, the expectation EQ UR( b Q|S) 1 when the model is converged. Therefore, the approximated result delivered by the greedy algorithm is close to the global optimal solution of the original maximization problem, Equation (10), especially when the model is converged. Based on the above two approximations, we present the proposed method dynamic AL in Algorithm 1. As described below, the algorithm starts by training a neural network f( ; θ) on the initial labeled set S until convergence. Then, for every unlabeled sample xu, we compute pseudo label ˆyu and the change of training dynamics ({(xu, byu)}|S). After that, dynamic AL will query labels for top-b samples causing the maximal change on training dynamics, train the neural network on the extended labeled set, and repeat the process. Note, to keep close to the theoretical analysis, re-initialization is not used after each query, which also enables dynamic AL to get rid of the computational overhead of retraining the deep neural networks every time. Algorithm 1 Deep Active Learning by Leveraging Training Dynamics Input: Neural network f( ; θ), unlabeled sample set U, initial labeled set S, number of query round R, query batch size b. for r = 1 to R do Train f( ; θ) on S with cross-entropy loss until convergence. for xu U do Compute its pseudo label ˆyu = argmaxf(xu; θ). Compute ({(xu, byu)}|S). end for Select b query samples Q with the highest values, and request their labels from the oracle. Update the labeled data set S = S Q . end for return Final model f( ; θ). 3.3 Relation to existing methods Although existing deep active learning methods are usually designed based on heuristic criteria, some of them have empirically shown their effectiveness [11, 29, 30]. We surprisingly found that our theoretically-motivated method dynamic AL has some connections with those existing methods from the perspective of active learning criterion. The proposed active learning criterion in Equation (12) can be explicitly written as (derivation in Appendix A.5): ({(xu,byu)}|S) = θℓ(f(xu; θ), ˆyu) 2 + 2 X (x,y) S θℓ(f(xu; θ), ˆyu) θℓ(f(x; θ), y). (15) Note. The first term of the right-hand side can be interpreted as the square of gradient length (2norm) which reflects the uncertainty of the model on the example and has been wildly used as an active learning criterion in some existing works [30, 11, 31]. The second term can be viewed as the influence function [32] with identity hessian matrix. And recently, [29] has empirically shown that the effectiveness of using the influence function with identity hessian matrix as active learning criterion. We hope our theoretical analysis can also shed some light on the interpretation of previous methods. 4 Theoretical analysis In this section, we study the correlation between the convergence rate of the training loss and the generalization error under the ultra-wide condition [12, 13]. We define a measure named alignment to quantify the convergence rate and further show its connection with generalization bound. The analysis provides a theoretical guarantee for the phenomenon of Train Faster, Generalize Better as well as our active learning method dynamic AL with a rigorous treatment. Finally, we show that the active learning proxy, training dynamics, is correlated with alignment, which indicates that increasing the training dynamics leads to larger convergence rate and better generalization performance. We leave all proofs of theorems and details of verification experiments in Appendix B and D respectively. 4.1 Train faster provably generalize better Given an ultra-wide neural network, the gradient descent can achieve a near-zero training error [12, 33] and its generalization ability in unseen data can be bounded [13]. It is shown that both the convergence and generalization of a neural network can be analyzed using the NTK [13]. However, the question what is the relation between the convergence rate and the generalization bound has not been answered. We formally give a solution by introducing the concept of alignment, which is defined as follows: Definition 1 (Alignment). Given a data set S = {X, Y }, the alignment is a measure of correlation between X and Y projected in the NTK space. In particular, the alignment can be computed by A(X, Y ) = Tr[Y Θ(X, X)Y ] = PK k=1 Pn i=1 λi( v i Y k)2. In the following, we will demonstrate why Train Faster leads to Generalize Better through alignment. In particular, the relation of the convergence rate and the generalization bound with alignment is analyzed. The convergence rate of gradient descent for ultra-wide networks is presented in following lemma: Lemma 1 (Convergence Analysis with NTK, Theorem 4.1 of [13]). Suppose λ0 = λmin(Θ) > 0 for all subsets of data samples. For δ (0, 1), if m = Ω( n7 λ4 0δ4ϵ2 ) and η = O( λ0 n2 ), with probability at least 1 δ, the network can achieve near-zero training error, Y f(X; θ(t)) 2 = i=1 (1 ηλi)2t( v i Y k)2 ϵ, (16) where n denotes the number of training samples and m denotes the width of hidden layers. The NTK Θ = V ΛV with Λ = {λi}n i=1 is a diagonal matrix of eigenvalues and V = { vi}n i=1 is a unitary matrix. In this lemma, we take mean square error (MSE) loss as an example for the convenience of illustration. The conclusion can be extended to other loss functions such as cross-entropy loss (see Appendix B.2 in [14]). From the lemma, we find the convergence rate is governed by the dominant term (16) as Et(X, Y ) = q PK k=1 Pn i=1(1 ηλi)2t( v i Y k)2, which is correlated with the alignment: Theorem 1 (Relationship between the convergence rate and alignment). Under the same assumptions as in Lemma 1, the convergence rate described by Et satisfies, Tr[Y Y ] 2tηA(X, Y ) E2 t (X, Y ) Tr[Y Y ] ηA(X, Y ). (17) Remark 1. In the above theorem, we demonstrate that the alignment can measure the convergence rate. Especially, we find that both the upper bound and the lower bound of error Et(X, Y ) are inversely proportional to the alignment, which implies that higher alignment will lead to achieving faster convergence. Now we analyze the generalization performance of the proposed method through complexity analysis. We demonstrate that the ultra-wide networks can achieve a reasonable generalization bound. Lemma 2 (Generalization bound with NTK, Theorem 5.1 of [13]). Suppose data S = {(xi, yi)}n i=1 are i.i.d. samples from a non-degenerate distribution p(x, y), and m poly(n, λ 1 0 , δ 1). Consider any loss function ℓ: R R [0, 1] that is 1-Lipschitz, then with probability at least 1 δ over the random initialization, the network trained by gradient descent for T Ω( 1 ηλ0 log n δ ) iterations has population risk Lp = E(x,y) p(x,y)[ℓ(f T (x; θ), y)] that is bounded as follows: 2 Tr[Y Θ 1(X, X)Y ] log n λ0δ n In this lemma, we show that the dominant term in the generalization upper bound is B(X, Y ) = q 2 Tr[Y Θ 1Y ] n . In the following theorem, we further prove that this bound is inversely proportional to the alignment A(X, Y ). Theorem 2 (Relationship between the generalization bound and alignment). Under the same assump- tions as in Lemma 2, if we define the generalization upper bound as B(X, Y ) = q 2 Tr[Y Θ 1Y ] n , then it can be bounded with the alignment as follows: 2 B2(X, Y ) λmax λmin Tr2[Y Y ] A(X, Y ) . (19) Remark 2. Theorems 1 and 2 reveal that the cause for the correlated phenomenons Train Faster and Generalize Better is the projection of label vector on the NTK space (alignment). 4.2 Train Faster, Generalize Better for active learning Figure 1: Comparison between Empirical Generalization Bound and MMD. In the NTK framework [13], the empirical average requires data in S is i.i.d. samples (Lemma 2). However, this assumption may not hold in the active learning setting with multiple query rounds, because the training data is composed by i.i.d. sampled initial label set and samples queried by active learning policy. To extend the previous analysis principle to active learning, we follow [34] to reformulate the Lemma 2 as: Lp (Lp Lq) + 2 Tr[Y Θ 1(X, X)Y ] log n λ0δ n (20) where Lq = E(x,y) q(x,y)[ℓ(f(x; θ), y)], q(x, y) denotes the data distribution after query, and X, Y includes initial training samples and samples after query. There is a new term in the upper bound, which is the difference between the true risk under different data distributions. Lp Lq =E(x,y) p(x,y)[ℓ(f(x; θ), y)] E(x,y) q(x,y)[ℓ(f(x; θ), y)] (21) Though in active learning the data distribution for the labeled samples may be different from the original distribution, they share the same conditional probability p(y|x). We define g(x) = R y ℓ(f(x; θ), y)p(y|x)dy, and then we have: x g(x)p(x)dx Z x g(x)q(x)dx. (22) To measure the distance between two distributions, we employ the Maximum Mean Discrepancy (MMD) with neural tangent kernel [35] (derivation in Appendix B.3). Lp Lq MMD(S0, S, HΘ) + O r Slightly overloading the notation, we denote the initial labeled set as S0, HΘ as the associated Reproducing Kernel Hilbert Space for the NTK Θ, and x, x S, Θ(x, x ) C. Note, MMD(S0, S, HΘ) is the empirical measure for MMD(p(x), q(x), HΘ). We empirically compute MMD and the dominant term of the generalization upper bound B under the active learning setting with our method dynamic AL. As shown in Figure 1, on CIFAR10 with a CNN target model (three convolutional layers with global average pooling), the initial labeled set size |S| = 500, query round R = 1 and budget size b {250, 500, 1000}, we observe that, under different active learning settings, the MMD is always much smaller than the B. Besides, we further investigate the MMD and B for R 2 and observe the similar results. Therefore, the lemma 2 still holds for the target model with dynamic AL. More results and discussions for R 2 are in Appendix E.4 and the computation details of MMD and NTK are in Appendix D.1. 4.3 Alignment and training dynamics in active learning Figure 2: Relation between Alignment and Training Dynamics. In this section, we show the relationship between the alignment and the training dynamics. To be consistent with the previous theoretical analysis (Theorem 1 and 2), we use the training dynamics with mean square error under the ultrawidth condition, which can be expressed as GMSE(S) = Tr (f(X; θ) Y ) Θ(X, X)(f(X; θ) Y ) . Due to the limited space, we leave the derivation in Appendix A.3. To further quantitatively evaluate the correlation between GMSE(S Q) and A(X XQ, Y YQ), we utilize the Kendall τ coefficient [36] to empirically measure their relation. As shown in Figure 2, for CNN on CIFAR10 with active learning setting, where |S| = 500 and |Q| = 250, there is a strong agreement between GMSE(S Q) and A(X XQ, Y YQ), which further indicates that increasing the training dynamics will lead to a faster convergence and better generalization performance. More details about this verification experiment are in Appendix D.2. 5 Experiments 5.1 Experiment setup Baselines. We compare dynamic AL with the following eight baselines: Random, Corset, Confidence Sampling (Conf), Margin Sampling (Marg), Entropy, and Active Learning by Learning (ALBL), Batch Active learning by Diverse Gradient Embeddings (BADGE). Description of baseline methods is in Appendix E.1. Data sets and Target Model. We evaluate all the methods on three benchmark data sets, namely, CIFAR10 [23], SVHN [24], and Caltech101 [25]. We use accuracy as the evaluation metric and report the mean value of 5 runs. We consider three neural network architectures: vanilla CNN, Res Net18 [26], and VGG11 [27]. For each model, we keep the hyper-parameters used in their official implementations. More information about the implementation is in Appendix C.1. Active Learning Protocol. Following the previous evaluation protocol [11], we compare all those active learning methods in a batch-mode setup with an initial set size M = 500 for all those three data sets, batch size b varying from {250, 500, 1000}. For the selection of test set, we use the benchmark split of the CIFAR10 [23], SVHN [24] and sample 20% from each class to form the test set for the Caltech101 [25]. 5.2 Results and analysis The main experimental results have been provided as plots due to the limited space. We also provide tables in which we report the mean and standard deviation for each plot in Appendix E.3. Overall results. The average test accuracy at each query round is shown in Figure 3. Our method dynamic AL can consistently outperform other methods for all query rounds. This suggests that dynamic AL is a good choice regardless of the labeling budget. And, we notice dynamic AL can work well on data sets with a large class number, such as Caltech101. However, the previous state-of-the-art method, BADGE, cannot be scaled up to those data sets, because the required memory is linear with the number of classes. Besides, because dynamic AL depends on pseudo labeling, a relatively large initial labeled set can provide advantages for dynamic AL. Therefore, it is important to examine whether dynamic AL can work well with a small initial labeled set. As shown in Figure 3, dynamic AL is able to work well with a relatively small initial labeled set (M = 500). Due to the limited space, we only show the result under three different settings in Figure 3. More evaluation results are in Appendix E.2. Moreover, although the re-initialization trick makes dynamic AL deviate from the dynamics analysis, we investigate the effect of it to dynamic AL and provide the empirical observations and analysis in Appendix E.5. Effect of query size and query round. Given the total label budget B, the increasing of query size always leads to the decreasing of query round. We study the influence of different query size 0 1 2 3 4 5 6 7 8 9 # Query Round T est Accuracy (%) CIF AR10, CNN, Query Batch Size:500, Initial Set Size:500 0 1 2 3 4 5 6 7 8 9 # Query Round T est Accuracy (%) SVHN, VGG, Query Batch Size:500, Initial Set Size:500 0 1 2 3 4 5 6 7 8 9 # Query Round T est Accuracy (%) Caltech101, Res Net, Query Batch Size:500, Initial Set Size:500 Figure 3: Active learning test accuracy versus the number of query rounds for a range of conditions. and query round on dynamic AL from two perspectives. First, we study the expected approximation ratio with different query batch sizes on different data sets. As shown in Figure 4, under different settings the expected approximation ratio always converges to 1 with the increase of training epochs, which further indicates that the query set selected by using the approximated change of training dynamics is a reasonably good result for the query set selection problem. Second, we study influence of query round for actual performance of target models. The performance for different target models on different data sets with total budge size B = 1000 is shown in Table 1. For certain query budget, our active learning algorithm can be further improved if more query rounds are allowed. 10 15 20 25 30 35 40 45 50 CIF AR10, Res Net Batch Size b=250 Batch Size b=500 Batch Size b=1000 10 15 20 25 30 35 40 Batch Size b=250 Batch Size b=500 Batch Size b=1000 Figure 4: The Expectation of the Approximation Ratio with different query batch sizes b. Table 1: Accuracy of dynamic AL with different query batch size b. Setting CIFAR10+CNN CIFAR10+Resnet SVHN+VGG Caltech101+Resnet R = 10, b = 100 36.84 40.92 76.34 37.06 R = 4, b = 250 36.72 40.78 75.26 36.48 R = 2, b = 500 36.71 40.46 74.10 35.91 R = 1, b = 1000 36.67 40.09 70.04 33.82 Comparison with different variants. The active learning criterion of dynamic AL can be written as P (x,y) S θℓ(f(x; θu), ˆyu) 2 + γ θℓ(f(xu; θ), ˆyu) θℓ(f(x; θ), y). We empirically show the performance for γ {0, 1, 2, } in Figure 5. With γ = 0, the criterion is close to the expected gradient length method [31]. And with γ = , the selected samples are same with the samples selected by using the influence function with identity hessian matrix criterion [29]. As shown in Figure 5, the model achieves the best performance with γ = 2, which is aligned with the value indicated by the theoretical analysis (Equation 15). The result confirms the importance of theoretical analysis for the design of deep active learning methods. 6 Related work Neural Tangent Kernel (NTK): Recent study has shown that under proper conditions, an infinitewidth neural network can be simplified as a linear model with Neural Tangent Kernel (NTK) [12]. Since then, NTK has become a powerful theoretical tool to analyze the behavior of deep learning architecture (CNN, GNN, RNN) [33, 37, 38], random initialization [39], stochastic neural network [40], and graph neural network [41] from its output dynamics and to characterize the convergence and generalization error [13]. Besides, [15] studies the finite-width NTK, aiming at making the NTK more practical. 0 1 2 3 4 5 # Query Round T est Accuracy (%) CIF AR10, CNN, Batch Siz : 500 0 1 2 3 4 5 # Query Round T est Accuracy (%) SVHN, VGG, Batc Si)e: 500 0 1 2 3 4 5 # Query Round T est Accuracy (%) CIF AR10, Res Net, Batch Size: 500 Figure 5: Test Accuracy of different variants. Active Learning: Active learning aims at interactively query labels for unlabeled data points to maximize model performances [2]. Among others, there are two popular strategies for active learning, i.e., diversity sampling [42, 43, 44] and uncertainty sampling [45, 46, 47, 11, 48, 49, 29]. Recently, several papers proposed to use gradient to measure uncertainty [49, 11, 29]. However, those methods need to compute gradient for each class, and thus they can hardly be applied on data sets with a large class number. Besides, recent works [50, 51] leverage NTK to analyze contextual bandit with streaming data, which are hard to be applied into our pool-based setting. 7 Conclusion In this work, we bridge the gap between the theoretic findings of deep neural networks and realworld deep active learning applications. By exploring the connection between the generalization performance and the training dynamics, we propose a theory-driven method, dynamic AL, which selects samples to maximize training dynamics. We prove that the convergence speed of training and the generalization performance is (positively) strongly correlated under the ultra-wide condition and we show that maximizing the training dynamics will lead to a lower generalization error. Empirically, our work shows that dynamic AL not only consistently outperforms strong baselines across various setting, but also scales well on large deep learning models. 8 Acknowledgment This work is supported by National Science Foundation (IIS-1947203, IIS-2117902, IIS-2137468, IIS-2134079, and CNS-2125626), a joint ACES-ICGA funding initiative via USDA Hatch ILLU802-946, and Agriculture and Food Research Initiative (AFRI) grant no. 2020-67021-32799/project accession no.1024178 from the USDA National Institute of Food and Agriculture. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government. [1] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017. [2] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin Madison, 2009. [3] Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014. [4] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine learning, 15(2):201 221, 1994. [5] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78 89, 2009. [6] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In International conference on Machine learning, pages 208 215, 2008. [7] Maria-Florina Balcan, Andrei Z. Broder, and Tong Zhang. Margin based active learning. In Conference on Learning Theory, volume 4539, pages 35 50, 2007. [8] Maria-Florina Balcan and Philip M. Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, volume 30, pages 288 316, 2013. [9] Mina Karzand and Robert D. Nowak. Active learning in the overparameterized and interpolating regime. Co RR, abs/1905.12782, 2019. [10] Andreas Kirsch, Joost van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. In Neural Information Processing Systems, pages 7024 7035, 2019. [11] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations, 2020. [12] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and generalization in neural networks. In Neural Information Processing Systems, pages 8580 8589, 2018. [13] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, volume 97, pages 322 332, 2019. [14] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Neural Information Processing Systems, pages 8570 8581, 2019. [15] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. In International Conference on Learning Representations, 2020. [16] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32:10836 10846, 2019. [17] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, volume 48, pages 1225 1234, 2016. [18] Tongliang Liu, Gábor Lugosi, Gergely Neu, and Dacheng Tao. Algorithmic stability and hypothesis complexity. In International Conference on Machine Learning, volume 70, pages 2159 2167, 2017. [19] Clare Lyle, Lisa Schut, Robin Ru, Yarin Gal, and Mark van der Wilk. A bayesian perspective on training speed and model selection. In Neural Information Processing Systems, 2020. [20] Binxin Ru, Clare Lyle, Lisa Schut, Mark van der Wilk, and Yarin Gal. Revisiting the train loss: an efficient performance estimator for neural architecture search. ar Xiv preprint ar Xiv:2006.04492, 2020. [21] Keyulu Xu, Mozhi Zhang, Stefanie Jegelka, and Kenji Kawaguchi. Optimization of graph neural networks: Implicit acceleration by skip connections and more depth. In International Conference on Machine Learning, volume 139, pages 11592 11602, 2021. [22] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alex J Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49 e57, 2006. [23] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [24] Yuval Netzer, Tao Wang, Adam Coates, Ro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning, 2011. [25] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178 178. IEEE, 2004. [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. [27] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015. [28] Fangzhou Mu, Yingyu Liang, and Yin Li. Gradients as features for deep representation learning. In International Conference on Learning Representations, 2020. [29] Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng Dai, and Conghui He. Influence selection for active learning. ar Xiv preprint ar Xiv:2108.09331, 2021. [30] Jiaji Huang, Rewon Child, Vinay Rao, Hairong Liu, Sanjeev Satheesh, and Adam Coates. Active learning for speech recognition: the power of gradients. ar Xiv preprint ar Xiv:1612.03226, 2016. [31] Megh Shukla. Egl++: Extending expected gradient length to active learning for human pose estimation. ar Xiv preprint ar Xiv:2104.09493, 2021. [32] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, volume 70, pages 1885 1894, 2017. [33] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Neural Information Processing Systems, pages 8139 8148, 2019. [34] Zheng Wang and Jieping Ye. Querying discriminative and representative samples for batch mode active learning. ACM Transactions on Knowledge Discovery from Data, 9(3):1 23, 2015. [35] Sheng Jia, Ehsan Nezhadarya, Yuhuai Wu, and Jimmy Ba. Efficient statistical tests: A neural tangent kernel approach. In International Conference on Machine Learning, volume 139, pages 4893 4903, 2021. [36] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81 93, 1938. [37] Simon S Du, Kangcheng Hou, Barnabás Póczos, Ruslan Salakhutdinov, Ruosong Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. ar Xiv preprint ar Xiv:1905.13192, 2019. [38] Sina Alemohammad, Zichao Wang, Randall Balestriero, and Richard Baraniuk. The recurrent neural tangent kernel. In International Conference on Learning Representations, 2020. [39] Wei Huang, Weitao Du, and Richard Yi Da Xu. On the neural tangent kernel of deep networks with orthogonal initialization. ar Xiv preprint ar Xiv:2004.05867, 2020. [40] Wei Huang, Chunrui Liu, Yilan Chen, Tianyu Liu, and Richard Yi Da Xu. Demystify optimization and generalization of over-parameterized pac-bayesian learning. ar Xiv preprint ar Xiv:2202.01958, 2022. [41] Wei Huang, Yayong Li, Weitao Du, Richard Yi Da Xu, Jie Yin, Ling Chen, and Miao Zhang. Towards deepening graph neural networks: A gntk-based optimization perspective. ar Xiv preprint ar Xiv:2103.03113, 2021. [42] Bo Du, Zengmao Wang, Lefei Zhang, Liangpei Zhang, Wei Liu, Jialie Shen, and Dacheng Tao. Exploring representativeness and informativeness for active learning. IEEE transactions on cybernetics, 47(1):14 26, 2015. [43] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C. Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. In Neural Information Processing Systems, pages 5339 5349, 2018. [44] Fedor Zhdanov. Diverse mini-batch active learning. ar Xiv preprint ar Xiv:1901.05954, 2019. [45] Nicholas Roy and Andrew Mc Callum. Toward optimal active learning through sampling estimation of error reduction. In International Conference on Machine Learning, pages 441 448, 2001. [46] Jingbo Zhu and Matthew Ma. Uncertainty-based active learning with instability estimation for text classification. ACM Transactions on Speech and Language Processing (TSLP), 8(4):1 21, 2012. [47] Yazhou Yang and Marco Loog. Active learning using uncertainty information. In International Conference on Pattern Recognition, pages 2646 2651, 2016. [48] Donggeun Yoo and In So Kweon. Learning loss for active learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 93 102, 2019. [49] Burr Settles, Mark Craven, and Soumya Ray. Multiple-instance active learning. In Neural Information Processing Systems, pages 1289 1296, 2007. [50] Pranjal Awasthi, Christoph Dann, Claudio Gentile, Ayush Sekhari, and Zhilei Wang. Neural active learning with performance guarantees. ar Xiv preprint ar Xiv:2106.03243, 2021. [51] Yikun Ban, Yuheng Zhang, Hanghang Tong, Arindam Banerjee, and Jingrui He. Improved algorithms for neural active learning. 2022. [52] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018. [53] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. [54] Felix Dangel, Frederik Kunstner, and Philipp Hennig. Back PACK: Packing more into backprop. In International Conference on Learning Representations, 2020. [55] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. [56] Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. In International Conference on Machine Learning, pages 18347 18377. PMLR, 2022. [57] Pranav Subramani, Nicholas Vadivelu, and Gautam Kamath. Enabling fast differentially private sgd via just-in-time compilation and vectorization. Advances in Neural Information Processing Systems, 34:26409 26421, 2021. [58] Amir Zandieh, Insu Han, Haim Avron, Neta Shoham, Chaewon Kim, and Jinwoo Shin. Scaling neural tangent kernels via sketching and random features. ar Xiv preprint ar Xiv:2106.07880, 2021. [59] R Botsch. Chapter 12: Significance and measures of association. Scopes and Methods of Political Science, 2011. [60] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. [61] Dan Wang and Yi Shang. A new active labeling method for deep learning. In International Joint Conference on Neural Networks, pages 112 119, 2014. [62] Dan Roth and Kevin Small. Margin-based active learning for structured output spaces. In Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006, Proceedings, volume 4212, pages 413 424, 2006. [63] Wei-Ning Hsu and Hsuan-Tien Lin. Active learning by learning. In AAAI Conference on Artificial Intelligence, pages 2659 2665, 2015. [64] Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems, 33:15156 15172, 2020. [65] Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems, 33:5850 5861, 2020. [66] Daniel S Park, Jaehoon Lee, Daiyi Peng, Yuan Cao, and Jascha Sohl-Dickstein. Towards nngp-guided neural architecture search. ar Xiv preprint ar Xiv:2011.06006, 2020. [67] Wuyang Chen, Xinyu Gong, and Zhangyang Wang. Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective. ar Xiv preprint ar Xiv:2102.11535, 2021. [68] Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations. ar Xiv preprint ar Xiv:2106.01548, 2021. [69] Aditya Deshpande, Alessandro Achille, Avinash Ravichandran, Hao Li, Luca Zancato, Charless Fowlkes, Rahul Bhotika, Stefano Soatto, and Pietro Perona. A linearized framework and a new benchmark for model selection for fine-tuning. ar Xiv preprint ar Xiv:2102.00084, 2021.