# an_indepth_study_of_stochastic_backpropagation__a3e6cb17.pdf An In-depth Study of Stochastic Backpropagation Jun Fang Mingze Xu Hao Chen Bing Shuai Zhuowen Tu Joseph Tighe AWS AI Labs {junfa, xumingze, hxen, bshuai, ztu, tighej}@amazon.com In this paper, we provide an in-depth study of Stochastic Backpropagation (SBP) when training deep neural networks for standard image classification and object detection tasks. During backward propagation, SBP calculates gradients by using only a subset of feature maps to save GPU memory and computational cost. We interpret SBP as an efficient way to implement stochastic gradient decent by performing backpropagation dropout, which leads to significant memory saving and training run-time reduction, with a minimal impact on the overall model accuracy. We offer best practices to apply SBP for training image recognition models, which can be adopted in learning a wide range of deep neural networks. Experiments on image classification and object detection show that SBP can save up to 40% of GPU memory with less than 1% accuracy degradation. Code is available at: https://github.com/amazon-research/stochastic-backpropagation 1 Introduction A common practice to improve accuracy when training deep neural networks is to increase the input resolution [19], model depth [7], and model width [21]. However, this can significantly increase the GPU memory usage, making training on devices with limited resources difficult. For example, training an object detector using Conv Ne Xt-Base [14] as the backbone with batch size 2 requires over 17 GB of memory, which cannot fit on many modern GPUs. A number of prior works have explored memory-efficient training designs, especially for large model training on high resolution images or videos. Mixed-precision training [17] uses lower precision to represent the network weights and activations for certain layers of the network. Gradient checkpointing [2] restores intermediate feature maps by recomputing nodes in the computation graph during the backward pass, which requires extra computations. Gradient accumulation [12] splits a mini-batch into smaller chunks to calculate gradients iteratively, which slows down training process. A recent method, Stochastic Backpropagation (SBP) [3], offers a new perspective for memoryefficient model training. It performs partial execution of gradients during backpropagation on randomly selected frames when training video models [13, 24, 23, 22]. It achieves considerable memory saving by only caching a subset of video frame feature maps without significant accuracy loss. However, SBP in [3] is only applied to video-based models using two architectures and its design space has not be thoroughly explored on wider tasks. The explanation provided in [3] for the effectiveness of SBP is simple but narrow it attributes the validity of the method to the high redundancy of video frames. This leaves a number of questions around SBP unanswered: (1) Is SBP s effectiveness limited to the temporal redundancy of video or can it effectively leverage spatial redundancy? (2) If so, can SBP be generalized to image-based tasks, with more limited redundancy? (3) Is SBP generalizable Corresponding Author. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). to different networks (e.g., MLP, CNN, Transformer) and operators? (4) What are the key design choices when implementing SBP ? To answer these questions, we first formalize the SBP process and analyze the effect of SBP on the calculation of gradients. With this formulation, we show how to generalize the idea of SBPto image models and present an in-depth study of the design choices when using SBP. During backward propagation, SBP calculates gradients from a subset of spatial feature maps to save memory and computation cost. We observe that the calculated gradients by SBP are highly correlated with the standard SGD gradients, indicating that SBP is a reasonable approximation. We also provide a new, straightforward implementation of the idea that requires only a few lines of code. We explore several important design strategies, which have a noticeable impact on the performance. We validate the generalizability of SBP on two common tasks (image classification and object detection) with two popular network architectures (Vi T [6] and Conv Ne Xt [14]). We show that SBP can effectively train image models with less than 1% accuracy loss in a memory-efficient fashion. 2 Related Work Since deep neural networks have expanded in both depth [7] and breadth [21] in recent years, the area of designing memory-efficient training methods has attracted a fair amount of attention. By saving and recomputing nodes in the computation graph during backpropagation, gradient checkpointing [2] can save a large amount of memory. Gradient accumulation [12] splits the large sample batch into several mini batches and runs them iteratively without updating model parameters. This saves the GPU memory proportionally without affecting the accuracy, because accumulating the gradients of these sub-iterations and then updating the model parameters are identical to directly optimizing the model with a global batch size. However, these two methods slow down the training process. Multigrid [20] proposes to use different mini-batch shapes to speedup the model training. Sparse network [5] particularly targets the recognition tasks but can only save the memory theoretically. Sideways [15] and its follow-up [16] reduce the memory cost by overwriting activations whenever new ones become available, but they are limited to causal models. SBP [3] reduces a large portion of GPU memory by only backpropagating gradients from incomplete execution for video models. In this paper, we generalize SBP to more standard computer vision tasks including image recognition. 3 Understanding Stochastic Backpropagation (SBP) In this section, we first formalize the process of Stochastic Backpropagation (SBP) (sec. 3.1) and show how to compute the gradient for two common operations (sec. 3.2). With these formulations established, we analyze how to perform the chain rule of gradient calculations for SBP (sec. 3.3). Finally, we present a simple implementation of SBP in practice with only a few lines of code (sec. 3.4). 3.1 Formulation of SBP We first present a general formulation of SBP [3]. We denote a loss function L(Θ) to learn a model on a dataset D = {dj}N j=1, where Θ is the collection (vector, matrix, or tensor) of all parameters of the model, N is the total number of training samples, dj RT H W C is an input data sample, where H and W are spatial resolution height and width respectively, C is the input channel size, T = 1 represents an image and T > 1 represents a video. In each training iteration, the model with SBP processes a mini-batch of data samples S RB T H W C D to compute the loss after the forward pass, where B is the batch size. During the backward pass, different from the traditional backpropagation that relies on the full feature maps Xi RB Ti Hi Wi Ci(i is the index of the layers), an incomplete execution for backpropagation (i.e., SBP) only utilizes a subset of the feature maps Xsub i RB T sub i Hsub i W sub i Ci to approximate the gradient calculation. Specifically, it updates the model weights as Θiter+1 = Θiter η 1 xsub i Xsub i ΘL(xsub i , Θiter), (1) where η is the learning rate, Θ is the collection of model weight gradients, Θiter and Θiter+1 are model weights at the current and next iteration, respectively. SBP computes all forward paths, but only caches a subset of feature maps in the backward pass. Note that the method in [3] is a special case of (1) as it operates on the temporal dimension by using a subset of video frames XT sub i i RB T sub i Hi Wi Ci. 3.2 Backward Gradient Calculation Next, we derive equations to compute the gradient during the backward phase of SBP. We denote that at layer i of the model, an operator fi with learnable weights Wfi to processes the feature maps Xi RB Ti Hi Wi Cin i , where Cin i refers to the input channel size. 3.2.1 Calculation of Linear Layers We start by exploring fi as a point-wise convolutional (PW-Conv) layer. PW-Conv is widely used in the state-of-the-art computer vision models such as Vi T [6] and Conv Ne Xt [14]. It is equivalent to the linear layers in Transformer MLP blocks [14], and is also known as channel-wise fully connected layer. It can be either implemented through the linear layer, or the 1 1 convolutional layer. It is mainly used to enrich the channel information with weights Wfi RCin i Cout i , and the analysis on PW-Conv can be easily extended to other layers. The forward pass of this layer is (the bias term is omitted for simplicity): Xi+1 = fi(Xi) = Xi Wfi, (2) where Xi+1 R(B Ti Hi Wi) Cout i (Xi and Xi+1 are reshaped as a matrix format). During the traditional backward pass with full backpropagation, the gradients of both activations and weights are calculated by: d Xi+1 = Loss/ Xi+1, d Wfi = Loss/ Wfi = XT i d Xi+1, d Xi = Loss/ Xi = d Xi+1W T fi. (3) It requires storing the weights Wfi RCin i Cout i and full feature maps Xi R(B Ti Hi Wi) Cin i to compute the gradients. In many cases, B Ti Hi Wi Cout i , i.e., caching of feature maps dominates the memory usage. In order to effectively save GPU memory during training, SBP only caches a subset of the feature maps for backpropagation. In the backward pass of SBP process, we split the full feature maps Xi into two subsets Xkeep i and Xdrop i , which denote the corresponding indices of the sampled feature maps that are kept and dropped, respectively. An example of Xkeep i can be Xi[:, even index on T, :, :, :] for video input, or Xi[:, :, even index on H, odd index on W, :] for image input (e.g., Fig. 3 left), and Xdrop i is the complementary part of the feature maps. We have Xkeep i Xdrop i = and Xkeep i Xdrop i = Xi. We denote keep-ratio r as the number of gradient kept indices j Zkeep i over the number of all indices. One advantage of simplifying the SBP problem by using PW-Conv operator is that the nodes in spatial or temporal dimensions can be calculated independently through the forward and backward passes. Therefore, we can update the forward pass separately on the kept and dropped indices: Xi+1 = [Xkeep i+1 , Xdrop i+1 ] = [Xkeep i , Xdrop i ]Wfi, i.e., Xkeep i+1 = Xkeep i Wfi, Xdrop i+1 = Xdrop i Wfi (4) as well as the backward pass: d Xi+1 = [d Xkeep i+1 , d Xdrop i+1 ], d Xkeep i = d Xkeep i+1 W T fi, d Xdrop i = d Xdrop i+1 W T fi, d Wfi = XT i d Xi+1 = [Xkeep i , Xdrop i ]T [d Xkeep i+1 , d Xdrop i+1 ] = Xkeep i T d Xkeep i+1 + Xdrop i T d Xdrop i+1 . SBP drops all the gradients calculation with superscripts drop and keeps all the gradients calculation with superscripts keep, thus it saves memory and computation on the dependencies of dropped indices. In other words, the memory usage and the backward computation cost are proportional to the keepratio of the original case. Mathematically, it is equivalent to set d Xdrop i+1 = 0 (thus d Xdrop i = 0) and uses d W keep fi = Xkeep i T d Xkeep i+1 as a stochastic approximation of the original weight gradients d Wfi to update the model weights. 3.2.2 Calculation of Convolutional Layers We then derive the calculation of a general convolutional layer (Conv Layer) as fi. Unlike the PWConv, nodes of general Conv Layer with kernel size > 1 in spatial or temporal dimensions are no longer independent in forward and backward calculations. Assume that we apply a convolutional kernel k k with stride s. The regular backpropagation of the Conv Layer is calculated as follows: d Xi = zeropad(d Xi+1) W fi d Wfi = Xi d Xi+1 (6) where is the convolution operation, zeropad is to pad the edge of the matrix with zeros, and W fi represents the 180 rotation of Wfi. From Eq. (6), in the gradient calculation of Conv Layer at a given location j on the input feature map as xj i Xi, all the elements from d Xi+1 inside the region of the kernel size k will contribute to the gradient calculation of dxj i. Therefore, in SBP of Conv Layer, when we set d Xdrop i+1 = 0 for certain areas, in general, two things will happen: (a) even if j Zdrop and dxj i+1 = 0, dxj i will not be 0 due to the non-zero gradient contribution of the neighbors of j on d Xi+1; (b) dxj i will be an approximated gradient unless all dxp i+1 with p in the neighbor areas of j in terms of the kernel size k (i.e., indices p are within the convolutional kernel of index j) are kept and there is no chain rule effect on dxp i+1. However, there is a special scenario: when s k, the neighbor effects disappear due to the fact that d Xi+1 needs to be zero interweaved in gradient calculation. As a result, nodes in spatial or temporal dimensions are patch-wise independent, which is similar to the case of PW-Conv. 3.2.3 Discussion From the above derivation, we see that the approximation effect of SBP performs differently when the operator changes. For layers, such as linear layer or PW-Conv, the activation gradients (i.e., dxi) are either 0 or exact2. While for convolutional layers with kernel size larger than 1, most of the gradients on the activations are neither 0 nor exact. Due to space limitation, we only derived SBP on two most popular operators. For derivation of other operators, please refer to the supplementary material. 3.3 Chain Rule Effect We have derived the calculation of SBP on a single layer in the previous section. In this section, we derive the calculation of SBP on a stack of multiple layers. For simplicity, we take two layers as an example, which can be easily generalized to more layers. 3.3.1 Two PW-Conv Layers We first consider a scenario with two PW-Conv layers. We assume a layer before fi, i.e., fi 1, is another PW-Conv layer with weights Wfi 1. The forward pass of this layer is (the bias term is also omitted): Xi = fi 1(Xi 1) = Xi 1Wfi 1. (7) Then we can calculate the gradients of weights and activations in layer fi 1 similar to Eq. (3): d Xi 1 = d Xi W T fi 1 = d Xi+1W T fi W T fi 1, d Wfi 1 = XT i 1d Xi = XT i 1d Xi+1W T fi. (8) We first consider the case when there is no additional gradient dropout at layer fi 1. Since d Xi+1 has non-zero gradients at kept indices (i.e., d Xkeep i+1 ) and zero gradients at dropped indices (i.e., d Xdrop i+1 ), Eq. (8) can be updated to: d Xi 1 = [d Xkeep i 1 , d Xdrop i 1 ], d Xkeep i 1 = d Xkeep i+1 W T fi W T fi 1, d Xdrop i 1 = d Xdrop i+1 W T fi W T fi 1 = 0, d Wfi 1 = Xkeep i 1 T d Xkeep i+1 W T fi. 2By exact we mean the exact calculation of the gradient for the given mini-batch, though it is still an estimation due to SGD. From the equation, we can see that if there is no additional gradient dropout applied on layer fi 1, through the chain rule, the keep and drop set will transfer from Xi+1 to Xi 1 identically. If we apply gradient dropout on layer fi 1 as well, i.e., on layer fi 1, we split d Xi into d Xkeep i and d Xdrop i . Assume that the keep and drop indices set at layer i + 1 is Zkeep i+1 and Zdrop i+1 , and at layer i is Zkeep i and Zdrop i , respectively. Then through Eq. (8) and Eq. (9), the keep and drop indices set on layer i 1 will be: Zkeep i 1 = Zkeep i+1 Zkeep i , Zdrop i 1 = Zdrop i+1 Zdrop i . (10) Eq. (9) becomes: d Xi 1 = [d Xkeep i 1 , d Xdrop i 1 = d Xkeep i+1 W T fi W T fi 1, d Xdrop d Wfi 1 = Xkeep i+1 W T fi. where keep = Zkeep i 1 and drop = Zdrop i 1 . 3.3.2 PW-Conv + Conv Layers Now let s look at the chain-rule effect of the case when fi 1 is a Conv Layer with kernel size k > 1 and stride s. Using the chain rule, we can update the Eq. (6) as follows: d Xi 1 = zeropad(d Xi) W fi 1 = zeropad(d Xi+1W T fi) W fi 1, d Wfi 1 = Xi 1 d Xi = Xi 1 d Xi+1W T fi. (12) We can see from above equation that each dxj i 1 receives the backward gradients from the neighbors of dxj i, which might be partially dropped out, thus the gradient at dxj i 1 is no longer exact. From Eq. (12), the dropout at layer i makes the matrix d Xi sparse (only indices at Zkeep i+1 are non-zero). If we conduct dropout at the current layer fi 1 (i.e., we have Zkeep i and Zdrop i ), the matrix d Xi 1 will become even more sparse unless the dropout is happening on the same location Zdrop i+1 = Zdrop i . If we stack PW-Conv + Conv Layer further and further in an interweaving way (i.e., Zdrop i+1 = Zdrop i ), the chain rule effect may lead to a very sparse output derivative matrix (e.g., d Xi 1). However, stacking Conv Layers (due to the space limitation, we skip the math derivation) will not have such a trend, as the gradient dropout in Conv Layer only creates approximated gradient but not many zeros, because of the neighbor effect of gradient calculation. 3.3.3 Discussion Overall, in terms of the chain rule effect, we can see from Eq. (11) that stacking PW-Conv layers, especially in an interweaving way, may lead to a sparse keep set Zkeep i 1 and a vanishing gradient of d Wfi 1. Thus, in practice, we recommend to drop gradients at the same positions across layers to avoid the gradient vanishing and maintain high accuracy. Stacking PW-Conv + Conv Layers (in Eq. (12)) also tends to make the keep set sparse and thus should be avoided. While stacking Conv Layers doesn t have such effect, it stacks the non-exact gradient at almost every position but tends to make the gradients gradually drift from the dense gradients. 3.4 Efficient Implementation With a deeper understanding of how SBP works, we provide a simple and efficient implementation of the SBP technique in Alg. 1. We point out that this is more efficient than the prior work [3] as there is (1) no need to do the re-compute the forward during the backward pass, thus it is computationally more efficient, and (2) no need to manually cache the intermediate feature maps and fill zero gradients in the backward pass, which is much simpler. We emphasize that it is very simple and easy to implement, it only needs 3 lines of code to do the forward, and the backward (no need to implement manually) can be seamlessly handled by the ease of autograd engine in the deep learning frameworks. It supports point-wise convolutional layers, the dot-product attention layers, and multilayer perception (MLP) in popular networks including Vi T [6] and its variants [6, 11, 13], Conv Ne Xt [14], and hybrid architectures [4]. Algorithm 1 Pytorch-like pseudocode of SBP for an arbitrary operation f. # f: an arbitrary operation # grad_keep_idx: sampled indices where gradients are kept # grad_drop_idx: sampled indices where gradients are dropped def sbp_f(f, inputs, grad_keep_idx, grad_drop_idx): # initiate outputs outputs = torch.zeros(output_shape, device=inputs.device) # forward with gradient calculation, gradients will be calculated with torch.autograd with torch.enable_grad(): outputs[grad_keep_idx] = f(inputs[grad_keep_idx]) # forward without gradient calculation with torch.no_grad(): outputs[grad_drop_idx] = f(inputs[grad_drop_idx]) return outputs 0 20 40 60 80 100 Percentage of layers (%) applying SBP GPU memory (GB) Top-1 accuracy (%) 0.2 0.4 0.6 0.8 1.0 SBP keep ratio GPU memory (GB) Top-1 accuracy (%) SBP-0.5 SBP-0.25 58 Top-1 accuracy How to set SBP keep-ratio Uniform keep-ratios Decreasing keep-ratios Increasing keep-ratios Figure 1: Illustration of the effect on applying SBP on different network layers and keep-ratios. Left: accuracy curve of applying SBP on different percentage of layers, the keep ratio of SBP is 0.5. Middle: trade-off of keep-ratio versus accuracy, every SBP layer has the same keep-ratio. Right: comparison of using uniform, decreasing and increasing keep-ratios for the SBP layers. 0 500 1000 1500 2000 0.5 Uniform keep-ratios 0 500 1000 1500 2000 0.5 Decreasing keep-ratios 0 500 1000 1500 2000 0.5 Increasing keep-ratios 0 500 1000 1500 2000 0.5 0 500 1000 1500 2000 0.5 0 500 1000 1500 2000 0.5 0 500 1000 1500 2000 0.8 0 500 1000 1500 2000 0.8 0 500 1000 1500 2000 0.8 0 500 1000 1500 2000 # of data batches 0 500 1000 1500 2000 # of data batches 0 500 1000 1500 2000 # of data batches Figure 2: Cosine similarity of gradient weights with and without applying SBP on Vi T-Tiny with uniform, decreasing, and increasing keep-ratios. Results on Layer 20 (MHSA), 40 (MLP), 60 (Layer Norm), 80 (MLP) are displayed. Results on other layers have similar behavior. 4 Design Strategies In this section, we investigate the effectiveness of different design strategies of SBP on image tasks. We use Vi T-Tiny [6] model to evaluate the top-1 accuracy on Image Net [18] validation dataset. The Vi T-Tiny is a smaller version of the Vi T [6] variants, it uses patch size 16 with 12 transformer blocks (more than 80 layers), the embedding dimension is 192 and the number of heads is 3, i.e., each head has dimension 64. The experiments are trained for 100 epochs on the Image Net dataset. The conclusions remain consistent for larger models and longer training schedules. How many layers should be applied with SBP? To answer this question, we progressively apply SBP to transformer layers from the last layers to early layers and present their results in Fig. 1 Left. As shown, we observe a sweet spot when applying SBP to 8 out of 12 block layers, which reduces GPU memory consumption by 33% with an accuracy drop within 1.5%. Although applying SBP on all the transformer layers can save the more GPU memory, it causes a large accuracy drop. Therefore, unless stated otherwise, we apply SBP on 2/3 of all layers of Vi T models for the rest of this work. Figure 3: Left two: a particular example of gradient keep mask for grid-wise sampling and random sampling. Right two: cosine similarity of weights gradient between with and without applying SBP on Vi T-Tiny. The SBP are applied with a keep-ratio of 0.5 on these two mask sampling methods. How much gradients should be kept in SBP? In this section, we discuss the trade-off between the gradient keep-ratio and accuracy in SBP. As a reminder, keep-ratio r is the number of gradient kept indices over the number of all indices, and it represents the percentage of spatial feature map values used to calculate the gradients. In general, the higher the keep-ratio is, the more gradient information is preserved, and the higher the accuracy is, the less memory is saved. A natural question arises: what is a good keep-ratio to balance the accuracy drop and memory saving? In Fig. 1 Middle, we show the trade-off between model accuracy and the keep-ratio of gradients. In this case, we apply the same keep-ratio to all the SBP layers at the same gradient kept indices. We name this strategy "uniform keep-ratio". We observe that setting the keep-ratio to be 0.5 achieves a reasonable trade off while having a keep-ratio of 0.25 is too aggressive, which results in a dramatic accuracy drop (3.04%) of the model. We also explore different keep-ratio methods by linearly increasing or decreasing the keep-ratios from early to late layers while keeping their average keepratio to be 0.5. Specifically, we use keep-ratios of [0.25, 0.32, 0.39, 0.46, 0.53, 0.60, 0.68, 0.75] (increasing) or its inverse (decreasing) on 8 transformer blocks. As shown in Fig. 1 Right, both models perform significantly worse. We simulate the gradient calculation on Vi T-Tiny with 2000 data batches randomly sampled from the Image Net training dataset. We apply different keep-ratio methods on the same model with the same fixed weights, and calculate the correlation (measured by cosine similarity) between the weight gradients of applying SBP and the original exact weights gradients, i.e., Cosine(d WSBP , d Wno SBP ). From Fig. 2, we observe that the weight gradients have a stronger correlation when using uniform keep-ratios compared to the case of increasing or decreasing keep-ratios. We believe that the fact that uniform keep-ratios keep the location of the dropped gradients consistency between different layers helps preserve gradient information at these spatial locations. However, non-uniform keep-ratios will drop at different locations between different layers, which adds noise to produce a less accurate estimation of the original gradients, and hence lowers accuracy. How to sample the gradient keep mask? SBP calculates the gradients by only using a subset of feature maps. A key question is how to sample the subset of feature maps to mask the kept gradients. If we fix the spatial locations of the gradient kept indices for every training step, the backward only propagates the valid gradients (non-zeros) on these fixed locations, which forces the model to only learn from fixed partial spatial information and therefore severely hurts the model performance. A better strategy is to randomly sample the gradient kept indices in each training step, so that it can statistically visit every spatial location with equal importance. We examine two mask sampling strategies: grid-wise mask sampling and random mask sampling. See Fig. 3 left as a particular example of keep-ratio 0.5. We report the accuracy in Fig. 4 left. Overall, grid-wise mask sampling achieves 0.5% 1.7% higher accuracy over random mask sampling. In Fig. 3, a stronger correlation is also observed by using the grid-wise sampling. Grid-wise sampling more structured sampling approach seems to provides a more accurate estimation of gradients. Unless stated otherwise, we use grid-wise sampling in this paper. How to apply SBP on MHSA? In general, the transformer block consists of two sub-blocks: Multi Head Self-Attention (MHSA) and Multi-Layer Perception (MLP). Theoretically, the memory usage of applying SBP on MLP (two linear layers) is proportional to the keep-ratio r. However, the memory saving of applying SBP on MHSA depends on the sampling method on the query, key, value (QKV), and the ratio of head dimension d over the number of tokens n. From [3], activation maps needed SBP-0.75 SBP-0.5 SBP-0.25 59 Top-1 accuracy Effect of gradients keep mask Grid mask Random mask SBP-0.5 SBP-0.25 Top-1 accuracy Effect of drop methods on MHSA Drop QKV Drop Query only Drop Head Figure 4: Left: effect of different sampling method for gradient keep mask. Right: effect of different SBP dropping methods on Multi-Head Self-Attention (MHSA). 0 500 1000 1500 2000 0.5 0 500 1000 1500 2000 0.5 Drop Query only 0 500 1000 1500 2000 0.5 0 500 1000 1500 2000 0.5 0 500 1000 1500 2000 0.5 0 500 1000 1500 2000 0.5 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000 # of data batches 0 500 1000 1500 2000 # of data batches 0 500 1000 1500 2000 # of data batches Figure 5: Cosine similarity of weights gradient for different method on applying SBP on Vi T-Tiny MHSA layers with a keep-ratio of 0.5. Left: Drop QKV, middle: Drop Query only, right: Drop Head. for gradients calculation in MHSA include the inputs and outputs (2hdn), QKV vectors (each with hdn), the attention weight maps (2hnn before and after the softmax), thus requiring the memory of 3hdn + 2hnn in total, where h is the number of heads. Dropping gradients on query only (Drop Query only) or on all QKV (Drop QKV) enjoys the following memory usage ratio: Drop Query only: 2hdn+hdnr+2hnnr 3hdn+2hnn = r( d n Drop QKV: 3hdnr+2hnnr2 3hdn+2hnn = r 3+2r d In video transformers, the number of tokens n is much larger than the head dimension d (i.e., in video Swin Transformer [13] , n = 392 and d = 32, d n = 0.082) and thus the attention weight maps 2hnn comprise most of the memory. Therefore, dropping gradients on query only in the video model is good enough to save memory so that its memory is linearly proportional to the keep ratio r [3]. However, in image task models such as Vi T with a typical input size 224 and patch size 16, n = 14 14 = 196 and d = 64, d n = 0.327. With keep-ratio r = 0.5 (or r = 0.25), dropping gradients on query alone uses 0.61 (or 0.42 ) memory while dropping all QKV reduces its memory usage ratio to 0.46 (or 0.22 ). From Fig. 4, dropping gradients on all QKV also gains better accuracy. We also report the performance of dropping gradients along the attention heads, but we did not observe accuracy improvement. Hence, we apply SBP on all QKV for image-based transformers. We plot the cosine similarity of the weights gradients of applying SBP and the original exact gradients in Fig. 5. One interesting observation is that Drop QKV has a slightly better correlation to Drop Query only and a much higher correlation to Drop Head. We believe an accurate estimation of the gradients might be an important factor to the model performance trained with SBP, which provides insight for further investigation. 5 Generalizability of SBP We evaluate the generalizability of our proposed SBP on two computer vision benchmarks: image classification on Image Net [18] and object detection on COCO [10]. In order to do a fair comparison of the same network, we keep the training hyper-parameters of optimizer, augmentation, regularization, batch size and learning rate the same for a given model, and only adopt different stochastic depth Table 1: Accuracy and memory results of applying SBP for Vi T and Conv Ne Xt on Image Net. Network Keep-ratio Batch size Memory (MB / GPU) Top-1 accuracy (%) Vi T-Tiny no SBP 256 8248 73.68 Vi T-Tiny 0.5 256 5587 (0.68 ) 73.09 (-0.59) Vi T-Base no SBP 64 10083 81.22 Vi T-Base 0.5 64 7436 (0.74 ) 80.62 (-0.60) Conv Ne Xt-Tiny no SBP 128 12134 82.1 Conv Ne Xt-Tiny 0.5 128 7059 (0.58 ) 81.61 (-0.49) Conv Ne Xt-Base no SBP 64 14130 83.8 Conv Ne Xt-Base 0.5 64 8758 (0.62 ) 83.27 (-0.53) augmentation [9] for different model sizes. The Mixed Precision Training [17] method is enabled for faster training. All experiments are conducted on machines with 8 Tesla 16GB V100. Each accuracy is reported with an average result of three runs with different random seeds. For more details of experimental settings, please refer to the supplementary material. 5.1 Classification on Image Net We report the memory and accuracy trade-off in Tab. 1 for two state-of-the-art networks: transformerbased Vi T [6] and convolutional-based Conv Ne Xt [14]. We train 300 epochs for both these two networks by following the training recipe of Conv Ne Xt [14]. For Vi T models, we apply SBP on the last 8 transformer blocks including MHSA and MLP layers with the keep-ratio of 0.5, reducing GPU memory usage by approximately 30%. Note that the memory saving ratio of the network is not proportional the the drop-ratio (1 keep-ratio) because we only apply SBP on 2 3 of all the layers. With keep ratio of 0.5, the SBP technique can still effectively learn with an accuracy drop of only 0.59% and 0.60% for Vi T-Tiny and Vi T-Base, respectively. For Conv Ne Xt models, it builds the basic block with a depth-wise convolutional (DW-Conv) layer with kernel size 7 7 followed by two point-wise convolutional (PW-Conv) layers3 and the DW-Conv layer only consumes a small portion of GPU memory (less than 10%). It has multi-scale stages where early stages have higher redundancy and occupy most of the GPU memory. Therefore, we apply SBP on the PW-Conv and down-sampling layers in the first three stages, and do not apply it on the last stage. With a keep ratio of 0.5, Conv Ne Xt-Tiny and Conv Ne Xt-Base can efficiently learn the model with a GPU memory saving of above 40% and an accuracy drop within 0.53%. 5.2 Object Detection and Segmentation on COCO We fine-tune the vanilla Mask R-CNN [8] and Cascade Mask R-CNN [1] on the COCO dataset [10] with Conv Ne Xt-Tiny and Conv Ne Xt-Base backbones pretrained on Image Net-1K, respectively. Following the same hyper-parameter settings of [14], we apply SBP on the Conv Ne Xt backbones to save training memory. Results in Tab. 2 show that SBP with keep-ratio of 0.5 can still learn the detection task reasonably well with 0.2% and 0.7% loss on the box and mask average precision (AP) for Conv Ne Xt-Base and Conv Ne Xt-Tiny backbones, respectively. However, it only consumes about 0.7 of the GPU memory. Note that training detection models on high resolution images (up to 800 1333) is memory intensive even for a very small batch size. For example, with backbone of Conv Ne Xt-Base and batch size of 2, the model training requires 17.4GB of GPU memory which may lead to out of memory error for common devices such as Tesla 16GB V100. While with the help of SBP, the model can still be effectively trained on GPUs with 12.5GB memory, which is much more feasible for many research groups. Here we apply SBP only in the backbone, we believe it has a potential to apply SBP on more other layers in the detection model to achieve more promising results in terms of memory efficiency. 3The two PW-Conv layers are equivalent to an MLP block in transformers. Table 2: COCO object detection and segmentation results using Mask-RCNN with backbone Conv Ne Xt-T and Cascade Mask-RCNN with backbone Conv Ne Xt-B. Backbone Keepratio Batch size Memory (GB / GPU) APbox APbox 50 APbox 75 APmask APmask 50 APmask 75 Conv Ne Xt-T no SBP 2 8.6 46.2 67.9 50.8 41.7 65.0 44.9 Conv Ne Xt-T 0.5 2 5.9 (0.69 ) 45.5 67.4 50.1 41.1 64.4 44.1 Conv Ne Xt-B no SBP 2 17.4 52.7 71.3 57.2 45.6 68.9 49.5 Conv Ne Xt-B 0.5 2 12.5 (0.72 ) 52.5 71.3 57.2 45.4 68.7 49.2 6 Discussion In this work, we present a comprehensive study of the Stochastic Backpropagation (SBP) mechanism with a generalized implementation. We analyze the effect of different design strategies to optimize the trade-off between accuracy and memory. We show that our approach can reduce up to 40% of the GPU memory when training image recognition models under various deep learning backbones. Limitations: In general, SBP is a memory efficient training method, but it still causes slight loss of accuracy. In our current results, SBP produces only a small amount of training speedup( 1.1 ). How to further speed up the training process is part of our future work. Societal impact: Our work studies the stochastic backpropagation mechanism, which can be used for training general deep learning models. Misuse can potentially cause societal harm. However, we believe our work is a general approach and is consist with the common practice in machine learning. 7 Acknowledgments and Disclosure of Funding We thank the anonymous reviewers for their helpful suggestions. This work was funded by Amazon. [1] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018. [2] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. ar Xiv:1604.06174, 2016. [3] Feng Cheng, Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Li, and Wei Xia. Stochastic backpropagation: A memory efficient strategy for training video models. CVPR, 2022. [4] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. Neur IPS, 2021. [5] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. ar Xiv:1907.04840, 2019. [6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv:2010.11929, 2020. [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [8] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017. [9] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017. [10] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. [11] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. [12] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Official Implementation of Swin transformer. https://github.com/microsoft/ Swin-Transformer, 2021. [Online; accessed 12-Sep-2022]. [13] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. ar Xiv:2106.13230, 2021. [14] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. CVPR, 2022. [15] Mateusz Malinowski, Grzegorz Swirszcz, Joao Carreira, and Viorica Patraucean. Sideways: Depth-parallel training of video models. In CVPR, 2020. [16] Mateusz Malinowski, Dimitrios Vytiniotis, Grzegorz Swirszcz, Viorica Patraucean, and Joao Carreira. Gradient forward-propagation for large-scale temporal video modelling. In CVPR, 2021. [17] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. ar Xiv:1710.03740, 2017. [18] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015. [19] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019. [20] Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Feichtenhofer, and Philipp Krahenbuhl. A multigrid method for efficiently training video models. In CVPR, 2020. [21] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. [22] Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. G-tad: Sub-graph localization for temporal action detection. In CVPR, 2020. [23] Mingze Xu, Mingfei Gao, Yi-Ting Chen, Larry S Davis, and David J Crandall. Temporal recurrent networks for online action detection. In ICCV, 2019. [24] Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia, Zhuowen Tu, and Stefano Soatto. Long short-term transformer for online action detection. Neur IPS, 2021. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 6. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Section 5 s first paragraph and Section 8.3 in the supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 5 s first paragraph and Section 5.1 s first paragraph. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Each accuracy report is an average result of running experiments three times with different random seeds. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 5 s first paragraph and Section 8.3 in the supplementary material. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] We cited the all datasets we used, such as Image Net [18] ad COCO [10]. (b) Did you mention the license of the assets? [No] Image Net is under BSD 3-Clause License and COCO is under Creative Commons Attribution 4.0 License. (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [No] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] We do not have sufficient resource to exhaustively screen offensive content from the datasets we used. We expect, because the datasets only contain images of people with daily activities and common objects, they are unlikely to contain offensive content. They are both very widely used. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]