# online_continual_learning_via_logit_adjusted_softmax__9cb643b6.pdf

Published in Transactions on Machine Learning Research (05/2024)

Online Continual Learning via Logit Adjusted Softmax

Zhehao Huang kinght_h@sjtu.edu.cn Institute of Image Processing and Pattern Recognition Shanghai Jiao Tong University

Tao Li li.tao@sjtu.edu.cn Institute of Image Processing and Pattern Recognition Shanghai Jiao Tong University

Chenhe Yuan vernunft@sjtu.edu.cn Department of Automation Shanghai Jiao Tong University

Yingwen Wu yingwen_wu@sjtu.edu.cn Institute of Image Processing and Pattern Recognition Shanghai Jiao Tong University

Xiaolin Huang xiaolinhuang@sjtu.edu.cn Institute of Image Processing and Pattern Recognition Shanghai Jiao Tong University

Reviewed on Open Review: https://openreview.net/forum?id=My QKc QAte6

Online continual learning is a challenging problem where models must learn from a nonstationary data stream while avoiding catastrophic forgetting. Inter-class imbalance during training has been identiﬁed as a major cause of forgetting, leading to model prediction bias towards recently learned classes. In this paper, we theoretically analyze that inter-class imbalance is entirely attributed to imbalanced class-priors, and the function learned from intra-class intrinsic distributions is the optimal classiﬁer that minimizes the class-balanced error. To that end, we present that a simple adjustment of model logits during training can eﬀectively resist prior class bias and pursue the corresponding optimum. Our proposed method, Logit Adjusted Softmax, can mitigate the impact of inter-class imbalance not only in class-incremental but also in realistic scenarios that sum up class and domain incremental learning, with little additional computational cost. We evaluate our approach on various benchmarks and demonstrate signiﬁcant performance improvements compared to prior arts. For example, our approach improves the best baseline by 4.6% on CIFAR10. Codes are available at https://github.com/K1nght/online_CL_logit_adjusted_softmax.

1 Introduction

Continual learning (CL) has emerged to equip deep learning models with the ability to handle multiple tasks on an unbounded data stream. This paper focuses on the online class-incremental (class-IL) (Lange et al., 2019) CL problem (Zhou et al., 2023), which holds high relevance to real-world applications (Wang et al., 2022). In online CL, also known as task-free CL, data is obtained from an unknown non-stationary stream for single-pass training. Class-IL learning, in contrast to task-IL learning, continuously introduces new classes to the model as the data stream distribution changes, without task-identiﬁers to assist classiﬁcation (van de Ven et al., 2022).

Published in Transactions on Machine Learning Research (05/2024)

Catastrophic forgetting (Mc Closkey & Cohen, 1989; Ratcliﬀ, 1990) is a major obstacle to deploying deep learning models in CL. Recent research attributes catastrophic forgetting to recency bias (Chrysakis & Moens, 2023), which causes deep neural networks to classify samples into currently learned classes. In fact, this bias in CL is similar to the dominance of head classes in long-tailed distribution learning (Menon et al., 2021). The vanilla model trained on a long-tailed distribution suﬀers from inter-class imbalance and tends to infer samples into classes that possess a majority of samples. Previous works (Ahn et al., 2020) have also observed that one of the primary causes of catastrophic forgetting is inter-class imbalance throughout training. Growing attention to recency bias and inter-class imbalance has given rise to methods (Koh et al., 2022) to alleviate the negative impact of imbalance, among which recently rehearsal-based methods have been highly successful but still with limitations. Replay buﬀers will become ineﬀective for long sequential data streams or tasks with numerous categories. Some methods (Guo et al., 2022; Prabhu et al., 2020) train only on replayed samples to achieve balanced learning but sacriﬁce most valuable training data and risk overﬁtting on the buﬀer. Methods (Caccia et al., 2022) that separate gradient updates for old and new classes eﬀectively prevent the impact of imbalance between them but fail to construct clear classiﬁcation boundaries between old and new classes.

Upon decomposing sample probability in non-stationary data streams through conditional probability (sample probability = class-conditional class-prior), revealing that recency bias caused by inter-class imbalance is entirely attributable to imbalanced class-priors. The underlying class-conditional invariant in online class-IL data streams motivates us to learn a function from intrinsic intra-class distributions instead of traditional sample distributions. We propose Logit Adjusted Softmax (LAS) to resist the impact of classpriors and grasp class-conditionals by simply adjusting the model logits output via input label frequencies in training. Our method is grounded by the optimal classiﬁer that minimizes the class-balanced error in online class-IL setup. Moreover, in the challenging online CL scenarios (Xu et al., 2021) that sum up classand domain-IL, we show that preserving knowledge in the class-conditional function can better adapt the learner to changing domains. LAS provides the following three practical beneﬁts in comparison to other previous online CL methods: (1) It can eliminate the prediction bias caused by the imbalance between old and new classes, as well as the inherent inter-class imbalance of the data stream. (2) It is orthogonal to the methods improving replay strategies and plug-in to most of the rehearsal-based methods. (3) It improves performance with nearly no additional computational overhead.

We evaluate LAS on extensive benchmarks over various datasets and multiple setups. Our LAS lifts the plainest Experience Replay (ER) (Chaudhry et al., 2019) to state-of-the-art performance without model expansion or computationally intensive technique, e.g., improving the accuracy of the best baseline by 4.6% on CIFAR10 (Krizhevsky, 2009) in the online class-IL setup. Furthermore, we notice that inter-class imbalance dominates forgetting in long sequential data streams, which is rarely evaluated and always underestimated in previous work, so we evaluate on the challenging Image Net (Deng et al., 2009) and i Naturalist (Horn et al., 2017), where our proposed method consistently outperforms previous approaches. In addition to the class-IL setup, LAS also succeeds in the blurry setup and and scenarios that sum up classand domain-IL.

Key contributions of this paper include: (1) We discover the class-conditional invariant and the optimality that minimizes the class-balanced error for the class-conditional function in online class-IL. (2) We propose eliminating class-priors and learning class-conditionals separately under the scenarios that sum up classand domain-IL. (3) We introduce to adjust model logit outputs in training with a batch-wise sliding-window estimator for time-varying class-priors to pursue the class-conditional function.

2 Problem Setup

Beyond the task-IL setting (Li & Hoiem, 2016) with clear task-boundaries, we consider a more realistic environment where task-identiﬁers and task-boundaries are absent at any time, and the total number of labels is unknown. Speciﬁcally, let X be the instance set and Y be the corresponding label set. In online CL, |Y| = . At time t T = {1, 2, . . . }, given an unknown non-stationary data stream Dt over X Y, the learner samples data batch Bt = {xi, yi}|Bt| i=1 P(x, y|Dt). We refer to Bt as the incoming batch. If a pair of instance and label is not stored in the memory, it will be inaccessible in subsequent training unless resampled.

Published in Transactions on Machine Learning Research (05/2024)

Commonly, a constrained memory M (|M| M) is utilized to enhance online CL: if the buﬀer is not empty at time t, a Retrieval program ensembles several instances and other speciﬁc information I to form a buﬀer batch BM t = Retrieval(Bt, Mt) = {xi, Ii}|BM t | i=1 P(x, I|Mt). The buﬀer Updates with incoming batches, Mt+1 Update(Bt, Mt). Typically, ER (Chaudhry et al., 2019) stores instances and labels Ii = yi, retrievals by random replaying, and updates via reservoir sampling (Vitter, 1985). Rehearsal helps to alleviate inter-class imbalance when the number of classes is limited, but can not fundamentally eliminate its impact. The minimum class-prior in memory is bounded by the inverse proportion to the number of observed classes, miny Yt P(y|Mt) 1/|Yt| 0 (t ). When the number of seen classes surges, rehearsal will no longer be able to support balanced inter-class learning.

The learner is a neural network parameterized by Θ = {θ, w}. Function fθ : X RD extracts feature embeddings with dimension D. Following the feature extractor, a single-head linear classiﬁer produces logits, Φ( ) = w fθ( ) : X R|Yt| (for short Φy( ) = w y fθ( )), where w RD R|Yt| represents the weights corresponding to target classes. The dimension of weights in the classiﬁer can grow as more classes have been observed. The learner trains through a surrogate loss averaged on all input instances, Lt : Yt R|Yt| R (Yt is the set of all observed labels), typically the softmax cross-entropy loss:

LCE(y, Φ(x)) = log eΦy(x) P

y Yt eΦy (x) = log[1 + X

y =y eΦy (x) Φy(x)]. (1)

3 Statistical View for Time-varying Distribution Learning

The standard CL methods learn from the sample probability P(x, y|ρt) of the target distribution ρt (for example Dt in practice). The model is encouraged to pursue a posterior probability function P(y|x, ρt) and to minimize the misclassiﬁcation error Eρt[Ex,y|ρt[y = arg maxy Yt Φy (x)]]. From Bayesian and conditional probability rule, we notice P(y|x, ρt) P(x, y|ρt) = P(x|y, ρt) P(y|ρt), revealing that the sample probability P(x, y|ρt) of a time-varying distribution is controlled by the class-conditional P(x|y, ρt) and the class-prior P(y|ρt). In unknown non-stationary data streams, inter-class imbalance entirely attributes to time-varying class-priors and is independent of class-conditionals. Therefore, such a factorization of probability motivates us to learn a class-balanced classiﬁer by exclusively pursuing a class-conditional function P(x|y, ρt), which is agnostic to arbitrarily imbalanced class-priors. The class-conditional function has been widely studied in statistical learning on stable distributions (Long et al., 2017). Similar motivation regarding the decomposition of the sample probability has also been discussed in previous work (Van De Ven et al., 2021). We further extend the analysis of the class-conditional function to non-stationary stream distribution learning and discover the class-balanced optimality of the class-conditional function when learning stream distributions without domain drift, i.e., with ﬁxed class-conditionals, as demonstrated in the following Theorem 3.1.

Theorem 3.1. For the time-varying distribution ρt, given that its class-conditionals keep the same throughout time, i.e., t, P(x|y, ρt) = P(x|y, ρ0), the class-conditional function satisﬁes the optimal classiﬁer Φ t that minimizes the class-balanced error,

Φ t arg min Φ:X R|Yt| CBE(Φ, Yt), arg max y |Yt| Φ t,y(x) = arg max y |Yt| P(x|y, ρt). (2)

CBE(Φ, Yt) = 1 |Yt|

y Yt Eρt[Ex|y,ρt[y = arg max y Yt Φy (x)]]. (3)

In other words, the optimal class-balanced estimate is the class under which the sample is most likely to appear. CBE(Φ, Yt) is the Class-Balanced Error (Menon et al., 2013) on the current label set Yt, extended from the misclassiﬁcation error for class-balanced evaluation, formally in Equation 3. Noting that the classconditional function corresponds to the Bayes optimal classiﬁer when class-priors are uniform. However, in scenarios where the class-priors are inherently imbalanced, the Bayes optimal classiﬁer fails to minimize the class-balanced error (Menon et al., 2013). Bias towards the most recently occurring classes does not aid in reducing the class-balanced error, but approximation towards real underlying class-conditionals helps balanced classiﬁcation because the class-balanced error is averaged from the per-class error rate. Therefore,

Published in Transactions on Machine Learning Research (05/2024)

to address the impact of inter-class imbalance and leverage knowledge from intra-class intrinsic distributions, we propose eliminating class-priors and constructing a class-conditional function in online CL. The proof of Theorem 3.1 is in Appendix A. Following, we discuss two distinct CL scenarios on the critical condition of class-conditionals.

Discussion on online class-IL with time-invariant class-conditionals. Prior works (Chrysakis & Moens, 2023) have typically assumed no occurrence of domain drift during the learning process in online class-IL. Although domain drift should be taken into account in realistic scenarios, nearly time-invariant class-conditionals are genuinely feasible in practical situations. For instance, acting as a lifelong species observer in the wild, the agent can ﬁnd that the target class-conditionals conform to their natural distributions, determined by their semantic information and occurrence frequencies. Without intentional human interference, the concept of natural semantics will remain almost unchanged over a prolonged time, i.e., t, P(x|y, ρt) P(x|y, ρ0). In the experiments, we mainly adhere to the conventional class-IL conﬁguration of no consideration of domain drift and focus on addressing the issues of inter-class imbalance and forgetting induced by recency bias.

Discussion on online CL scenarios with time-varying class-conditionals. In the challenging online CL scenarios where P(y|ρt) and P(x|y, ρt) ﬂuctuate as the data stream ﬂows, both inter-class imbalance and intra-class domain drift are crucial considerations. While this setting has been studied in oﬄine incremental setups (Xie et al., 2022), there has been no research on this topic under online conditions, to the best of our knowledge. We now present our contribution to bridging this gap. In this online CL setup, we eliminate class-priors and focus on the class-conditional function, which should not favor any speciﬁc domain but should blend all observed domains uniformly for optimal decision-making,

arg max y |Yt| Φ t,y(x) = arg max y |Yt|

i=1 P(x|y, ρi). (4)

Since previous distributions are unavailable in CL, determining the optimal uniform domain distribution is intractable. Nevertheless, the disparity between the optimal classiﬁer that minimizes the class-balanced error and the learned class-conditional function could be measured by the similarity between their underlying intra-class distributions. We combine with knowledge distillation techniques (Li & Hoiem, 2016; Tao et al., 2020; Kang et al., 2022; Dong et al., 2023) to narrow that disparity in probability space. Results in 6.3 show that preserving the knowledge in the class-balanced class-conditional function after eliminating class-priors can better adapt to domain drift than the standard posterior function. Therefore, our proposal is scalable to online settings that sum up classand domain-IL. Furthermore, it paves the way for developing further eﬃcient solutions for this online scenario by minimizing the intra-class probabilities gap from the optimal class-conditional function, which we intend to explore in future research.

4.1 Logit Adjustment Technique

Now our objective becomes excluding class-priors and establishing an estimator for current class-conditionals, i.e., Φt : X R|Yt|, exp(Φt,y) P(x|y, ρt). However, it is notoriously diﬃcult to model the class-conditionals explicitly (Van De Ven et al., 2021; Zając et al., 2023). To detour this problem, we draw on the Logit Adjustment technique proposed by (Menon et al., 2021). Suppose the optimum scorer obtained by minimizing misclassiﬁcation error on the target distribution ρt at time t is s t : X R|Yt|, exp(s t,y) P(y|x, ρt). Recalling P(y|x, ρt) P(x|y, ρt) P(y|ρt), we can derive the relationship between the class-conditional estimator Φt and the optimum scorer s t as follows:

arg max y |Yt| es t,y(x) = arg max y |Yt|

eΦt,y(x) P(y|ρt) = arg max y |Yt| (Φt,y(x) + ln P(y|ρt)) . (5)

Equation 5 induces a straightforward method to approximate class-conditionals and to achieve a classbalanced classiﬁer: adjusting the model logits output according to class-priors P(y|ρt) and directly optimizing the softmax cross-entropy loss.

Published in Transactions on Machine Learning Research (05/2024)

logit adjustment

+ memory buffer batch retrieve

non-stationary

distribution incoming batch

Figure 1: Left is the diagram of Experience Replay (ER) with our proposed Logit Adjusted Softmax and a batch-wise sliding-window estimator (ER-LAS). LAS helps mitigate the inter-class imbalance problem by adding label frequencies to predicted logits. The model in ER-LAS is still trained via the softmax crossentropy loss. And right is model prediction test samples by Fine-Tune, ER, and ER-LAS on C-CIFAR100 (10 tasks). The gray dashed line indicates the ground truth task-wise distribution (1k for each). We count according to the tasks to which the predicted classes belong.

4.2 Logit Adjusted Softmax Cross-entropy Loss

We now show how to incorporate Logit Adjustment technique into the softmax cross-entropy loss for the aim of addressing the inter-class imbalance issues in online CL. The modiﬁed Logit Adjusted Softmax cross-entropy loss is deﬁned as follows:

LLAS(y, Φ(x)) = log eΦy(x)+τ log πy,t P

y Yt eΦy (x)+τ log πy ,t = log[1 + X

τ e(Φy (x) Φy(x))], (6)

where τ is the temperature scalar, and πy,t is the class prior P(y|St) at time t. In practice, St represents the data point collection from which the model samples input batch each time. Due to the uncertainty of St, it is impossible to pinpoint class priors at each moment. To overcome this barrier, the following 4.3 will provide a simple yet eﬀective method for estimating class-priors in the ﬂowing input stream. Applied to rehearsal-based methods, LLAS will act both on incoming and buﬀer batches to fully exploit input data. The right-hand side of Equation 6 illustrates its distinction to the cross-entropy loss in Equation 1, enforcing a large relative margin between the major class and the minor class, i.e., (πmajor,t/πminor,t)τ > 1 (Cao et al., 2019; Tan et al., 2020; Menon et al., 2021).

4.3 Estimator for Time-varying Class-priors

In stationary distribution, the Logit Adjustment technique (Collell et al., 2016) can determine class-priors based on a large amount of training data. But when facing an unknown time-varying input data stream, it is required to continuously estimate class-priors πy,t for LLAS at each time t. Therefore, we propose an intuitive batch-wise estimator with sliding window, where the occurrence frequency of a label in input batches covered by the sliding time window approximates the corresponding class-prior to that label. Given the length l > 0 of the time frame, πy,t is calculated as follows:

Pt i=t l+1 P

{x ,y } BS i 1(y = y) Pt i=t l+1 |BS i | , (7)

where 1( ) is the indicator function of label y and BS t is the input batch sampled from the data point collection St. For rehearsal-based methods, the input batch often consists of the incoming and buﬀer batch, i.e., BS t = Bt BM t . The length l of the time window concerns a sensitivity-stability trade-oﬀ(Nagengast et al., 2011) with respect to the estimation of class priors, which we further study in the sensitivity analysis of 6.4.

Discussion. Logit Adjusted Softmax cross-entropy loss and the batch-wise estimator with sliding window together constitute our proposed LAS approach. Our method is orthogonal to previous methods of various replay strategies and knowledge distillation techniques. Exact joint label distribution of the non-stationary

Published in Transactions on Machine Learning Research (05/2024)

data stream and the memory retrieval program is unnecessary to our approach, allowing us to eﬀortlessly incorporate LAS into existing methods and correct their model prediction bias caused by inter-class imbalance at nearly no cost of additional computational overhead. The experiment in Figure 1(right) veriﬁes the eﬀect of LAS on correcting the prediction bias, which follows the same setting as in 6. Fine-Tune, which trains without any precautions against catastrophic forgetting and inter-class imbalance, categorizes all test samples into the most recently studied task classes. ER (Chaudhry et al., 2019) includes a constrained memory to store previously observed data but still assigns about 38% (instead of the expected 10%) test samples to the most recently learned classes. By contrast, our ER-LAS shown in Figure 1(left) eliminates the recency bias, achieves balanced class-posteriors similar to ground truth distribution, and signiﬁcantly improves ER performance evaluated in the following 6. The algorithm of LAS is in Appendix B.

Implementation in online GCL. We combine LAS with knowledge distillation in online GCL to preserve a class-balanced class-conditional function over averaged domain distributions. We directly calculate the distillation loss between the outputs of old and current models without logit adjustment. Noting that distillation necessitates well-deﬁned task-boundaries to preserve the previous model for distillation. This requirement presents a formidable obstacle in online CL settings, where such boundaries are absent. To investigate the eﬃcacy of our proposed method under the online GCL setting, we allow to acquire taskboundaries in relative experiments. The algorithm of LAS with knowledge distillation for online GCL is Algorithm 3 in Appendix B.

5 Related Work

We next provide some intuition on the eﬀectiveness of our proposed approach by comparing LAS to prior work from the perspective of traditional and continual imbalanced distribution learning. We also highlight the computational eﬃciency in online conditions.

Methods for mitigating inter-class imbalance in stable distributions. Logit Adjustment (Menon et al., 2021) technique appears similar to Loss weighting (Cui et al., 2019) methods, yet the two diﬀer signiﬁcantly in addressing inter-class imbalance. While Loss weighting methods can balance the representation learning on minority class samples by weighting after the loss between logits and ground truth, it cannot rectify prior class bias and therefore cannot address recency bias. In contrast, Logit Adjustment technique directly balances the class-priors on logits, eradicating the impact of prior class imbalance on model classiﬁcation and resolving recency bias. In addition to Loss weighting methods, there are also other methods such as Weight normalization (Kang et al., 2020), Resampling (Kubát & Matwin, 1997), and Post-hoc correction (Collell et al., 2016). Diﬀerent from these methods and the original Logit Adjustment technique, our adapted LAS possesses ﬁrm statistical grounding for non-stationary distributions. We compare with these inter-class imbalance mitigation methods in Appendix F.4.

Methods for mitigating inter-class imbalance in non-stationary distributions. The fundamental ER (Chaudhry et al., 2019) and recently proposed ER-ACE (Caccia et al., 2022) represent two extreme cases of our approach. ER corresponds to the case where τ = 0, and LLAS degenerates into the conventional cross-entropy loss function LCE in Equation 1, losing the ability to alleviate inter-class imbalance. ERACE employs asymmetric losses for incoming and buﬀer batches, considering only the classes present in the current batch for incoming, i.e., τ , and all previously seen classes for replaying, i.e., τ = 0, to mitigate representation shift. However, completely separating the gradients of current and past classes blocks the construction of inter-class decision boundaries. Our method lies between ER and ER-ACE, not only pursuing class-conditional function but also encouraging large relative margins between old and new classes in online class-IL, i.e., always (πnew,t/πold,t)τ 1 in Equation 6 derived from their imbalance. We also notice highly related Logit Rectify methods (Zhou et al., 2023) designed for oﬄine task-IL, which we compare in Appendix F.5.

Computational eﬃciency. Online CL cannot ignore real-time requirements because memory and training time is usually limited in practical scenarios. Compared to traditional Softmax, Logit Adjusted Softmax slightly increases the computational cost of O(|Yt|). Our suggested estimator raises the calculation time by O(|Bt| + |BM t | + |Yt|) and the memory cost by O(|Yt|). In contrast to the time and storage overhead of the model and the memory, such an increase is negligible and lower than in previous works. Our experiments

Published in Transactions on Machine Learning Research (05/2024)

primarily compare methods with computational costs similar to our approach. Noting that CL methods based on contrastive learning (Guo et al., 2022; 2023) may consume substantially more computational resources than our algorithm. We present a performance comparison with these methods in Appendix F.6.

6 Experiment

In this section, we conduct comprehensive experiments to demonstrate the eﬀectiveness of our proposed LAS. First, we investigate the performance of LAS in the online class-IL scenario with class-disjoint tasks and in the online blurry CL scenario without clear task-boundaries. Then, we evaluate LASs gains on rehearsalbased methods in the online class-IL setup and gains on knowledge distillation approaches in the online CL setup that sums up classand domain-IL. Finally, we study the extreme variants of our method, the necessity of the suggested batch-wise estimator with sliding window, and the hyperparameter sensitivity of our LAS.

Benchmark setups. We use 5 image classiﬁcation datasets combined with 3 kinds of CL setups to form 8 benchmarks. Among datasets, CIFAR10 (Krizhevsky, 2009) has 10 classes. CIFAR100 (Krizhevsky, 2009) has 100 classes, and they can also be categorized into 20 superclasses with 5 domains. Tiny Image Net (Le & Yang, 2015) has 200 classes. Image Net ILSVRC 2012 (Deng et al., 2009) has 1,000 classes, evaluating method performance on the long sequence data stream. i Naturalist 2017 (Horn et al., 2017) has 5,089 classes. The distribution of images per category in i Naturalist follows the observation frequency of the species in the wild, so the data stream possesses inherent inter-class imbalance. As to CL setups, online class-IL (C) (Aljundi et al., 2019) splits a dataset into multiple tasks with uniform disjoint classes, e.g., C-CIFAR10 (5 tasks) is split into 5 disjoint tasks with 2 classes each, except for C-i Naturalist (26 tasks) that is organized into 26 disjoint tasks according to the initial letter of each class. Online blurry CL (B) (Koh et al., 2022) has both class-IL distributions and blurry task boundaries. It divides the classes into Nblurry% disjoint part and (100 Nblurry%) blurry part. The disjoint part classes only appear in ﬁxed tasks, while the blurry part classes occur throughout the data stream but with inherent inter-class imbalance represented by blurry level Mblurry. We split CIFAR100 and Tiny Image Net into 10 blurry tasks according to (Koh et al., 2022) with disjoint ratio Nblurry = 50 and blurry level Mblurry = 10. Online Sum-Class-Domain CL (S) (Xie et al., 2022) covers classand domain-IL setup, where incoming data contains images from new classes and new domains. We only apply this online CL setup on CIFAR100. The learner needs to predict superclass labels. Each superclass has 5 subclasses representing 5 diﬀerent domains within the same class. See Appendix C for more details about benchmark setups.

Training Protocol. For all experiments, unless otherwise speciﬁed, following (Buzzega et al., 2020). We use the full Res Net18 as the feature extractor. For small-scale datasets, we start training from scratch. We pre-train models on 100 randomly selected classes from C-Image Net and then perform online learning on the remaining 900 classes(Gallardo et al., 2021). As for C-i Naturalist, we pre-train models on the entire Image Net dataset. A single-head classiﬁer is applied to classify all seen labels. We use SGD optimizer without momentum and weight decay. The learning rate is set to 0.03 and kept constant. Incoming and buﬀer batch sizes are both 32. On C-Image Net and C-i Naturalist, we set both batch sizes to 128. We apply standard data augmentation, including random-resized-crop, horizontal-ﬂip, and normalization. Some literature(Koh et al., 2022) assumes that data arrive one-by-one in online CL, in which case we can accumulate samples as a batch to help model optimization convergence. We discuss the performance under varying batch sizes and per-sample updating in Appendix F.3. For online CL, only one epoch is used to run all methods for each task, and gradient descent is performed only once per incoming batch. By default, we set τ = 1.0 and l = 1 for LAS. We report means and standard deviations of all results across 10 independent runs.

Evaluation Protocol. A commonly used metric is the ﬁnal average accuracy AT . Another common metric is the ﬁnal average forgetting (Chaudhry et al., 2020) FT . For blurry setup, we follow (Koh et al., 2022) to add the Area Under the Curve of Accuracy AAUC to evaluate the model performance throughout training. The detailed computation of each metric is given in Appendix D.

Baselines. We consider 7 rehearsal-based methods for online CL to compare: ER (Chaudhry et al., 2019) uses reservoir update and random replay. DER++ (Buzzega et al., 2020) replays samples with previous logits for distillation loss. MRO (Chrysakis & Moens, 2023) only trains from memory. SS-IL (Ahn et al.,

Published in Transactions on Machine Learning Research (05/2024)

Table 1: Final average accuracy AT (higher is better) on C-CIFAR10 (5 tasks), C-CIFAR100 (10 tasks), and C-Tiny Image Net (10 tasks). M is memory size.

Dataset C-CIFAR10 C-CIFAR100 C-Tiny Image Net

Method M = 0.5k M = 1k M = 2k M = 0.5k M = 1k M = 2k M = 0.5k M = 1k M = 2k

ER 40.9 1.2 45.4 1.8 50.3 1.1 12.9 0.3 16.5 0.4 19.8 0.6 8.8 0.2 11.0 0.2 14.3 0.3 DER++ 49.4 1.0 49.7 3.0 48.9 0.9 8.9 0.4 13.1 0.4 12.3 0.4 5.9 0.2 8.0 0.3 9.5 0.3 MRO 43.4 1.0 49.3 1.1 55.9 0.6 11.5 0.1 18.3 0.2 23.1 0.1 5.9 0.1 9.2 0.1 13.4 0.2 SS-IL 47.7 0.7 52.6 0.5 51.7 0.4 19.2 0.2 21.5 0.2 24.2 0.2 13.1 0.2 14.9 0.1 17.1 0.9 CLIB 48.4 0.9 54.8 1.0 55.9 1.0 15.9 0.2 20.7 0.2 25.3 0.3 8.3 0.1 12.1 0.2 15.9 0.2 ER-ACE 44.4 1.0 48.1 1.1 51.2 1.2 18.6 0.4 22.5 0.5 25.0 0.9 11.4 0.2 14.8 0.2 16.4 0.4 ER-OBC 45.1 0.6 46.4 0.6 46.0 0.4 15.6 0.2 17.9 0.2 22.1 0.3 9.1 0.1 13.2 0.1 16.4 0.1 ER-CBA 45.0 1.6 54.2 1.1 56.3 0.9 20.1 0.6 23.0 0.3 26.0 0.6 12.3 0.7 13.6 0.5 17.0 0.5 ER-LAS 51.7 0.9 55.3 1.6 60.5 0.8 20.1 0.2 25.7 0.3 27.0 0.3 13.7 0.2 15.5 0.2 18.7 0.2

Table 2: Final average accuracy AT (higher is better) and ﬁnal average forgetting FT (lower is better) on C-Image Net (90 tasks) and C-i Naturalist (26 tasks). We show the results of top-3 methods. Memory sizes are M = 20k.

Dataset C-Image Net C-i Naturalist

Method AT / FT AT / FT

ER 31.8 0.1 / 38.6 0.2 4.7 0.0 / 18.0 0.0 ER-ACE 33.4 0.2 / 11.3 0.1 5.7 0.0 / 1.1 0.0 MRO 35.8 0.1 / 10.2 0.2 5.0 0.0 / 0.4 0.0 ER-LAS 39.3 0.1 / 9.0 0.1 8.1 0.0 / 2.8 0.0

Table 3: AUC of Accuracy AAUC and ﬁnal average accuracy AT (both higher is better) on B-CIFAR100 (10 tasks) and B-Tiny Image Net (10 tasks). We show the results of top-3 methods. Memory sizes are M = 2k.

Dataset B-CIFAR100 B-Tiny Image Net

Method AT / AAUC AT / AAUC

ER 19.6 1.6/16.1 0.1 16.2 0.2/12.4 0.0 ER-ACE 18.3 1.0/15.2 0.0 16.4 0.3/12.2 0.1 CLIB 21.9 0.3/18.0 0.1 15.9 0.2/12.6 0.1 ER-LAS 24.9 0.5/20.3 0.0 19.4 0.4/15.1 0.0

2020) separates the loss for present and absent classes. CLIB (Koh et al., 2022) updates by sample-wise importance and only trains on replayed samples. ER-ACE (Caccia et al., 2022) employs the asymmetric loss to reduce representation shift. ER-OBC (Chrysakis & Moens, 2023) additionally updates the classiﬁer by balanced buﬀer batches. ER-CBA (Wang et al., 2023) introduces a continual bias adapter inserted after the classiﬁer and conducts dual optimization on input and buﬀer batches. In addition, we enhance 3 methods of replay strategy: MIR (Aljundi et al., 2019) retrieves the memory samples most interfered with by the model updating. ASERµ (Shim et al., 2020) calculates Shapley values of samples to update and retrieve. OCS (Yoon et al., 2022) selects coreset with high aﬃnity to replay. Also, knowledge distillation losses in 3 approaches are augmented by LAS: Lw F (Li & Hoiem, 2016) distills on logits of previous classes. LUCIR (Hou et al., 2019) distills on normalized features. Geo DL (Simon et al., 2021) also distills in the feature space but measures by the geodesic path.

6.1 Results on Online Class-IL Scenarios

Accuracy results. Table 1 and Table 2 show the ﬁnal average accuracy for C-CIFAR10, C-CIFAR100, CTiny Image Net, C-Image Net, and C-i Naturalist with various memory sizes. ER-LAS consistently outperforms all compared baselines, achieving 60.5% (+4.6%), 27.0% (+1.7%), and 18.7% (+1.6%) on C-CIFAR10, CCIFAR100, C-Tiny Image Net respectively compared to the best baselines. Compared to only considering replayed samples in MRO and CLIB or separating the gradients between old and new classes in SS-IL and ER-ACE, LAS optimizes for a class-balanced function for incoming and buﬀer batches and enforces large relative margins between imbalanced classes, resulting in better performance. Considering that the challenging C-Image Net and C-i Naturalist benchmarks possess substantially longer sequences of data stream than the above three benchmarks, where the recency bias problem caused by inter-class imbalance becomes severely critical, we also apply LAS to boost the performance of ER. We present the results of the top-3

Published in Transactions on Machine Learning Research (05/2024)

Table 4: Final average accuracy AT (higher is better) by replay strategy methods w/o and w/ LAS on C-CIFAR100 (10 tasks). Gains are shown in parentheses. M is memory size.

Dataset C-CIFAR100

Method M = 0.1k M = 0.5k

ER 6.5 0.2 12.9 0.3 ER-LAS 10.7 0.2 (4.2 ) 20.1 0.2 (7.2 )

MIR 6.6 0.3 12.0 0.3 MIR-LAS 11.8 0.1 (5.2 ) 21.1 0.2 (9.1 )

ASERµ 7.8 0.2 13.8 0.3 ASERµ-LAS 9.5 0.4 (1.7 ) 18.0 0.3 (4.2 )

OCS 9.4 0.1 16.2 0.2 OCS-LAS 12.7 0.2 (3.3 ) 21.0 0.3 (4.8 )

Table 5: Final average accuracy AT (higher is better) by knowledge distillation approaches w/o and w/ LAS on S-CIFAR100 (20 tasks). Gains are shown in parentheses. M is memory size.

Dataset S-CIFAR100

Method M = 0.1k M = 0.5k

ER 20.4 0.2 27.3 0.4 ER-LAS 24.1 0.2 (3.7 ) 31.5 0.5 (4.2 )

Lw F 23.9 0.3 30.1 0.3 Lw F-LAS 26.0 0.2 (2.1 ) 32.4 0.1 (2.3 )

LUCIR 20.1 0.1 29.4 0.3 LUCIR-LAS 25.0 0.2 (4.9 ) 32.6 0.3 (3.2 )

Geo DL 20.6 0.2 30.1 0.2 Geo DL-LAS 25.2 0.2 (4.6 ) 32.8 0.2 (2.7 )

baselines (MRO, ER-ACE, ER) on C-Image Net and C-i Naturalist. ER-LAS can obtain 39.3% (+3.5%) on C-Image Net and 8.1% (+2.4%) on C-i Naturalist compared to the best baselines. To ensure a fair comparison on C-i Naturalist, we present both the ﬁnal average accuracy directly evaluated on the test data like other methods do, and the results of the ﬁnal average class-balanced accuracy aligning with our Theorem 3.1 in Appendix F.7. Our extensive evaluations demonstrate the superior performance of our LAS by eﬀectively alleviating inter-class imbalance in the online class-IL setup with nearly no additional computation cost (Table 7). ER-LAS is only slightly slower than ER, contributing to its real-world online applications.

Forgetting rate. We compare the ﬁnal average forgetting of ER-LAS with top-3 performed baselines (MRO, ER-ACE, ER) on C-Image Net and C-i Naturalist. As shown in Table 2, ER-LAS achieves the least forgetting rate on C-Image Net and only forgets more than MRO and ER-ACE on C-i Naturalist. However, the lowest forgetting rate (e.g., 0.4% of MRO) does not necessarily guarantee the highest accuracy (8.1% of ER-LAS) because of the stability-plasticity dilemma (Kim & Han, 2023). In the following sensitivity analysis of 6.4, we show that although a lower forgetting rate can be obtained by deliberately tuning hyperparameters in our LAS, a better stability-plasticity trade-oﬀcan be achieved by the optimal hyperparameters. It is worth noting that in long sequence benchmarks, compared to ER without considering inter-class imbalance, methods trying to address recency bias not only remarkably reduce forgetting rates but also bring about improvements in accuracy, underscoring the importance of inter-class imbalance as a top priority in lifelong class-IL. We provide prediction results on C-Image Net to further support our eﬃcacy of eliminating recency bias in Appendix F.8. We also evaluate the ﬁnal average forgetting on C-CIFAR10, C-CIFAR100, and C-Tiny Image Net in Appendix F.7.

6.2 Results on Online Blurry CL Scenarios

We compare ER-LAS with the best 3 baselines on B-CIFAR100 and B-Tiny Image Net. Table 3 shows that ER-LAS can outperform all baselines on AT and AAUC. For example, ER-LAS improves the best baseline by 3.0% AT and 2.5% AAUC on B-Tiny Image Net. In fact, our method is particularly suitable for the online blurry CL setup because LAS alleviates the detrimental eﬀects of inter-class imbalance both inherently in the data stream and between new and old classes. The results of ER-LAS further conﬁrm that such an advantage can help obtain high accuracy throughout learning under the realistic online blurry CL setup with challenging inter-class imbalance problems.

Published in Transactions on Machine Learning Research (05/2024)

Table 6: Ablation study about two extreme situations of τ and about randomly assigned (Random) or macro statistical (Macro) class-priors on CCIFAR100 (10 tasks). M = 2k.

Method τ = 0 τ = Random Macro LAS

AT 19.4 0.4 22.7 0.2 20.6 0.2 22.1 0.6 27.0 0.3

FT 29.1 0.4 2.7 0.4 23.5 0.2 14.2 0.8 10.7 0.4

Table 7: Training time compared with top-3 fast methods on C-CIFAR100 (10 tasks) by one Nvidia Geforce GTX 2080 Ti. M = 2k.

Method ER ER-ACE MRO ER-LAS

Training Time (s) 77.4 ( 0.94) 84.7 ( 1.02) 99.0 ( 1.20) 82.6 ( 1.00)

Figure 2: Final average accuracy (darker is better, left) and ﬁnal average forgetting (lighter is better, right) of various hyperparameter combinations on CCIFAR100 (10 tasks). M = 2k.

6.3 Gains on Enhanced Methods

Rehearsal-based methods on online class-IL scenarios. We verify the performance boost of LAS by plugging it into ER, MIR, ASERµ, and OCS. These three baselines train via softmax cross-entropy loss with diﬀerent replay strategies, which harmonize with our approach. Table 4 shows that LAS can signiﬁcantly improve ER and its variants (+1.7% +9.1%) in the online class-IL setup. Although these methods with various memory management strategies beneﬁt from our LAS, the gains depend on the estimation of classpriors from retrieval, as a relatively smaller boost is observed on ASERµ which has a sophisticated strategy to manage memory.

Knowledge distillation methods on online Sum-Class-Domain CL scenarios. To further investigate LASs eﬀectiveness in alleviating inter-class imbalance, we combined it with knowledge distillation approaches in the diﬃcult and realistic online CL setup that sums up classand domain-IL. Table 5 summarizes the results. LCE represents the basic CE loss used in ER. Knowledge distillation losses obtain higher accuracy by adapting intra-class domain drift. Augmented by our LAS, consistent gains (+2.1% +4.9%) are observed by eliminating class-imbalanced prior bias. The results demonstrate the validity of our proposal to separately handle class-conditionals and class-priors in non-stationary stream learning. It also showcases the performance improvement of eliminating imbalanced class-priors by our method in this CL setup. Noting that we allowed the knowledge distillation methods to preserve old models at boundaries, which is intractable in real-world online CL. In future studies, we will explore the eﬃcient and task-free method for handling intra-class domain drift to further reﬁne the solution to online Sum-Class-Domain CL.

6.4 Ablation Studies

Extreme variants of LAS. We investigate the performance of two variants of our method by pushing τ towards two extremes. When τ = 0, LAS degenerates into the traditional softmax cross-entropy loss in ER. In τ = , we set (πy ,t/πy,t)τ = 0 in Equation 6 when πy ,t/πy,t < 1, otherwise we keep this coeﬃcient and set τ = 1 to ensure runnable. It achieves a similar eﬀect as separating the gradient of new and old categories in ER-ACE, reducing representation shift. As shown in Table 6, the performance of τ = 0 is similar to ER as expected. τ = beneﬁts from a remarkably low forgetting rate. However, our proposed LAS with τ = 1 achieves the highest accuracy, indicating that enforcing a relative margin between classes based on the imbalanced class-priors can obtain a better stability-plasticity trade-oﬀ.

Necessity of batch-wise estimator with sliding window. We empirically validate the necessity of our designed estimator. We randomly assign each prior of seen classes by a uniform distribution U[0, 1] and normalize them to 1, as Random. We also explicitly calculate the joint label distribution of the current data stream and the memory replay, as Macro, which is intractable in practice. Results in Table 6 demonstrate that Random degrades to performance similar to ER, and Macro is also inferior to our proposed estimator. We conjecture that the online CL model concerns more about the distribution within current or short-term

Published in Transactions on Machine Learning Research (05/2024)

input batches than the macro distribution of sequential data stream and memory. Therefore our batch-wise estimator can better exploit the Logit Adjustment technique to improve performance.

Hyperparameter sensitivity analysis. We conduct the sensitivity analysis of the hyperparameters τ and l in our method in Figure 2. ER-LAS is robust to a wide range of l. In practice, if the distribution ﬂuctuations in the stream can be discerned, we recommend setting short l for streams that change rapidly and vice versa. As to temperature scalar τ, it has distinct impacts on accuracy and forgetting rate. Although a larger τ can enable models to forget remarkably less, the best accuracy result is achieved around 1.0. Therefore the stability-plasticity trade-oﬀfor target applications can be achieved by tuning τ and l together.

7 Conclusion

We discover the class-conditional invariant and prove the optimality of the class-conditional function that minimizes the class-balanced error in online class-IL. As a corollary of our theoretical analysis, we introduce Logit Adjusted Softmax with a batch-wise sliding-window estimator to purse the class-conditional function. Extended to online GCL, knowledge of the learned class-conditional function should be preserved for adaptation to domain drift. Under conditions without model expansion or computationally intensive techniques, extensive experiments demonstrate that LAS can achieve state-of-the-art performance on various benchmarks with minimal additional computational overhead, conﬁrming the eﬀectiveness and eﬃciency of our method to mitigate inter-class imbalance. It is eﬀortless to implement LAS and plug it into rehearsal-based methods to correct their recency bias and boost their accuracy. Rehearsal-free approaches with LAS for online CL could be a subject of further study. Furthermore, we will continue to investigate eﬃcient approaches to handling online domain drift, contributing to practical online GCL applications in the real world.

8 Broader Impacts and Limitations

Broader Impacts. The implication of our research on the community of CL learning is two-fold. Firstly, our proposed LAS method is simple yet highly eﬀective in eliminating the recency bias caused by interclass imbalance. LAS can be easily implemented and enhance the performance of other online CL methods. Moreover, LAS readily adapts to real-world uses, such as robot environment adaptation, object detection in autonomous driving, and real-time recommendation systems, improving augmented applications accuracy. Secondly, we propose partitioning non-stationary data stream learning into the class-conditional and classprior functions and empirically demonstrate its eﬀectiveness. Our approach lays the framework for future research on online general CL and can provoke more task-free methods to address domain drift. Overall, our work is unlikely to have a negative impact on society.

Limitations. As the focus of our proposed method in this paper is primarily on addressing the impact of inter-class imbalance and forgetting induced by recency bias on CL performance. We eliminate the imbalanced class-priors to improve performance. However, it can not provide any beneﬁts when faced with domain-IL scenarios with only intra-class domain drift and no inter-class imbalance. Our next research focuses on developing task-free domain-IL methods to address intra-class domain drift eﬃciently. We hope to integrate our method with task-free domain-IL methods to form a comprehensive solution for online general CL. Another limitation of our approach is that it is not eﬀectively applicable to rehearsal-free online CL. Such a drawback does not stem from the constraints of our theoretical framework but rather from the diﬃculty in balancing inter-class margins when using the logits adjustment technique to pursue the classconditional function. For detailed discussions, please refer to the experimental results in Appendix F.2. In the future, we will stick to researching rehearsal-free online CL.

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful comments.

The research leading to these results has received funding from National Key Research Development Project (2023YFF1104202), National Natural Science Foundation of China (62376155), Shanghai Municipal Science and Technology Research Program (22511105600) and Major Project (2021SHZDZX0102).

Published in Transactions on Machine Learning Research (05/2024)

Hongjoon Ahn, Jihwan Kwak, Su Fang Lim, Hyeonsu Bang, Hyojun Kim, and Taesup Moon. Ss-il: Separated softmax for incremental learning. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 824 833, 2020.

Rahaf Aljundi, Lucas Caccia, Eugene Belilovsky, Massimo Caccia, Min Lin, Laurent Charlin, and Tinne Tuytelaars. Online continual learning with maximally interfered retrieval. Ar Xiv, abs/1908.04742, 2019.

Eden Belouadah and Adrian Daniel Popescu. Il2m: Class incremental learning with dual memory. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 583 592, 2019.

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2022 (Neur IPS), 2020.

Lucas Caccia, Rahaf Aljundi, Nader Asadi, Tinne Tuytelaars, Joelle Pineau, and Eugene Belilovsky. New insights on reducing abrupt representation change in online continual learning. The Tenth International Conference on Learning Representations (ICLR), 2022.

Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NIPS), 32, 2019.

Francisco Manuel Castro, Manuel J. Marín-Jiménez, Nicolás Guil Mata, Cordelia Schmid, and Alahari Karteek. End-to-end incremental learning. Ar Xiv, abs/1807.09536, 2018.

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet Kumar Dokania, Philip H. S. Torr, and Marc Aurelio Ranzato. Continual learning with tiny episodic memories. Ar Xiv, abs/1902.10486, 2019.

Arslan Chaudhry, Naeemullah Khan, Puneet Kumar Dokania, and Philip H. S. Torr. Continual learning in low-rank orthogonal subspaces. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (Neur IPS), 2020.

Aristotelis Chrysakis and Marie-Francine Moens. Online bias correction for task-free continual learning. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https:// openreview.net/forum?id=18Xzeu YZh_.

Guillem Collell, Drazen Prelec, and Kaustubh R. Patil. Reviving threshold-moving: a simple plug-in bagging ensemble for binary and multiclass imbalanced data. Ar Xiv, abs/1606.08698, 2016.

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge J. Belongie. Class-balanced loss based on eﬀective number of samples. 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9260 9269, 2019.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248 255, 2009.

Jiahua Dong, Wenqi Liang, Yang Cong, and Gan Sun. Heterogeneous forgetting compensation for classincremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11742 11751, 2023.

Jhair Gallardo, Tyler L. Hayes, and Christopher Kanan. Self-supervised training enhances online continual learning. In British Machine Vision Conference, 2021. URL https://api.semanticscholar.org/ Corpus ID:232352548.

Yiduo Guo, B. Liu, and Dongyan Zhao. Online continual learning through mutual information maximization. In International Conference on Machine Learning (ICML), 2022.

Published in Transactions on Machine Learning Research (05/2024)

Yiduo Guo, Bing Liu, and Dongyan Zhao. Dealing with cross-task class discrimination in online continual learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alexander Shepard, Hartwig Adam, Pietro Perona, and Serge J. Belongie. The inaturalist species classiﬁcation and detection dataset. 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8769 8778, 2017.

Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a uniﬁed classiﬁer incrementally via rebalancing. 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 831 839, 2019.

Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classiﬁer for long-tailed recognition. 8th International Conference on Learning Representations (ICLR), 2020.

Minsoo Kang, Jaeyoo Park, and Bohyung Han. Class-incremental learning by knowledge distillation with adaptive feature consolidation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16071 16080, 2022.

Dongwan Kim and Bohyung Han. On the stability-plasticity dilemma of class-incremental learning. Ar Xiv, abs/2304.01663, 2023.

James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114:3521 3526, 2016.

Hyunseo Koh, Dahyun Kim, Jung-Woo Ha, and Jonghyun Choi. Online continual learning on class incremental blurry task conﬁguration with anytime inference. The Tenth International Conference on Learning Representations (ICLR), 2022.

Alex Krizhevsky. Learning multiple layers of features from tiny images. In Ar Xi V, 2009.

Miroslav Kubát and Stan Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML), 1997.

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory G.Slabaugh, and Tinne Tuytelaars. Acontinuallearningsurvey : Defyingforgettinginclassificationtasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 : 3366 3385, 2019.

Ya Le and Xuan S. Yang. Tiny imagenet visual recognition challenge. In Ar Xi V, 2015.

Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:2935 2947, 2016.

Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 (NIPS), 2017.

Zheda Mai, Ruiwen Li, Hyunwoo J. Kim, and Scott Sanner. Supervised contrastive replay: Revisiting the nearest class mean classiﬁer in online class-incremental continual learning. 2021 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), pp. 3584 3594, 2021.

Michael Mc Closkey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24:109 165, 1989.

Aditya Krishna Menon, Harikrishna Narasimhan, Shivani Agarwal, and Sanjay Chawla. On the statistical consistency of algorithms for binary classiﬁcation under class imbalance. In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.

Published in Transactions on Machine Learning Research (05/2024)

Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. 9th International Conference on Learning Representations (ICLR), 2021.

Arne J. Nagengast, Daniel A. Braun, and Daniel M. Wolpert. Risk-sensitivity and the mean-variance tradeoﬀ: decision making in sensorimotor control. Proceedings of the Royal Society B: Biological Sciences, 278: 2325 2332, 2011.

Ameya Prabhu, Philip H. S. Torr, and Puneet Kumar Dokania. Gdumb: A simple approach that questions our progress in continual learning. In 16th European Conference on Computer Vision (ECCV), 2020.

Roger Ratcliﬀ. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97 2:285 308, 1990.

Dongsub Shim, Zheda Mai, Jihwan Jeong, Scott Sanner, Hyunwoo J. Kim, and Jongseong Jang. Online class-incremental continual learning with adversarial shapley value. In Thirty-Fifth AAAI Conference on Artiﬁcial Intelligence (AAAI), 2020.

Christian Simon, Piotr Koniusz, and Mehrtash Harandi. On learning the geodesic path for incremental learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1591 1600, 2021.

Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, and Junjie Yan. Equalization loss for long-tailed object recognition. In 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11662 11671, 2020.

Xiaoyu Tao, Xinyuan Chang, Xiaopeng Hong, Xing Wei, and Yihong Gong. Topology-preserving classincremental learning. In 16th European Conference on Computer Vision (ECCV), 2020.

Gido M Van De Ven, Zhe Li, and Andreas S Tolias. Class-incremental learning with generative classiﬁers. In 2021 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), pp. 3611 3620, 2021.

Gido M van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning. Nature Machine Intelligence, 4(12):1185 1197, 2022.

Jeﬀrey Scott Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11:37 57, 1985.

Quanziang Wang, Renzhen Wang, Yichen Wu, Xixi Jia, and Deyu Meng. Cba: Improving online continual learning via continual bias adaptor. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 19082 19092, 2023.

Zifeng Wang, Zheng Zhan, Yifan Gong, Geng Yuan, Wei Niu, Tong Jian, Bin Ren, Stratis Ioannidis, Yanzhi Wang, and Jennifer G. Dy. Sparcl: Sparse continual learning on the edge. Ar Xiv, abs/2209.09476, 2022.

Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Raymond Fu. Large scale incremental learning. 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 374 382, 2019.

Jiangwei Xie, Shipeng Yan, and Xuming He. General incremental learning with domain-aware categorical representations. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14331 14340, 2022.

Mengya Xu, Mobarakol Islam, Chwee Ming Lim, and Hongliang Ren. Class-incremental domain adaptation with smoothing and calibration for surgical report generation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2021.

Jaehong Yoon, Divyam Madaan, Eunho Yang, and Sung Ju Hwang. Online coreset selection for rehearsalbased continual learning. In International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/forum?id=f9D-5WNG4Nv.

Published in Transactions on Machine Learning Research (05/2024)

Michał Zając, Tinne Tuytelaars, and Gido M van de Ven. Prediction error-based classiﬁcation for classincremental learning. Ar Xiv, abs/2305.18806, 2023.

Chen Zeno, Itay Golan, Elad Hoﬀer, and Daniel Soudry. Task agnostic continual learning using online variational bayes. Ar Xiv, abs/1803.10123, 2018.

Da-Wei Zhou, Qiwen Wang, Zhiyuan Qi, Han-Jia Ye, De chuan Zhan, and Ziwei Liu. Deep class-incremental learning: A survey. Ar Xiv, abs/2302.03648, 2023.

Published in Transactions on Machine Learning Research (05/2024)

A.1 Proof in Online Class-incremental Scenarios

To tackle the issue of inter-class imbalance, extensive research (Menon et al., 2013; Collell et al., 2016; Menon et al., 2021) has been conducted on the optimal classiﬁer that minimizes the class-balanced error for stable distributions. Actually, previous arts have proposed the following Theorem about this optimal classiﬁer:

Theorem A.1. For time-invariant distributions, the optimal estimate that minimizes the class-balanced error is the class under which the sample probability is most likely:

Φ arg min Φ:X R|Y| CBE(Φ, Y), arg max y |Y| Φ y(x) = arg max y |Y| P(x|y) (8)

Theorem A.1 Menon et al. (2013); Collell et al. (2016) states that this optimal classiﬁer is independent of arbitrary imbalanced label distributions P(y). The class-conditional function in stable distributions naturally minimizes the class-balanced error. From Theorem A.1 and given the condition of ﬁxed class-conditionals, i.e., t, P(x|y, ρt) = P(x|y, ρ0), we can derive the proof of Theorem 3.1 as follows:

arg max y |Yt| Φ t,y(x) = arg max y |Yt|

i=1 P(x|y, ρi) = arg max y |Yt| P(x|y, ρt) (9)

A.2 Proof in Online General Continual Learning Scenarios

Without any prior information about the distribution of the test data, we assume that its distribution should conform to a uniformly joint distribution of all observed class distributions. Therefore, the ﬁnal intra-class distribution is 1

t Pt i=1 P(x|y, ρi). Therefore, the result of Equation 4 is from deﬁnition.

Let px|y and qx|y be the underlying distributions the optimal classiﬁer that minimizes the class-balanced error and the learned class-conditional function represents, respectively. The class-balanced error gap between the optimal classiﬁer exp(Φ t,y(x)) P(x|y, p) = px|y and the learned class-conditional function exp(Φt,y(x)) P(x|y, q) = qx|y can be formalized as follows:

| CBE(Φ , Yt) CBE(Φ, Yt)| | {z } ϵt(Φ ,Φ)

y Yt Eρt[Ex|y,ρt[arg max y Yt px|y = arg max y Yt qx|y]]

| {z } dt(px|y,qx|y)

Equation 10 describes the disparity ϵt( , ) from the optimal solution by a similarity measure dt( , ) in the probability space. Aligning two class-conditionals requires techniques for domain generalization and concept shift. In the future, we will explore eﬃcient class-conditional alignment techniques in the context of online CL.

B Algorithm

We give the algorithm of Experience Replay in Algorithm 1. The algorithm of our proposed Logit Adjusted Softmax enhanced Experience Replay in Algorithm 2 is mainly based on Algorithm 1. We also apply our method to online GCL by combining with knowledge distillation, as shown in Algorithm 3.

C Benchmark Details

C.1 Dataset Details

We list the image size, the total number of training samples, the total number of test samples, and the total number of classes for the 5 datasets (CIFAR10 Krizhevsky (2009), CIFAR100 Krizhevsky (2009),

Published in Transactions on Machine Learning Research (05/2024)

Algorithm 1 Experience Replay (ER) Chaudhry et al. (2019)

Input: Data stream {Dt}T i=1 Initialize: Learner Φ( ), model parameter Θ, memory buﬀer M1 {}, label set Y1 {}. for t = 1 to T do

Sample incoming batch Bt from Dt Yt Yt 1 set({yi}|Bt| i=1 ) BM t Retrieval(Bt, Mt) z Φ(concat(Bt, BM t ), Θ)

SGD( 1 |Bt|+|BM t | P|Bt|+|BM t | i=1 LCE(yi, zi), Θ) Mt+1 Update(Bt, Mt) end for

Algorithm 2 Experience Replay with Logit Adjusted Softmax (ER-LAS)

Input: Data stream {Dt}T i=1, temperature scalar τ, sliding window estimator length l Initialize: Learner Φ( ), model parameter Θ, memory buﬀer M1 {}, label set Y1 {}. for t = 1 to T do

Sample incoming batch Bt from Dt Yt Yt 1 set({yi}|Bt| i=1 ) BM t Retrieval(Bt, Mt) for y in Yt do

πy,t compute class-priors from Equation 7 end for z Φ(concat(Bt, BM t ), Θ)

SGD( 1 |Bt|+|BM t | P|Bt|+|BM t | i=1 LLAS(yi, zi), Θ) Mt+1 Update(Bt, Mt) end for

Tiny Image Net Le & Yang (2015), Image Net Deng et al. (2009), and i Naturalist Horn et al. (2017)) in Table 8. In the former four class-balanced datasets, each category contains an equivalent number of training and test samples. However, within i Naturalist, an inherent imbalance exists between classes, posing a greater challenge. We download the dataset of i Naturalist from https://github.com/visipedia/inat_comp.

C.2 Continual Learning Setup Details

In online class-IL Aljundi et al. (2019), classes of CIFAR10, CIFAR100, Tiny Image Net, and Image Net are evenly split from the total into each task. And the classes in diﬀerent tasks are disjoint. For example, C-CIFAR10 (5 tasks) is split into 5 disjoint tasks with 2 classes each. As a result, the numbers of training samples, testing samples, and classes are the same in each task, except i Naturalist. We divide the i Naturalist into 26 disjoint tasks according to the initial letter of the category. The numbers of classes in each task are shown in Figure 3. It shows that the number of classes varies signiﬁcantly among each task. Noting that the classes within each task are also imbalanced. The comprehensive inter-class imbalance issues of C-i Naturalist (26 tasks) pose great challenges to online CL methods.

In online blurry CL Koh et al. (2022), the classes are divided into Nblurry% disjoint part and (100 Nblurry%) blurry part. The classes that belong to the disjoint part will only appear in ﬁxed tasks, while all other classes in the blurry part will occur throughout the data stream. In each task, (#train (T 1) Mblurry) instances will be sampled from the training data of head blurry classes and Mblurry instances will be sampled from the training data of remaining blurry classes, which forms the apparently class-imbalanced blurry part samples. The classes in blurry part play the role of head classes in turn across diﬀerent tasks. During inference, the model will predict on test samples from all currently observed classes. We split CIFAR100 and Tiny Image Net into 10 blurry tasks according to Koh et al. (2022) with ﬁxed disjoint ratio Nblurry = 50 and blurry level Mblurry = 10. Next we take B-CIFAR100 (10 tasks) as an example.

Published in Transactions on Machine Learning Research (05/2024)

Algorithm 3 Knowledge distillation with Logit Adjusted Softmax (KD-LAS)

Input: Data stream {Dt}T i=1, temperature scalar τ, sliding window estimator length l Initialize: Learner Φ( ), model parameter Θ, memory buﬀer M1 {}, label set Y1 {}. for t = 1 to T do

Sample incoming batch Bt from Dt Yt Yt 1 set({yi}|Bt| i=1 ) BM t Retrieval(Bt, Mt) for y in Yt do

πy,t compute class-priors from Equation 7 end for z Φ(concat(Bt, BM t ), Θ) zold Φ(concat(Bt, BM t ), Θold)

SGD( 1 |Bt|+|BM t | P|Bt|+|BM t | i=1 (LLAS(yi, zi)+ LKD(zi, zold i )), Θ) Mt+1 Update(Bt, Mt) if t ends a task. then

Save the old model Θold Θ end if end for

Table 8: Dataset information for CIFAR10, CIFAR100, Tiny Image Net, Image Net, and i Naturalist.

Dataset Image Size # Train # Test # Class

CIFAR10 Krizhevsky (2009) 3 32 32 50,000 10,000 10 CIFAR100 Krizhevsky (2009) 3 32 32 50,000 10,000 100 Tiny Image Net Le & Yang (2015) 3 64 64 100,000 10,000 200 Image Net Deng et al. (2009) 3 224 224 1,281,167 50,000 1000 i Naturalist Horn et al. (2017) 3 299 299 579,184 95,986 5089

In B-CIFAR100 (10 tasks, Nblurry = 50, Mblurry = 10, #train = 500 per class, #class = 100), the disjoint part contains 50 classes, and each task possesses 5 disjoint classes of training data. On the other hand, the blurry part comprises the other 50 classes, and each task has 5 head classes. The head classes contain 500 9 10 = 410 training samples, whereas the remaining 45 blurry classes only have 10 training samples each for the current task. Therefore the model in this setup will continuously observes disjoint new classes as stream ﬂows and imbalanced classes overlap across all tasks, encountering a severe problem of inter-class imbalance.

In online Sum-Class-Domain CL Xie et al. (2022), incoming data contains images from new classes and new domains. We only apply online Sum-Class-Domain CL on CIFAR100. Similar to the online class IL setup, we partition the CIFAR100 dataset into 20 tasks, each with 5 subclasses. However, the model is required to predict superclasses, with each subclass representing a distinct domain within them. Each domain of superclass has the same number of training samples. As depicted in Figure 4, diﬀerent superclasses appear in various tasks. Also, varying number of superclasses occur in each tasks. And the distribution within each superclass changes across diﬀerent domains. Therefore, S-CIFAR100 (20 tasks) possesses both inter-class imbalance and intra-class domain drift, i.e, changing class-priors and class-conditionals. Next, we discuss scenarios where mixed class and domain distributions jointly change, presenting a greater challenge than the sum of individual online classand domain-IL learning problems. We refer to this case as "Mix-Class-Domain." This increased diﬃculty arises due to the possible coupling between domain shift and class distribution shift in Mix-Class-Domain CL.

Published in Transactions on Machine Learning Research (05/2024)

Figure 3: The number of classes per task in divided i Naturalist. Each one of these 26 tasks contains categories with the same corresponding initial letter.

Class-IL: The joint distribution of sample x and label y can be written as p(x, y, t) = p(x|y, t), p(y, t), where p(x|y, t) is constant throughout time t. p(y, t) changes with data ﬂow and brings about interclass imbalance.

Domain-IL: A common assumption in domain-IL is that each class has the latent domain indicator z. The joint distribution of sample x, label y, and domain indicator z can be decomposed into p(x, y, z, t) = p(x|z, y, t), p(z|y, t), p(y, t), where only p(z|y, t) changes over time t, while p(x|z, y, t) and p(y, t) keep constant.

Sum-Class-Domain: This is a special case of the Mix-Class-Domain CL problems, where domain shift and class distribution shift are separable. In the decomposition of the joint distribution, p(x, y, z, t) = p(x|z, y, t), p(z|y, t), p(y, t), only p(x|z, y, t) remains invariant and independent of t, while p(z|y, t) and p(y, t) vary over time, giving rise to intra-class imbalance and inter-class imbalance issues, respectively.

Mix-Class-Domain: We consider the inherent coupling between domain indicators z and class labels y, rather than their mutual independence which allows for decomposition. In this case, the joint distribution can only be decomposed as p(x, y, z, t) = p(x|z, y, t)p(z, y, t). The fact that domain indicators and class labels are inseparable forms a more challenging problem than the separable Sum-Class-Domain CL, which may lead to more forgetting.

Unfortunately, there is also a lack of standard benchmarks for the indecomposable case. As one of the most challenging and realistic scenarios for practical applications, it deserves further in-depth investigation in future research.

D Metrics Details

Assume test samples of task j is Sj = {(xn, yn)}N j

n=1. The number of test samples in each class y for task j is N j y. The model trained on task i is Φi. The seen class set at task i is Yi. The accuracy ai,j on task j after training on task i is formalized as follows:

ai,j = 1 N j X

Sj 1 arg max y Yi Φi,y (xn) = yn

Published in Transactions on Machine Learning Research (05/2024)

Figure 4: An illustration of the occurrence of subclasses within each superclass for every task in S-CIFAR100 (20 tasks). The y-axis represents the number of occurrences of subclasses. The x-axis represents the 20 superclasses. Worth noting that each subclass is a distinct domain.

For long-tailed data streams with inherent inter-class imbalance, we also consider a more appropriate metric, namely class-balanced accuracy acbl i,j , instead of standard accuracy ai,j for evaluation. Class-balanced accuracy excludes prior class imbalances and prevents the overestimation of trivial solutions with high probabilities for major classes.

acbl i,j = 1 |Yi|

{(xn,yn)|(xn,yn) Sj,yn=y} 1 arg max y Yi Φi,y (xn) = y . (12)

The corresponding ﬁnal average accuracy AT and ﬁnal average class-balanced accuracy Acbl T can be calculated as follows:

j=1 a T,j, (13)

j=1 acbl T,j. (14)

In datasets such as CIFAR-10 and CIFAR-100, where class-priors are uniform, the ﬁnal average accuracy is equal to the ﬁnal average class-balanced accuracy, consistent with our analysis.

Published in Transactions on Machine Learning Research (05/2024)

The ﬁnal average forgetting FT can be computed Chaudhry et al. (2020) as follows:

j=1 max i {1,...,T 1} (ai,j a T,j) . (15)

We follow Koh et al. (2022) to add the Area Under the Curve of Accuracy AAUC in the online blurry CL setup. AAUC is the average accuracy to {# of samples}. We simplify the calculation of AAUC by replacing {# of samples} with {# of steps}. Then this metric can be calculated as follows:

k=1 f(k n) n, (16)

where N represents the total number of training steps, n denotes that we sample the accuracy f( ) of the model every n steps, and K is the total number of sample intervals. We set n = 5 in the experiments.

E Implementation Details

E.1 Baseline Implementation

We as follows list the hyperparameter conﬁgurations for the baseline methods mentioned in this paper, along with their sources of code implementation.

For ER (Chaudhry et al., 2019), we set the learning rate as 0.03. The code source is https://github.com/ aimagelab/mammoth.

For DER++ (Buzzega et al., 2020), we set the learning rate as 0.03. α is set to 0.1, and β is set to 0.5. The code source is https://github.com/aimagelab/mammoth.

For MRO (Chrysakis & Moens, 2023), we set the learning rate as 0.03. The code source is https://github. com/aimagelab/mammoth.

For SS-IL (Ahn et al., 2020), we set the learning rate as 0.03. We update the teacher model every 100 steps. The code source is https://github.com/hongjoon0805/SS-IL-Official.

For CLIB (Koh et al., 2022), we set the learning rate as 0.03. The period between sample-wise importance updates is set to 3. The code source is https://github.com/naver-ai/i-Blurry.

For ER-ACE (Caccia et al., 2022), we set the learning rate as 0.03. The code source is https://github. com/pclucas14/AML.

For ER-OBC (Chrysakis & Moens, 2023), we set the learning rate as 0.03 for both training and bias correction. The code source is https://github.com/chrysakis/OBC.

For ER-CBA (Wang et al., 2023), we set the learning rate as 0.001 for C-CIFAR10, and 0.01 for C-CIFAR100 and C-Tiny Image Net. The code source is https://github.com/wqza/CBA-online-CL.

For MIR (Aljundi et al., 2019), we set the learning rate as 0.03. The number of subsampling in replay is 160. The code source is https://github.com/optimass/Maximally_Interfered_Retrieval.

For ASERµ (Shim et al., 2020), we set the learning rate as 0.03. The number of nearest neighbors K to perform ASER is 5. We use mean values of adversarial Shapley values and cooperative Shapley values. The maximum number of samples per class for random sampling is 6.0 times of incoming batch size. The code source is https://github.com/Raptor Mai/online-continual-learning.

For OCS (Yoon et al., 2022), we set the learning rate as 0.03. The hyperparameter τ that controls the degree of model plasticity and stability is set to 1000.0. The code source is https://openreview.net/forum?id= f9D-5WNG4Nv.

For Lw F (Li & Hoiem, 2016), we set the learning rate as 0.03. The penalty weight α is set to 0.5 and the temperature scalar is set to 2.0. The code source is https://github.com/aimagelab/mammoth.

Published in Transactions on Machine Learning Research (05/2024)

For LURIC (Hou et al., 2019), we set the learning rate as 0.03. λbase is set to 5.0 for all the experiments. The code source is https://github.com/hshustc/CVPR19_Incremental_Learning.

For Geo DL (Simon et al., 2021), we set the learning rate as 0.03. The adaptive weight β is set to 5.0. The code source is https://github.com/chrysts/geodesic_continual_learning.

We also list the hyperparameter conﬁguration for the baseline methods used in this appendix with their sources of code implementation.

For SCR (Mai et al., 2021), we set the learning rate as 0.03 and the temperature as τ = 0.07. The code source is https://github.com/Raptor Mai/online-continual-learning.

For OCM (Guo et al., 2022), we use Adam optimizer and set the learning rate as 0.001. The code source is https://github.com/gydpku/OCM.

For Bi C (Wu et al., 2019), we set the learning rate as 0.03. We split 10% of the training data into a validation set for training the bias injector with 50 epochs. The softmax temperature T is 2.0. Distillation loss is also applied after bias correction. The code source is https://github.com/sairin1202/BIC.

For E2E(Castro et al., 2018), we set the learning rate as 0.03. In the process of balanced ﬁne-tuning, we set the learning rate as 0.003 and train 30 epochs. The code source is https://github.com/Patrick ZH/ End-to-End-Incremental-Learning.

For IL2M(Belouadah & Popescu, 2019), we set the learning rate as 0.03. We calculate the mean and variance of each batch online to re-scale the outputs. The code source is https://github.com/Eden Belouadah/ class-incremental-learning.

For LUCIR (Hou et al., 2019), we set the learning rate as 0.03. λbase is set to 5.0, K is set to 2, and m is set to 0.5 for all the experiments. The code source is https://github.com/hshustc/CVPR19_Incremental_ Learning.

E.2 Ablation Implementation

In the ablation study of 6.4, we employ two extreme variants of LAS, along with two estimators. Random estimator randomly assigns class-priors. Macro uses statistical information to assign class-priors. Now, we elaborate on how these four methods are implemented. Recalling our proposed Logit Adjusted Softmax cross-entropy loss in Equation 6.

LLAS(y, Φ(x)) = log eΦy(x)+τ log πy,t P y Yt eΦy (x)+τ log πy ,t = log[1 + X

τ e(Φy (x) Φy(x))]. (17)

τ = 0 is a simple special case that sets the temperature scalar to 0.

τ = needs modiﬁcation because directly setting the hyperparameter τ to a large value to pursue would cause troubles when πy ,t/πy,t > 1, as it would lead to an inﬁnity coeﬃcient and result in gradient explosion, obstructing the gradient descent optimization algorithm. Therefore, as shown in Equation 18, we set the coeﬃcient to 0 only when πy ,t/πy,t < 1, while retaining τ = 1 for all other situations to enable successful model training. The signiﬁcantly low forgetting rate and competitive accuracy observed in the experimental results suggest that this approach closely approximates τ = as expected.

(πy ,t/πy,t)τ =

( 0, (πy ,t/πy,t) < 1 (πy ,t/πy,t) , otherwise . (18)

Random samples each prior of seen classes from a uniform distribution U[0, 1]. Then they are normalized to 1.

Macro computes the joint label distribution by taking into account the occurrence frequencies of each class in the current data stream, as well as the label probabilities in the memory buﬀer, to serve as the current

Published in Transactions on Machine Learning Research (05/2024)

class-priors. It is worth noting that since the distribution of the data stream is unknown during training, Macro cannot be directly obtained and serves only as a reference for comparing and validating the necessity of batch-wise estimators. For instance, in C-CIFAR (5 tasks), when it comes to the 2nd task, 2 classes in the data stream are of the same quantity, and the 2 classes in the memory buﬀer also contain a similar number of samples from the previous task. The incoming and buﬀer batch sizes are also the same. At this point, the 4 classes probabilities that may appear in the input batch are all equally likely, i.e., 1/4. When it comes to the 5th task, the data stream still consists of 2 classes with the same label probabilities, but the memory buﬀer now stores 8 classes that have appeared before. Therefore, it can be calculated that the class-priors of the 2 classes in the data stream are 1/4, while the class-priors of the 8 classes in the memory buﬀer are 1/16. Macro represents a statistical oracle, but experiments show that its performance is inferior to batch-wise estimators, indicating that in online CL, the model may pay more attention to the label distributions within each batch rather than the label distributions across the sequential tasks.

F More Experimental Results

F.1 Results on C-MNIST

We provide the results for MNIST of the online class-IL setting in Table 9. ER-LAS still achieves competitive performance, highlighting the eﬀectiveness of our method in addressing simple online CL problems. However, we observe that on small datasets like MNIST, the performance improvement achieved by our method is limited. This limitation arises because the tasks in MNIST are relatively easy, and the forgetting caused by inter-class imbalance is not prominent. Therefore, in our experiments, we have primarily focused on more challenging scenarios with considerable classes or large-scale datasets like Image Net. Considering that online CL aims to handle potentially inﬁnite data streams, we believe that scalability to large-scale datasets is crucial in validating the eﬀectiveness of online CL algorithms.

Table 9: Final average accuracy AT (higher is better) on C-MNIST (5 tasks). M is memory size.

Method M = 0.5k M = 1k M = 2k

ER 87.0 0.2 88.0 0.2 90.7 0.1 DER++ 92.3 0.1 93.9 0.0 94.2 0.1 MRO 87.4 0.2 88.9 0.1 92.6 0.1 SS-IL 88.7 0.3 90.3 0.2 91.7 0.1 CLIB 88.4 0.3 90.6 0.0 91.9 0.0 ER-ACE 90.4 0.0 92.4 0.1 93.8 0.2 ER-OBC 90.0 0.1 89.7 0.1 91.6 0.4 ER-CBA 90.1 0.3 90.2 0.1 91.3 0.1 ER-LAS 91.7 0.1 92.8 0.1 94.0 0.1

F.2 Results on MNIST without Rehearsal

Despite the current mainstream online continual learning methods incorporating memory replay of samples and achieving satisfactory performance, a range of studies (Zając et al., 2023; Li & Hoiem, 2016; Kirkpatrick et al., 2016; Zeno et al., 2018) have also explored eﬀective continual learning approaches without rehearsal. Our theoretical foundation of the class-conditional function paves the way for our rehearsal-free applications. In our proposed LAS loss function, the use of replayed samples is not necessary, allowing for direct application in online continual learning scenarios without replay. We conducted experiments on the C-MNIST dataset with no memory. The results in Table C show that LAS outperforms previously proposed methods. PEC achieves superior performance to LAS, beneﬁtting from the expansion of new models for each class. Although our theorems hold without the need for rehearsal, the implementation of the logit adjustment technique requires support from replay data. As indicated on the right-hand side of Equation (6) in the original paper, LAS adjusts inter-class classiﬁcation margins by leveraging the imbalanced class-priors between major and minor classes, achieving balanced learning. When the number of minor classes decreases to zero, LAS

Published in Transactions on Machine Learning Research (05/2024)

fails to adjust the inter-class classiﬁcation margins and degrades into learning based only on the currently encountered classes.

Table 10: Final average accuracy AT (higher is better) on C-MNIST (5 tasks) without rehearsal.

Lw F 19.8 0.0 EWC 19.8 0.0 Labels trick 45.7 3.5 PEC 92.3 0.1 LAS 48.4 1.2

F.3 Results when Batch Sizes are varied

While typically samples arrive one by one in the online learning data stream, advanced algorithms(Caccia et al., 2022; Chrysakis & Moens, 2023) are commonly designed to update the model by accumulating a certain number of incoming samples as a batch. This is because per-batch updating is generally more advantageous for model optimization convergence and well-deﬁned classiﬁcation boundaries than updating on each individual sample. However, in some situations with constrained computational resources, only very small batch sizes are available or the batch sizes vary. Based on this concern, we consider two setups related to changing the batch size: one is various batch sizes for the entire online training process, and the other is varying the batch size during training. We conduct experiments on online C-CIFAR10. We begin with brief introductions to these two setups.

1. Evaluating the batch size change for the entire online training process examines the macro robustness of our method to the hyperparameter of batch size. In the manuscript, we set both incoming and buﬀer batch sizes to 32. We now experiment with corresponding batch sizes of 4, 16, 64. Smaller batch sizes bring more gradient updates for the model, but each contains less information for forming inter-class margins. Larger batch sizes may lead to overﬁtting on the memory buﬀer, thereby reducing performance.

2. Varying the batch size throughout the entire online training process examines the micro robustness of the batch size. This is a practical scenario where the frequency of incoming data may vary at diﬀerent stages, requiring time-varying batch sizes. In this experiment, we only vary incoming batch sizes while keeping buﬀer batch sizes at 32. We consider 3 cases of changing incoming batch sizes:

Increasing incoming batch sizes during training, speciﬁcally for C-CIFAR10: 2, 4, 8, 16, 32, as Increase. The inter-class imbalance issue intensiﬁes. Decreasing incoming batch sizes during training, speciﬁcally for C-CIFAR10: 32, 16, 8, 4, 2, as Decrease. The inter-class imbalance issue is alleviated. Randomly sampling incoming batch sizes from a uniform distribution U[2, 32] at each stage, as Random. This is a fusion of the previous two cases, where the impact of inter-class imbalance varies during training.

The results in Table 11 and Table 12 show that ER-LAS consistently achieves the best accuracy across various and varying batch sizes, highlighting the robustness of our method to batch size variations. In theory, changing batch sizes or the variation of batch sizes during training poses no threat to our principle of mitigating inter-class imbalance through the elimination of class-priors. It only aﬀects our estimation of time-varying class-priors. However, the ablation study of estimators in 6.4 indicates that online CL models may pay more attention to the current input class distributions. Therefore, our designed batch-wise estimator can timely provide eﬀective approximation at various batch sizes. The potential issues may lie in the cases where batch sizes become extremely small, such as 1. Following, we discuss this problem in detail and provide recommendations for improvement.

Published in Transactions on Machine Learning Research (05/2024)

Table 11: Comparison of ﬁnal average accuracy on online C-CIFAR10 with various batch sizes. In the manuscript, we set both incoming and buﬀer batch sizes to 32. We experiment with corresponding batch sizes of 4, 16, and 64. Experimental settings are the same as in Table 1. Memory sizes are M = 1k.

Batch size 4 16 32 64

ER 54.0 1.8 52.4 1.8 45.4 1.8 45.4 1.9 ER-ACE 55.1 1.9 56.7 2.1 48.1 1.1 46.2 1.9 ER-OBC 48.6 1.8 54.7 1.4 46.4 0.6 39.0 1.8 ER-LAS 59.2 1.2 57.5 1.3 55.3 1.6 53.1 1.2

Table 12: Comparison of ﬁnal average accuracy on online C-CIFAR10 with varying batch sizes during training. We only vary incoming batch sizes while keeping buﬀer batch sizes as 32: Increase incoming batch sizes during training, i.e., 2, 4, 8, 16, 32. Decrease incoming batch sizes during training, i.e., 32, 16, 8, 4, 2. Randomly sampling incoming batch sizes from a uniform distribution U[2, 32] at each stage. Experimental settings are the same as in Table 1. Memory sizes are M = 1k.

Incoming batch size Increase Decrease Random

ER 52.7 1.8 65.2 1.9 55.1 1.8 ER-ACE 50.8 1.2 62.5 1.1 55.1 1.9 ER-OBC 53.1 1.4 65.3 1.1 51.4 1.7 ER-LAS 59.8 1.1 65.7 0.7 59.3 1.9

In fact, training on a single incoming sample goes against the theory of traditional stochastic gradient descent, which may harm model convergence and hinder the establishment of classiﬁcation boundaries. Therefore, we maintain the incoming batch size of 1 and consider concatenating various numbers of buﬀer batch sizes to ensure valid training and practical performance. The experiments are conducted on online C-CIFAR10 in order to explore the impact of changed buﬀer batch sizes on a single incoming batch size.

Table 13: Comparison of ﬁnal average accuracy on online C-CIFAR10 with ﬁxed incoming batch sizes of 1 and various buﬀer batch sizes. Experimental settings are the same as in Table 1. Memory sizes are M = 1k.

Buﬀer batch size 1 4 16 64

ER 39.3 2.0 62.4 2.0 63.7 1.8 60.6 1.9 ER-ACE 27.9 2.1 57.7 1.7 58.1 1.8 54.5 1.7 ER-OBC 33.2 1.9 64.3 1.8 65.3 1.8 60.9 1.7 ER-LAS 36.9 1.5 66.4 1.5 67.2 1.5 62.2 1.3

The results in Table 13 show that when both incoming and buﬀer batch sizes are 1, ER-LAS performs slightly worse than the ER baseline. Nevertheless, simply increasing buﬀer batch sizes can enable ER-LAS to achieve the highest accuracy. This is because the case of extremely small batch sizes of 1 aﬀects our estimation of time-varying class-priors and hinders the construction of classiﬁcation margins, where slightly increasing buﬀer batch sizes can serve as an eﬀective approach to refresh our method. Noting that excessive buﬀer batch sizes can lead to overﬁtting on the memory buﬀer and harm performance, as shown in the rightmost column of Table 13.

F.4 Comparison to Inter-class Imbalance Mitigation Methods

We mentioned the diﬀerences between LAS and other class imbalance mitigation methods from an analytical perspective in 5. We worry that since these methods have not been deliberately designed and applied to online CL in previous works, our direct application may lack credibility and endorsement. As a result, we do not compare with these methods in experiments. However, we have indeed conducted experiments with them in the preliminary exploration phase of our method. Here, we brieﬂy describe our applications and provide

Published in Transactions on Machine Learning Research (05/2024)

experimental comparisons and analysis. We compare four class imbalance mitigation methods originally for stable distributions. We refer to Cui et al. (2019) and apply the Class-Balanced loss, which re-weights the loss terms of each class based on the input class distribution, as ER-CBL of Loss weighting. We refer to Kang et al. (2020) and normalize the weights of classiﬁers with wy 2, as ER-WN of Weight normalization. We perform upsampling on the buﬀer samples and downsampling by randomly ignoring some incoming samples to maintain consistent input class distributions, as ER-Up and ER-Down of Resampling (Kubát & Matwin, 1997).

Table 14: Comparison of ﬁnal average accuracy by ER, ER-LAS, and imbalance mitigation methods. Experimental settings are the same as in Table 1. Memory sizes are M = 1k.

Dataset C-CIFAR10 C-CIFAR100 C-Tiny Image Net

ER 45.4 1.8 16.5 0.4 11.0 0.2 ER-CBL 48.1 1.6 18.8 0.2 11.1 0.2 ER-WN 46.1 1.5 16.6 0.8 11.0 0.1 ER-Up 53.6 1.6 23.4 0.5 15.0 0.1 ER-Down 48.2 1.6 18.9 0.3 13.4 0.2 ER-LAS 55.3 1.6 25.7 0.3 15.5 0.2

The results in Table 14 show that our ER-LAS outperforms all other compared methods for mitigating class imbalance. Next, we will analyze the shortcomings of these methods. ER-CBL re-weights the loss after computing the logits and ground truth, which helps in learning better features of minority classes but fails to eliminate the inﬂuence of class-priors to achieve balanced posterior outputs. ER-WN ensures that the model output is not aﬀected by class weight bias. However, we ﬁnd that CL models are still aﬀected by feature drift(Caccia et al., 2022), leading to recency bias. Therefore, these two methods cannot truly solve the inter-class imbalance problem in online CL. ER-Up is the closest method to our ER-LAS, but as the inter-class imbalance problem intensiﬁes, it results in a signiﬁcant computational burden, whereas our method costs almost no additional computational resources. ER-Down, although maintaining inter-class balance during training, discards a majority of valuable incoming training samples. Furthermore, unlike these four methods, our proposed LAS is supported by a statistical ground of underlying class-conditionals.

F.5 Results on Oﬄine task-IL Scenarios

In oﬄine task-IL settings, learners have access to a whole dataset for each task and can undergo multiple epochs of training. Previous arts that are highly related to our work have proposed some Logit Rectify methods to alleviate the issue of inter-class imbalance in oﬄine CL and improve learning performance. Bi C (Wu et al., 2019) adds a bias correction layer to the model and stores a portion of input data as the held-out validation set to calibrate this layer and lessen the model s task-recency bias. E2E (Castro et al., 2018) ﬁne-tunes the model with a balanced dataset after each task. IL2M (Belouadah & Popescu, 2019) rescales the model output with historical statistics. LURIC (Hou et al., 2019) combines cosine normalization, less-forget constraint, and inter-class separation to mitigate the adverse eﬀects of class imbalance. We also compare successful oﬄine CL methods (ER (Chaudhry et al., 2019), DER++ (Buzzega et al., 2020), and ER-ACE (Caccia et al., 2022)). Our experimental settings follow (Buzzega et al., 2020). LUCIR fails to work on C-Tiny Image Net due to too low memory size.

The results in Table 15 demonstrate that ER-LAS still achieves competitive performance in oﬄine CL. When combined with knowledge distillation method Lw F (Li & Hoiem, 2016), Lw F-LAS outperform the other compared methods. These ﬁnds indicate that our approach can also eﬀectively mitigate inter-class imbalance and improve performance than previously proposed Logit Rectify methods in oﬄine task-IL setups. Considering the severe impact of recency bias in online CL, our main focus is how to eliminate the adverse eﬀects caused by inter-class imbalance in online settings, and we design a widely applicable LAS algorithm.

Published in Transactions on Machine Learning Research (05/2024)

Table 15: Final average accuracy AT (higher is better) on C-CIFAR10 (5 tasks), C-CIFAR100 (10 tasks), and C-Tiny Image Net (10 tasks) in the oﬄine condition. M = 100. The epoch is set to 50.

Dataset C-CIFAR10 C-CIFAR100 C-Tiny Image Net

Bi C 23.4 0.8 15.3 0.1 10.1 0.1 E2E 51.6 0.3 16.7 0.1 9.0 0.0 IL2M 42.1 0.6 11.0 0.2 8.4 0.1 LUCIR 28.9 1.0 15.7 0.7 10.2 0.1 ER 39.4 0.3 11.5 0.1 8.1 0.0 DER++ 55.3 1.2 14.8 1.8 9.4 0.3 ER-ACE 55.9 1.0 17.7 0.7 8.7 0.2 ER-LAS 53.9 1.0 16.4 0.2 10.3 0.1 Lw F-LAS 57.5 0.2 22.6 0.1 12.4 0.1

Figure 5: Comparison with online CL methods based on contrastive learning on C-CIFAR10 (5 tasks). Memory size M = 1k. The x-axis represents training time, and the y-axis represents the ﬁnal average accuracy AT (higher is better). We evaluate the accuracy and the time eﬃciency of SCR, OCM, and our ER-LAS at batch sizes of 8, 16, 32, and 64. Noting that the time consumption increases as the batch size decreases.

F.6 Comparison to CL Methods Based on Contrastive Learning

We compare our method with the online CL methods (SCR Mai et al. (2021) and OCM Guo et al. (2022)) based on contrastive learning. SCR utilizes the NCM classiﬁer and is trained via supervised contrastive learning. OCM employs contrastive learning to maximize mutual information. These methods based on contrastive learning typically require more computational resources, and their performance is inﬂuenced by the number of negative samples, but they often achieve better performance. As shown in Figure 5, we evaluate the training time and the ﬁnal average accuracy of our ER-LAS and the contrastive learning-based online CL methods under diﬀerent batch sizes. SCR and OCM require 2x and 30x more computational time than our method, respectively. Although they achieve higher accuracy than our method, LAS exhibits superior overall computational eﬃciency.

Published in Transactions on Machine Learning Research (05/2024)

Table 16: Final average forgetting FT (lower is better) on C-CIFAR10 (5 tasks), C-CIFAR100 (10 tasks), and C-Tiny Image Net (10 tasks). M is the memory buﬀer size.

Dataset C-CIFAR10 C-CIFAR100 C-Tiny Image Net

Method M = 0.5k M = 1k M = 2k M = 0.5k M = 1k M = 2k M = 0.5k M = 1k M = 2k

ER 43.0 1.5 36.2 1.6 24.5 1.3 31.1 0.6 23.2 1.0 23.2 0.6 38.5 0.5 33.4 0.3 27.8 0.3 DER++ 29.3 1.2 31.6 2.9 32.4 2.3 37.6 0.5 34.5 0.5 36.4 0.6 38.6 0.2 37.2 0.3 37.2 0.2 MRO 26.1 1.5 21.0 1.2 8.9 0.8 13.5 0.3 9.3 0.3 6.3 0.2 11.1 0.1 10.9 0.2 8.4 0.2 SS-IL 22.0 0.8 20.0 0.9 16.5 0.5 11.8 0.3 10.0 0.2 8.1 0.3 14.5 0.2 12.2 0.2 10.0 0.8 CLIB 30.4 1.6 17.7 1.5 16.1 1.3 25.9 0.3 14.9 0.3 7.6 0.3 30.1 0.3 20.4 0.3 10.8 0.3 ER-ACE 11.0 1.6 16.1 1.3 10.1 1.3 9.3 0.7 7.9 0.5 5.6 0.7 13.8 0.3 9.9 0.3 7.5 0.4 ER-OBC 37.3 0.8 19.5 0.6 30.5 0.8 24.0 0.3 22.4 0.4 18.7 0.4 36.2 0.2 29.4 0.2 21.9 0.2 ER-LAS 28.5 1.3 18.5 1.4 7.1 1.1 22.1 0.4 11.5 0.6 9.3 0.7 26.9 0.2 19.9 0.4 10.3 0.2

F.7 Forgetting Rate

We compare the forgetting rate of each method on C-CIFAR10, C-CIFAR100, and C-Tiny Image Net in Table 16. In most settings, ER-ACE achieved the lowest forgetting rate, except for when compared to our proposed ER-LAS on C-CIFAR10 with M = 2k, and to MRO on C-Tiny Image Net with M = 0.5k. Noting that the lowest forgetting rate does not necessarily correspond to the highest accuracy. Moreover, remarkable reductions in the forgetting rate can be achieved by deliberately adjusting the hyperparameters of our method, but at the cost of accuracy. Currently, our method achieves the optimal stability-plasticity trade-oﬀ.

F.8 Prediction Results on C-Image Net

Figure 6: Prediction results by ER and ER-LAS on C-Image Net (90 tasks). We calculate the average accuracy of classes within each task to demonstrate the recency bias.

We present the prediction results of ER and our proposed ER-LAS on C-Image Net after training, as shown in Figure 6. Recalling that ER assigns 38% of the test samples to the most recently learned classes in CCIFAR100 (Figure 1 in 4). ER also outperforms ER-LAS on the last task, but is inferior to ER-LAS on all other tasks. This is due to the larger task sequence and the more total number of classes in C-Image Net than in C-CIFAR100, resulting in a much more severe recency bias for the ER method. However, our ER-LAS successfully eliminates the recency bias as expected, and as a result, achieves a remarkably lower forgetting

Published in Transactions on Machine Learning Research (05/2024)

rate and the highest accuracy in the experiments of 6.1. These results validate that inter-class imbalance is more severe in long sequential tasks and demonstrate that our method can adapt to learning from such highly imbalanced data streams by pursuing the class-conditional function.

F.9 Class-balanced Accuracy on C-i Naturalist

Table 17: Final average accuracy AT and ﬁnal average class-balanced accuracy Acbl T (both higher is better) on C-i Naturalist (26 tasks). We show the results of top-3 methods. Memory sizes are M = 20k.

Dataset i Naturalist

Method AT Acbl T

ER 4.66 0.01 6.25 0.01 ER-ACE 5.68 0.01 6.32 0.01 MRO 4.96 0.0 4.47 0.01 ER-LAS 8.11 0.01 8.62 0.01

As we aim to pursue the optimal classiﬁer that minimizes the class-balanced error on imbalanced data streams, we also evaluate the class-balanced accuracy of our method and baselines on C-i Naturalist. As shown in Table 17, ER-LAS achieves the best performance in both accuracy and class-balanced accuracy, validating the eﬀectiveness of our optimization towards this optimal estimator.