# online_continual_learning_from_imbalanced_data__20b13d7e.pdf

Online Continual Learning from Imbalanced Data

Aristotelis Chrysakis 1 Marie-Francine Moens 1

A well-documented weakness of neural networks is the fact that they suffer from catastrophic for getting when trained on data provided by a nonstationary distribution. Recent work in the ﬁeld of continual learning attempts to understand and overcome this issue. Unfortunately, the majority of relevant work embraces the implicit assump tion that the distribution of observed data is per fectly balanced, despite the fact that, in the real world, humans and animals learn from observa tions that are temporally correlated and severely imbalanced. Motivated by this remark, we aim to evaluate memory population methods that are used in online continual learning, when dealing with highly imbalanced and temporally correlated streams of data. More importantly, we introduce a new memory population approach, which we call class-balancing reservoir sampling (CBRS). We demonstrate that CBRS outperforms the state-of the-art memory population algorithms in a consid erably challenging learning setting, over a range of different datasets, and for multiple architec tures.

1. Introduction

Over the past decade, deep neural networks have been used to tackle an impressive range of problems. Such problems vary from classifying (Krizhevsky et al., 2012) or gener ating images (Radford et al., 2015) to translating natural language (Bahdanau et al., 2015) and outperforming hu mans at playing several board games (Silver et al., 2018). The models utilized to solve the aforementioned tasks are typically trained ofﬂine by performing multiple passes over large amounts of previously collected (mostly labeled) data.

The accumulation of knowledge and experience by humans

1Department of Computer Science, KU Leuven, Leuven, Belgium. Correspondence to: Aristotelis Chrysakis <aris totelis.chrysakis@kuleuven.be>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the au thor(s).

is a vastly different story. Over our lives we perceive a stream of temporally correlated, unlabeled observations, and rarely revisit the same observation multiple times (Parisi et al., 2019). Moreover, the learning and application of new knowledge takes place concurrently, rather than in two distinct stages, by interacting with our surrounding environ ment (Tani, 2016). Finally, humans learn to address a large number of problems (i.e., visual understanding of our sur roundings, communication, use of our limbs etc.), instead of only just one (Cangelosi & Schlesinger, 2015).

In general, neural networks perform poorly when trained on a non-stationary distribution of data. Assuming a network is being trained on a sequence of distinct tasks, sometimes called a continuum, it learns the currently presented task but fails to remember previously learned ones (Goodfellow et al., 2014). This phenomenon is deﬁned as catastrophic forgetting (Mc Closkey & Cohen, 1989; French, 1999).

The ﬁeld researching how to overcome catastrophic for getting is called continual learning (CL). Two distinct CL settings have primarily been investigated (De Lange et al., 2019). The ﬁrst one, termed task-incremental CL, assumes that each task of the continuum can be learned ofﬂine. Speciﬁcally, the learner is given access to all the data from the current task and can perform numerous passes over it. Under the second CL setting, which is being referred to as online CL, the learner is presented with a stream of tiny batches of observations and cannot revisit previously seen batches from the current or the previous tasks.

A majority of recent work has centered on the taskincremental setting. In this work, however, we focus on online CL, as it more closely resembles the way humans and animals learn (Cangelosi & Schlesinger, 2015). In par ticular, we focus on replay-based online CL, under which a small part of the incoming stream of observations are mem orized, so that they can be used to retrain the learner in the future. Research in the ﬁeld of neuroscience has shown that replay is also present in living beings and is connected to the process of memory consolidation (Girardeau et al., 2009; Ego-Stengel & Wilson, 2010).

Our main motivation has been to emphasize a gap that exists in CL research. Speciﬁcally, most of the benchmarks used to evaluate CL approaches are perfectly balanced, both in terms of the sizes of the different tasks, and in terms of the

Online Continual Learning from Imbalanced Data

classes that comprise each task. Consequently, the major ity of the proposed approaches assume either explicitly or implicitly that the data used to train the model is perfectly balanced. In contrast, a living being will experience major imbalances in the experiences it acquires over its lifetime.

Here, we consider an online CL setting that contains sig niﬁcant data imbalances and does not provide task identi ﬁers or boundaries. To tackle such settings, we propose class-balancing reservoir sampling (CBRS), a novel mem ory population approach, which does not require any prior knowledge about the incoming stream and does not make any assumptions about its distribution. CBRS is designed to be able to balance the stored data instances with respect to their class labels, having made only one pass over the stream. We compare CBRS to state-of-the-art memory pop ulation methods, over four different datasets, and two dif ferent neural network architectures. We demonstrate that it consistently outperforms the state of the art (with relative differences of up to more than 40%), while being equally or more efﬁcient computationally. We present a sizeable amount of experimental results that support our claims.

1.1. Document Structure

In Section 2 we present the main techniques proposed in this work namely, the CBRS algorithm and the use of weighted replay. In Section 3 we describe our experimental work and discuss the corresponding results. In Section 4 we review other work relevant to ours, and ﬁnally, in Section 5, we offer our concluding remarks.

1.2. Notation

We use lower-case boldface characters to denote a vector (e.g., v) and lower-case italic characters for scalars (e.g., s), with the only exception being loss values, where the special symbol L is used . Upper-case boldface characters signify a matrix (e.g., M).

2. Methodology

2.1. Online Continual Learning

Online continual learning can be formally deﬁned as learn ing from a lengthy stream of data that is produced by a nonstationary distribution (Aljundi et al., 2019a). The stream is generally modeled as a sequence of distinct tasks, each of them containing instances from a speciﬁc set of classes (Chaudhry et al., 2019a). The data distribution is typically assumed to be stationary throughout each task (Aljundi et al., 2019a). The stream provides the learner with data instances (xt, yt) or small batches of them (Xt, y ), where t xt, yt are the input and label of the instance respectively, and t refers to the current time-step. An incoming instance

or batch cannot be revisited once the next instance or batch is received (Chaudhry et al., 2019a). A more informative stream might also provide the learner with so-called task boundaries, hinting that the current task has been completed and the next one is about to start.

The learner is typically evaluated after the stream has been received in its entirety. During the evaluation process, some learners require task identiﬁers that is, information that communicates to them the task to which the instances to be predicted belong, before they make their prediction (Lopez Paz & Ranzato, 2017). Having access to task identiﬁers simpliﬁes CL signiﬁcantly (van de Ven & Tolias, 2019), but is inconsistent with human learning.

In this work, we consider a minimal-prerequisite scenario, where task identiﬁers and boundaries are not provided dur ing training and evaluation. Additionally, we assume the absence of any prior knowledge regarding the incoming stream (i.e., length, task composition etc.). Our motiva tion for making these choices is to constrain the learning process so that it approximates the human aggregation of experiences and knowledge.

2.2. Class-Balancing Reservoir Sampling

In this section we describe in detail the main contributions of this work. We propose an algorithm called class-balancing reservoir sampling (CBRS) that decides which data in stances of the input stream to store for future replay. In essence, having read the stream in its entirety, the aim of CBRS is to store an independent and identically distributed (iid) sample from each class, and keep the classes as bal anced as possible. In other words, the algorithm preserves the distribution of each class, while at the same time altering the distribution of the stream so that class imbalances are mitigated.

Let us ﬁrst describe our notation. Given a memory of size m, we will say that the memory is ﬁlled when all its m storage units are occupied. When a certain class contains the most instances among all the different classes present in the memory, we will call it the largest. Two or more classes can be the largest, if they are equal in size and also the most numerous. Finally, we will call a class full if it currently is, or has been in one of the previous time steps, the largest class. Once a class becomes full, it remains so in the future. Such classes are called full, because CBRS does not allow them to grow in size.

As in the case of reservoir sampling (Vitter, 1985), the CBRS sampling scheme can be split into two stages. During the ﬁrst stage, and as long as the memory is not ﬁlled, all the incoming stream instances are getting stored in memory. After the memory is ﬁlled, we move on to the second stage.

During this stage, when a stream instance (xi, yi) is re

Online Continual Learning from Imbalanced Data

Algorithm 1 Class-Balancing Reservoir Sampling

1: input: stream: {(xi, yi)}n

i=1 2: for i = 1 to n do 3: if memory is not ﬁlled then 4: store (xi, yi) 5: else 6: if c yi is not a full class then 7: ﬁnd all instances of the largest class 8: select from them an instance at random 9: overwrite the selected instance with (xi, yi) 10: else 11: mc number of currently stored instances of class c yi 12: nc number of stream instances of class c yi encountered thus far 13: sample u Uniform(0, 1) 14: if u mc/nc then 15: pick a stored instance of class c yi at ran dom 16: replace it with (xi, yi) 17: else 18: ignore the instance (xi, yi) 19: end if 20: end if 21: end if 22: end for

ceived, the algorithm checks at ﬁrst whether yi belongs to a full class. If it does not, the received instance is stored in the place of another instance that belongs to the largest class. This is the component of CBRS that mitigates class imbalances in memory. In the opposite case, the received instance replaces a randomly selected stored instance of the same class c with probability mc/nc, where mc is the num ber of instances of the class c currently stored in memory, and nc the number of stream instances of class c that we have encountered so far. We sketch the pseudocode for the CBRS algorithm in Algorithm 1.

In Figure 1 we compare the memory distributions of CBRS and two state-of-the-art memory population algorithms (Vit ter, 1985; Aljundi et al., 2019b)1. The input sequence is an imbalanced stream of the ﬁrst ﬁve (for illustrative pur poses) classes of the MNIST dataset (Le Cun et al., 2010). Evidently, the resulting distributions of stored instances un der reservoir sampling and GSS are signiﬁcantly inﬂuenced by the stream distribution. In contrast, CBRS stores all instances of the two smallest classes (classes 0 and 2) and balances the remaining three.

Storing a balanced subset of the input stream for future re play, represents a prior belief that all the observed classes are

1For more details on these two algorithms, see Subsection 3.1.

equally difﬁcult and equally important to learn. This princi ple is applied because of the absence of any more speciﬁc knowledge about the contents of the stream. Nonetheless, if one possesses such knowledge, it is trivial to extend CBRS to take it into account. We describe how in the supplemen tary material.

As we show in Section 3, CBRS exhibits superior perfor mance compared to the state of the art. At this point, how ever, we would like to highlight two important properties that CBRS possesses. Assuming a stream that contains in stances of nc distinct classes, and a memory of size m, the following two statements stand.2

First, if the stream contains a class with less than m/nc in stances, then all of these instances will be stored in memory after the stream has been read in whole. This is a remark able attribute of CBRS, which guarantees that not even a single instance of severely underrepresented classes will be discarded. We can see this property illustrated in Figure 1. The stream contains instances from nc = 5 classes, and the memory size is size m = 1000. Thus, the two classes of the stream that are composed by less than 200 instances (i.e., classes 0 and 2) are stored in their entirety by CBRS.

Second, the subset of the instances from each class that are stored in memory is iid with respect to the instances of the same class contained in the stream. This is another important characteristic of CBRS, that is also present when performing reservoir sampling. Since we cannot assume that the instances of each class are presented in an iid manner over time, it is desirable that we capture a representative sample from each class in memory.

2.3. Weighted Random Replay

In some cases it is impossible to fully utilize the whole capacity of the memory and keep it balanced at the same time. This is well exempliﬁed by Figure 1. Assuming a memory of size m = 1000, and keeping in mind that the smallest class of the stream contains 51 instances, one way to keep the memory balanced would be to store 51 instances from each class. In this case, however, only a quarter of the memory would be utilized, which is very wasteful. Thus, we could ﬁll the entire memory using CBRS (as in Figure 1), and try to mitigate the inﬂuence of the resulting imbalance later.

A well-known and effective way to deal with such imbal ances is oversampling the minority classes (Branco et al., 2016). In our case, we propose the use of a custom replay sampling scheme, where the probability of replaying a cer tain stored instance is inversely proportional to the number of stored instances of the same class. In other words, in stances of a smaller class will have a higher probability of

2We prove both statements in the supplementary material.

Online Continual Learning from Imbalanced Data

T h i s figure c

plots. The first ba

r l p o t illustr

alanced stream of five classes.

T h e remainin

lots depict the memory co

m p o s i tion aft

e stream with reservoir sampling, GSS, and CBRS respectively. The memory distribution in the case of reservoir sampling and GSS resembles that of the stream markedly, while the memory composition in the case of CBRS is more balanced.

Figure 1. A simple illustration of three memory population methods when learning from an imbalanced stream. All methods employ a memory of size m = 1000. We describe the ﬁgures from left to right. (i) An imbalanced stream containing instances from the ﬁrst ﬁve classes from MNIST. The resulting memory composition when using Reservoir Sampling (ii) and GSS-Greedy (iii) respectively. Both methods are considerably affected by the distribution of the incoming stream. On the other hand, CBRS (iv) does not discard any instances from the signiﬁcantly underrepresented classes 0 and 2 and balances the remaining three.

getting replayed, in comparison to instances of a larger one. We will call this scheme weighted replay, as opposed to uniform replay, under which all stored instances are equally likely to be replayed.

2.4. Putting it All Together

At this point, we describe a general replay-based online CL training process (see Algorithm 2) that is a generaliza tion of previously proposed ones (Chaudhry et al., 2019b; Aljundi et al., 2019a). During the learning process, the model is trained both on the stream, and via replay of the stored instances. The process we describe can be used with any stream, classiﬁer, loss function, memory population algorithm, and replay sampling scheme.

In order to be able to adapt to changes in the stream and at the same time prevent catastrophic forgetting, the model updates are guided by a two-component loss. The ﬁrst component Ls is computed with respect to the currently observed stream batch (Xt, y ) of size b, and the second t component Lr with respect to a batch (Xr, y ), also of r size b, that was sampled from memory to be replayed. In our implementation, we use as a loss function the crossentropy between the model predictions and the true labels, but any other differentiable loss could be used. The joint loss is computed as the convex combination of the two loss components

L = a Ls + (1 a) Lr, (1)

where a [0, 1] controls the relative importance between them. In practice, a represents a trade-off between quickly adapting to temporal changes in the stream and safeguard ing currently possessed knowledge. Previous work either treats the loss components as equally important (Chaudhry et al., 2019b; Aljundi et al., 2019b), or decreases a with the number of completed tasks (Shin et al., 2017; Li & Hoiem, 2017). In our setting, however, we are not provided with

Algorithm 2 Replay-Based Online Continual Learning

1: input: stream: {(xi, yi)}n

i=1 2: model: f( ) 3: loss function: ( , ) 4: batch size: b 5: steps per batch: nb 6: repeat 7: receive batch (Xt, y ) of size b from stream t 8: nc number of classes encountered so far 9: a 1/nc 10: for nb steps do 11: predict outputs: ˆy = f(Xt) t 12: stream loss: Ls = (yˆt, y ) t 13: sample batch (Xr, y ) of size b from memory r 14: predict outputs: yˆ = f(Xr) r 15: replay loss: Lr = (ˆy , y ) r r 16: joint loss: L = a Ls + (1 a) Lr 17: update model f according to L 18: end for 19: for j = 1 to b do 20: (xt,j , yt,j ) j-th instance of batch (Xt, y ) t 21: decide whether to store (xt,j , yt,j ) according to the selected memory population algorithm 22: end for 23: until the stream has been read in full

task boundaries. We set a = 1/nc , where nc is the number of distinct classes encountered so far, thus sidestepping the issue of not having access to task information. As we show in the supplementary material, considering the loss compo nents as equally important results in consistently inferior performance.

Note that, although we cannot revisit previously seen batches, we can perform multiple parameter updates on the currently observed batch, as is common practice in online CL (Chaudhry et al., 2019b). In practice, we ﬁnd that

Online Continual Learning from Imbalanced Data

performing just one update per time-step might result in un derﬁtting the data. On the other hand, performing multiple (nb) updates has an additional computational cost, but is nevertheless beneﬁcial in the sense that it allows for faster adaptation to the non-stationary nature of the stream. Addi tionally, it permits us to perform multiple replay steps (with a different batch of stored instances each time) for each incoming stream batch, which in turn mitigates forgetting more effectively.

Finally, the learner goes over the data instances contained in the currently observed batch, one at a time, and decides which ones to store, according to the selected memory pop ulation algorithm.

3. Experimental Work

In this section, we aim to compare CBRS to the state of the art when it comes to memory population strategies under online CL scenarios. We exclusively consider replay-based training as it is the only one that is suitable to our setting.

3.1. Memory Population Approaches

At the time of writing, reservoir sampling (Vitter, 1985) and gradient-space sampling (Aljundi et al., 2019b) are considered the state-of-the-art approaches at populating the memory during online CL. We refer to these methods as RESERVOIR and GSS respectively.3

The memory population strategy that RESERVOIR follows is split in two phases. During the ﬁrst phase, which lasts until the memory gets ﬁlled, all encountered data instances are stored in empty memory spots. In the second phase, which starts once the memory gets ﬁlled and continues from then on, the currently observed data instance is stored with probability m/n, where m is the size of the memory and n is the number of data instances encountered so far. The data instance is stored in a memory spot that is uniformly selected, thus all currently stored instances are equally likely to be overwritten. It can be proven that RESERVOIR is equivalent to extracting an iid subset of size m from the stream.

GSS attempts to greedily maximize the variance of the gradi ent directions of the samples contained in memory (Aljundi et al., 2019b). To that end, it computes a score of similarity between the incoming instance and some randomly sampled stored instances. If the similarity score is small, the instance is more likely to be stored, with instances that have a high score being more likely to get overwritten in the process.

In addition to the aforementioned methods, we consider two weaker baselines one that trains a model exclusively on

3Of the two GSS alternatives proposed in Aljundi et al. (2019b), we select the best-performing one, namely GSS-Greedy.

the stream without replaying stored memories, and another that stores the current stream instance with 50% probability, in the place of a randomly picked stored instance. We call these methods NAIVE and RANDOM respectively.

3.2. Simulating Imbalances

In this subsection, we describe our approach to creating imbalanced streams. For each class, we deﬁne its reten tion factor as the percentage of its instances in the original dataset that will be present in the stream. We deﬁne a vector r containing k retention factors as follows:

r = (r1, r2, , rk) . (2)

We distribute the retention factors to each class at random and without replacement, starting over if the number of classes in the dataset is larger than k. In practice, we use

r = 10 2 , 10 1.5 , 10 1 , 10 0.5 , 100 (3)

for all of our experiments. Given this choice, the minimum imbalance that can exist between two classes in a stream is approximately three to one, while the maximum is 100 to one. The selection of these retention factors represents a trade-off between simulating extreme imbalances and en suring adequate class representation. Speciﬁcally, we take into account the fact that each class in CIFAR-100 (see Sub section 3.3) contains only 500 instances. Consequently, this selection of r ensures that at least ﬁve instances from each class will be present in the resulting stream, while allowing for imbalances of up to two orders of magnitude.

The retention factors for each class are selected at random in every run, and thus each run is performed using a stream with different imbalances. We want to stress, however, that in each experiment, all memory population methods are compared on exactly the same group of imbalanced streams. We do not discard other sources of stochasticity such as the random initialization of a model or the replay sampling.

3.3. Benchmarks

Datasets Following Aljundi et al. (2019b), we select MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky, 2012) for our experiments. In addition, we use Fashion MNIST (Xiao et al., 2017) and CIFAR-100 (Krizhevsky, 2012). All four of the datasets used in this work are freely available online. We evaluate each method on the standard test set of each selected dataset.

We opt for using one class per task, with the classes being presented in increasing order. This way, the continuum is more difﬁcult to learn since it is split in more distinct tasks. When allowing for two or more classes to be present and iid in each task, as is usually the case in the relevant research (Chaudhry et al., 2019b; Aljundi et al., 2019b), the stream

Online Continual Learning from Imbalanced Data

remains stationary for longer periods of time steps, and is thus easier to learn.

All datasets are used as split benchmarks that is to say, the learner is ﬁrst presented with the instances of the ﬁrst class, then with the ones of the second, and so on. We found the inﬂuence of the class ordering on the results to be negligible.4 We intentionally refrain from using the permuted MNIST benchmark (Goodfellow et al., 2014), due to the criticism it has received in Farquhar & Gal (2018) for being too simple and unrealistic for CL.

Models Our choices here are largely inﬂuenced by previ ous work (Chaudhry et al., 2019b; Lopez-Paz & Ranzato, 2017). For MNIST & Fashion-MNIST, we train a multi layer perceptron (MLP) consisting of two hidden layers with 250 neurons and Re LU activations. Since we need a model with greater modeling capacity for CIFAR-10 and CIFAR 100, we pick a Res Net-18 (He et al., 2016) pre-trained on

Image Net (Deng et al., 2009).

Hyperparameters We use a learning rate of 0.05 when training the MLP and 0.01 when training the Res Net-18. Both were selected via grid search in the range [0.1, 0.001]. Following Aljundi et al. (2019b); Chaudhry et al. (2019b), we set the batch size at b = 10, abiding by the assumption that batches provided by the stream should be relatively small, and we perform nb = 5 update steps per incoming batch, as it is a good trade-off between minimizing the training time and maximizing the predictive performance.

3.4. Comparison of Memory Population Methods

Here, we compare the ﬁve selected methods (NAIVE, RAN

DOM, RESERVOIR, GSS and CBRS) over the four selected datasets. Complying with previous work (Aljundi et al., 2019b), we set the memory size at m = 500 for MNIST and Fashion-MNIST, while for the more difﬁcult CIFAR-10 and CIFAR-100, we set it at m = 1000. We report the 95% conﬁdence interval of the test set accuracy (over ten runs) of each model, after having been trained with the correspond ing memory population algorithm. The relevant results are presented in Table 1.

We observe that CBRS outperforms the other four ap proaches in all four datasets by signiﬁcant margins. Es pecially in the case of CIFAR-100 where some classes are represented only by a single-digit number of instances in the stream the relative gap over the second-best method is more than 40%. As expected, we notice that the weaker baselines (i.e., NAIVE and RANDOM) are signiﬁcantly out performed by the other three approaches.

In all experiments excluding the ablation, all replay methods

4For more details see the supplementary material.

Table 1. Comparison of the ﬁve learning methods over four dif ferent benchmarks. For MNIST and Fashion-MNIST we set the memory size at m = 500, while for the two CIFARdatasets we set it at m = 1000. All methods use weighted replay so that they can be compared on equal terms. We report the 95% conﬁdence interval of the accuracy on the test set after the training is complete. Each experiment is repeated for ten different streams.

METHODS MNIST F-MNIST CIFAR-10 CIFAR-100

NAIVE 10.1 0.0 10.0 0.0 10.0 0.0 1.0 0.0 RANDOM 37.8 5.6 38.9 6.7 52.1 3.8 25.8 1.0 RESERVOIR 67.9 3.0 64.1 1.8 54.7 2.2 28.1 1.2 GSS 74.2 3.4 64.9 3.1 63.3 2.3 23.0 1.1 CBRS 83.3 2.5 75.0 2.5 73.4 2.2 40.2 1.0

Table 2. Comparison of the four memory population methods on Fashion-MNIST for various memory sizes. All methods use weighted replay so that they can be compared on equal terms. We report the 95% conﬁdence interval of the accuracy on the test set after the training is complete. Each experiment is repeated for ten different streams.

MEMORY SIZE m

METHODS m = 100 m = 500 m = 1000 m = 5000

RANDOM 21.7 4.2 38.7 5.6 57.6 5.1 76.1 1.3 RESERVOIR 53.7 3.0 64.3 2.8 68.2 2.0 75.2 1.9 GSS 59.5 3.6 65.0 3.8 65.1 3.9 70.7 3.2 CBRS 66.5 1.5 75.8 2.0 76.7 2.4 77.1 1.8

use weighted replay. By setting up the experiments in this manner, we isolate the difference in performance between the different memory population schemes. In Subsection 3.6, we perform an ablation study that probes the use of the two different types of replay.

3.5. Varying the Memory Size

The goal of this experiment is to survey the impact of mem ory size in the ﬁnal learning performance. We evaluate the four methods that use a memory (i.e., all except for NAIVE) over four different memory sizes, ranging from very small (i.e., m = 100) to very large (i.e., m = 5000). Since we are using only one dataset in this experiment, we opt for one of moderate difﬁculty, namely Fashion-MNIST. For the same reason mentioned in the previous subsection, all methods use weighted replay. We present the results in Table 2.

The results are consistent across all memory sizes. CBRS outperforms all other methods in every case, with the rel ative performance improvement being more tangible for smaller memory sizes.

Online Continual Learning from Imbalanced Data

Table 3. Ablation study. We want to isolate the performance boost that CBRS provides both with and without weighted replay. We present results on two different datasets (MNIST and CIFAR-10) and for two different memory sizes (m = 1000 and m = 5000). We report the 95% conﬁdence interval of the accuracy on the test set after the training is complete. Each experiment is repeated for ten different streams.

m = 1000 m = 5000

MNIST CIFAR-10 MNIST CIFAR-10

METHODS UNIFORM WEIGHTED UNIFORM WEIGHTED UNIFORM WEIGHTED UNIFORM WEIGHTED

RESERVOIR 71.3 2.0 76.3 2.8 54.2 2.1 56.7 1.4 77.9 4.3 86.3 1.8 68.7 4.3 73.1 1.7 GSS 75.4 2.6 77.9 2.3 62.9 2.2 62.8 2.1 76.6 4.8 84.2 1.9 68.5 4.1 71.8 2.0 CBRS 86.0 2.6 86.1 2.6 74.4 2.5 73.7 2.6 85.6 2.8 90.3 1.3 76.8 2.3 78.3 1.9

Table 4. Comparison of the four memory population methods with respect to their computational efﬁciency. With regard to time complexity, we report the 95% conﬁdence interval of the wall-clock time per incoming batch in milliseconds over ten runs. With regard to memory overhead, we report the relative increase in storage space, compared to storing only the selected data instances, that each method results in. For more details on the contents of this table see Subsection 3.7.

WALL CLOCK TIME MEMORY OVERHEAD

METHODS MNIST F-MNIST CIFAR-10 CIFAR-100 MNIST F-MNIST CIFAR-10 CIFAR-100

RANDOM RESERVOIR GSS CBRS

20.0 0.5 19.8 0.5 295.9 18. 19.3 0.2

19.3 1.1 17.7 0.7 316.3 11. 20.4 1.0

363.4 6.4 367.6 4.6 3671.1 46. 364.9 5.8

357.6 4.8 358.7 4.8 3647.8 27. 357.9 5.1

0.00% 0.00% 827.19% 0.13%

0.00% 0.00% 827.19% 0.13%

0.00% 0.00% 3639.86% 0.03%

0.00% 0.00% 3639.86% 0.03%

3.6. Ablation Study

At this point, we would like to isolate the inﬂuence of using CBRS in the place of another strong baseline (i.e., RESERVOIR or GSS), from that of using weighted instead of uniform replay. We select an easier dataset (MNIST) and a harder one (CIFAR-10), and measure the accuracy of the trained classiﬁer for all combinations of the three se lected memory population methods (i.e., RESERVOIR, GSS, CBRS), two memory sizes (m = 1000 and m = 5000), and the use of either uniform or weighted replay. Each com bination is repeated ten times and the results are presented in Table 3.

There are three points we would like to emphasize here. First, we note that the relative performance gaps between CBRS and the other two methods are higher when using uni form replay. When the memory is populated using RESER VOIR or GSS two methods that are signiﬁcantly inﬂu enced by the imbalanced distribution of the stream the probabilities of replaying instances of different classes are very disparate. These disparities, in turn, cause the learner to be able to classify instances of some classes much better than others, and thus deteriorates its accuracy on the test set, where all the classes are more or less balanced.

Second, we observe that using weighted replay is almost always beneﬁcial, compared to using uniform replay. This observation follows naturally from what we described in the previous paragraph. In short, the use of weighted replay

partially masks the effects of the imbalanced memory by oversampling the underrepresented classes.

Finally, we notice that when using uniform replay, one of the CBRS entries in Table 3 are worse for m = 5000 than they are for m = 1000. This abnormality can be intuitively understood by realizing that for the same stream, a mem ory with higher storage capacity will end up being more imbalanced. In Figure 1 for instance, CBRS would keep the memory perfectly balanced if its size was m = 255, but since it actually is m = 1000, it cannot.

3.7. Computational Efﬁciency

An important characteristic of any online CL approach is its computational efﬁciency. Keeping that in mind, we contrast the time and memory complexity of the four selected replay methods (RANDOM, RESERVOIR, GSS and CBRS).

All experiments are run on an NVIDIA TITAN Xp. For the implementation of GSS, we followed the pseudocode in Aljundi et al. (2019b), while CBRS, RANDOM and RESER VOIR were optimized to the best of our knowledge. The experimental setup is identical to that of Subsection 3.4. All relevant results are presented in Table 4.

Timing the memory population methods is relatively simple. We repeat the learning process for all replay-based methods over ten different streams, and report the 95% conﬁdence interval of the elapsed time per incoming batch in millisec

Online Continual Learning from Imbalanced Data

onds. We observe that RANDOM, RESERVOIR and CBRS require roughly the same amount of time. In contrast, GSS is about an order of magnitude more time-consuming than the other three methods. By proﬁling GSS, we found that its main bottleneck is the additional computation of gradients that are involved in the calculation of the similarity score for each incoming instance.

Quantifying the memory overhead is slightly more complex. First of all, the number we report is a memory overhead factor. We deﬁne this factor as the additional storage space required by each memory population method, divided by the space that the m stored instances occupy. RANDOM and RESERVOIR require no additional storage space, hence their memory overhead factor is zero in every case. GSS requires an additional storage space for 10 gradient vectors, with each of them containing approximately 3.2 105 for the MLP, and 1.1 107 for the Res Net-18, 32-bit ﬂoating-point numbers. In order for CBRS to be time-efﬁcient, it utilizes a data structure that keeps track of which memory locations correspond to each class, and thus requires an additional storage space equivalent to m 32-bit integers in every case. Therefore, it has a meager 0.13% (MNIST and Fashion MNIST) and 0.03% (CIFAR-10 and CIFAR-100) memory overhead. Conversely, the additional memory requirements of GSS are equivalent to having an eight (for MNIST and Fashion-MNIST) and 36 (for CIFAR-10 and CIFAR-100) times larger memory, than the one it actually uses for the stored instances.

3.8. Discussion

In this subsection, we attempt to interpret our results quali tatively. As we showed experimentally, CBRS outperforms RESERVOIR and GSS in terms of their predictive accuracy during evaluation. This performance gap can be explained in two steps.

First, a more balanced memory translates into all classes be ing replayed more or less with the same frequency, and thus not being forgotten. In the opposite case, if a certain class is severely underrepresented in memory it will not be replayed as often, and will thus be more prone to being forgotten. This claim is indirectly exempliﬁed in Table 3, where CBRS is the method that has the least relative increase in accuracy when switching from uniform to weighted replay, since it stores a more balanced subset of the data instances provided by the imbalanced stream.

Second, and more importantly, RESERVOIR and GSS are biased by the imbalanced distribution of the stream and fail to store an adequate number of data instances from highly underrepresented classes, as is demonstrated in Figure 1. As a consequence, these classes are largely forgotten come evaluation time. In contrast, CBRS is guaranteed to store all data instances from signiﬁcantly underrepresented classes,

and thus, is in the best possible position to not forget them in the future. As evidence of this statement, we again refer to Table 3. We observe that even when using weighted replay, CBRS still has a higher accuracy than the other replay methods in all examined cases, as it stores enough instances in order to be able to better remember even the most severely underrepresented classes of the stream.

4. Related Work

There are three main CL paradigms in the relevant research. The ﬁrst one, called regularization-based CL, applies one or more additional loss terms when learning a task, to ensure previously acquired knowledge is not forgotten. Methods that follow this paradigm are, for instance, learning without forgetting (Li & Hoiem, 2017), synaptic intelligence (Zenke et al., 2017) and elastic weight consolidation (Kirkpatrick et al., 2017).

Approaches following the parameter isolation paradigm sidestep catastrophic forgetting by allocating nonoverlapping sets of model parameters to each task, with relevant examples being the work of Serr a et al. (2018) and Mallya & Lazebnik (2018). Most of such methods require task identiﬁers during training and prediction time.

Replay-based methods are another major CL paradigm. These methods replay previously observed data instances during future learning in order to mitigate catastrophic for getting. The replay can take place either directly, via storing a small subset of the observed data (Isele & Cosgun, 2018), or indirectly, with the aid of generative models (Shin et al., 2017).

The majority of research pertinent to CL focuses on the task-incremental setting, with some approaches (von Os wald et al., 2020; Shin et al., 2017; Rebufﬁ et al., 2016) achieving strong predictive performance and at the same time minimizing forgetting.

In contrast, the more challenging online CL setting has not been explored as much. Relevant approaches store a small subset of the observed instances, which are then ex ploited either via replay (Chaudhry et al., 2019b; Aljundi et al., 2019a;b), or by regularizing the training process using data-dependent constraints (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019a). The latter approach is not applica ble to our learning setting since it requires task identiﬁers both at training and at evaluation time. Therefore, replaybased methods are the only ones that can be practically applied to the minimal-assumption learning setting that we adopt in this work.

Reservoir sampling (Vitter, 1985) extracts an iid subset of the observations it receives, and was, until recently, con sidered to be the state-of-the-art in selecting which data

Online Continual Learning from Imbalanced Data

instances to store during online continual learning (Isele & Cosgun, 2018; Chaudhry et al., 2019b). Aljundi et al. (2019b) introduced two approaches (i.e., GSS-IQP and GSSGreedy) that try to maximize the variance of the stored memories with respect to the gradient direction of the model update they would generate. GSS-IQP utilizes inte ger quadratic programming, while GSS-Greedy is a more efﬁcient heuristic approach that actually outperforms GSS IQP. Additionally, Aljundi et al. (2019b) demonstrate that both their proposed algorithms achieve higher accuracy than reservoir sampling when learning moderately imbalanced streams of MNIST (Le Cun et al., 2010) digits with two classes per task.

5. Conclusion

In this work, we examined the issue of online continual learning from severely imbalanced, temporally correlated streams. Moreover, we proposed CBRS an efﬁcient mem ory population approach that outperforms the current state of the art in such learning settings. We provided an inter pretation for the performance gap between CBRS and the current state of the art, accompanied by supportive empiri cal evidence. Speciﬁcally, we argued that underrepresented classes in memory tend to be forgotten more quickly, partly because they are not replayed as often as more sizeable classes, and also because of the scarcity of their correspond ing stored instances. Further improvements in replay-based CL methods could be possible, if memory population meth ods are able to infer which classes are more challenging to learn and store their instances preferentially.

Acknowledgements

We would like to thank Ivan Vuli c and Pieter Delobelle for providing valuable feedback on this work. In addition, we want to thank the anonymous reviewers for their comments. This work is part of the CALCULUS5 project, which is funded by the ERC Advanced Grant H2020-ERC-2017 ADG 788506.

Aljundi, R., Belilovsky, E., Tuytelaars, T., Charlin, L., Cac cia, M., Lin, M., and Page-Caccia, L. Online continual learning with maximal interfered retrieval. In Advances in Neural Information Processing Systems 32, pp. 11849 11860. 2019a.

Aljundi, R., Lin, M., Goujaud, B., and Bengio, Y. Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems 32, pp. 11816 11825. 2019b.

5http://calculus-project.eu

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representa tions, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

Branco, P., Torgo, L., and Ribeiro, R. P. A survey of pre dictive modeling on imbalanced domains. ACM Comput. Surv., 49(2), August 2016.

Cangelosi, A. and Schlesinger, M. Developmental Robotics: From Babies to Robots. MIT Press, 2015.

Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny,

M. Efﬁcient lifelong learning with a-GEM. In Interna tional Conference on Learning Representations, 2019a.

Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P. K., Torr, P. H., and Ranzato, M. Continual learning with tiny episodic memories. ar Xiv preprint ar Xiv:1902.10486, 2019, 2019b.

De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. Continual learning: A comparative study on how to defy forgetting in classiﬁcation tasks. ar Xiv preprint ar Xiv:1909.08383v1, 2019.

Deng, J., Dong, W., Socher, R., Li, L., Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255, 2009.

Ego-Stengel, V. and Wilson, M. A. Disruption of rippleassociated hippocampal activity during rest impairs spa tial learning in the rat. Hippocampus, 20(1):1 10, 2010.

Farquhar, S. and Gal, Y. Towards Robust Evaluations of Continual Learning. In Lifelong Learning: A Reinforce ment Learning Approach workshop, ICML, 2018.

French, R. M. Catastrophic forgetting in connectionist net works. Trends in Cognitive Sciences, 3(4):128 135, 1999.

Girardeau, G., Benchenane, K., Wiener, S. I., Buzs aki, G., and Zugaro, M. B. Selective suppression of hippocampal ripples impairs spatial memory. Nature Neuroscience, 12 (10):1222, 2009.

Goodfellow, I. J., Mirza, M., Da, X., Courville, A. C., and Bengio, Y. An empirical investigation of catastrophic forgeting in gradient-based neural networks. Co RR, abs/1312.6211, 2014.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016.

Online Continual Learning from Imbalanced Data

Isele, D. and Cosgun, A. Selective experience replay for lifelong learning. In Thirty-second AAAI conference on artiﬁcial intelligence, 2018.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the Na tional Academy of Sciences, 114(13):3521 3526, 2017.

Krizhevsky, A. Learning multiple layers of features from tiny images. University of Toronto, 05 2012.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1097 1105. 2012.

Le Cun, Y., Cortes, C., and Burges, C. MNIST handwritten digit database. Yann Le Cun s Website, 2010. URL http: //yann.lecun.com/exdb/mnist/.

Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelli gence, 40(12):2935 2947, 2017.

Lopez-Paz, D. and Ranzato, M. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neu ral Information Processing Systems 30, pp. 6467 6476. 2017.

Mallya, A. and Lazebnik, S. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7765 7773, 2018.

Mc Closkey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning prob lem. In Psychology of Learning and Motivation; Volume 24, pp. 109 165. 1989.

Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter,

S. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.

Radford, A., Metz, L., and Chintala, S. Unsupervised rep resentation learning with deep convolutional generative adversarial networks. Co RR, abs/1511.06434, 2015.

Rebufﬁ, S.-A., Kolesnikov, A. I., Sperl, G., and Lampert,

C. H. icarl: Incremental classiﬁer and representation learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5533 5542, 2016.

Serr a, J., Suris, D., Miron, M., and Karatzoglou, A. Over coming catastrophic forgetting with hard attention to the task. ar Xiv preprint ar Xiv:1801.01423, 2018.

Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learning with deep generative replay. In Advances in Neural Infor mation Processing Systems 30, pp. 2990 2999. 2017.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae pel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 (6419):1140 1144, 2018.

Tani, J. Exploring Robotic Minds: Actions, Symbols, and Consciousness as Self-Organizing Dynamic Phenomena. Oxford University Press, 2016.

van de Ven, G. M. and Tolias, A. S. Three scenarios for continual learning. ar Xiv preprint ar Xiv:1904.07734, 2019.

Vitter, J. S. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37 57, March 1985.

von Oswald, J., Henning, C., Sacramento, J., and Grewe,

B. F. Continual learning with hypernetworks. In Interna tional Conference on Learning Representations, 2020.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987 3995, 2017.