# recursive_time_series_data_augmentation__edee2e46.pdf Published as a conference paper at ICLR 2023 RECURSIVE TIME SERIES DATA AUGMENTATION Amine Mohamed Aboussalah1, Minjae Kwon2, Raj G Patel3, Cheng Chi3, Chi-Guhn Lee3 1New York University, 2University of Virginia, 3University of Toronto ama10288@nyu.edu hbt9su@virginia.edu {rajg.patel, cheng.chi}@mail.utoronto.ca cglee@mie.utoronto.ca Time series observations can be seen as realizations of an underlying dynamical system governed by rules that we typically do not know. For time series learning tasks we create our model using available data. Training on available realizations, where data is limited, often induces severe over-fitting thereby preventing generalization. To address this issue, we introduce a general recursive framework for time series augmentation, which we call the Recursive Interpolation Method (RIM). New augmented time series are generated using a recursive interpolation function from the original time series for use in training. We perform theoretical analysis to characterize the proposed RIM and to guarantee its performance under certain conditions. We apply RIM to diverse synthetic and real-world time series cases to achieve strong performance over non-augmented data on a variety of learning tasks. Our method is also computationally more efficient and leads to better performance when compared to state of the art time series data augmentation. 1 INTRODUCTION The recent success of machine learning (ML) algorithms depends on the availability of a large amount of data and prodigious computing power, which in practice are not always available. In real world applications, it is often impossible to indefinitely sample and ideally, we would like the ML model to make good decisions with a limited number of samples. To overcome these issues, we can exploit additional information such as the structure or invariance in the data that help the ML algorithms efficiently learn and focus on the most important features for solving the task. In ML, the exploitation of structure in the data has been handled using four different yet complementary approaches: 1) Architecture design, 2) Transfer learning, 3) Data representation, and 4) Data augmentation. Our focus in this work is on data augmentation approaches in the context of time series learning. Time series representations do not expose the full information of the underlying dynamical system Prado (1998) in a way that ML can easily recognize. For instance, in financial time series data, there are patterns at various scales that can be learned to improve performance. At a more fundamental level, time series are one-dimensional projections of a hypersurface of data called the phase space of a dynamical system. This projection results in a loss of information regarding the dynamics of the system. However, we can still make inferences about the dynamical system that projects a time series realization. Our approach is to use these inferences to generate additional time series data from the original realization to build richer representations and improve time series pattern identification resulting in more optimal parameters and reduced variance. We show that our methodology is applicable to a variety of ML algorithms. Time series learning problems depend on the observed historical data used for training. We often use a set of time series data to train the ML model. Each element in the set can be viewed as a sample derived from the underlying stochastic dynamical system. However, each historical time series data sample is only one particular realization of the underlying stochastic dynamical system in the real Published as a conference paper at ICLR 2023 world that we are trying to learn. Our work focuses on problems where available realizations are limited but is not limited to these problems. In fact, our method can be applied to any time series learning task such as stock price prediction where we often have a single realization or problems with numerous realizations such as speech recognition where many audio clips are available for training. Let us consider the stock price prediction problem. The task is to predict or classify the trend of future price. Ideally we want our model to perform well by capturing the stochastic dynamics of stock markets. However, we only train the model using a single time series realization or limited historical realizations. As a result, we do not truly capture the characteristic behaviour of the underlying dynamical system. Using the original training data and hence one or a few realizations of the underlying dynamical system usually induces over-fitting. This is not ideal as we want our model to perform well in the stochastic system instead of just a specific realization of that system. Contributions. The contributions of our work are as follows: We present a time series augmentation technique based on recursive interpolation. We provide a theoretical analysis of learning improvement for the proposed time series augmentation method: We show that our recursive augmentation allows us to control by how much the augmented time series trajectory deviates from the original time series trajectory (Theorem 3.1) and that there is a natural trade-off that is induced when our augmentation deviates considerably from the original time series (Theorem 3.2). We demonstrate that our learning bound depends on the dimension and properties of the time series, as well as the neural network structure (Theorems 3.3 and 3.4). We believe that this work is the first to offer a theoretical ML framework for time series data augmentation with guarantees for variance reduction in the learned model (Theorem 3.5). We empirically demonstrate learning improvements using synthetic data as well as real world time series datasets. Outline of the paper. Section 2 presents the literature review. Section 3 defines the notations, the problem setting, and provides the main theoretical results. Section 4 describes the experimental results, and Section 5 concludes with a summary and a discussion of future work. 2 RELATED WORK Augmentation for Computer Vision. In the computer vision context, there are multiple ways to augment image data like cropping, rotation, translation, flipping, noise injection and so on. Among them, the mixup technique proposed in Zhang et al. (2018) is similar to our approach. They train a neural network on convex combinations of pairs of images and their labels. However, just applying a static technique to dynamic time series data is not appropriate. Chen et al. (2020) showed that data augmentation has a similar effect to an averaging operation over the orbits of a certain group of transformation that keep the data distribution invariant. Augmentation for Time Series. There is an exhaustive list of transformations applied to time series that are usually used as data augmentation Wen et al. (2020a). Fawaz et al. (2018) described transformations in the time domain such as time warping and time permutation. There are methods that belong to the magnitude domain such as magnitude warping, Gaussian noise injection, quantization, scaling, and rotation Wen & Keyes (2019). There exists other transformations on time series in the frequency and time-frequency domains that are based on Discrete Fourier Transform (DFT). In this context, they apply transformations in the amplitude and phase spectra of the time series and apply the reverse DFT to generate a new time series signal Gao et al. (2020). Besides the transformations in different domains, there are also more advanced methods, including decomposition-based methods such as the Seasonal and Trend decomposition using Loess (STL) method and its variants Cleveland et al. (1990); Wen et al. (2020b), statistical generative models Kang et al. (2020), and learning-based methods. The learning-based methods can be further divided into embedding space De Vries & Taylor (2017), and deep generative models (DGMs) Esteban et al. (2017); Yoon et al. (2019). These approaches are problem dependent and do not offer theoretical guaranteed learning improvement. In addition, the learning-based methods require large amounts of training data. Published as a conference paper at ICLR 2023 Augmentation for Reinforcement Learning (RL). Laskin et al. (2020) presented the RL with Augmented Data (RAD) module which can augment most RL algorithms that use image data. They have demonstrated that augmentations such as random translate, random convolutions, crop, patch cutout, amplitude scale, and color jitter can enable RL algorithms to outperform complex advanced methods on standard benchmarks. Kostrikov et al. (2021) presented a data augmentation method that can be applied to conventional model-free reinforcement learning (RL) algorithms, enabling learning directly from pixels without the need for pre-training or auxiliary losses. The inclusion of this augmentation method improves performance substantially, enabling a Soft Actor-Critic agent to reach advanced functioning capability on the Deep Mind control suite, outperforming model-based methods and contrastive learning. Laskin et al. (2020) and Kostrikov et al. (2021) show the benefit of RL augmentation using convolutional neural networks (CNNs) for static data but do not handle dynamic data such as time series. Ideally, we would like to have access to more data that is representative of the underlying dynamics of the system or the regime under which we operate. However, we can not randomly add more data as there is a probability that the added data might not be representative of our stochastic dynamical system. To ensure that we are able to add meaningful data without disturbing the properties of the original data, we introduce a new approach called Recursive Interpolation Method. Our paper proposes a recursive interpolation method for time series as a tool for generating data augmentations. 3 THEORETICAL FRAMEWORK FOR RECURSIVE INTERPOLATION METHOD 3.1 RECURSIVE INTERPOLATION METHOD (RIM) In our setting, we consider each time series sample as one realization from the underlying dynamical system. The realization consists of features along the time axis. Let d + 1 be the dimension of the time series sample and {0, 1, . . . , k} be the label set for each sample. Then, each sample belongs to Rd+1 {0, 1, . . . , k}. Let S = {s0, s1, . . . , s N} be the collection of the time series samples. Let D be a distribution with support [0, 1) and λi be drawn from D independently denoted by λi D for time i [1 : d]. Let us denote λ = (λ1, . . . , λd) to be the vector of interpolations. For the sake of notational simplicity, in the rest of the paper, we denote λ λ and λ D means that each component λi of λ is sampled independently from D. For a given vector λ [0, 1)d and a time series sample s S, we generate an augmented time series sample sλ as follows. Let us consider an original time series sample s = (x0, x1, . . . , xd, y) where the time series features (x0, x1, . . . , xd) Rd+1 and y {0, 1, . . . , k} being the label of the corresponding time series sample. For each λ = (λ1, . . . , λd) [0, 1)d (λ0 is considered to be a dummy value), we define an augmented sample sλ = (x0,λ0, x1,λ1, . . . , xd,λd, y) such that xi,λi = (1 λi)xi + λixi 1,λi 1 and x0,λ0 = x0 (1) X2,𝛌 = 1.12 X3,𝛌 = 3.42 X4,𝛌 = 2.28 RIM (𝛌i = 𝛌 = 0.8) Figure 1: Illustration of RIM. The newly generated augmented sample sλ has the same label y as the original sample s. Our recursive methodology allows us to generate a new time series realization that preserves the trajectory of the original data within some bound (Theorem 3.1). Published as a conference paper at ICLR 2023 Notations. Let l(s, θ) be a loss function defined on a sample s and a model parameter θ and similarly l(sλ, θ) is the loss defined on the augmented sample using a model parameter θ. We denote that the distribution P is parametrized by θ on the sample space S. Given few realizations {si}i [0:N] from the distribution P, we set: θ = argmin θ Es P[l(s, θ)] ˆθ = argmin θ i=0 l(si, θ) θaug = argmin θ Es P[Eλ D[l(sλ, θ)]] (2) ˆθaug = argmin θ i=0 Eλ D[l(si,λ, θ)] Rn(l Θ) = Eϵi E i=0 ϵil(si, θ) Using the notations above, our recursive time series data augmentation minimizes the augmented loss under which we take the expectation over the augmented sample space. We call it average augmented loss and denote it by laug(s, θ) = Eλ D[l(sλ, θ)]. The three most important aspects of our theoretical framework are the characterization of our recursive time series augmentation: the trade-off that is induced in the learning parameter space (Section 3.2), the learning bound showing the impact of the structural properties of time series and the neural network on the learning parameters (Section 3.3), and better parameter learning with reduced variance when using augmented samples compared to the non augmented ones (Section 3.4). 3.2 LEARNING BOUND CONNECTING ORIGINAL TIME SERIES AND AUGMENTED TIME SERIES We define the recursively interpolated time series and show that the augmented time series samples can deviate from the original ones. We measured this deviation from the original time series using a norm distance between the augmented time series and the original one. We show that this distance is bounded, where the bound depends on the characteristics of the time series features (Theorem 3.1). Let sign(t) = 0 if t = 0 1 if t > 0 1 if t < 0 , δab = ( 1 if a = b 0 if a = b, and D be a distribution with support [0, 1). Theorem 3.1. (Characterization of recursive augmentation) If λn D and g(λn) = (1 λn)(1 δ0n) + (1 sign(n)), then the following holds. Let n [0 : N]. (1) xn,λn = Pn k=0(Qn i=k+1 λi)g(λn)xn where λj D for j 1 and λ0 is a dummy value. (2) Let be a norm, e = E[D], m = maxi [1:N]{ xi xi 1 } and m = maxi [0:N]{ xi }. Then Eλ1,...,λn[(xn,λn xn)] min{(3e)m, e 1 em , Nem }, (3) Eλ1,...,λn[ (xn,λn xn) ] 2λnm (4) Recall that s = (x0, x1, . . . , xd, y) and sλ = (x0,λ0, x1,λ1, . . . , xd,λd, y), we have i=0 (xi xi,λi)2 i=0 |xi xi,λi| = s sλ 1 Note that we used the l2 norm for simplicity but in general, any norm satisfies the inequality. Hence, measuring the distance of each feature between the augmented time series sample and the original one gives us some information about how far they deviate from each other. Published as a conference paper at ICLR 2023 Theorem 3.2. (Learning bound using characterization of recursive augmentation) Without any loss of generality, let l( , ) [0, 1]1 be a loss function with Lipschitz condition and S = {s0, s1, . . . , s N} be the collection of the time series samples. Then with probability at least 1 δ over the samples {si}i [0:N], we have Es P[l(s, ˆθaug)] Es P[l(s, θ )] < 2RN(laug Θ) + q N+1 + 2LLip Es PEλ D[ sλ s ]. (5) Moreover, we have RN(laug Θ) RN(l Θ) + max i {0,...,N} LLip Eλ D[ si,λ si ]. (6) with Eλ D[ si,λ s ] 2mde, where e = E[D], d + 1 is the dimension of the time series sample, and m = maxi [0:N]{ xi }. Theorem 3.2 represents the standard argument for regret bounds using Rademacher complexity. The term in the left in (Eq.5) is what we call the generalization error or sometimes simply the risk, and the term in the right is called the regret excess risk. Theorem 3.2 tells us how good is our parameter estimation compared to the optimal parameters. Our bound is governed by three terms: 1) Rademacher complexity using the augmented data, 2) the sample size, 3) and the distance between our original time series sample and the augmented one. Now if the distance in (Eq.5) between time series samples before and after augmentation goes to zero then the learning bound is more tight. Another implication of the distance being small is that our recursive augmentation method decreases the Rademacher complexity as demonstrated by (Eq.6). Decreasing Rademacher complexity simply ends up producing a tighter bound on the parameter space. On the other hand, if the distance in (Eq.5 and Eq.6) becomes large (i.e. the augmented time series samples deviate considerably from the original ones), then we can not guarantee with high probability that our method will always outperform the non augmented case. This is a trade-off frequently observed in learning theory. 3.3 LEARNING BOUND CONNECTING THE STRUCTURAL PROPERTIES OF TIME SERIES AND NEURAL NETWORKS Theorem 3.3. Let fθ be a neural network with a model parameter θ, Re LU activations, and sigmoid function as the activation function for the last layer denoted by fθ(x) = σ(gθ(x)) where gθ(x) = g T θ x + b is the pre-activation signal for the last layer. Then the cross entropy loss function l( , ) has error bound as follows l(sλ, θ) l(s, θ) λ1 |λ= 0 x2,λ2 λ1 |λ= 0 xd,λd λ1 |λ= 0 0 x2,λ2 λ2 |λ= 0 xd,λd λ2 |λ= 0 ... ... ... ... 0 0 xd,λd λi λ1 |λ= 0 2x2,λ2 λi λ1 |λ= 0 2xd,λd λi λ1 |λ= 0 0 2x2,λ2 λi λ2 |λ= 0 2xd,λd λi λ2 |λ= 0 ... ... ... ... λi λd |λ= 0, 0 Rd denoted by column vector, and F is a Frobenius norm. Theorem 3.3 shows how the structural properties of the time series (A and B) and the neural network architecture ( gθ) can affect the learning process. The interpolation vector λ will determine velocity A and accelerations {Bi}i [1:d] of the features for the augmented sample sλ. 1We can choose the range of the loss function to be in any compact and connected subset of R under the usual topology. Published as a conference paper at ICLR 2023 Theorem 3.4. (Learning bound with structural properties on time series and neural network) Let l be the cross entropy loss function such that l( , ) [0, 1]. Then with probability at least 1 δ over the samples {si}i [0:N], we have Es P[l(s, ˆθaug)] Es P[l(s, θ )] < 2RN(laug Θ) + q d A + Pd i=1 Bi Moreover, we have RN(laug Θ) RN(l Θ) + Theorem 3.4 reveals that there is a trade off between the dimension of features of a time series sample, the λ, and the neural network architecture. A and {Bi}i [1:d] are the bounds for the velocity and accelerations, respectively, which are computed using the interpolation vector. The last term of (Eq.8) is constructed by the gradient of the pre-activation gθ with respect to an input, the dimension of features of a time series sample, and the bound for the velocity and accelerations induced by the interpolation vector. Theorem 3.4 tells us how good is our parameter estimation compared to the optimal parameters. Our bound is governed by three terms: 1) Rademacher complexity using the augmented data, 2) the sample size, 3) and a term that depends on the dimension of the features of the time series sample d, the structural properties of the time series (A and Bi), and the neural network architecture ( gθ). Because our RIM method uses a recursive interpolation between two consecutive features, the order of magnitude of A and B do not change drastically, and hence do not blow up the bound. Another implication of the A and B being small is that our recursive augmentation method decreases the Rademacher complexity as demonstrated by (Eq.9). Decreasing Rademacher complexity simply ends up producing a tighter bound on the parameter space. 3.4 VARIANCE REDUCTION Suppose that we observe a set {s0, s1, . . . , s N} of N +1 samples from the underlying sample space S. Using our RIM method, we can augment the observed sample si with a distribution D, which results in the set of augmented samples {si,λ | λ D} for si. Based on mild assumptions (please refer to Appendix A.5) on the regularity of the loss function and on the underlying sample space, we have the following results. Theorem 3.5. (Asymptotic normality) Assume Θ is open. Then ˆθ and ˆθaug admit the following Bahadur representation; N + 1(ˆθ θ ) = 1 i=0 l(si, θ ) + o P(1) N + 1(ˆθaug θ ) = 1 i=0 laug(si, θ ) + o P(1) Therefore, both ˆθ and ˆθaug are asymptotically normal N + 1(ˆθ θ ) N(0, Σ0) and N + 1(ˆθaug θ ) N(0, Σaug) (11) where the covariance is given by Σ0 = V 1 θ Es P[ l(s, θ ) l(s, θ )T ]V 1 θ Σaug = Σ0 Es P[XXT ] (12) where X = l(s, θ ) laug(s, θ ). As a consequence, the asymptotic relative efficiency of ˆθaug compared to ˆθ is RE = tr(Σ0) tr(Σaug) 1. (Eq.11) describes the asymptotic behaviour of the learning parameters. (Eq.12) shows that our recursive time series augmentation reduces the variance of the learning parameters. Published as a conference paper at ICLR 2023 4 EXPERIMENTS We show our results on time series classification tasks with different time series augmentation methods. We compare the improvements of downstream task performance due to different data augmentation methods by enlarging the training set when we train the classifiers. To benchmark our RIM approach, we compare our results with that achieved by Time GAN approach Yoon et al. (2019). We specifically select Time GAN as they are known to preserve the temporal dynamics thereby maintaining the correlation between variables across time. Furthermore, with the flexibility of the unsupervised GAN framework and the control offered by the supervised training in autoregressive models, comparing our results against those by Time GANs can provide us a rigorous benchmark against a tested state-of-the-art approach. We use a relatively small training set with a large testing set so that it is more challenging for classifiers to generalize and data augmentations are favorable. Note that we use data augmentation methods on two classes separately as data augmentations should only preserve properties of that particular class. We use different data augmentation methods on these two classes of time series in our training set as follows: RIM methods can be directly applied to time series within each class to generate new time series for that class to enlarge the original training set such that the generated series are close to original series in that class according to Theorem 3.1. For the Time GAN baseline, we train two Time GANs separately using time series from each class. Once these two Time GANs are trained, they are used to generate time series for each class to enlarge the original training set. We consider four tasks: the first two use synthetic datasets generated by solving 1-dimensional ODEs, and the last two use real-world datasets. We compare testing accuracy using the original data, augmented data with RIM, and augmented data with Time GANs. 4.2 RESULTS For task 1, we consider solutions to ODEs containing exponential functions, and two classes in our binary classification correspond to the two ODEs with different parameters. For task 2, we consider solutions to ODEs with trigonometric functions. In this setting, ODEs can be thought of as generators that generate time series on which we are performing classification, and ODEs with different parameters invoke different dynamical behaviors on their solutions (time series). For each class, we generate multiple solutions using corresponding ODE with different initial values. To make the learning tasks harder, we add random noise to the solution generated by these ODEs. 0 25 50 75 100 125 150 175 200 Time Class 1 Class 2 0 10 20 30 40 50 60 70 80 Epochs Test Accuracy Test Accuracy for Augmented Data vs Non-Augmented Data No-Aug RIM GAN Figure 2: Time series from two classes on the left plot, Test Accuracy for the exponential synthetic ODE system using a Convolutional Neural Network with kernel size=3, filter=32, batch size=16, using Batch Norm and Adam optimizer. The test accuracy plot indicates the resulting mean standard deviation from 10 runs. Task 1: Synthetic data - Exponential ODEs solutions Two ODEs (time series generators) containing exponential functions we use: dt = 0.5y2 + e y; class 2 dy dt = 0.3y2 + 1.5e y Published as a conference paper at ICLR 2023 Task 2: Synthetic data - Trigonometric ODEs solutions Two ODEs (time series generators) we use: dt = 0.6 + 0.5 sin(y); class 2 dy dt = 1 + cos(y) 0 25 50 75 100 125 150 175 200 Time Class 1 Class 2 0 10 20 30 40 50 60 70 80 Epochs Test Accuracy Test Accuracy for Augmented Data vs Non-Augmented Data No-Aug RIM GAN Figure 3: Time series from two classes and Test Accuracy for the trigonometric synthetic ODE system using a Convolutional Neural Network with kernel size=3, filter=32, batch size=16, using Batch Norm and Adam optimizer. The test accuracy plot indicates the resulting mean standard deviation from 10 runs. Task 3: Real dataset - Indoor User Movement from the Radio Signal Strength (RSS) Data This binary classification task from Bacciu et al. (2014) is associated with predicting the pattern of user movements in real world office environments from time series generated by a Wireless Sensor Network (WSN). The input data contains RSS measured between the nodes of a WSN, comprising of 5 sensors: 4 in the environment and 1 for the user. Data has been collected during movement of the users and labelled to indicate whether the user s trajectory will lead to a change in the room or not. In experiments, we use a subset of the data to form a small training set to challenge our algorithm. We achieve better and more robust test accuracy than the Time GAN and the non augmented case when using augmented data as reflected in Figure 4. Since λ is the only parameter used in our RIM augmentation technique, our ablation study paid very precise attention to the choice of the λ parameter. Under this, we tested different λ distributions. Given that we were interested in convex combinations between xi and xi 1,λi 1, we had to restrict λ between 0 and 1. Two ways to perform this would be: (1) uniformly distribute the weights while sampling λ; (2) concentrating on a specific part of λ distribution. To address (1), we use U(0, 1) which is the main test-bed for all the experiments in the current main text. Whereas to address the (2), we perform studies using beta distribution by varying its shape parameters to focus on specific parts of densities. We tested Beta(2, 2), Beta(0.5, 0.5) and Beta(2, 5), for which the resulting plots can be found in Appendix C. For all these cases, we observed improvements from using RIM compared to non-augmented training, both in terms of a higher final testing accuracy and with fewer training iterations thereby solidifying the effectiveness of RIM. 0 20 40 60 80 100 Time Class 1 Class 2 0 20 40 60 80 100 Epochs Test Accuracy Test Accuracy for Augmented Data vs Non-Augmented Data No-Aug RIM GAN Figure 4: Time series from two classes and Test Accuracy for the Indoor User Movement Classification using a Convolutional Neural Network with kernel size=3, filter=32, batch size=16, using Batch Norm and Adam optimizer. The test accuracy plot indicates the resulting mean standard deviation from 10 runs. Published as a conference paper at ICLR 2023 Task 4: Real dataset - Ford Engine Condition We use a subset of the Ford A dataset from 2008 WCCI Ford classification challenge Abou-Nasr & Feldkamp (2007). This dataset contains time series corresponding to measurements of engine noise captured by a motor sensor. The goal is to detect the presence of a specific issue with the engine by classifying each time series into issue/no issue classes. We sample 100 time series from Ford A to form a small training set to challenge our algorithm and 100 time series for testing. As shown in Figure 5, RIM outperforms the Time GAN and the non augmented case on the test accuracy. 0 100 200 300 400 500 Time Engine Noise Class 1 Class 2 0 20 40 60 80 100 Epochs Test Accuracy Test Accuracy for Augmented Data vs Non-Augmented Data No-Aug RIM GAN Figure 5: Time series from two classes and Test Accuracy for the Ford Engine Classification with a Convolutional Neural Network with kernel size=3, filter=32, batch size=16, using Batch Norm and Adam optimizer. The test accuracy plot indicates the resulting mean standard deviation from 10 runs. From Figure 2 to Figure 5, we can see that RIM achieves superior performance over the non augmented case. Furthermore, RIM is also able to achieve better or comparable performance than Time GAN on these tasks without going through the extensive training process associated with GANs. For our experiments, we train the Time GAN for 2500 epochs (3 hours on Xeon Processors CPU) for synthetic datasets and 5000 epochs (6 hours on Xeon Processors CPU) for real datasets. Visual comparisons of the time series generated by RIM and Time GAN with the original time series are shown in Appendix B. As expected from Theorem 3.5, RIM has smaller variance and better convergence compared to the non augmented case across all experiments. 4.3 EXTENSION TO OTHER LEARNING TASKS Section 4.2 demonstrates results for time series classification. In this section, we show that RIM can also be used in other learning tasks including continuous time series forecasting and RL. In continuous time series forecasting, we generally have one historical realization and we leverage that to form our training set composed of (x,y) pairs where x is the previous n time steps data and y is the future step target by decomposing the time series into smaller components. RIM can then be used to generate more time series from the unique realization so that we can enlarge our training set by adding more (x,y) pairs from the original realizations. We also used RIM to augment state trajectories in RL tasks (please refer to pseudo code in Appendix E.2). Preliminary experiments for continuous time series forecasting can be found in the Appendix D and RL tasks can be found in the Appendix E. 5 CONCLUSION We developed a Recursive Interpolation Method (RIM) for time series as a data augmentation technique to learn models accurately with limited data. The RIM is simple, yet effective, supported by theoretical analysis guaranteeing faster convergence. Theoretically, we proved that the RIM guarantees better parameter convergence with reduced variance. Empirically, our methodology outperforms the current state-of-the-art approaches for different real world problem domains and synthetic datasets by obtaining higher accuracy with reduced variance. Because our approach operates on the input time series data, it is invariant to the choice of the ML algorithm. The methodology described in this paper can be used to enhance ML solutions to a wide variety of time series learning problems. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENT Amine is grateful to Hafida Ines Bouzaouia and Douglas Tweed for their help and constructive comments regarding this work. This research is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) [grant number 411296449], Canada s federal funding agency for university based research. Mahmoud Abou-Nasr and Lee Feldkamp. Ford classification challenge. Zip Archive http://www.ti meseriesclassification.com/description.php, 2007. Davide Bacciu, Paolo Barsocchi, Stefano Chessa, Claudio Gallicchio, and Alessio Micheli. An experimental characterization of reservoir computing in ambient assisted living applications. Neural Computing and Applications, 24(6):1451 1464, 2014. Shuxiao Chen, Edgar Dobriban, and Jane H. Lee. A group-theoretic framework for data augmentation. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, pp. , 2020. Robert B Cleveland, William S Cleveland, Jean E Mc Rae, and Irma Terpenning. Stl: A seasonaltrend decomposition. Journal of Official Statistics, 6(1):3 73, 1990. Saverio De Vito, Ettore Massera, Marco Piga, Luca Martinotto, and Girolamo Di Francia. On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors and Actuators B: Chemical, 129(2):750 757, 2008. Terrance De Vries and Graham W Taylor. Dataset augmentation in feature space. ICLR, Toulon, pp. 1 12, 2017. Cristóbal Esteban, Stephanie L Hyland, and Gunnar Rätsch. Real-valued (medical) time series generation with recurrent conditional gans. ar Xiv preprint ar Xiv:1706.02633, 2017. Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. Data augmentation using synthetic data for time series classification with deep residual networks. ECML/PKDD Workshop on AALTD, 2018. Center for Machine Learning and Intelligent Systems UCI Machine Learning Repository. Air quality data set. In https://archive.ics.uci.edu/ml/datasets/air+quality., pp. . Jingkun Gao, Xiaomin Song, Qingsong Wen, Pichao Wang, Liang Sun, and Huan Xu. Robusttad: Robust time series anomaly detection via decomposition and convolutional neural networks. In Mile TS 20: 6th KDD Workshop on Mining and Learning from Time Series, pp. 1 6, 2020. Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507 523. Springer, 2011. Yanfei Kang, Rob J Hyndman, and Feng Li. Gratis: Generating time series with diverse and controllable characteristics. Statistical Analysis and Data Mining: The ASA Data Science Journal, 13(4):354 376, 2020. Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, 2021. Michael Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data. Advances in Neural Information Processing Systems, 2020. Raquel Prado. Latent structure in non-stationary time series. Ph D thesis, Duke University, 1998. Published as a conference paper at ICLR 2023 Aad W Van der Vaart. Asymptotic statistics. Cambridge university press, 1998. Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, Jingkun Gao, Xue Wang, and Huan Xu. Time series data augmentation for deep learning: A survey. ar Xiv preprint ar Xiv:2002.12478, 2020a. Qingsong Wen, Zhe Zhang, Yan Li, and Liang Sun. Fast robuststl: Efficient and robust seasonaltrend decomposition for time series with complex patterns. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2203 2213, 2020b. Tailai Wen and Roy Keyes. Time series anomaly detection using convolutional neural networks and transfer learning. IJCAI Workshop on AI4Io T, 2019. Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar. Time-series generative adversarial networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 5508 5518, 2019. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. Published as a conference paper at ICLR 2023 The code for all the experiments can be found at the following link. A.1 PROOF OF THEOREM 3.1 Let sign(t) = 0 if t = 0 1 if t > 0 1 if t < 0 , δab = ( 1 if a = b 0 if a = b, and D is a distribution with support [0, 1). Proof of Theorem 3.1 (1) Note that λ0 is a dummy value for mathematical convenience and Qn i=k+1 λi = 1 if n < k + 1. We prove this by induction on n. The case n = 1 shows that x1,λ1 = g(λ1)x1 + λ1g(λ0)x0 = (1 λ1)x1 + λ1x0 since g(λ0) = 1 and g(λ1) = 1 λ1. We now assume that the inequality holds for n 1 and prove it for n. By construction of xn,λn, xn,λn = (1 λn)xn + λnxn 1,λn 1 = (1 λn)xn + λn Thus (1) holds by mathematical induction. (2) Consider the equation xn,λn xn . Then by (1), the first bound can be found by E[(xn,λn xn)] = E[ i=k+1 λi)g(λk)xk xn] i=k+1 λi)g(λk)xk λnxn] k=1 en k(1 e)xk + en 1x0 exn k=1 en k(1 e) xk + en 1 x0 + e xn k=0 en k + en 1 + e)m k=2 en k + e)m (en 1 en + 2e)m 3em where e = E[D] and m = maxi [0:n]{ xi }. Now we prove the second bound. Since xn,λn xn = xn ((1 λn)xn + λnxn 1,λn 1) = λn xn xn 1 + xn 1 xn 1,λn 1 λn( xn xn 1 + xn 1 xn 1,λn 1 ), (13) Published as a conference paper at ICLR 2023 By recursively applying (Eq. 13), we obtain i=k λi) xk xk 1 . (14) By Jensen s inequality and (Eq. 14), we obtain the second bound E[(xn,λn xn)] E[ xn,λn xn ] i=k λi) xk xk 1 ] k=1 en k+1m min{ e 1 em , nem } where e = E[D] and m = maxi [1:n]{ xi xi 1 }. Now we show that Eλ1,...,λn[ (xn,λn xn) ] 2λnm. Eλ1,...,λn[ (xn,λn xn) ] = E[ i=k+1 λi)g(λk)xk λnxn ] i=k+1 λi)g(λk)| xk + |λn| xn ] i=k+1 λi)g(λk) + λn]m = λ1λ2 λn + λ2λ3 λn(1 λ1) + λ3λ4 λn(1 λ2) + λn(1 λn 1) + λn m = 2λnm (16) where m = maxi [0:N]{ xi }. The first inequality holds by Minkowski inequality. A.2 PROOF OF THEOREM 3.2 Neural Networks with Re LU Activations. Using Re LU activation functions, neural networks are constructed by piecewise linear functions of an input. Such a neural network gθ can be formulated by g T θ x + b where x is an input, g T θ is the gradient of gθ(x) along x. Lemma A.1. (Structure of partial derivative) Let s = (x0, . . . , xd) be one sample drawn from time series and λ [0, 1]d. Then 0 if i < j (xi 1,λi 1 xi) if i = j (Qi k=j+1 λk)(xj 1,λj 1 xj) if i > j (17) Proof. If i < j, then xi,λi does not depend on λj. Hence xi,λi If i = j, then xi,λi = (1 λi)xi + λixi 1,λi 1 = xi + λi(xi 1,λi 1 xi). Hence xi,λi λj = (xi 1,λi 1 xi). If i > j, we prove this by induction on i. The case i = j + 1 shows that xj+1,λj+1 = (1 λj+1)xj+1 + λj+1xj,λj = xj+1 + λj+1(xj,λj xj+1). Since xj+1 is not depedent on λj and xj,λj λj = (xj 1,λj 1 xj), we have xj+1,λj+1 λj = λj+1(xj 1,λj 1 xj). We now assume that the inequality holds for i and prove it for i + 1. Since xi+1,λi+1 = (1 λi+1)xi+1 + λi+1xi,λi = xi+1 + λi+1(xi,λi xi+1) and k=j+1 λk)(xj 1,λj 1 xj), Published as a conference paper at ICLR 2023 k=j+1 λk)(xj 1,λj 1 xj) = ( k=j+1 λk)(xj 1,λj 1 xj). As a consequence of mathematical induction, the conclusion holds. We use neural networks with Re LU activations and sigmoid function as the activation function for the last layer. So, a neural network fθ(x) = σ(gθ(x)) where gθ(x) = g T θ x+b is the pre-activation signal for the last layer. Recall that we denote by λ λ and λ D means that each component λi of λ is sampled independently from D. Proof of Theorem 3.2. Let sλ = (x0, x1,λ1, . . . , xd,λd, y) and s be one sample. We denote the features by xλ = (x0, x1,λ1, . . . , xd,λd) and the label by y. Define the matrices λ1 |λ= 0 x2,λ2 λ1 |λ= 0 xd,λd λ1 |λ= 0 0 x2,λ2 λ2 |λ= 0 xd,λd λ2 |λ= 0 ... ... ... ... 0 0 xd,λd λi λ1 |λ= 0 2x2,λ2 λi λ1 |λ= 0 2xd,λd λi λ1 |λ= 0 0 2x2,λ2 λi λ2 |λ= 0 2xd,λd λi λ2 |λ= 0 ... ... ... ... λi λd |λ= 0 where 0 Rd denoted by column vector and ei is a vector whose i-th component is 1 and 0 otherwise. Let l(sλ, θ) = y log(fθ(xλ)) + (1 y) log(1 fθ(xλ)). Denote lθ(λ) = l(sλ, θ). Using Taylor expansion of the loss l around λ, we have lθ(λ) = lθ( 0) + λ= 0 λi + 1 2lθ(λ) λi λj λ= 0 λiλj + O( λ 2) (20) Note that lθ(λ) xi,λi = y log(fθ(xλ)) xi,λi + (1 y) log(1 fθ(xλ)) xi,λi fθ(xλ) (1 y) xi,λi 1 fθ(xλ) y(1 fθ(xλ)) + (y 1)fθ(xλ) (1 fθ(xλ))fθ(xλ) y fθ(xλ) (1 fθ(xλ))fθ(xλ) xi,λi = σ(gθ(xλ)) xi,λi σ(gθ(xλ))(1 σ(gθ(xλ))) = ( g T θ xλ) xi,λi σ(gθ(xλ))(1 σ(gθ(xλ))) = ( g T θ xi,λi xλ + g T θ xλ xi,λi )σ(gθ(xλ))(1 σ(gθ(xλ))) = g T θ eiσ(gθ(xλ))(1 σ(gθ(xλ))). Published as a conference paper at ICLR 2023 Since the i-th feature xi,λi of xλ depends on {λ1, . . . , λi}, we have y fθ(xλ) (1 fθ(xλ))fθ(xλ) λj g T θ eiσ(gθ(xλ))(1 σ(gθ(xλ))) y fθ(xλ) (1 fθ(xλ))fθ(xλ) λj g T θ ei(y fθ(xλ)) Thus we have λ= 0 λj = (y fθ(x))(0, λ)T A gθ (24) Hence we have d A F gθ . (25) Note that y fθ(x)) 1. Now we consider the second partial derivative of the loss function lθ(λ). 2lθ(λ) λu λj = λu ( λj g T θ ei(y fθ(xλ))) i=j g T θ ei λu λj (y fθ(xλ)) λu g T θ ekfθ(xλ)(1 fθ(xλ)) We will put (Eq. 26) into (Eq. 20) to calculate the Taylor loss explicitly. For the first term, we have l=j g T θ el( 2xl,λl λi λj (y fθ(xλ))) λ= 0 = i=1 λi(y fθ(x))(0, λ)T Bi gθ and for the second term, we have i=j g T θ ei xi,λi λl g T θ ekfθ(xλ)(1 fθ(xλ)) λ= 0 i=j g T θ ei xi,λi λj fθ(x)(1 fθ(x))(0, λ)T A gθ = fθ(x)(1 fθ(x))(0, λ)T A gθ g T θ AT (0, λ). Published as a conference paper at ICLR 2023 By combining (Eq. 27 and 28), we have j=1 λiλj 2lθ(λ) λi λj i=1 λi(y fθ(x))(0, λ)T Bi gθ fθ(x)(1 fθ(x))(0, λ)T A gθ g T θ AT (0, λ) d Bi F gθ . The last inequality holds due to fθ(x)(1 fθ(x))(0, λ)T A gθ(x) gθ(x)T AT (0, λ) 0 (30) By (Eq. 20, 25 and 29), the conclusion holds. A.3 PROOF OF THEOREM 3.2 We start with a basic inequality frequently used in the proofs. Lemma A.2. (Supremum inequality) Let f and g be functions which have the same domain and range. Then sup θ |f(θ)| sup θ |g(θ)| sup θ |f(θ) g(θ)| (31) sup θ |f(θ)| = sup θ |f(θ) g(θ) + g(θ)| sup θ (|f(θ) g(θ)| + |g(θ)|) sup θ |f(θ) g(θ)| + sup θ |g(θ)| Thus the conclusion holds. Notation. We assume that the distribution P is parametrized by θ on the sample space S. Let {si}i [0:N] be the collection of samples from the distribution P. We set: θ = argmin θ Es P[l(s, θ)] ˆθ = argmin θ i=0 l(si, θ) θaug = argmin θ Es P[Eλ D[l(sλ, θ)]] ˆθaug = argmin θ i=0 Eλ D[l(si,λ, θ)] Rn(l Θ) = Eϵi E i=0 ϵil(si, θ) laug(s, θ) = Eλ D[l(sλ, θ)] Assumption 1. Assume that the loss function l satifies Lipschitz condition with respect to the norm. Proof of Theorem 3.2. Es P[l(s, ˆθaug)] Es P[l(s, θ )] = u1 + u2 + u3 + u4 + u5 Published as a conference paper at ICLR 2023 where u1 = Es P[l(s, ˆθaug)] Es P[Eλ D[l(sλ, ˆθaug)]] u2 = Es P[Eλ D[l(sλ, ˆθaug)]] i=0 Eλ D[l(si,λ, ˆθaug)] u3 = 1 N + 1 i=0 Eλ D[l(si,λ, ˆθaug)] i=0 Eλ D[l(si,λ, θ )] u4 = 1 N + 1 i=0 Eλ D[l(si,λ, θ )] Es P[Eλ D[l(sλ, θ )]] u5 = Es P[Eλ D[l(sλ, θ )]] Es P[l(s, θ )] u1 + u5 2 sup θ Θ Es P[l(s, θ)] Es P[Eλ D[l(sλ, θ)]] (34) where we have Es P[l(s, θ)] Es P[Eλ D[l(sλ, θ)]] = Es P[l(s, θ) Eλ D[l(sλ, θ)]] = Es P[Eλ D[l(s, θ) l(sλ, θ)]] LLip Es PEλ D[ sλ s ]. (35) Hence, from (Eq. 34 and 35) u1 + u5 2LLip Es PEλ D[ sλ s ]. (36) By Mc Diarmid s inequality, in terms of the probability i=0 Eλ D[l(si,λ, θ )] Es P[Eλ D[l(sλ, θ )]] t) exp 2t2 PN i=0( 1 N+1)2 u4 has the following bound with probability at least 1 δ log(1/δ) 2(N + 1). (38) Moreover, Rademacher complexity holds for u2, so we have u2 2RN(laug Θ) + 4 with probability at least 1 δ. By (Eq. 38 and 39), we get the following inequality u2 + u4 2RN(laug Θ) + 5 with probability at least 1 δ where RN(l Θ) = Eϵi E[sup θ Θ i=0 ϵi Eλ D[l(si,λ, θ)] ]. where ϵi is a Radamacher variable for all i [N]. Since ˆθaug is an optimal parameter for 1 N+1 PN i=0 Eλ D[l(si,λ, θ)], then Published as a conference paper at ICLR 2023 From (Eq. 47, 40 and 41), we conclude that Es P[l(s, ˆθaug)] Es P[l(s, θ )] < 2RN(laug Θ)+5 N + 1 +2LLip Es PEλ D[ sλ s ]. Now we will prove that RN(laug Θ) RN(l Θ) + max i {0,...,N} LLip Eλ D[ si,λ si ]. (43) RN(laug Θ) RN(l Θ) = Eϵi E[sup θ Θ i=0 ϵilaug(si, θ) sup θ Θ i=0 ϵil(si, θ) ] Eϵi E[sup θ Θ i=0 ϵilaug(si, θ) 1 N + 1 i=0 ϵil(si, θ) ] = Eϵi E[sup θ Θ i=0 ϵi(laug(si, θ) l(si, θ)) ] Eϵi E[sup θ Θ ϵi(laug(si, θ) l(si, θ))] (laug(si, θ) l(si, θ)) Eλ D[l(si,λ, θ) l(si, θ)] max i {0,...,N} LLip Eλ D[ si,λ si ]. We will show Eλ D[ si,λ s ] 2dem. Let each sample s (or si) Rd+1 {0, 1, . . . , k}. Then s = (x0, x1, . . . , xd, y). By the augmentation method, recursively applying convex combinations, we have sλ = (x0, x1,λ1, x2,λ2, . . . , xd,λd, y). By Theorem 3.1 (Eq. 4), each Eλ1,...,λj[ (xj,λj xj) ] 2λjm. Hence Eλ D[ si,λ s ] j=1 Eλ D[ xλj,j xj ] 2Eλ D[(λ1 + λ2 + + λd)]m 2mde where e = E[D], d + 1 is the dimension of the time series sample, and m = maxi [0:N]{ xi }. A.4 PROOF OF THEOREM 3.4 Let A = sups S A F and Bi = sups S Bi F . Then for each s P, we have: A F A and Bi F Bi, where A and Bi depend on s for i [d]. Please refer to (Eq. 18) and (Eq. 19). Proof of Theorem 3.4. The proof is the almost same with the proof of Theorem 3.2. We only describe the part should be replaced in order to prove Theorem 3.4. Similarly, We start with the Published as a conference paper at ICLR 2023 decomposition of the equation as follows Es P[l(s, ˆθaug)] Es P[l(s, θ )] = u1 + u2 + u3 + u4 + u5 where u1, u2, u3, u4 and u5 are defined in (Eq. 33). (Eq. 35) in the proof of Theorem 3.2 will be replaced with the following Es P[l(s, θ)] Es P[Eλ D[l(sλ, θ)]] = Es P[l(s, θ) Eλ D[l(sλ, θ)]] = Es P[Eλ D[l(s, θ) l(sλ, θ)]] Hence, from (Eq. 34) and (Eq. 46), we have The last two lines of (Eq. 44) will be replaced with the following Eλ D[l(si,λ, θ) l(si, θ)] Theorem 3.3 guarantees the inequality. Thus the conclusion holds. A.5 PROOF OF THEOREM 3.5 Asymptotic Results. Note that the sample space is contained in Rd+1 {0, . . . , k} and the parameter space Θ Rp where {0, . . . , k} is the label set. Under regularity conditions, it is well known that ˆθ is asymptotically normal with covariance given by the inverse Fisher information matrix. We will see that ˆθaug is also asymptotically normal with the covariance. Suppose that we observe a set {s0, s1, . . . , s N} of N + 1 samples from the underlying sample space S. Using our RIM method, we can augment the observed sample si with a distribution D, which results in the set of augmented samples {si,λ | λ D} for si. We then decompose N i=0{si,λ | λ D} into disjoint union of some sets Si such that si Si. Assumption 2. (Disjointness) the sample space S is the disjoint countable union of all possible augmented sample spaces. Consider the probability space is (S, F, P) where F is a sigma algebra and P the corresponding probability measure. Let µ be a measurable function from S to Rp for some p N. For each sample s = (x, y) S, define µ(s) = E[µ | Si] where s Si. Let s assume that we observe N + 1 samples {s0, s1, . . . , s N} from the underlying sample space. Then by Assumption 2, the underlying sample space is decomposed into i=1Si. i,e S = i=1Si. This is well-defined by Assumption 2, and µ is a measurable function. Let µi be the expectation value of µ over Si. Lemma A.3. (Effects of the average function) With notation as above, the following holds. 1. The law of total expectation: Es P[µ] = Es P[µ]. 2. The law of total covariance: Covs Pµ = Es P[Cov(µ|µ)] + Covs Pµ. Proof. From the disjointedness of the underlying sample space, for each s S, we have s Si for some i N Es P[µ] = Es P[Es P[µ | Si]] = Es P[Es P[µ | µ = µi]] = Es P[µ]. (49) Thus the law of total expectation and the law of total covariance naturally follow. Under mild assumptions for a given loss function, we show that the average loss function laug inherits the same properties from the non augmented loss function, where laug(s, θ) = Eλ D[l(sλ, θ)]. Published as a conference paper at ICLR 2023 Assumption 3. (Regularity of the loss function) For the loss function l( , θ), we assume that 1. For the minimizer θ of the population risk and any ϵ > 0, we have sup { θ θ ϵ | θ Θ} Es P[l(s, θ)] > Es P[l(s, θ )] 2. For every ϵ > 0, there exists a function l L2(P) such that for almost every s and for every θ1, θ2 N(θ0, ϵ), we have |l(s, θ1) l(s, θ2)| l (s) θ1 θ2 3. Uniform weak law of large number holds i=0 l(si, θ) Es P[l(s, θ)] 0 4. For each θ in Θ, the map s l(s, θ) is measurable 5. The map θ l(s, θ) is differentiable at θ for almost every s 6. The map θ Es P[l(s, θ)] admits a second-order Taylor expansion at θ with nonsingular second derivatives matrix Vθ Proposition A.4. For the pair (θaug, laug), the following property holds. For every ϵ > 0, there exists a function l aug L2(P) such that for almost every s and for every θ1, θ2 N(θ0, ϵ), we have |laug(s, θ1) laug(s, θ2)| l aug(s) θ1 θ2 Proof. Since we have |laug(s, θ1) laug(s, θ2)| = |Eλ D[l(sλ, θ1) l(sλ, θ2)]| Eλ D[|l(sλ, θ1) l(sλ, θ2)|] Eλ D[l (sλ) θ1 θ2 ] = Eλ D[l (sλ)] θ1 θ2 , the conclusion holds and l aug = Eλ D[l (sλ)]. Proposition A.5. For the pair (θaug, laug), the following property holds. i=0 laug(si, θ) Es P[laug(s, θ)] 0 Proof. We have i=0 laug(si, θ) Es P[laug(s, θ)]] i=0 Eλ D[l(si,λ, θ)] Es P[Eλ D[l(sλ, θ)]] where the last equality holds because the underlying sample space S is the disjoint countable union of all possible augmented sample spaces. Proposition A.6. For the pair (θaug, laug), the following property holds. For each θ in Θ, the map s laug(s, θ) is measurable Proof. laug is measurable since l is measurable and laug(s, θ) = Eλ D[l(sλ, θ)]. Published as a conference paper at ICLR 2023 Proposition A.7. For the pair (θaug, laug), the following property holds. For each θ in Θ, the map s laug(s, θ) is differentiable Proof. we have laug(s, θ + δ) laug(s, θ ) δT laug(s, θ ) Eλ D[l(sλ, θ +δ) l(sλ, θ )] δT Eλ D[ l(sλ, θ )] Eλ D l(sλ, θ +δ) l(sλ, θ ) δT ( l(sλ, θ )) Eλ D l(sλ, θ +δ) l(sλ, θ ) δT ( l(sλ, θ )) Let us define Fδ(s) = Eλ D l(sλ, θ +δ) l(sλ, θ ) δT ( l(sλ, θ )) δ G(s) = Eλ D[|l (sλ) + ( l(sλ, θ ))|] for all s S Since |Fδ(s)| G(s) for all s S by Lebesgue s dominated convergence theorem, laug(s, θ + δ) laug(s, θ ) δT laug(s, θ ) Eλ D l(sλ, θ +δ) l(sλ, θ ) δT ( l(sλ, θ )) = Eλ D lim δ 0 l(sλ, θ +δ) l(sλ, θ ) δT ( l(sλ, θ )) where the last inequality holds because l is differentiable. Proposition A.8. The map θ Es P[laug(s, θ)] admits a second-order Taylor expansion at θ with non-singular second derivatives matrix Vθ Proof. By the total law of expectation of Lemma A.3, we get Es P[laug(s, θ)] = Es P[l(s, θ)]. Hence the conclusion holds by the bullet 6 from Assumption 3. Combining all the Propositions [A.4 - A.8] with some results shown in Van der Vaart (1998), we prove Theorem 3.5. Proof of Theorem 3.5. The results for ˆθ have already been proven in (Van der Vaart, 1998, Theorem 5.23). The Propositions [A.4 - A.8] guarantee that (Van der Vaart, 1998, Theorem 5.23) can be applied to the pairs (θ , l) and (θaug, laug). Therefore ˆθaug is asymptotically normal and satisfies (Eq. 10). Let X = l(s, θ ) laug(s, θ ). Recall that Covs Pµ = Es P[Cov(µ|µ)] + Covs Pµ by Published as a conference paper at ICLR 2023 Lemma A.3. Let s consider µ and µ to be l(s, θ ) and laug(s, θ ), respectively. Then Σ0 = Cov( l(s, θ )) and Σaug = Cov( laug(s, θ )). Hence we have Σ0 Σaug = Covs Pµ Covs Pµ = Es P[Cov(µ|µ)] = Es P[Es P[(µ E[µ|µ])(µ E[µ|µ])T | µ]] = Es P[Es P[(µ µ)](µ µ)T | µ]] = Es P[Es P[( l(s, θ ) laug(s, θ ))( l(s, θ ) laug(s, θ ))T | µ]] = Es PEs P[XXT | µ] = Es P[XXT ]. Thus we get Σaug = Σ0 Es P[XXT ]. (51) Since tr(Es P[XXT ]) 0, we have RE = tr(Σ0) tr(Σaug) 1 B VISUALIZATIONS OF RIM TIMEGAN GENERATED TIME SERIES This sections shows visualization of time series generated by RIM and Time GAN. We plot samples from original time series, RIM generated time series and Time GAN generated time series from both classes for Synthetic exponential ODE classification in Figure 6. 0 25 50 75 100 125 150 175 200 Time Sample paths for Class 1 Original RIM GAN 0 25 50 75 100 125 150 175 200 Time Sample paths for Class 2 Original RIM GAN Figure 6: Visualization of exponential ODE classification series generated by RIM and Time GAN against original series (5 each) C ABLATION ANALYSIS 0 10 20 30 40 50 60 70 80 Epochs Test Accuracy Test Accuracy for Augmented Data vs Non-Augmented Data No-Aug RIM GAN 0 10 20 30 40 50 60 70 80 Epochs Test Accuracy Test Accuracy for Augmented Data vs Non-Augmented Data No-Aug RIM GAN 0 10 20 30 40 50 60 70 80 Epochs Test Accuracy Test Accuracy for Augmented Data vs Non-Augmented Data No-Aug RIM GAN Figure 7: Test accuracy over epochs for synthetic data with Exponential ODE with λ sampled from beta distributions with different scales (Left) Beta(0.5, 0.5), (Middle) Beta(2, 2), and (Right) Beta(2, 5). Published as a conference paper at ICLR 2023 0 20 40 60 80 100 Epochs Test Accuracy Test Accuracy for Augmented Data vs Non-Augmented Data No-Aug RIM GAN 0 20 40 60 80 100 Epochs Test Accuracy Test Accuracy for Augmented Data vs Non-Augmented Data No-Aug RIM GAN 0 20 40 60 80 100 Epochs Test Accuracy Test Accuracy for Augmented Data vs Non-Augmented Data No-Aug RIM GAN Figure 8: Test accuracy over epochs for Indoor dataset with λ sampled from beta distributions with different scales (Left) Beta(0.5, 0.5), (Middle) Beta(2, 2), and (Right) Beta(2, 5). D TIME SERIES FORECASTING In this section, we consider time series forecasting task where we use previous n time steps data to predict next times step data. We compare performances of regression model trained with original dataset and regression models trained with augmented dataset using RIM. Predicting Stock Price Movement. This regression task consists of predicting the next day SPY500 index Open price from historical SPY500 index using data from July 2008 to December 2012 as training data and data from January 2013 to March 2014 as testing data. The input data contains the last 7 days historical Open, Close, High, Low, and Volume of the SPY500 index. After predicting the next day s Open price, we take a long position if the predicted next day Open is larger than today s Open, short otherwise. On comparing the results on the test data for the augmented case and the original case, we observe that the proportion of profitable trading signals is higher in the augmentation-trained model as observed in Figure 9. Using these trading signals, we also calculate the trading system s CAGR (Compound Annual Growth Rate) which we observe to be higher in the augmentation trained model. The test loss plot shows that the MSE for the augmentation trained model is consistently lower than the non-augmentation trained model. 10 15 20 25 30 35 40 45 50 Epoch Proportion of Correct Trading Signal SPY Stock Dataset 10 15 20 25 30 35 40 45 50 Epoch SPY Stock Dataset market Aug no Aug 0 10 20 30 40 50 Epoch SPY stock dataset aug non-aug Figure 9: Profitable trading signals (left), test set CAGR (middle), test set MSE (right) for the SPY500 Dataset using an LSTM model with 2 LSTM layers (200 neurons), 2 dense layers (100 neurons), lr=1e-4, batch size=16. The plots indicate resulting mean standard deviation from 10 runs. Predicting Air Quality. The restricted air quality dataset contains 1200 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors. This is a time series regression task where the target is the next time step s CO concentration. The input data contains the last six time steps 10 features as used in De Vito et al. (2008) and for Machine Learning & Repository. Figure 10 shows that the test MSE of the augmentation trained model remains lower than the non-augmentation trained model during the training epochs validating our claims about the robustness of the approach. Accordingly, the proportion of correct predictions of CO up/down also remains higher for the augmented case. Published as a conference paper at ICLR 2023 10 15 20 25 30 35 40 45 50 Epoch Test Accuacy Model Accuracy - Predicting Carbon Monoxide Level Movement 0 10 20 30 40 50 Epoch Air Quality Dataset aug non-aug Figure 10: Test accuracy (left) and Test MSE (right) for the Air Quality Dataset using an LSTM model with 2 LSTM layers (200 neurons), 2 dense layers (100 neurons), lr=1e-4, and batch size=16. The plots indicate the resulting mean standard deviation from 10 runs. E TIME SERIES RL In this section, we consider portfolio management task using reinforcement learning. More specifically, we compare performances of agents (DPG) trained with original state trajectories (price evolution) and augmented state trajectories using RIM. E.1 DATASET Figure 11: Evolution of stock prices. For RL, the data comes from the Quandl finance database that has daily data. Our RL models are trained over 600 trading days from 2010-08-09 to 2012-12-25 and tested over 200 trading days from 2012-12-26 to 2013-10-11 on a portfolio consisting of ten selected stocks: American Tower Corp. (AMT), American Express Company (AXP), Boeing Company (BA), Chevron Corporation (CVX), Johnson & Johnson (JNJ), Coca-Cola Co (KO), Mc Donald s Corp. (MCD), Microsoft Corporation (MSFT), AT&T Inc. (T) and Walmart Inc (WMT). To promote the diversification of the portfolio, these stocks were selected from different sectors of the S&P 500, so that they are uncorrelated as much as possible as shown in Fig. 11. Published as a conference paper at ICLR 2023 E.2 RL PSEUDOCODE Algorithm 1: RIM: RL Training input : Observed time series data S = {s0, s1, . . . , s T } where si = (xi, yi) for xi Rd and yi R for i [0 : T]. 1 Initialize θ parameter for the policy network, Y epochs, and a distribution D with support [0, 1). for e = 1 to Y do 2 Initialize λ = (λ1, . . . , λT ) with λi D // Initialize interpolation coefficients vector Augmented Path Simulator Generate an augmented trajectory S λ = {s0, s1,λ1, . . . , s T,λT } where si,λi = (xi,λi, yi,λi) for t = 1 to T do 3 at π(.|st,λt) 4 s t p(.|st,λt, at) 5 St S S λ // Add a transition to the replay buffer 6 Update Critic(St) 7 Update Actor(St) // Data augmentation is applied to the samples for actor training as well E.3 POLICY DEPLOYMENT In our RL experiments, the investment decisions to rebalance the portfolio are made daily and each input signal represents a multidimensional tensor that aggregates historical open, low, high, close prices and volume. It should be noted that our training and testing include the transaction costs (TC). We used the typical cost due to bid-ask spread and market impact that is 0.25%. We believe these are reasonable transaction costs for the portfolio trades. 0 100 200 300 400 500 600 Training steps Cumulative return value 0 25 50 75 100 125 150 175 200 Testing steps Cumulative return value 0 100 200 300 400 500 600 Training steps Cumulative return value DDPG_no Aug 0 25 50 75 100 125 150 175 200 Testing steps Cumulative return value DDPG_no Aug Figure 12: Training and testing results for DPG (above) and DDPG (below). The plots indicate the resulting mean standard deviation from 20 runs with different seeds. Published as a conference paper at ICLR 2023 F HYPERPARAMETERS We note that the primary objective of the conducted experiments is to show that RIM can improve model performance. Therefore, in all experiments, instead of finding optimal set of parameters for augmented and non-augmented trained model, we compare the performance of augmented trained model and non-augmented trained model with the same hyperparameter configuration. We demonstrated in Section 4 that RIM indeed improves model performance with same hyperparameter configuration. However, are these improvements robust to other hyperparameters? To answer this question, we conducted sensitivity analysis for two supervised learning tasks (Indoor user movement classification and Air Quality regression) and RL task (Portfolio Management). F.1 HYPERPARAMETER SENSITIVITY FOR SUPERVISED TASKS For Indoor movement classification task, we conducted the same experiment in Section 4 with 9 different hyperparameter configurations as shown in table 1 and observed that for all the cases RIM outperforms Non-Augmented case (with smaller mean test loss and higher mean test accuracy) which solidifies our claim of enhancement observed in model performance when we use RIM. Table 1: In the table, the Test Loss and Test Accuracy are the mean(standard deviation) over 10 runs for 50 epochs for varying Filters and Kernel size for Indoor User Movement Classification Task Filters Kernel Size Test Loss RIM Test Loss No Aug Test Acc RIM Test Acc No Aug 16 3 1.09 (0.43) 2.72 (0.95) 0.71 (0.04) 0.63 (0.02) 16 4 0.90 (0.45) 1.70 (0.69) 0.76 (0.04) 0.68 (0.04) 16 5 0.86 (0.19) 1.00 (0.19) 0.72 (0.01) 0.60 (0.05) 32 3 1.37 (0.31) 3.07 (0.67) 0.72 (0.02) 0.62 (0.05) 32 4 1.75 (0.71) 2.70 (1.20) 0.74 (0.03) 0.68 (0.03) 32 5 1.70 (0.70) 2.57 (0.85) 0.66 (0.08) 0.64 (0.06) 64 3 2.99 (2.17) 3.31 (1.36) 0.68 (0.07) 0.68 (0.06) 64 4 2.60 (0.70) 4.80 (4.30) 0.73 (0.03) 0.68 (0.02) 64 5 3.30 (0.74) 4.90 (4.08) 0.66 (0.03) 0.60 (0.06) For Air quality regression task, we again conducted the same experiment as in Section 4 with 8 different hyperparameter configurations as shown in table 2. Here too, we find that RIM outperforms Non-Augmented case (with smaller mean test MSE and higher mean test accuracy) which confirms our claim of enhancement observed in model performance when we use RIM. Table 2: In the table, the MSE and Accuracy are the mean(standard deviation) over 10 runs for 50 epochs for varying Epoch Initial, Batch Size, LSTM Layer and Dense Layer for Air Quality Regression Task on Test Data Epoch Init Batch Size LSTM Layer Dense Layer MSE RIM MSE No Aug Acc RIM Acc No Aug 5 16 100 50 5.34 (0.11) 5.43 (0.25) 0.63 (0.03) 0.47 (0.08) 1 32 80 40 5.40 (0.08) 5.44 (0.04) 0.63 (0.05) 0.51 ( 0.02) 5 16 150 50 5.59 (0.01) 5.74 (0.35) 0.66 (0.03) 0.53 (0.10) 6 32 200 100 5.23 (0.10) 5.55 (0.19) 0.64 (0.02) 0.51 (0.06) 10 16 150 100 5.32 (0.05) 5.48 (0.11) 0.57 (0.02) 0.51 (0.35) 1 64 200 100 5.59 (0.07) 5.67 (0.23) 0.65 (0.02) 0.52 (0.03) 10 16 100 50 5.44 (0.24) 5.49 (0.38) 0.59 (0.05) 0.56 (0.07) 15 16 100 50 5.56 (0.08) 5.60 (0.16) 0.65 (0.01) 0.56 (0.02) Published as a conference paper at ICLR 2023 F.2 HYPERPARAMETER SENSITIVITY FOR RL TASK The hyperparameter space is represented by a hypercube: the more values it contains the harder it is to explore all the possible combinations. To efficiently find the optimal set of hperparameters, we explored the hyperparameter space using Bayesian optimization (BO) Hutter et al. (2011). Table 3 shows the range of values for the hyperparameters used during the training and validation phase. The learning rate controls the speed at which neural network parameters are updated. The window is used to allow the deep RL agents to utilize a range of historical data values to relax the Markov assumption. We allow the use of 2 days up to 30 days of historical data. The number of filters and kernel strides are the hyperparameters for the convolution neural networks. It is important to carefully optimize these parameters in order to capture the best feature representations used by the policy networks. Finally, the training and testing sizes may also impact the RL performance. So, we also consider them as hyperparameters. Table 3: Hyperparameters used by our RL algorithms. Parameters Bounds Type Learning rate (lr) 10 5 5.10 1 Discrete Trading cost (tc) 2 30 Discrete Number of filters (nf) 2 52 Discrete Kernel Strides (ks) 2 10 Discrete Window 2 30 Discrete Training size (train) 20 500 Discrete Testing size (test) 5 100 Discrete Figure 13: Hyperparameter sensitivity for DPG: the vertical axes list all the models we evaluated by its index. The detailed hyperparameter configurations each index refers to are listed in Tables 4 and 5. The horizontal axis shows the cumulative total validation return. The blue line shows the validation performance for DPG without augmentation. The orange line shows the validation performance for DPG using RIM. The worst to best models are ordered from bottom to top. Published as a conference paper at ICLR 2023 Table 4: Sensitivity analysis for DPG configurations without augmentation lr tc nf ks window train test validation cumulative return 0.001 0.0225 [48, 20, 1710] 9 19 50 10 1.1563 1.0 0.0125 [50, 38, 1340] 3 6 500 100 1.1509 0.0005 0.0125 [14, 6, 1300] 4 16 50 50 1.1600 0.01 0.0125 [28, 28, 210] 4 13 100 10 1.1380 0.01 0.0175 [16, 2, 1080] 3 9 50 5 1.1597 0.001 0.0125 [18, 12, 1290] 8 16 50 50 1.1542 1.0 0.0175 [20, 20, 1090] 6 16 50 10 1.1507 0.0005 0.0075 [18, 2, 1270] 3 16 50 50 1.1614 1,00E-05 0.0225 [10, 10, 1310] 8 9 50 50 1.1591 0.1 0.0225 [10, 10, 770] 2 17 200 20 1.15470 5,00E-05 0.0075 [12, 16, 1300] 7 17 50 50 1.1558 0.05 0.0125 [16, 10, 1090] 3 7 50 5 1.1551 0.0001 0.0075 [24, 12, 1080] 5 15 50 5 1.1544 0.1 0.0175 [44, 2, 1500] 2 11 20 10 1.16082 0.005 0.0025 [2, 42, 270] 8 16 500 50 1.1374 5,00E-05 0.0175 [24, 10, 1090] 2 11 50 5 1.1537 1.0 0.0175 [32, 12, 1490] 2 8 20 20 1.1574 0.5 0.0075 [46, 6, 950] 4 18 20 20 1.1589 0.0001 0.0025 [30, 26, 1190] 7 9 500 100 1.1510 0.01 0.0025 [50, 16, 940] 7 15 20 5 1.1501 Published as a conference paper at ICLR 2023 Table 5: Sensitivity analysis for DPG configurations with augmentation lr tc nf ks window train test validation cumulative return 0.005 0.0175 [36, 36, 1960] 4 9 20 10 1.1513 0.0001 0.0025 [22, 14, 1270] 9 10 50 50 1.1582 0.1 0.0125 [40, 22, 880] 3 6 20 10 1.1503 1,00E-05 0.0075 [8, 16, 760] 2 6 200 20 1.1521 0.005 0.0175 [44, 6, 1500] 5 6 20 20 1.1593 1,00E-05 0.0025 [2, 2, 1310] 2 19 50 50 1.1619 0.001 0.0125 [36, 6, 960] 3 17 20 20 1.1560 0.0001 0.0075 [22, 6, 770] 9 13 200 20 1.1573 1,00E-05 0.0025 [22, 2, 1320] 2 19 50 50 1.1619 0.5 0.0225 [28, 2, 970] 7 12 20 5 1.1581 0.5 0.0125 [40, 14, 960] 9 18 20 10 1.1547 0.01 0.0175 [32, 16, 760] 5 7 200 20 1.1478 0.01 0.0025 [24, 10, 780] 7 16 200 5 1.1517 0.005 0.0025 [46, 12, 770] 9 19 200 20 1.1502 0.01 0.0225 [48, 44, 1190] 4 10 100 50 1.1450 1.0 0.0175 [28, 8, 960] 6 13 20 5 1.1541 0.005 0.0125 [36, 2, 1500] 3 15 20 5 1.1607 0.005 0.0225 [18, 8, 1270] 5 9 50 50 1.1585 1.0 0.0125 [8, 6, 1260] 7 8 50 50 1.1604 1.0 0.0025 [28, 2, 1250] 2 7 50 50 1.1607