# dynamic_time_lag_regression_predicting_what__when__01b61745.pdf Published as a conference paper at ICLR 2020 DYNAMIC TIME LAG REGRESSION: PREDICTING WHAT & WHEN Mandar Chandorkar Centrum Wiskunde en Informatica Amsterdam 1098XG Cyril Furtlehner INRIA-Saclay Bala Poduval University of New Hampshire Durham, NH 03824 Enrico Camporeale CIRES, University of Colorado Boulder, CO Michèle Sebag CNRS Univ. Paris-Saclay This paper tackles a new regression problem, called Dynamic Time-Lag Regression (DTLR), where a cause signal drives an effect signal with an unknown time delay. The motivating application, pertaining to space weather modelling, aims to predict the near-Earth solar wind speed based on estimates of the Sun s coronal magnetic field. DTLR differs from mainstream regression and from sequence-to-sequence learning in two respects: firstly, no ground truth (e.g., pairs of associated subsequences) is available; secondly, the cause signal contains much information irrelevant to the effect signal (the solar magnetic field governs the solar wind propagation in the heliosphere, of which the Earth s magnetosphere is but a minuscule region). A Bayesian approach is presented to tackle the specifics of the DTLR problem, with theoretical justifications based on linear stability analysis. A proof of concept on synthetic problems is presented. Finally, the empirical results on the solar wind modelling task improve on the state of the art in solar wind forecasting. 1 INTRODUCTION A significant body of work in machine learning concerns the modeling of spatio-temporal phenomena (Shi and Yeung, 2018; Rangapuram et al., 2018), including the causal analysis of time series Peters et al. (2017), with applications ranging from markets (Pennacchioli et al., 2014) to bioinformatics (Brouard et al., 2016) to climate (Nooteboom et al., 2018). This paper focuses on the problem of modeling the temporal dependency between two spatio-temporal phenomena, where the latter one is caused by the former one (Granger, 1969; Runge, 2018) with a non-stationary time delay. The motivating application domain is that of space weather. The sun, a perennial source of charged energetic particles, is at the origin of geomagnetic phenomena within the sun-earth system. Specifically, the sun ejects charged particles into the surrounding space in all directions and some of these particle clouds, a.k.a. solar wind, reach the Earth s vicinity. High speed solar wind is a major threat for the modern world, causing severe damages to e.g., satellites, telecommunication infrastructures, under sea pipelines, among others.1 A key prediction task thus is to forecast the speed of the solar wind in the vicinity of the Earth (Munteanu et al., 2013; Haaland et al., 2010; Reiss et al., 2019), sufficiently early to emit an alarm and be able to prevent the damage to the best possible extent. Formally the goal is to model the dependency between heliospheric observations (available at light speed), referred to as cause series, and the solar wind speed series recorded at the Lagrangian point L1 (a point on the Sun-Earth line 1.5 million kilometers away from the Earth), referred to as effect series. The key difficulty is that the 1The adverse impact of space weather is estimated to cost 200 to 400 million USD per year, but can sporadically lead to much larger losses. Published as a conference paper at ICLR 2020 time lag between an input and its effect, the solar wind recorded at L1, varies from circa 2 to 5 days depending on, among many factors, the initial direction of emitted particles and their energy. Would the lag be constant, the solar wind prediction problem would boil down to a mainstream regression problem. The challenge here is to predict, from the solar image x(t) at time t the value y(t + t) of the solar wind speed reaching the earth at time t + t where both the value y(t + t) and the time lag t depend on x(t). Related work. Indeed, the modeling of dependencies among time series has been intensively tackled (see e.g., Zhou and Sornette (2006); Runge (2018)). When considering varying time lag, many approaches rely on dynamic time warping (DTW) (Sakoe and Chiba, 1978). For instance, DTW is used in Gaskell et al. (2015), taking a Bayesian approach to achieve the temporal alignment of both series under some restricting assumptions (considering slowly varying time lags and linear relationships between the cause and effect time series). More generally, the use of DTW in time series analysis relies on simplifying assumptions on the cause and effect series (same dimensionality and structure) and builds upon available cost matrices for the temporal alignment. Also related is sequence-to-sequence learning (Sutskever et al., 2014), primarily aimed to machine translation. While Seq2Seq modelling relaxes some former assumptions (such as the fixed or comparable sizes of the source and target series), it still relies on the known segmentation of the source series into disjoint units (the sentences), each one being mapped into a large fixed-size vector using an LSTM; and this vector is exploited by another LSTM to extract the output sequence. Attention-based mechanisms Graves (2013); Bahdanau et al. (2015) alleviate the need to encode the full source sentence into a fixed-size vector, by learning the alignment and allowing the model to search for the parts of the source sentence relevant to predict a target part. More advanced attention mechanisms (Kim et al., 2017; Vaswani et al., 2017) refine the way the source information is leveraged to produce a target part. But to our best knowledge, the end-to-end learning of the sequence-to-sequence modelling relies on the segmentation of the source and target series, and the definition of associated pairs of segments (e.g. the sentences). Our claim is that the regression problem of predicting both what the effect is and when the effect is observed, called Dynamic Time Lag Regression (DTLR), constitutes a new ML problem: With respect to the modeling of dependencies among time series, it involves stochastic dependencies of arbitrary complexity; the relationship between the cause and the effect series can be non-linear (the what model). Furthermore, the time lag phenomenon (the when model) can be non smooth (as opposed to e.g. Zhou and Sornette (2006)). With respect to sequence-to-sequence translation, a main difference is that the end-to-end training of the model cannot rely on pairs of associated units (the sentences), adversely affecting the alignment learning. Lastly, and most importantly, in the considered DTLR problem, even if the cause series has high information content, only a small portion of it is relevant to the prediction of the effect series. On one hand, the cause series might be high dimensional (images) whereas the effect series is scalar; on the other hand, the cause series governs the solar wind speed in the whole heliosphere and not just in near-Earth space. In addition to avoiding typically one or two orders of magnitude expansion of an already large input signal dimension, inserting the time-lag inference explicitly in the model can also potentially improve its interpretability. Organization of the paper. The Bayesian approach proposed to tackle the specifics of the DTLR regression problem is described in section 2; the associated learning equations are discussed, followed by a stability analysis and a proof of consistency (section 3). The algorithm is detailed in section 4. The experimental setting used to validate the approach is presented in section 5; the empirical validation on toy problems and on the real-world problem are discussed in section 6 2 PROBABILISTIC DYNAMICALLY DELAYED REGRESSION 2.1 POSITION OF THE PROBLEM Given two time series, the cause series x(t) (x(t) X RD) and the observed effect series y(t), the sought model consists of a mapping f(.) which maps each input pattern x(t) to an output y(φ(t)), and a mapping g(.) which determines the time delay φ(t) t between the input and output patterns: Published as a conference paper at ICLR 2020 y(φ(t)) = f[x(t)] (1) φ(t) = t + g[x(t)] (2) with f : X R, and g : X R+, where t R+ represents the continuous temporal domain. The input signal x(t) is possibly high dimensional and contains the hidden cause to the effect y(t) R; y(t) is assumed to be scalar in the remainder of the paper. Function g : X R+ represents the time delay between inputs and outputs. Vectors are written using bold fonts. As said, Eqs 1-2 define a regression problem that differs from standard regression in two ways: Firstly, the time lag g[x(t)] is non-stationary as it depends on x(t). Secondly, g[x(t)] is unknown, i.e. it is not recorded explicitly in the training data. Assumption. For the sake of the model identifiability and computational stability, the time warping function φ(t) = t + g[x(t)] is assumed to be sufficiently regular w.r.t. t. Formally, φ(.) is assumed to be continuous2. 2.2 PROBABILISTIC DYNAMIC TIME-LAG REGRESSION For practical reasons, cause and effect series are sampled at constant rate, and thereafter noted xt and yt with t in N. Accordingly, mapping g maps xt onto a finite set T of possible time lags, with T = { tmin . . . , tmax} an integer interval defined from domain knowledge. The unavoidable error due to the discretization of the continuous time lag is mitigated along a probabilistic model, associating to each cause x, a set of predictors ˆy(x) = {ˆyi(x), i T } and a probability distribution ˆp(x) on T estimating the probability of delay of the effects of x. Overall, the DTLR solution is sought as a probability distribution conditioned on cause x, mixture of Gaussians3 centered on the predictors ˆyi(x), where the mixture weights are defined from ˆp(x). More formally, letting yt denote the vector of random variables {yt+i, i T }: P yt|xt = x = X {τi {0,1},i T } ˆp τ1, . . . , τ|T ||x)N ˆy(x), Σ(τ) (3) with Σ = Diag(σi(τ)2) the diagonal matrix of variance parameters attached to each time-lag i T . Two simplifications are made for the sake of the analysis. Firstly, the stochastic time lag is modelled as the vector (τi), i T of binary latent variables, where τi indicates whether xt drives yt+i (τi = 1) or not (τi = 0). The assumption that every cause has a single effect is modelled by imposing: 4 X i T τi = 1. (4) From constraint (4), probability distribution ˆp(x) thus is sought as the vector (ˆpi(x)) for i in T , summing to 1, such that ˆpi(x) stands for the probability of the effect of xt = x to occur with delay i. The second simplifying assumption is that the variance σ2 i (τ) of predictor ˆyi does not depend on x, by setting: σi(τ) 2 = 1 + X j αijτj σ 2, with σ2 a default variance and αij 0 a matrix of non-negative real parameters. This particular formulation supports the tractable analysis of the posterior probability of τi (in supplementary 2 For some authors (Zhou and Sornette, 2006),the monotonicity of φ(.) is additionally required and enforced using constraints: φ(t1) φ(t2), t1 t2. This assumption is not retained as one may achieve a similar effect by using regularization based smoothness penalties. 3Note that pre-processing can be used if needed to map non-Gaussian data onto Gaussian data. 4Note however that the cause-effect correspondence might be many-to-one, with an effect depending on several causes. Published as a conference paper at ICLR 2020 material). The fact that x can influence yi through predictor ˆyi(x) even when τi = 0 reflects an indirect influence due to the auto-correlation of the y series. This influence comes with a higher variance, enforced by making αij a decreasing function of |i j|. More generally, a large value of αii compared to αij for i = j corresponds to a small auto-correlation time of the effect series. 2.3 LEARNING CRITERION The joint distribution is classically learned by maximizing the log likelihood of the data, which can here be expressed in closed form. Let us denote respectively the dataset and parameters as {(x, y)}data and θ = (ˆy, ˆp, σ, α). From Eq. (3) the conditional probability qi(x, y) def= P(τi = 1|x, y) reads: qi(x, y) = 1 Z(x, y|θ) ˆpi(x) exp 1 j T αji yj ˆyj(x) 2 + 1 j T log(1 + αji) (5) with normalization constant Z(x, y|θ) = X i T ˆpi(x) exp 1 j T αji yj ˆyj(x) 2 + 1 j T log(1 + αji) . The log-likelihood then reads (intermediate calculations in supplementary material, appendix A): L[{(x, y)}data|θ] = |T | log(σ) Edata h X 1 2σ2 yi ˆyi(x) 2 log Z(x, y|θ) i (6) where Edata denotes averaging over the dataset. For notational simplicity, the time index t is omitted in the following and the empirical averaging on the data is noted Edata. The hyper-parameters σ and matrix α of the model are obtained by optimizing L: 1 + αij = Edata yi ˆyi(x) 2qj(x, y) Edata qj(x, y) , (7) In addition the optimal ˆy and ˆp reads: ˆyi(x) = Edata h yi 1 + P j T αijqj(x, y) x i Edata h 1 + P j T αijqj(x, y) x i (8) ˆpi(x) = Edata h qi(x, y) x i , (9) where the above conditional empirical averaging operates as an averaging over samples close to x. These are self-consistent equations, since qi(x, y) depends on the parameters σ2 and αij, ˆy and ˆp. The proposed algorithm detailed in section 4 implements the saddle point method defined from Eqs (7,5,8,9): alternatively, hyper-parameters σ and αij are updated from Eq. (7) based on the current ˆyi and ˆpi; and predictors ˆyi and mixture weights ˆpi are updated according to Eqs (8) and (9) respectively. 3 THEORETICAL ANALYSIS The proposed DTLR approach is shown to be consistent and analyzed in the simple case where α is a diagonal matrix (αij = αδij). 3.1 LOSS FUNCTION AND RELATED OPTIMAL PREDICTOR Let us assume that the hyper-parameters of the model have been identified together with predictors ˆyi(x) and weights ˆpi(x). These are leveraged to achieve the prediction of the effect series. For any given input x, the sought eventual predictor is expressed as (ˆy(x), ˆI(x)) where ˆI(x) is the predicted time lag and ˆy(x) the predicted value. The associated L2 loss is: L2(ˆy, ˆI) = Edata h yˆI(x) ˆy(x) 2i . (10) Then it comes: Published as a conference paper at ICLR 2020 Proposition 3.1. With same notations as in Eq. (3), with αij = αδij, α > 0, the optimal composite predictor (y , I ) is given by y (x) = ˆy I (x)(x) with I (x) = arg max i Proof. In supplementary material, Appendix C. 3.2 LINEAR STABILITY ANALYSIS The saddle point (Eqs 7, 5, 8, 9) admits among others a degenerate solution, corresponding to ˆpi(x) = 1/|T |, αij = 0 for all pairs (i, j), with σ2 = σ2 0. Informally the model converges toward this degenerate trivial solution when there is not enough information to build specialized predictors ˆyi. Let us denote y2 i (x) = yi ˆyi(x) 2 the square error made by predictor ˆyi for x, and σ2 0 = 1 |T |Edata X i T y2 i (x) the average of MSE over the set of the predictors ˆyi, i T . Let us investigate the conditions under which the degenerate solution may appear, by computing the Hessian of the log-likelihood and its eigenvalues. Under the simplifying assumption αij = αδij, the model involves 2|T | functional parameters ˆy and ˆp and two hyper-parameters α and r = σ2/σ2 0. After the computation of the Hessian (in supplementary material, Appendix B) the system involves three key statistical quantities, two global ones: σ2 0 Edata X i T qi(x, y) y2 i (x) , (11) σ4 0 Edata h X i T qi(x, y) y2 i (x) X j T qj(x, y) y2 j (x) 2i , (12) and a local |T |-vector of components ui[x, q] = 1 σ2 0 Edata h qi(x, y) y2 i (x) X j T qj(x, y) y2 j (x) x i . Up to a constant, C1 represents the covariance between the latent variables {τi} and the normalized predictor errors. C1 smaller than one indicates a positive correlation between the latent variables and small errors; the smaller the better. For the degenerate solution, i.e. q = q0 uniform, C1[q0] = 1 and C2[q0] represents the default variability among the prediction errors. ui[x, q] informally measures the quality of predictor ˆyi relatively to the other ones at x. More precisely, a negative value of ui[x, q] indicates that ˆyi is doing better than average in the neighborhood of x. At a saddle point the parameters are given by: σ2 σ2 0 = |T | C1[q] |T | 1 and α = |T | |T | 1 1 C1[q] The predictors ˆy are decoupled from the rest whenever they are centered, which we assume. So the analysis can focus on the other parameters. If ˆp is fixed a saddle point is stable iff C2[q] < 2C2 1[q] + O 1 In particular, the degenerate solution is unstable if C2[q0] > 2 1 1 Note that for yi(x) iid centered with variance σ2 0 and relative kurtosis κ (conditionally to x) one has C2 = (2 + κ)(1 1/|T|). Therefore, whenever y2 i (x) fluctuates and the relative kurtosis is non-negative, the degenerate solution is unstable and will thus be avoided. Published as a conference paper at ICLR 2020 If ˆp is allowed to evolve (after Eq. (9)) the degenerate trivial solution becomes unstable as soon as C2[q0] is non-zero, due to the fact that the gradient points in the opposite direction to u(x) (with dˆp(x) C2[q0]u(x)), thus rewarding the predictors with lowest errors by increasing their weights. The system is then driven toward other solutions, among which the localized solutions of the form: ˆpi(x) = δi,I(x), with an input dependent index I(x) T . As shown (in supplementary material, Appendix C) the maximum likelihood localized solution also minimizes the loss function (Eq. 10). The stability of such localized solutions and the existence of other (non-localized) solutions is left for further work. 4 THE DTLR ALGORITHM The DTLR algorithm learns both regression models ˆy(x) and ˆp(x) from series xt and yt, using alternate optimization of the model parameters and the model hyper-parameters α and σ2, after Eqs (7,5,8,9). The model search space is that of neural nets, parameterized by their weight vector θ. The inner optimization loop updates θ using mini-batch based stochastic gradient descent. At the end of each epoch, after all minibatches have been considered, the outer optimization loop computes hyper-parameters α and σ2 on the whole data. Initialization of α and σ it 0 ; while it < max do while epoch do θ Optimize(L(θ, α, σ2)) ; end σ2 σ2 0 |T | C1[q] α |T | |T | 1 1 C1[q] C1[q] ; end Result: Model parameters θ = {ˆy, ˆp}, hyper-parameters α, σ2 Algorithm 1: DTLR algorithm The algorithm code is available in supplementary material and will be made public after the reviewing period. The initialization of hyper-parameters α and σ is settled using preliminary experiments (same setting for all considered problems: α U(0.75, 2); σ2 U(10 5, 5)). The neural architecture implements predictors ˆy(x) and weights ˆp(x) on the top of a same feature extractor from input x. In the experiments, the architecture of the feature extractor is a 2-hidden layer fully connected network. On the top of the feature extractor are the single layer ˆy and ˆp models, each with |T| output neurons, with |T| the size of the chosen domain for the time lag. 5 EXPERIMENTAL SETTING The goal of experimental validation is threefold. A first issue regards the accuracy of DTLR, measured from the mean absolute error (MAE), root mean square error (RMSE) and Pearson correlation of the learned DTLR model (y (xt), I (x)). DTLR is compared to the natural baseline defined as the regressor with constant time lag, ˆy (xt), with being the average of all possible time lags in T . The predictions of DTLR and the baseline are compared with the ground truth value of the effect series. However the predicted time lag I (xt) can only be assessed if the ground truth time-lag relationship is known. In order to do so, three synthetic problems of increasing difficulty are defined below. Secondly, the stability and non-degeneracy of the learned model are assessed from the statistical quantities σ0 and C1 (section 3.2), compared to the degenerate solution ˆpi(x) = 1/|T |. For C1 < 1, the model accurately specializes the found predictors ˆpi. Lastly, and most importantly, DTLR is assessed on the solar wind prediction problem, and compared to the best state of the art in space weather. Published as a conference paper at ICLR 2020 Synthetic Problems. Four synthetic problems of increasing difficulty are generated using Stochastic Langevin Dynamics. In all problems, the cause signal xt R10 and the effect signal yt are generated as follows (with η = 0.02, s2 = 0.7): xt+1 = (1 η)xt + N(0, s2) (13) vt = k||xt||2 + c (14) yt+g(xt) = f(vt), (15) with time-lag mapping g(xt) ranges in a time interval with width 20 (except for problem I where |T | = 15). The complexity of the synthetic problems is governed by the amplitude and time-lag functions f and g (more in appendix, Table 2): Problem f(vt) g(xt) Other I vt 5 k=10,c=0 II vt 100/vm k=1,c=10 III v2 t +2ad ( v2m+2ad v)/a k=5,a=5,d=1000,c=100 IV vt g(xt)=exp(vt)/(1+exp(vt/20)) k=10,c=40 Solar Wind Speed Prediction. The challenge of predicting solar wind speed from heliospheric data is due to the non-stationary propagation time of the solar plasma through the interplanetary medium. For the sake of a fair comparison with the best state of the art Reiss et al. (2019), the same experimental setting is used. The cause series xt includes the solar magnetic (flux tube expansion, FTE: the rate at which the magnetic flux tubes expand between the photosphere and a reference height in the corona at 2.5 solar radii) and the coronal magnetic field strength estimates produced by the Current Sheet Source Surface model (Zhao and Hoeksema, 1995; Poduval and Zhao, 2014; Poduval, 2016), exploiting the hourly magnetogram data recorded by the Global Oscillation Network Group from 2008 to 2016. The effect series, the hourly solar wind data is available from the OMNI data base from the Space Physics Data Facility 5. After domain knowledge, the time-lag ranges from 2 to 5 days, segmented in six-hour segments (thus |T | = 12). For the i-th segment, the "ground truth" solar wind yi is set to its median value over the 6 hours. DTLR is validated using a nine fold cross-validation (Table 3 in appendix), where each fold is a continuous period corresponding to a solar rotation.6 6 EMPIRICAL VALIDATION Table 1 summarizes the DTLR performance on the synthetic and solar wind problems (detailed results are provided in the appendix). Table 1: DTLR performance: accuracy (MAE and RMSE, the lower the better; Pearson, the higher the better) and stability σ0 and C1 (the lower the better). For each indicator, is reported the DTLR value (9-fold CV), the baseline value and the time-lag error. Problem M.A.E R.M.S.E Pearson Corr. σ0 C1 I 8.82 / 21.79 / 0.021 12.35 / 28.79 / 0.26 0.98 / 0.87 / 29.8 0.14 II 10.15 / 27.40 / 0.4 13.70 / 35.11 / 0.67 0.95 / 0.73 / 0.70 26.83 0.16 III 3.17 / 11.01 / 0.17 4.63 / 14.99 / 0.42 0.98 / 0.79 / 0.84 11.84 0.09 IV 3.88 / 12.28 / 0.34 5.33 / 15.89 / 0.64 0.98 /0.79/ 0.81 12.18 0.13 Solar Wind 56.35 / 66.45 / 74.20 / 84.53 / 0.6 / 0.41 / 76.46 0.89 6.1 SYNTHETIC PROBLEMS On the easy Problem I, DTLR predicts the correct time lag for 97.93% of the samples. The higher value of σ0 in problems I and II compared to the other problems is explained from the higher variance in the effect series y(t). 5https://omniweb.gsfc.nasa.gov 6 The Sun completes a rotation (or Carrington rotation) in approximately 27 days. Published as a conference paper at ICLR 2020 On Problem II, DTLR accurately learns the inverse relationship between xt, g(xt) and yt on average. The time lag is overestimated in the regions with low time lag (with high velocity), which is blamed on the low sample density in this region, due to the data generation process. Interestingly, Problems III and IV are better handled by DTLR, despite a more complex dynamic time lag relationship. In both latter cases however, the model tends to under-estimate the time lag in the high time lag region and conversely to over-estimate it in the low time lag region. 6.2 THE SOLAR WIND PROBLEM DTLR finds an operational solar wind model (Table 1), though the significantly higher difficulty of the solar wind problem is witnessed by the C1 value close to the degenerate value 1. The detailed comparison with the state of the art Reiss et al. (2019) (Fig. 1, Left) shows that DTLR improves on the current best state of the art (on all variants including ensemble approaches, and noting that median models are notoriously hard to beat). (Fig. 1, Right) shows the good correlation between the predicted solar wind7 and the measured solar wind. Model M.A.E R.M.S.E WS 74.09 85.27 DCHB 83.83 103.43 WSA 68.54 82.62 Ensemble Median (WS) 71.52 83.36 Ensemble Median (DCHB) 78.27 100.04 Ensemble Median (WSA) 62.24 74.86 Persistence (4 days) 130.48 161.99 Persistence (27 days) 66.54 78.86 Fixed Lag Baseline 67.33 80.39 DTLR 60.19 72.64 (a) Comparative assessment on the Solar Wind problem compared to the state of the art Reiss et al. (2019, Table 1) 300 400 500 600 700 v (km/s) (b) Scatter Chart (9 fold CV) Figure 1: DTLR on the solar wind problem. Left: comparative quantitative assessment w.r.t. the state of the art (Carrington rotation 2077). Right: qualitative assessment of the prediction. 7 DISCUSSION AND PERSPECTIVES The contribution of the paper is twofold. A new ML setting, Dynamic Time Lag Regression has been defined, aimed at the modelling of varying time-lag dependency between time series. The introduction of this new setting is motivated by an important scientific and practical problem from the domain of space weather, an open problem for over two decades. Secondly, a Bayesian formalization has been proposed to tackle the DTLR problem, relying on a saddle point optimization process. A closed form analysis of the training procedure stability under simplifying assumptions has been conducted, yielding a practical alternate optimization formulation, implemented in the DTLR algorithm. This algorithm has been successfully validated on synthetic and real-world problems, although some bias toward the mean has been detected in some cases. On the methodological side, this work opens a short term perspective (handling the bias) and a longer term perspective, extending the proposed nested inference procedure and integrating the model selection step within the inference architecture. The challenge is to provide the algorithm with the means of assessing online the stability and/or the degeneracy of the learning trajectory. Regarding the motivating solar wind prediction application, a next step consists of enriching the data sources and the description of the cause series xt, typically by directly using the solar images. Another perspective is to consider other applications of the general DTLR setting, e.g. considering fine-grained modelling of diffusion phenomena. 7The predicted values, every 6 hours, are interpolated for comparison with the hourly measured solar wind. Published as a conference paper at ICLR 2020 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio Y. Neural machine translation by jointly learning to align and translate. In Proc. ICLR 2015. 2015. Céline Brouard, Huibin Shen, Kai Dührkop, Florence d Alché-Buc, Sebastian Böcker, and Juho Rousu. Fast metabolite identification with input output kernel regression. Bioinformatics, 32 (12):28 36, 2016. doi: 10.1093/bioinformatics/btw246. URL https://doi.org/10.1093/ bioinformatics/btw246. Paul Gaskell, Frank Mc Groarty, and Thanassis Tiropanis. Signal diffusion mapping: Optimal forecasting with time-varying lags. Journal of Forecasting, 35(1):70 85, 2015. C. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37(3):424 438, 1969. Alex Graves. Generating sequences with recurrent neural networks. Co RR, abs/1308.0850, 2013. URL http://arxiv.org/abs/1308.0850. S. Haaland, C. Munteanu, and B. Mailyan. Solar wind propagation delay: Comment on minimum variance analysis-based propagation of the solar wind observations: Application to real-time global magnetohydrodynamic simulations by A. Pulkkinen and L. Raststatter. Space Weather, 8(6), 2010. Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks. In Proc. ICLR 2017. 2017. C. Munteanu, S. Haaland, B. Mailyan, M. Echim, and K. Mursula. Propagation delay of solar wind discontinuities: Comparing different methods and evaluating the effect of wavelet denoising. Journal of Geophysical Research: Space Physics, 118(7):3985 3994, 2013. P. D. Nooteboom, Q. Y. Feng, C. López, E. Hernández-García, and H. A. Dijkstra. Using network theory and machine learning to predict el niño. Earth System Dynamics, 9(3):969 983, 2018. doi: 10.5194/esd-9-969-2018. URL https://www.earth-syst-dynam.net/9/969/ 2018/. D. Odstrˇcil. Modeling 3-d solar wind structure. Advances in Space Research, 32(4):497 506, 2003. ISSN 0273-1177. doi: https://doi.org/10.1016/S0273-1177(03)00332-6. URL http://www. sciencedirect.com/science/article/pii/S0273117703003326. Heliosphere at Solar Maximum. D. Odstrˇcil and V. J. Pizzo. Three-dimensional propagation of coronal mass ejections (cmes) in a structured solar wind flow: 1. cme launched within the streamer belt. Journal of Geophysical Research: Space Physics, 104(A1):483 492, 1999a. doi: 10.1029/1998JA900019. URL https: //agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/1998JA900019. D. Odstrˇcil and V. J. Pizzo. Three-dimensional propagation of coronal mass ejections (cmes) in a structured solar wind flow: 2. cme launched adjacent to the streamer belt. Journal of Geophysical Research: Space Physics, 104(A1):493 503, 1999b. doi: 10.1029/1998JA900038. URL https: //agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/1998JA900038. D. Odstrˇcil, Z. Smith, and M. Dryer. Distortion of the heliospheric plasma sheet by interplanetary shocks. Geophysical Research Letters, 23(18):2521 2524, 1996. doi: 10.1029/ 96GL00159. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10. 1029/96GL00159. D. Odstrˇcil, P. Riley, and X. P. Zhao. Numerical simulation of the 12 may 1997 interplanetary cme event. Journal of Geophysical Research: Space Physics, 109(A2), 2004. doi: 10.1029/ 2003JA010135. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10. 1029/2003JA010135. Diego Pennacchioli, Michele Coscia, Salvatore Rinzivillo, Fosca Giannotti, and Dino Pedreschi. The retail market as a complex system. EPJ Data Sci., 3(1):33, 2014. Published as a conference paper at ICLR 2020 Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference - Foundations and Learning Algorithms. MIT Press, 2017. B. Poduval. Controlling Influence of Magnetic Field on Solar Wind Outflow: An Investigation Using Current Sheet Source Surface Model. The Astrophysical Journal, 827(1):L6, aug 2016. doi: 10.3847/2041-8205/827/1/l6. URL https://iopscience.iop.org/article/10. 3847/2041-8205/827/1/L6/meta. B. Poduval and X. P. Zhao. Validating Solar Wind Prediction Using the Current Sheet Source Surface Model. The Astrophysical Journal, 782(2):L22, jan 2014. doi: 10.1088/2041-8205/ 782/2/l22. URL https://iopscience.iop.org/article/10.1088/2041-8205/ 782/2/L22/meta. Syama Sundar Rangapuram, Matthias W. Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. In Neur IPS 2018, pages 7796 7805, 2018. Martin A. Reiss, Peter J. Mac Neice, Leila M. Mays, Charles N. Arge, Christian Möstl, Ljubomir Nikolic, and Tanja Amerstorfer. Forecasting the ambient solar wind with numerical models. i. on the implementation of an operational framework. The Astrophysical Journal Supplement Series, 240(2):35, 2019. J. Runge. Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7):075310, 2018. doi: 10.1063/1.5025050. URL https://doi.org/10.1063/1.5025050. H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43 49, 1978. Xingjian Shi and Dit-Yan Yeung. Machine learning for spatiotemporal sequence forecasting: A survey. Ar Xiv, abs/1808.06865, 2018. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104 3112. Curran Associates, Inc., 2014. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998 6008. Curran Associates, Inc., 2017. Y. M. Wang and Jr. Sheeley, N. R. Solar Wind Speed and Coronal Flux-Tube Expansion. The Astrophysical Journal, 355:726, Jun 1990. doi: 10.1086/168805. X Zhao and JT Hoeksema. Predicting the heliospheric magnetic field using the current sheet-source surface model. Advances in Space Research, 16(9):181 184, 1995. Wei-Xing Zhou and Didier Sornette. Non-parametric determination of real-time lag structure between two time series: The optimal thermal causal path method with applications to economic data. Journal of Macroeconomics, 28(1):195 224, 2006. APPENDIX A LOG LIKELIHOOD OF THE LATENT MODEL (3) A.1 DIRECT COMPUTATION Due to the single effect constraint (4) the mixture model (3) can be expressed simply as i T ˆpi(x) Y 2πσ2 e 1 2σ2 (1+αji) yj ˆyj(x) 2 i T ˆpi(x) Y 2πσ2 e 1 2σ2 αji yj ˆyj(x) 2 exp 1 yj ˆyj(x) 2 Published as a conference paper at ICLR 2020 Let θ def= (ˆy, ˆp, σ, α) denote the parameters of the model and consider the probability that predictor ˆyi is the good one conditionally to a pair of observation (x, y): qi(x, y) = P(τi = 1|x, y) = 1 Z(x, y|θ) ˆpi(x) exp 1 j T αji yj ˆyj(x) 2 + 1 j T log(1 + αji) Z(x, y|θ) = X i T ˆpi(x) exp 1 j T αji yj ˆyj(x) 2 + 1 j T log(1 + αji) . This gives immediately L[{(x, y)}data|θ] = |T | log(σ) Edata h X 1 2σ2 yi ˆyi(x) 2 log Z(x, y|θ) i A.2 LARGE DEVIATION ARGUMENT Even though the log likelihood can be obtained by direct summation, for sake of generality we show how this can result from a large deviation principle. Assume that the number of learning samples tends to infinity, and so that in a small volume dv = dxdy around a given joint configuration (x, y), the number of data Nx,y becomes large. Restricting the likelihood to this subset of the data yields the following: ˆp(τ (m)|x) Q 2π σi(τ (m)) exp 1 yi ˆyi(x) 2 Upon introducing the relative frequencies: qi(x, y) = 1 Nx,y m=1 τ (m) i satisfying X i T qi(x, y) = 1, the sum over the τ (m) i is replaced by a sum over these new variables, with the summand obeying a large deviation principle Lx,y X q exp Nx,y Fx,y q where the rate function reads Fx,y q = |T| log(σ) + X h yi ˆyi(x) 2 1 + P j T αijqj 2σ2 1 j T log(1 + αji) + qi log qi Taking the saddle point for qi yield as a function of (x, y) expression (7). Inserting this into F and taking the average over the data set yields the log likelihood (5) with opposite sign: L[{(x, y)}data|θ] = Edata h Fx,y q(x, y) i . A.3 SADDLE POINT EQUATIONS Now we turn to the self-consistent equations relating the parameters θ of the model at a saddle point of the log likelihood function. First, the optimization of the predictors ˆy yields: L ˆyi(x) = 1 σ2 Edata h yi ˆyi(x) 1 + X j T αijqj(x, y) x i . Then the optimization of ˆp gives: L ˆpi(x) = Edata hqi(x, y) ˆpi(x) λ(x) x i , = 1 ˆpi(x)Edata h qi(x, y) x i λ(x) Published as a conference paper at ICLR 2020 with λ(x) a Lagrange multiplier to insure that P i ˆpi(x) = 1 This gives ˆpi(x) = 1 λ(x)Edata h qi(x, y) x i i T ˆpi(x) = 1 λ(x) = 1 x in order to fulfill the normalization constraint, yielding finally expression (9). Finally the optimization of α reads: L αij = 1 2(1 + αij)Edata qj(x, y) 1 2σ2 Edata yi ˆyi(x) 2qj(x, y) . APPENDIX B PROOF OF PROPOSITION 3.1 Given I(x) a candidate index function we associate the point-like measure pi(x) = δi,I(x). Written in terms of p the loss function reads L2(ˆy, p) = Ex,y h X i T pi(x) yi ˆy(x) 2i . Under (3) (with αij = αδij) the loss is equal to L2(ˆy, p) = Ex h X i T pi(x) ˆyi(x) ˆy(x) 2 ˆpi(x) ασ2 The minimization w.r.t. ˆy yields ˆy(x) = X i T pi(x)ˆyi(x). (16) In turn, as a function of pi the loss being a convex combination, its minimization yields pi(x) = δi,I(x), (17) I(x) = arg min i T ˆyi(x) ˆy(x) 2 ˆpi(x) ασ2 Combining these equations (16,17,18) we get I(x) = arg max i T which concludes the proof. APPENDIX C STABILITY ANALYSIS The analysis is restricted for simplicity to the case αij = αδij. The log likelihood as a function of r = σ2/σ2 0 and β = α/r after inserting the optimal q = q(x, y) reads in that case L(r, β) = |T | 2 log(r) |T | 2 log(1 + rβ) + Edata h log(Z) λ(x) X i T ˆpi(x) i i ˆpi(x) exp β 2σ2 0 y2 i (x) , Published as a conference paper at ICLR 2020 and where λ(x) is a Lagrange multiplier which has been added to impose the normalization of ˆp. The gradient reads |T |(1 r) + βr2 L β = r 2(1 + rβ) 1 L ˆyi(x) = 1 σ2 Edata h yi ˆyi(x) 1 + αqi(x, y) x i . L ˆpi(x) = Edata qi(x, y)|x ˆpi(x) λ(x), σ2 0 Edata X i T qi(x, y) y2 i (x) , This leads to the following relation at the saddle point: r = |T | C1[q] α = |T | |T | 1 1 C1[q] ˆyi(x) = Edata h yi 1 + αqi(x, y) x i Edata h 1 + αqi(x, y) x i ˆpi(x) = Edata qi(x, y)|x . Let us now compute the Hessian. It is easy to see that the block corresponding to the predictors ˆy decouples from the rest as soon as these predictors are centered. σ4 0 Edata h X i T qi(x, y) y2 i (x) j=1 qj(x, y) y2 j (x) 2i , |T | + 2 |T | |T | 1 C1[q] 1 β2C2 1[q] 2L r β = 1 2r2 C2 1[q] C2[q] 2C2 1[q] 2L ˆpi(x) ˆpj(x) = Edata qi(x, y)qj(x, y)|x ˆpi(x)ˆpj(x) 2L r ˆpi(x) = 0 2L β ˆpi(x) = ui[x, q] Published as a conference paper at ICLR 2020 where ui[x, q] def= 1 σ2 0 Edata h qi(x, y) y2 i (x) X j T qj(x, y) y2 j (x) |x i . There are two blocks in this Hessian, the one corresponding to r and β and the one corresponding to derivatives with respect to ˆpi. The stability of the first one depends on the sign of C2[q] 2C2 1[q] for |T | large while the second block is always stable as being an average of the exterior product of the vector (q1(x, y)/ˆp1(x), . . . , q|T |(x, y)/ˆp|T |(x)) by itself. At the degenerate point α = 0, r = 1, ˆpi = 1/|T | the Hessian simplifies as follows. Denote dη = dre1 + dβe2 + Z dx X i T dˆpi(x)ei+2(x) a given vector of perturbations, decomposed onto a set of unit tangent vectors, {e1 and e2} being respectively associated to r and β, while ei(x) associated to ˆpi(x) for all i T and x X. Denote Z dxui[x]ei(x) C2 = 1 |T |σ4 0 Edata h X j T y2 j (x) 2i . σ2 0 Edata y2 i (x) σ2 0|x . With these notations the Hessian reads: |T |e1et 1 + e1et 2 + e2et 1 + C2 2 1 e2et 2 uet 2 e2ut Z dxv(x)vt(x) . In fact we are interested in the eigenvalues of H in the subspace of deformations which conserve the norm of ˆp, i.e. orthogonal to v(x), thereby given by η = η1e1 + η2e2 + η3u. In this subspace the Hessian reads 2 1 M|T |C2 0 M|T |C2 0 where M is the number of data points, resulting from the fact that X Z dxui[x]2 = M σ4 0 Edata h X y2 i (x) σ2 0 2i , because Edata( |x) as a function of x is actually a point-wise function on the data. If |u|2 > 0 or if |u| = 0 and 1 + |T |(C2/2 1) > 0 there is at least one positive eigenvalue. Let Λ be such an eigenvalue. After eliminating dr and dβ from the eigenvalue equations in dη, the deformation along this mode verifies dη Λe1 + Λ(|T | + Λ)e2 M|T |(|T | + Λ)C2u, which corresponds to increasing r and α while decreasing for each x the ˆpi having the highest mean relative error ui[x]. Published as a conference paper at ICLR 2020 Concerning solutions for which ˆpi(x) = δiˆI(x) is concentrated on some index ˆI(x), the analysis is more complex. In that case C2[p] = 0 and C1[p] > 0. The (r, β) sector has 2 negative eigenvalues, while the ˆp block is ( ) a covariance matrix, so it has as well negative eigenvalues. The coupling between these two blocks could however in principle generate in some cases some instabilities. Still, the log likelihood of such solutions reads 2 log(σ2) + 1 2 log(1 + α) 1 2σ2 Edata h X i T y2 i (x) i α 2σ2 Edata h y2 I(x)(x) i so we get the following optimal solution σ2 = 1 |T |Edata h X i T y2 i (x) i , 1 1 + α = Edata h y2 I(x)(x) i I(x) = arg min i T Edata h y2 i (x)|x i . Published as a conference paper at ICLR 2020 Figure 2: Geometry of the CSSS Model (Zhao and Hoeksema, 1995). The spherical surface with radius Rcp is the cusp surface, representing the locus of cusp points of helmet streamers and the location of the inner corona. The source surface (radius Rss) is free to be placed anywhere outside the cusp surface. Figure reproduced from Zhao and Hoeksema (1995). APPENDIX D EXPERIMENTS: ADDITIONAL DETAILS Here we provide some additional details and context to the experimental validation of the DTLR methodology described in section 5. Table 2 provides some information about the datasets used in the synthetic and solar wind prediction problems8. Sections D.1.1 and D.1.2 give additional plots for evaluating the experimental results. For the solar wind prediction task, the solar wind data was mapped into standardized Gaussian space using a quantile-quantile and inverse probit mapping. Nine fold cross-validation was performed using splits as specified in table 3. To compare the DTLR results with the state of the art solar wind forecasting, we used results from Reiss et al. (2019, Table 1). Since Reiss et al. (2019) compared the various forecasting methods on only one solar rotation (first row of table 3), comparing these results with DTLR can be considered as a preliminary examination. Nevertheless, the results presented in table 1a show encouraging signs for the competitiveness and usefulness of the DTLR method. It is well-established that the solar wind speed observed near the Earth s orbit (at the L1 Lagrange point about a million kilometers upstream solar wind) bears an inverse correlation with the rate (FTE) at which the magnetic flux tubes expand between the photosphere and a reference height called the source surface, typically placed at 2.5 solar radii in many coronal models (Wang and Sheeley, 1990). That is, regions of the Sun with smaller FTE give rise to fast wind and regions with larger values of FTE emanate slow solar wind. This inverse correlation forms the basis of the current solar wind prediction techniques (Poduval and Zhao, 2014; Poduval, 2016) including the WSA-ENLIL model (Wang and Sheeley, 1990; Odstrˇcil et al., 1996; Odstrˇcil and Pizzo, 1999a;b; Odstrˇcil, 2003; Odstrˇcil et al., 2004) of NOAA/SWPC. Mathematically, FTE is defined as: Br(ss) (19) where, Br(phot) and Br(ss) are the radial component of magnetic fields at the photosphere and the source surface, and R and Rss are the respective radii. The coronal magnetic field is computed using CSSS model, the geometry of which is shown in figure 2, by extrapolating the photospheric magnetic field measured by various ground based observatories and spacecraft. For the present work, 8In the solar wind problem, the training and test data sizes correspond to one cross-validation split Published as a conference paper at ICLR 2020 we used the photospheric field measured using the Global Oscillation Network Group (GONG), a network of ground based solar observatories over different places on the globe. Table 2: Synthetic and Real-World Problems Problem # train # test d |T| I 10, 000 2, 000 10 15 II 10, 000 2, 000 10 20 III 10, 000 2, 000 10 20 IV 10, 000 2, 000 10 20 Solar Wind 77, 367 2, 205 374 12 Table 3: Cross validation splits used to evaluate DTLR on the solar wind forecasting task Split Id Carrington Rotation Start End 1 2077 2008/11/20 07:00:04 2008/12/17 14:38:34 2 2090 2009/11/09 20:33:43 2009/12/07 04:03:59 3 2104 2010/11/26 17:32:44 2010/12/24 01:15:56 4 2117 2011/11/16 07:04:41 2011/12/13 14:39:28 5 2130 2012/11/04 20:39:43 2012/12/02 04:06:23 6 2143 2013/10/25 10:17:52 2013/11/21 17:36:35 7 2157 2014/11/11 07:09:56 2014/12/08 14:41:02 8 2171 2015/11/28 04:09:27 2015/12/25 11:53:33 9 2184 2016/11/16 17:41:04 2016/12/14 01:16:43 Published as a conference paper at ICLR 2020 D.1 SUPPLEMENTARY PLOTS D.1.1 SYNTHETIC PROBLEMS 200 300 400 Actual Output Predicted Output (a) Problem II, Goodness of fit, Output y(x) 2 3 4 5 6 7 Actual Time Lag Predicted Time Lag (b) Problem II, Goodness of fit, Time lag τ(t) 0 2 4 6 8 Error: Time Lag (c) Problem II, Error of time lag prediction 200 300 400 Output Type predicted actual (d) Problem II, Output vs Time Lag Relationship Figure 3: Problem II, Results Published as a conference paper at ICLR 2020 150 175 200 225 250 275 Actual Output Predicted Output (a) Problem III, Goodness of fit, Output y(x) 3 4 5 6 7 Actual Time Lag Predicted Time Lag (b) Problem III, Goodness of fit, Time lag τ(t) 0 2 4 Error: Time Lag (c) Problem III, Error of time lag prediction 150 175 200 225 250 275 Output Type predicted actual (d) Problem III, Output vs Time Lag Relationship Figure 4: Problem III, Results Published as a conference paper at ICLR 2020 50 100 150 Actual Output Predicted Output (a) Problem IV, Goodness of fit, Output y(x) 2 3 4 5 6 7 Actual Time Lag Predicted Time Lag (b) Problem IV, Goodness of fit, Time lag τ(t) 2 0 2 4 Error: Time Lag (c) Problem IV, Error of time lag prediction 50 100 150 Output Type predicted actual (d) Problem IV, Output vs Time Lag Relationship Figure 5: Problem IV, Results Published as a conference paper at ICLR 2020 D.1.2 SOLAR WIND PREDICTION GGGGGGGGGGG GGGGGGGGGGGGGGGGG G G GGGGGGGGG G GG G G GGGGGGGGGGGGGGGGGGG G G G G G G G G G G GG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGG GGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGG G GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GG G G GGGG G G G GG GG G GGGGGGGGGGGGGGGGGGGGG G G G G GGGGG G GGGGGGGGG G G GG GG GGGGGGGGG GG GG G GG G GG GG GG GGGGG GG GG GG G GGG G GG G G G 300 0 200 400 600 t (hours) G G Actual Prediction (a) Hourly forecasts for period 2008-11-20 07:00 to 2008-12-17 14:00 GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGG G GGGGGGGGGGGGGGGGGGGG G G GGGGGGGGGGGGGGGGG G GGGGGGGGGG G GG G GGGG G GG G GG G GG G GG G GG GG GGG G G G G GG GGGGGGGGG G G G G GGGGG G G GGGGGGGGGGGGGG G G GG G G GG G G G GG G G G G GGGGGGGG G GGGGGGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGG G G G G G GGG GG G G G G G G G G GGGGGGGG G GG GGGGGGG GG GGGGGGGGGGGGGG G GG GG G G G G G GGGGG GG G GGGGGG 0 200 400 600 t (hours) G G Actual Prediction (b) Hourly forecasts for period 2016-11-16 17:00 to 2016-12-14 01:00 Figure 6: Solar Wind Prediction: reconstructed time series predictions APPENDIX E NEURAL NETWORK ARCHITECTURE DETAILS Table 4: Network Architecture Details Problem # Hidden layers Layer sizes Activations I 2 [40, 40] [Re LU, Sigmoid] II 2 [40, 40] [Re LU, Sigmoid] III 2 [40, 40] [Re LU, Sigmoid] IV 2 [60, 40] [Re LU, Sigmoid] Solar Wind 2 [50, 50] [Re LU, Sigmoid] hidden layers fully connected fully connected X (ˆy(x), ˆI(x)) Figure 7: Architecture of the neural network specified by the number of units (nv, nh 1, nh 2, 2|T|) in each layer.