# incontext_time_series_predictor__2012d89b.pdf Published as a conference paper at ICLR 2025 IN-CONTEXT TIME SERIES PREDICTOR Jiecheng Lu, Yan Sun, Shihao Yang Georgia Institute of Technology {jlu414,yansun}@gatech.edu, shihao.yang@isye.gatech.edu Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate "time series forecasting tasks" as input tokens by constructing a series of (lookback, future) pairs within the tokens. This method aligns more closely with the inherent in-context mechanisms, and is more parameter-efficient without the need of using pre-trained LLM parameters. Furthermore, it addresses issues such as overfitting in existing Transformer-based TSF models, consistently achieving better performance across full-data, few-shot, and zero-shot settings compared to previous architectures 1. 1 INTRODUCTION Transformer-based large language models (LLMs) have significantly impacted various research and application areas (Brown et al., 2020). Their inherent in-context learning (ICL) capabilities highlighted by previous studies (Müller et al., 2022; Min et al., 2022; Wei et al., 2023; Xie et al., 2022) allow them to adapt and generalize from context examples provided in input prompts without any parameter updates. This enables LLMs to effectively handle few-shot and zero-shot tasks. Recent research on ICL (Zhang et al., 2023; Garg et al., 2022; Akyürek et al., 2023; Li et al., 2023; Dai et al., 2023; Bai et al., 2023) has shown that Transformers can adaptively learn to perform various functions including linear predictors, shallow MLPs, gradient descent, algorithm selection, etc., based on a series of (input, label) pairs as input tokens, as shown in Figure 1 (a). Context Example Tokens (Ground Truth) Target Token(s) Target Token(s) (a) In-context Learning (b) In-context Time Series Predictor Context Forecasting Example Tokens (Ground Truth) Figure 1: Overview of in-context TSF learning in our setup. Time series forecasting (TSF) is critical in fields like epidemiology, finance, and traffic (Kaastra & Boyd, 1996; Lana et al., 2018; Yang et al., 2015; Ma et al., 2022), predicting future values from historical data. Temporal-wise Transformers, which build input tokens from series values at each timestep, have been widely researched in TSF (Li et al., 2019; Zhou et al., 2021; Wu et al., 2021; Nie et al., 2022). Some studies have identified several issues with such Transformers, like timestep mixing and permutation invariance (Zeng et al., 2023), leading to overfitting and underperformance on real-world datasets compared to simpler models like linear predictors. Previously proposed solutions include channel independence (Nie et al., 2022), random channel dropout (Lu et al., 2024), and using Series-wise Transformers (Liu et al., 2024a; Wang et al., 2024), which consider each series as a token. Yet, their underlying mechanisms are not well explained. Additionally, the fixed structure of input series of these existing Transformers restricts their adaptability in few-shot and zero-shot learning for multivariate TSF. Moreover, recent research have expanded the application of Transformer-based LLMs to TSF (Zhang et al., 2024), achieving improvements in few-shot and zero-shot generalization. They use the methods like prompt engineering (Gruver et al., 2023), fine-tuning and embedding inversion (Zhou et al., 2023; 1Code implementation is available at: https://anonymous.4open.science/r/ICTSP-C995 Published as a conference paper at ICLR 2025 Jin et al., 2024) to integrate time series context into prompts, improving forecasting accuracy for new data. However, these methods are designed to adapt LLMs to TSF tasks rather than directly addressing the core aspects of TSF problems. This adaptation leads to inefficient use of LLM parameters and substantially increased computational costs. In this study, we apply the most fundamental ICL settings in TSF problems to construct the In-context Time Series Predictor (ICTSP). This structure allows us to leverage the in-context generalization capabilities of Transformers efficiently without relying on large-scale LLM parameters. In ICTSP, we generate a sequence of (lookback, future) pairs, representing "time series forecasting tasks," from the original time series data, which are used as input tokens for the Transformer, as shown 1 (b). This setup enables the Transformer to adaptively learn the most effective predictor for the target tasks based on the ground truth forecasting examples as context. Additionally, the ICTSP effectively resolves aforementioned longstanding issues in previous TSF Trasnformers, significantly enhancing the performance in few-shot and zero-shot learning scenarios in multivariate TSF settings. The main contributions of this paper are summarized as follows: a) We innovatively use forecasting tasks instead of traditional timestep values or single series as input tokens to construct the ICTSP structure. By utilizing ground truth (lookback, future) pairs as context examples, we fundamentally and efficiently leverage ICL abilities for TSF tasks. ICTSP outperforms previous methods across full-data, few-shot, and zero-shot scenarios, positioning it as a potential solution for building universal large TSF models. b) From an ICL perspective, we explain that issues in existing TSF Transformers, like timestep mixing, permutation invariance, and channel structure restriction, are caused by inappropriate token formulation. We show how previous solutions have partially addressed these issues and how ICTSP effectively solves them without the drawbacks of previous approaches. c) We show that the ICTSP structure encompasses several simpler models as special cases, allowing for a sequential adaptive reduction in complexity: i) predictors learned from context examples through ICL, ii) a series-wise Transformer without context examples (Liu et al., 2024a), and iii) univariate MLPs or linear predictors (Zeng et al., 2023). This connection ensures stable performance across different complexities of time series dataset and prevents the significant overfitting that has previously hindered Transformers from outperforming simpler models consistently in real-world datasets. 2.1 PRELIMINARIES Time Series Forecasting Let X RC L represent a multivariate time series, where L is the total length and C is the number of channels of input series. X is split into historical input XI RC LI and future data XP RC LP , with L = LI + LP . Here, LI and LP represent the lengths of the historical and forecasting periods, respectively. The value at the t-th timestep in the j-th channel is X(t) j , where t {1, . . . , L} and j {1, . . . , C}. The objective is to develop the best predictor f : RC LI RC LP that maps historical inputs to future outputs, yielding b XP = f(XI). Transformer Architecture In this study, we employ a Transformer architecture with prenormalization settings, processing D-dimensional input tokens {zi}N i=1 from the input matrix ZD N = [z1, . . . , z N] RD N. Each token zi passes through K layers of the Transformer. Each layer begins with layer normalization LN( ), followed by self-attention Attn( ) and a feed-forward network FFN( ). The output Zk for each Transformer layer, TFk, is computed as: Zk = TFk(Zk 1) = Zk 1 + Attnk (LN(Zk 1)) + LN (FFNk (Zk 1 + Attnk (LN(Zk 1)))) , (1) with Z0 = Z and the first addition of Zk 1 providing a residual shortcut directly from input to output. The Transformer s final output is TF(Z0) = ZK. In-context Learning ICL involves each datapoint (xi, yi) Ra Rb from a dataset {(xi, yi)}N i=1. Unlike traditional supervised learning that learns direct mappings from xi to yi, ICL predicts yi using both historical observations {(xj, yj)}j