# memory_consolidation_enables_longcontext_video_understanding__d784d932.pdf

Memory Consolidation Enables Long-Context Video Understanding

Ivana Balaˇzevi c * 1 Yuge Shi * 1 Pinelopi Papalampidi * 1

Rahma Chaabouni 1 Skanda Koppula 1 Olivier J. H enaff 1

Most transformer-based video encoders are limited to short temporal contexts due to their quadratic complexity. While various attempts have been made to extend this context, this has often come at the cost of both conceptual and computational complexity. Instead, we propose to re-purpose existing pretrained video transformers by simply fine-tuning them to attend to memories derived non-parametrically from past activations. By leveraging redundancy reduction, our memoryconsolidated vision transformer (MC-Vi T) effortlessly extends its context far into the past and exhibits excellent scaling behavior when learning from longer videos. In doing so, MC-Vi T sets a new state-of-the-art in long-context video understanding on Ego Schema, Perception Test, and Diving48, outperforming methods that benefit from orders of magnitude more parameters.

1. Introduction

Humans and animals reason about events extending over days, weeks, and years (Tulving, 1985), yet current artificial vision systems live largely in the present. While architectures that model the dynamics of natural videos have grown ever more sophisticated (Carreira & Zisserman, 2017; Feichtenhofer et al., 2019; Arnab et al., 2021), the temporal extent over which they reason has typically been limited to a small number of frames. In particular, transformer architectures (Vaswani et al., 2017) which power most applications in vision and language do not scale to the vast number of tokens present in natural videos due to their quadratic complexity. For example, 30 minutes of video sampled at standard rates may contain half a million tokens more than what current state-of-the-art architectures using optimized

*Equal contribution 1Google Deep Mind. Correspondence to: Ivana Balaˇzevi c <balazevic@google.com>, Olivier J. H enaff <henaff@google.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

25 30 35 40 45 Ego Schema VQA accuracy

Perception Test VQA accuracy

Short Vi Vi T

Long Vi Vi T

Bard + Image Vi T

Bard + Short Vi Vi T

Bard + PALI

Number of parameters

Figure 1. Long-context video understanding on Ego Schema and Perception Test. The proposed Memory-Consolidated Vision Transformer (MC-Vi T-{B,L}, shown in bold) surpasses both public and large-scale proprietary models, despite using orders of magnitude fewer parameters and requiring only short fine-tuning schedules on top of standard pretrained models.

attention algorithms (e.g. Dao et al., 2022) can process. Several attempts have been made to extend the temporal context of video transformers, including masking, attention approximations, and parametric memory modules (e.g. Wu et al., 2022; Piergiovanni et al., 2024). However, these approaches often introduce additional complexity, requiring specialized architectures and training paradigms.

In this work, we question whether such modifications are indeed necessary to enable long-context modeling. Starting from standard pretrained video transformers (Arnab et al., 2021), we process videos in a streaming setting in order to bound their complexity by the length of short segments (Dai et al., 2019). Crucially, we process individual segments in relation to a memory bank which is populated nonparametrically with the consolidated activations from past segments. This allows us to re-purpose pretrained video transformers for long-context understanding without any architectural modification, by simply fine-tuning them to attend to this memory with short training schedules.

Memory Consolidation Enables Long-Context Video Understanding

A central question we are faced with is therefore how to choose which of the quasi-infinite tokens from past frames to store in memory. Inspired by evidence from psychology and neuroscience which formulates memory as a reconstructive process (Bartlett, 1932; Marr, 1971; Spens & Burgess, 2024), we adopt simple nonparametric schemes that form memories that are maximally representative of the full set of past activations. We find these mechanisms to effectively compress memories by an order of magnitude, and allow our memory-consolidated vision transformer (MC-Vi T) to extend its context to significantly longer videos while maintaining a bounded complexity. In particular,

1. MC-Vi T strikes a favorable trade-off between computational complexity and expressivity, outperforming standard video transformers and efficient approximations thereof with 10 less memory and computation.

2. The non-parametric nature of MC-Vi T allows us to straightforwardly re-purpose off-the-shelf pretrained video transformers by fine-tuning them to use their consolidated memory, yielding large efficiency gains by decreasing overall training time on long videos.

3. MC-Vi T sets a new state-of-the-art on long-context video understanding tasks such as fine-grained action recognition (Diving48) and video question answering (Ego Schema and Perception Test), outperforming methods which benefit from orders of magnitude more parameters.

4. MC-Vi T is competitive with large-scale proprietary systems such as GPT-4V and Bard, despite using a small, standard, and open architecture and training paradigm.

2. Related Work

Long-context architectures. Prior work has thoroughly explored approaches for handling long textual or visual inputs, by sparsifying either the input tokens or the attention applied over these tokens. In natural language processing, notable examples include Big Bird (Zaheer et al., 2020) and Long Former (Beltagy et al., 2020) that employ local self-attention over restricted windows combined with global tokens that attend over the entire sequence. Alternative attention mechanisms in vision have utilized pooling (Wang et al., 2021; Li et al., 2022b), linear (Bolya et al., 2022) and windowed formulations (Dong et al., 2022; Li et al., 2022a; Ryali et al., 2023). Several works reduce the number of tokens via multi-resolution patchification, thus processing the input video at different granularities (Feichtenhofer et al., 2019; Yan et al., 2022a; Piergiovanni et al., 2023). Similarly, Papalampidi et al. (2024) showcase the benefits of this approach by training video encoders on long contexts with high ratios of input masking. Current state-of-the-art approaches for processing long videos consist of modular systems for captioning and extracting frame-level informa-

tion, followed by a billion-scale LLM for aggregating this information (Zeng et al., 2022; Wang et al., 2022c; Li et al., 2023; Lin et al., 2023; Wang et al., 2023a; Zhang et al., 2023). The approach proposed in this work is orthogonal to these, by re-purposing standard transformer architectures for long-context modeling, whose representations can be incorporated into LLMs.

Memory-augmented transformers. Since the introduction of transformers (Vaswani et al., 2017), several works have sought to give them additional context via auxiliary memory banks. In NLP, Transformer XL does so by simply attending to recent activations in a streaming setting (Dai et al., 2019), whereas Retro (Borgeaud et al., 2022) does so by retrieving semantically related content. In vision, memory-augmented architectures have also been shown to enable video object segmentation (Oh et al., 2019), tracking (Lai et al., 2020), and action recognition (Wu et al., 2019).

Memory-compressing transformers. Several transformerbased architectures explored compressing past activations into a finite-length memory. In NLP, Neural Turing Machines (Graves et al., 2014) and Token Turning Machines (Ryoo et al., 2023) learn to read and write from a memory bank in an end-to-end manner. Similarly, Compressive Transformers (Rae et al., 2020), -former (Martins et al., 2022) and in vision, Mem DPC (Han et al., 2020), LSTR (Xu et al., 2021b), Me MVi T (Wu et al., 2022) and Long Mem (Wang et al., 2023b) extend the effective context length by compressing prior activations with additional parametric modules. Concurrent work Mirasol3B (Piergiovanni et al., 2024) showcases the power of this approach by combining these memory modules with large language models and a bespoke pretraining protocol.

Our work differs from these in that we find that a simple, non-parametric mechanism followed by light-weight finetuning is sufficient to re-purpose standard pretrained video transformer architectures (e.g. Vi Vi T, Arnab et al., 2021) to achieve strong long-context modeling. Closest to our approach is concurrent work Movie Chat (Song et al., 2024) that also uses a non-parametric memory, but focuses on extending the visual context of an LLM for video-to-text generation while using fixed frame-level representations via a pre-trained image encoder. In contrast, we directly add memory consolidation to the visual backbone.

3.1. Overview of Video Vision Transformers (Vi Vi T)

Video Vision Transformers (Vi Vi T; Arnab et al. 2021) adapt Vision Transformers (Dosovitskiy et al., 2021) to straightforwardly process videos. Specifically, Vi Vi T divides a video V RT H W into NT non-overlapping spatio-temporal patches xi Rt h w such that NT = T

w , and linearly

Memory Consolidation Enables Long-Context Video Understanding

Figure 2. Visualization of the proposed method. Left: Streaming Vi T processes each segment of the sequence independently by attending over activations within a segment. Middle: Memory-Augmented Vi T, similar to Transformer XL (Dai et al., 2019), attends to current activations (yellow blocks) and those in recent history (green blocks). Right: In Memory-consolidated Vi T, we consolidate the extended context into shorter memory and cross-attend over them, which enables us to effectively attend over longer sequences.

projects these patches into 1D embedding space:

zi = Exi + pi, (1)

where E denotes a learnable projection layer and pi Rd

additional positional embeddings. The resulting token sequence z0 = [zi, i [1, NT ]] RNT d is then passed through a series of L transformer layers, which alternate Multi-head Self-Attention (MSA; Vaswani et al. 2017), layer normalization (LN; Ba et al. 2016) and MLP blocks:

yl = MSA(LN(zl)) + zl (2)

zl+1 = MLP(LN(yl)) + yl. (3)

While various schemes for factorizing the attention have been proposed for Vi Vi T, we build our model upon the simplest joint space-time attention which models dependencies between all tokens in the sequence. We leave exploring other factorization methods for future research. In contrast to Vi Vi T s self-attention which spans the entire video, our MC-Vi T model uses self-attention within much shorter segments, and cross-attention across segments via a set of consolidated memories, which we detail below.

3.2. Memory-Consolidated Vision Transformers

In this section, we explore three successive modifications to the original Vi Vi T architecture that enable efficient and expressive scaling to longer videos (see visualization in Figure 2). The culmination of these modifications represents our proposed method: Memory-Consolidated Vi T (MCVi T). We apply consistent preand post-processing steps across all three approaches: we divide the video V into s temporal segments vτ RS H W , where S = T

s is the number of frames per segment and S = 16 in our experiments. We then process each segment (either individually or jointly, see below), yielding a list of s representations {z1, , zs}, one for each segment. All of these are then concatenated as the final representation of the video.

Streaming Vi T (ST-Vi T). Since the computational complexity of transformers scales quadratically with the num-

ber of tokens, full joint space-time attention becomes intractable for video lengths that exceed even small numbers of frames. To counteract this, we start with a simple streaming-based extension of Vi Vi T, which processes each segment vτ, τ [1, s] independently, as described in Section 3.1, with positional embeddings spanning the entire video. Crucially, the number of tokens processed by the Vi Vi T encoder at a given time is instead N = S

w , bounding the quadratic complexity by the segment length S rather than the total video length T. We include the pseudocode for the streaming Vi T implementation in Appendix A, Algorithm 2.

Memory-Augmented Vi T (MA-Vi T). While more scalable, the streaming setting limits the encoder s ability to reason over events which span multiple segments. Hence, as in Dai et al. (2017), we augment the self-attention module with an additional set of memories ml τ = [zl 0; zl 1; ...; zl τ 1] RM d consisting of concatenated activations of previous segments at each layer l:

yl τ = MCA(LN(zl τ) | {z } query , [LN(zl τ); LN(ml τ)] | {z } key-value

) + zl τ, (4)

where [ ; ] denotes the concatenation operation and Multihead Cross-Attention (MCA; Dai et al. 2019) generalizes MSA by decoupling the inputs to the query and key/value heads. Specifically, the MCA operation allows activations from the current segment zl τ to attend both to themselves (as in MSA) and to memories of all past activations ml τ, while keeping the quadratic complexity limited to N + M. We include the pseudocode for Memory-Augmented Vi T in Appendix A, Algorithm 3.

Memory-Consolidated Vi T (MC-Vi T). Given the memoryaugmented vision transformer architecture, a central question is how to consolidate the (potentially infinite) activations of previous segments into a finite (and ideally small) set of memories. We consider three simple instances of memory consolidation that model memory through a nonparametric reconstructive process.

Memory Consolidation Enables Long-Context Video Understanding

Algorithm 1 Memory-consolidated Vi T.

def mc_vit(

video, n_chunks, n_layers, pos_emb, mc_method, num_mem ):

emb = linear_proj(video) + pos_emb # [B, N, D] chunked_video = np.split(emb, n_chunks, axis=1) memory = {layer: None for layer in range(n_layers)} zs = [] for z in chunked_video:

z_norm = layer_norm(z) for layer in range(n_layers):

if memory[layer] is None:

y = self_attention(z_norm) + z else:

kv = np.concatenate(z_norm, memory[layer])) y = cross_attention(q=z_norm, kv=kv) + z y_norm = layer_norm(y) z = mlp(y_norm) + y memory[layer] = memory_consolidation(

memory[layer], z, num_mem, mc_method) memory[layer] = layer_norm(memory[layer]) zs.append(z) return np.concatenate(zs, axis=1)

To produce a new consolidated memory mτ for the current segment (dropping the layer index l for concision), we consolidate the set of activations from the preceding segment zτ 1 RN d into ˆzτ 1 RK d (K N) and concatenate them to the memories consolidated from all prior segments mτ =[mτ 1, ˆzτ 1] R(M+K) d. The proposed instances of non-parametric memory consolidation differ in their way of computing ˆzτ 1, which we detail below.

MC-Vi T-R (random) is the simplest non-parametric baseline which randomly selects a set of K activations from zτ 1 and uses them as the consolidated memory for the preceding segment:

ˆz R τ 1 = {zτ 1,k | k I} RK d, (5)

where I [1, N]K is a set of K randomly selected indices.

MC-Vi T-CS (coreset) constructs a maximally representative set of memories by applying the greedy coreset selection algorithm (Agarwal et al., 2005) to the activations of the preceding segment zτ 1 by iteratively adding the most distant activations to the ones already included in the consolidated memory for that segment. One iteration of the algorithm is defined as:

k = arg max k [1,N] min j M ||zτ 1,k zτ 1,j||2 2 (6)

M M {k }, (7)

where M is the set of activation indices chosen to be added to the consolidated memory ˆz CS τ 1. The greedy coreset selection algorithm is run for K iterations to produce the consolidated memory ˆz CS τ 1 RK d for the segment vτ 1. Due to its iterative nature, the coreset selection algorithm becomes increasingly computationally expensive as the size of the segment memory K =|M | increases.

MC-Vi T-KM (k-means) randomly initializes K cluster centroids as ˆz R τ 1 (see Equation 5) and then performs 5 iter-

ations of k-means clustering on all activations of the previous segment zτ 1 to compute the updated cluster centroids, which we use as the consolidated memory ˆz KM τ 1 RK d for the segment vτ 1.

We include the pseudocode for MC-Vi T in Algorithm 1. The newly consolidated memory mτ is then jointly processed with the current segment activations zτ via MCA, analogously to MA-Vi T (see Equation 4).

We compare these different consolidation methods in Section 4.4 and find that MC-Vi T-KM performs better than the others. Therefore, unless specified otherwise, MC-Vi T refers to MC-Vi T-KM in the following sections.

3.3. Training and Evaluation

Initialization. Since the parameters of MC-Vi T are almost identical to those of Vi Vi T (Arnab et al., 2021), we initialize most parameters from a Vi Vi T encoder pretrained on short (16-frame) video clips using multimodal contrastive learning (Xu et al., 2021a; Papalampidi et al., 2024), see Appendix B.1. The only parameters which differ are positional embeddings, as we fine-tune MC-Vi T on significantly longer videos (e.g. up to 128 frames) than the short clips used for pretraining. We therefore initialize the positional embeddings by linearly upsampling Vi Vi T s positional embeddings through interpolation along the time dimension (note that we experimented with both interpolation and extrapolation through repetition and did not find it to have a big impact on performance). Similarly, we re-use and fine-tune a BERT-style language encoder pretrained in the same setup.

Fine-tuning. For each evaluation, we fine-tune on a dataset mixture that enables a like-for-like comparison with the previous state-of-the-art. All datasets are composed of videotext pairs, and we therefore simply fine-tune the model with noise contrastive estimation. Given the video and text embeddings zv i and zt i of an example i, we minimize

ℓi = log exp(zv i zt i) P j exp(zv i zt j) log exp(zt i zv i ) P j exp(zt i zv j ) (8)

where the negative embeddings zv j and zt j are the in-batch examples unless otherwise specified. We provide further training details in Appendix B.2.

Evaluation. We employ the standard zero-shot transfer paradigm from CLIP (Radford et al., 2021) to perform all downstream tasks. In all cases, a test video is equipped with multiple possible captions , only one of which is correct. For action recognition, these captions are simply the class names. For video question answering, captions are question-answer pairs constructed from the set of multiplechoice answers. We utilize the language model to compute caption embeddings zt i, and compare them to the video em-

Memory Consolidation Enables Long-Context Video Understanding

16 32 64 128 256 Number of test frames

Diving48 Top-1 accuracy

Streaming, with memory (MC-Vi T)

16 frames FT 32 frames FT

64 frames FT 128 frames FT

16 32 64 128 256 Number of test frames

Joint space-time attention

16 frames FT 32 frames FT

64 frames FT 128 frames (OOM)

16 32 64 128 256 Number of test frames

Streaming, no memory (ST-Vi T)

16 frames FT 32 frames FT

64 frames FT 128 frames FT

Figure 3. MC-Vi T effectively learns from long videos. Left: MC-Vi T scales to long Diving48 videos at both training and inference time, and benefits from fine-tuning on longer videos. Middle: Joint space-time attention benefits from fine-tuning on longer videos, but cannot learn from long (128 frame) videos due to its large complexity and memory footprint. Right: ST-Vi T scales to longer videos but does not benefit from training on them.

bedding zv i . The model s prediction i = arg maxi zv i zt i is simply the caption with the highest similarity to the video embedding.

4. Experiments

4.1. Datasets

We evaluate our method on four challenging datasets for long-context video understanding, namely Diving48, Ego Schema, Next-QA, and Perception Test.

Diving48 (Li et al., 2018) was specifically designed to assess the importance of dynamic and long-term temporal reasoning in action recognition. Video lengths vary between 24 and 822 frames, with 158 frames on average. Each video is categorized into 48 fine-grained classes based on the specific dive type it depicts. Consequently, correct classification requires dense video sampling and fine-grained understanding in addition to retaining information over a long temporal extent, which necessitates reasoning over a large number of frames. To align with prior methods, we fine-tune on the Diving48 training set and re-initialize the language encoder randomly with a linear embedding function.

Ego Schema (Mangalam et al., 2023) is a long-form multiple-choice video question answering dataset derived from Ego4D (Grauman et al., 2022). The task involves selecting the correct answer out of five options based on a three-minute-long video clip. This task is particularly interesting for evaluating long-context understanding, as it benefits from long temporal certificate lengths, i.e. the minimum video duration a human needs to answer the question accurately. The model is fine-tuned on a mixture of How To100M and Ego4D, and we ensure that there is no

overlap between Ego4D training and Ego Schema examples.

Next-QA (Xiao et al., 2021) emphasizes testing causal and temporal reasoning with openand close-ended (multiplechoice) QA tasks. Videos in this dataset have an average duration of 44 seconds but can be as long as 2 minutes. We use the close-ended version for both fine-tuning and inference. Since the training set is fairly small and in order to avoid over-fitting on this domain, we add and only tune lowrank adapters (Lo RA; Hu et al. 2021) at the self-attention and feed-forward blocks of every layer, which account for 12% of model parameters. For fine-tuning on this multiplechoice QA dataset, we use the four incorrect answers to the given question as hard negatives in Equation (8).

Perception Test (P atr aucean et al., 2023) is inspired by assessment in developmental psychology and features a collection of games or daily activities that evaluate a model s grasp of physics, reasoning, memory, and semantic extraction. Although videos in this dataset are short with an average duration of 30 seconds, accurate localization and recognition of actions and objects require a higher FPS rate (we use an FPS of 4), resulting in sequences of hundreds of frames. We evaluate on the multiple-choice video question answering task by selecting one out of three possible answers, while training on Next-QA for zero-shot evaluation on this benchmark.

4.2. MC-Vi T Effectively Learns from Long Videos

We start by assessing the ability of MC-Vi T to model videos of increasing lengths. For this we fine-tune MC-Vi T on videos with different number of frames (16, 32, 64, or 128) by varying the FPS rate. At inference time, we also apply the model to videos with 16 to 256 frames. Figure 3 (left) shows

Memory Consolidation Enables Long-Context Video Understanding

16 32 64 128 Number of test frames

Diving48 Top-1 accuracy

Peak memory footprint (Mi B)

Total FLOPS (inference)

Memory, consolidated (MC-Vi T) Joint space-time attention Joint STA w masking Streaming, no memory (ST-Vi T) Late temporal fusion

Figure 4. MC-Vi T efficiently models long videos. Fine-grained video understanding on Diving48 as a function of number of test frames (left), memory consumption (middle), and computational complexity (FLOPS, right), for joint space-time attention w/ and w/o masking (yellow and red respectively), memory-less streaming setting (green), the late temporal fusion baseline (purple) and our proposed method MC-Vi T (blue). MC-Vi T reaches the highest accuracy with 10 less memory and FLOPS than the joint space-time attention method.

that MC-Vi T s performance improves with more, densely sampled frames at both training and inference time on Diving48 fine-grained action recognition. In particular, training with longer contexts allows MC-Vi T to benefit from more frames at inference time, with the optimal inference-time video length being twice that of the train-time video length, demonstrating reasonable generalization of the consolidated cross-attention mechanism.

In contrast, neither joint space-time attention (Figure 3, middle) nor a memory-less streaming ST-Vi T architecture (Figure 3, right) effectively learn from long videos. While joint-space time attention benefits from training on more frames in terms of performance, its memory footprint prevents it from training or evaluating on the longest videos. ST-Vi T on the other hand scales to more frames, but does not benefit from them, since it lacks the ability to reason over events that span multiple segments.

4.3. MC-Vi T Efficiently Models Long Videos

Inference-time efficiency. We evaluate the performance of joint space-time attention, ST-Vi T, and MC-Vi T in relation to their memory and computational complexity, by varying the number of frames at inference time (all models are trained with 64 frames). MC-Vi T s memory consumption is bounded by the number of tokens within a segment, similar to memory-less ST-Vi T, whereas that of joint spacetime attention increases with video length (Figure 4, middle). Similarly, while the computational complexity of joint space-time attention is quadratic in the video length, it is linear for both ST-Vi T and MC-Vi T (Figure 4, right).

In terms of performance, Figure 4 demonstrates that MCVi T remarkably outperforms joint space-time attention with a 10 smaller memory footprint (middle) and FLOPS

(right). We additionally test other scalable baselines, such as applying 25% input token masking to joint space-time attention (Papalampidi et al., 2024), and late temporal fusion (Alayrac et al., 2022; Yan et al., 2022b), where we add a learnable module on top of ST-Vi T for contextualizing information across segments (see Appendix C). Not only does MC-Vi T display a better scaling behavior than these baselines (Figure 4, left), but it does so with robust improvements in memory footprint and computational complexity.

8 32 128 512 2048 Number of test memories

Diving48 Top-1 accuracy

Streaming, no memory (ST-Vi T)

Joint space-time attn

K-means Random Coreset

Figure 5. MC-Vi T makes efficient use of finite-length context. We show three MC-Vi T instances and compare them to relevant baselines (dashed horizontal lines). K-means (red) and coreset (orange) surpass all methods at 16 compression rate with 128 memories per segment, demonstrating the efficiency of our approach. Surprisingly, even random memory selection (blue) achieves impressive performance on this task, outperforming all baselines at 4 compression rate with 512 memories, which further showcases efficiency and robustness of the MC-Vi T framework.

Training-time efficiency. We additionally assess whether these inference-time gains in efficiency translate in similar gains when fine-tuning. Specifically, we vary the number

Memory Consolidation Enables Long-Context Video Understanding

of frames used for fine-tuning, and measure the resulting performance when evaluating the model with twice the number of frames (as suggested in Figure 3), as well as the training-time memory footprint and computational complexity. Figure A.1, left, shows that MC-Vi T quickly surpasses joint space-time attention and the streaming memory architecture when provided with longer videos for training. In particular, MC-Vi T surpasses the performance of joint space-time attention with a 3.5 smaller memory footprint and 3.2 fewer FLOPS (Figure A.1, middle and right).

4.4. Memory Consolidation Makes Efficient Use of a Finite Context Window

Consolidation mechanisms. We now analyze the computational efficiency and expressiveness of MC-Vi T s consolidation methods. We compare our methods to three baselines: (1) joint space-time attention, (2) ST-Vi T, and (3) Me MVi T (Wu et al., 2022). Notably, Me MVi T employs a parametric approach to memory compression, requiring a convolutional module to be trained alongside the network, where we experimented with different convolutional kernel sizes and found the kernel size 4 2 2 (equivalent to MC-Vi T s K =128) to perform the best (see Appendix C for details). Figure 5 illustrates the performance of these methods on Diving48 as a function of the number of memories K per segment. Given K =128 memories obtained through k-means consolidation (i.e. a 16 compression compared to MA-Vi T; red curve), MC-Vi T-KM outperforms all baselines. Remarkably, even random selection of K =128 memories (with MC-Vi T-R) is sufficient to surpass Vi Vi T and ST-Vi T. Finally, consolidating past-activations with MC-Vi T-CS (coreset, orange curve) performs similarly to MC-Vi T-KM, highlighting the robustness of MC-Vi T to the particular choice of memory consolidation algorithm. K-means consolidation is used as the default method given its greater computational efficiency and slightly higher performance for larger sets of memories.

Finite memory size. Further, to adapt MC-Vi T to very long videos and decouple memory growth from video length, we compare two simple modifications to MC-Vi T in Table 1 on the Diving-48 dataset: (1) keep all tokens from the last N segments of length K (MC-Vi T-last) or (2) randomly select a fixed number of N K tokens across all past segments (MC-Vi T-global). The results show that 5 segments (or 2560 tokens) should be saved in memory with no loss in performance. Further, we see that selecting N K random tokens across segments (with K =512 for all experiments) is more beneficial than keeping the last N K tokens, which shows that relevant information for the current segment is contained throughout the whole video and not just in the last several segments. This also emphasizes the importance of redundancy reduction in the consolidation scheme, as uniformly sampled memories from the entire video are less redundant than all tokens from the most recent segments.

4.5. MC-Vi T Achieves State-of-the-Art Long-Context Video Understanding

Fine-grained action recognition. In Table 3, we compare MC-Vi T to prior methods on Diving48, and find that it delivers state-of-the-art results. Unlike previous methods that require object tracking models (Herzig et al., 2022) or additional modeling components, MC-Vi T achieves strong performance by simply re-purposing a general transformer architecture for long-context modeling: while previous methods are limited to 32 frames of video, the efficient scaling properties of MC-Vi T allow it to process 128 frames. Further, MC-Vi T does not require multiple spatial crops at inference time to achieve state-of-the-art results.

Long video question answering. We compare MC-Vi T to prior methods on long video question answering in Table 2. We find that our approach outperforms prior works that use up to 10 more parameters. Most notably, even our smaller model version (MC-Vi T-B, with 200M parameters in total) is able to achieve a 10% improvement on Ego Schema in comparison to much larger models (up to 5B parameters). This demonstrates the importance of processing more frames, which our straightforward memory consolidation method enables, as well as the effectiveness of fine-tuning MC-Vi T from standard pretrained video encoders.

It is particularly notable that MC-Vi T is competitive with models such as Flamingo (Alayrac et al., 2022) and Se Vi LA (Yu et al., 2023), which boast billion-scale LLM decoders. Such methods benefit from the language bias in VQA which allows for some questions to be trivially answered without any visual input and extensive textual training data. While MC-Vi T surpasses these models on Ego Schema and Perception Test, Se Vi La maintains stronger performance on Next-QA. We hypothesize that this benchmark is not challenging enough for long video understanding and relies heavily on language-only reasoning, since Yu et al. (2023) achieve their results while using a single input frame. Thus, frame-level models with strong decoders, such as Se Vi LA, may be sufficient for benchmarks requiring language-only reasoning and localization (Next-QA, Perception Test), but fail to capture a summary representation of the entire video (Ego Schema). In contrast, our method, despite lacking large language decoders, performs competitively across the board, demonstrating strong localization and long-context modeling capabilities. Finally, MC-Vi T requires minimal architectural changes and training overhead for adapting to long-context understanding, in contrast to modular methods (e.g., Yu et al., 2023) which involve multiple modules and complex training regimes.

MC-Vi T vs. large-scale proprietary models. Additionally, in Table 4 we compare our method to large-scale proprietary models such as GPT-4V (Achiam et al., 2023), Gemini (Anil

Memory Consolidation Enables Long-Context Video Understanding

Table 1. MC-Vi T with memory size decoupled from video length. For MC-Vi T-last, we store the last N segments (of length K =512) into memory. The equivalent for MC-Vi T-global is to store randomly selected N K tokens across segments.

Method 2 / 1024 3 / 1536 4 / 2048 5 / 2560 6 / 3072 7 / 3584

MC-Vi T-B-last 81.5 85.0 87.0 87.8 87.8 87.8 MC-Vi T-B-global 84.8 86.8 87.5 87.8 87.8 87.8

Table 2. Long video question answering, compared to public models. Performance is calculated as percentage correct on multiplechoice video question answering on Ego Schema, Perception Test and Next-QA. By scaling to significantly longer videos, MC-Vi T outperforms models that benefit from an order of magnitude more parameters. We highlight the best and second-best methods per dataset. 128+ stands for 128 or more frames, where we use 128 frames for Ego Schema and Next-QA and 256 frames for Perception Test.

Method Params Frames Ego Schema Perception Test Next-QA

Subset Full

Co VGT (Xiao et al., 2023) 149M 32 60.0 Se Vi TFi D (Kim et al., 2023) 215M 10 60.6 Hi Te A (Ye et al., 2023) 297M 16 63.1 Intern Video (Wang et al., 2022b) 478M 90 32.1 63.2 Image Vi T (Papalampidi et al., 2024) 1B 16 40.8 30.9 39.1 Short Vi Vi T (Papalampidi et al., 2024) 1B 16 47.9 31.0 41.9 Flamingo (Alayrac et al., 2022) 3B 32 43.6 Se Vi LA Localizer + Short Vi Vi T (Papalampidi et al., 2024) 5B 32 49.6 31.3 Long Vi Vi T (Papalampidi et al., 2024) 1B 256 56.8 33.3 45.7 Se Vi LA (Yu et al., 2023) 4B 32 25.7 22.7 46.2 73.8 MC-Vi T-B 203M 128+ 61.2 42.3 47.0 60.6 MC-Vi T-L 424M 128+ 62.6 44.4 48.1 65.0

Table 3. Fine-grained action classification on Diving48. Prior methods use 3 more spatial crops at inference time (SC) and/or bounding box information (BB), which MC-Vi T does not require.

Method Params Extra Top-1

Time S-L (Bertasius et al., 2021) 121M SC 81.0 Video Swin-B (Liu et al., 2022) 88M SC 81.9 BEVT (Wang et al., 2022a) 88M SC 86.7 SIFAR-B-14 (Fan et al., 2021) 87M SC 87.3 ORVi T (Herzig et al., 2022) 160M SC+BB 88.0 AIM Vi T-B (Yang et al., 2023) 97M SC 88.9 AIM Vi T-L (Yang et al., 2023) 341M SC 90.6 MC-Vi T-B 99M 89.7 MC-Vi T-L 313M 91.0

et al., 2023) and Bard1 + PALI (Google AI, 2023; Chen et al., 2023). While their exact implementation details are not publicly available, these models are thought to contain hundreds of billions to trillions of parameters, i.e. 1000 more than MC-Vi T. It is also important to note that these proprietary models are trained on massive amounts of data from the internet, resulting in potential data contamination, which we proactively avoid in our training pipeline.

We showcase the performance of all models in the raw columns. Additionally, we also present (in grey font) the

1Release of September 2023.

blind model performance where the model is trained only on question-answer pairs without the visual modality. In order to isolate the visual perception capabilities from the natural language reasoning of the model, we add the visual column in Table 4 which is computed as the difference between raw and blind . Examining the visual-only capabilities, we conclude that our small-scale model is competitive against the large proprietary ones and even surpasses GPT-4V performance on the subset of Ego Schema.

Despite using a fraction of the parameters and training data, our method remains competitive and, in some cases, outperforms these models. In particular, MC-Vi T achieves 5% improvements on Ego Schema and Perception Test against the sophisticated Bard + PALI modular system used for information aggregation and frame captioning, respectively.

5. Discussion

In this work, we introduced the Memory-Consolidated Vision Transformer (MC-Vi T), which efficiently models longrange dependencies in videos by consolidating past activations into a compact memory bank. MC-Vi T achieves state-of-the-art performance on multiple long video benchmarks by repurposing existing video architectures without the need for specialized architectures and training regimes. Our small-scale model outperforms approaches that bene-

Memory Consolidation Enables Long-Context Video Understanding

Table 4. Long video question answering on Ego Schema and Perception Test, compared to large-scale proprietary models. Performance is evaluated on the original ( raw ) dataset, as well as on the visual subset of questions that cannot be answered by a blind language model and on Perception Test for the validation set. For each model, we compute the performance of a blind variant on Ego Schema that only has access to question-answer pairs. The performance of the blind model is subtracted from that of the full model to compute visual performance. We underline the top 2 performing models for each benchmark and subset.

Ego Schema Raw Ego Schema Visual Perception Test Raw Perception Test Visual Method Subset Full Subset Full

Random chance 20.0 20.0 33.3

Bard only (blind) 27.0 33.2 0.0 0.0 36.8 0.0 Bard + Image Vi T (Papalampidi et al., 2024) 35.0 35.0 8.0 1.8 37.8 1.0 Bard + Short Vi Vi T (Papalampidi et al., 2024) 42.0 36.2 15.0 3.0 38.8 2.0 Bard + PALI (Papalampidi et al., 2024) 44.8 39.2 17.8 6.0 42.4 5.6

GPT-4 Turbo (blind) 31.0 30.8 0.0 0.0 GPT-4V 63.5 55.6 32.5 24.8 Gemini Ultra (Anil et al., 2023) 54.7

MC-Vi T-B (blind) 18.2 23.4 0.0 0.0 37.6 0.0 MC-Vi T-B 61.2 42.3 43.0 18.9 47.1 9.5 MC-Vi T-L (blind) 15.0 22.7 0.0 0.0 35.1 0.0 MC-Vi T-L 62.6 44.0 47.6 21.3 47.6 12.5

fit from orders of magnitude more parameters, and is even competitive with large-scale proprietary systems such as GPT-4V and Bard, demonstrating the importance of strong compressed video representations. As an extension, these representations could be fed into large language models to augment their long-range temporal reasoning capabilities.

We showcased the effectiveness of non-parametric memory consolidation techniques as a simple means of extending long video contexts, and future work could straightforwardly build on MC-Vi T by exploring alternative consolidation strategies. For instance, incorporating insights from cognitive models of memory, such as the role of episodic and semantic memory systems, as well as theories of efficient coding (Barlow, 1961), could inspire new consolidation techniques. Furthermore, the concept of memory consolidation could be applied to other domains involving sequential data, such as natural language and audio processing, laying the foundation for personalized assistant technologies that jointly reason over multiple modalities.

Impact Statement

By adapting standard video architectures to the long-context setup, this work could potentially equip general-purpose assistant models with the ability to efficiently process long videos. These models will likely suffer from similar biases and potential harms associated with visual language models and large language models more generally.

Further, since this work focuses on efficient processing of long sequences without sacrificing performance, the corresponding methods and findings from this work could poten-

tially be applied to other domains, such as NLP or audio, allowing for faster processing of large amounts of data and thus making long-context model training more readily available for widespread use.

Acknowledgements

We thank Andrew Zisserman, Jo ao Carreira, Carl Allen, and Nikhil Parthasarathy for their thoughtful feedback, Relja Arandjelovi c for fruitful discussions at the inception of this project, and Oliver Vikbladh, Eleanor Spens, and Neil Burgess for sharing their insights into memory consolidation in the human mind.

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. GPT-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023.

Agarwal, P. K., Har-Peled, S., Varadarajan, K. R., et al. Geometric approximation via coresets. Combinatorial and Computational Geometry, 52(1):1 30, 2005.

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems, 2022.

Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. Gem-

Memory Consolidation Enables Long-Context Video Understanding

ini: A family of highly capable multimodal models. ar Xiv preprint ar Xiv:2312.11805, 2023.

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Luˇci c, M., and Schmid, C. Vi Vi T: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

Barlow, H. B. Possible principles underlying the transformation of sensory messages. Sensory communication, 1 (01):217 233, 1961.

Bartlett, F. C. Remembering: A study in experimental and social psychology. Cambridge University Press, 1932.

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. ar Xiv:2004.05150, 2020.

Bertasius, G., Wang, H., and Torresani, L. Is space-time attention all you need for video understanding? In International Conference on Machine Learning, 2021.

Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., and Hoffman, J. Hydra attention: Efficient attention with many heads. In European Conference on Computer Vision, 2022.

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driessche, G. B., Lespiau, J.-B., Damoc, B., Clark, A., et al. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, 2022.

Carreira, J. and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al. Pa LI: A jointly-scaled multilingual language-image model. In International Conference on Learning Representations, 2023.

Dai, Z., Yang, Z., Yang, F., Cohen, W. W., and Salakhutdinov, R. R. Good semi-supervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, 2017.

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the Association for Computational Linguistics, 2019.

Dao, T., Fu, D., Ermon, S., Rudra, A., and R e, C. Flash Attention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems, 2022.

Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., and Guo, B. CSWin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

Fan, Q., Panda, R., et al. Can an image classifier suffice for action recognition? In International Conference on Learning Representations, 2021.

Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slow Fast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.

Google AI. Bard [large language model], 2023.

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. Ego4D: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

Graves, A., Wayne, G., and Danihelka, I. Neural Turing machines. ar Xiv preprint ar Xiv:1410.5401, 2014.

Han, T., Xie, W., and Zisserman, A. Memory-augmented dense predictive coding for video representation learning. In Proceedings of the European Conference on Computer Vision, 2020.

Herzig, R., Ben-Avraham, E., Mangalam, K., Bar, A., Chechik, G., Rohrbach, A., Darrell, T., and Globerson, A. Object-region video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

Hu, E. J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.

Jaegle, A., Borgeaud, S., Alayrac, J.-B., Doersch, C., Ionescu, C., Ding, D., Koppula, S., Zoran, D., Brock, A., Shelhamer, E., et al. Perceiver IO: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2022.

Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B., and Lim, S.-N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, 2022.

Memory Consolidation Enables Long-Context Video Understanding

Kim, S., Kim, J.-H., Lee, J., and Seo, M. Semiparametric video-grounded text generation. ar Xiv preprint ar Xiv:2301.11507, 2023.

Lai, Z., Lu, E., and Xie, W. MAST: A memory-augmented self-supervised tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., and Qiao, Y. Video Chat: Chat-centric video understanding. ar Xiv preprint ar Xiv:2305.06355, 2023.

Li, Y., Li, Y., and Vasconcelos, N. RESOUND: Towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision, 2018.

Li, Y., Mao, H., Girshick, R., and He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision, 2022a.

Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., and Feichtenhofer, C. MVi Tv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022b.

Lin, K., Ahmed, F., Li, L., Lin, C.-C., Azarnasab, E., Yang, Z., Wang, J., Liang, L., Liu, Z., Lu, Y., et al. MM-VID: Advancing video understanding with GPT4V(ision). ar Xiv preprint ar Xiv:2310.19773, 2023.

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and Hu, H. Video Swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

Mangalam, K., Akshulakov, R., and Malik, J. Ego Schema: A diagnostic benchmark for very long-form video language understanding. In Advances in Neural Information Processing Systems, 2023.

Marr, D. Simple memory: a theory for archicortex. Philosophical Transactions of the Royal Society of London, 1971.

Martins, P. H., Marinho, Z., and Martins, A. F. -former: Infinite memory transformer. In Proceedings of the Association for Computational Linguistics, 2022.

Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev, I., and Sivic, J. Howto100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.

Oh, S. W., Lee, J.-Y., Xu, N., and Kim, S. J. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.

Papalampidi, P., Koppula, S., Pathak, S., Chiu, J., Heyward, J., Patraucean, V., Shen, J., Miech, A., Zisserman, A., and Nematzdeh, A. A simple recipe for contrastively pre-training video-first encoders beyond 16 frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.

P atr aucean, V., Smaira, L., Gupta, A., Continente, A. R., Markeeva, L., Banarse, D., Koppula, S., Heyward, J., Malinowski, M., Yang, Y., et al. Perception Test: A diagnostic benchmark for multimodal video models. In Advances in Neural Information Processing Systems, 2023.

Piergiovanni, A., Kuo, W., and Angelova, A. Rethinking video Vi Ts: Sparse video tubes for joint image and video learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

Piergiovanni, A., Nobel, I., Kim, D., Ryoo, M. S., Gomes, V., and Angelova, A. Mirasol3B: A multimodal autoregressive model for time-aligned and contextual modalities. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitio, 2024.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.

Rae, J. W., Potapenko, A., Jayakumar, S. M., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. International Conference on Learning Representations, 2020.

Ryali, C., Hu, Y.-T., Bolya, D., Wei, C., Fan, H., Huang, P.-Y., Aggarwal, V., Chowdhury, A., Poursaeed, O., Hoffman, J., et al. Hiera: A hierarchical vision transformer without the bells-and-whistles. In International Conference on Machine learning, 2023.

Ryoo, M. S., Gopalakrishnan, K., Kahatapitiya, K., Xiao, T., Rao, K., Stone, A., Lu, Y., Ibarz, J., and Arnab, A. Token Turing machines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Guo, X., Ye, T., Lu, Y., Hwang, J.-N., et al. Movie Chat: From dense token to sparse memory for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.

Memory Consolidation Enables Long-Context Video Understanding

Spens, E. and Burgess, N. A generative model of memory construction and consolidation. Nature Human Behaviour, pp. 1 18, 2024.

Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision, 2017.

Tulving, E. Memory and consciousness. Canadian Psychology/Psychologie canadienne, 26(1):1, 1985.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.

Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Jiang, Y.-G., Zhou, L., and Yuan, L. BEVT: BERT pretraining of video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022a.

Wang, S., Zhao, Q., Do, M. Q., Agarwal, N., Lee, K., and Sun, C. Vamos: Versatile action models for video understanding. ar Xiv preprint ar Xiv:2311.13627, 2023a.

Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., and Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.

Wang, W., Dong, L., Cheng, H., Liu, X., Yan, X., Gao, J., and Wei, F. Augmenting language models with long-term memory. Advances in Neural Information Processing Systems, 2023b.

Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Zhang, H., Xu, J., Liu, Y., Wang, Z., et al. Intern Video: General video foundation models via generative and discriminative learning. ar Xiv preprint ar Xiv:2212.03191, 2022b.

Wang, Z., Li, M., Xu, R., Zhou, L., Lei, J., Lin, X., Wang, S., Yang, Z., Zhu, C., Hoiem, D., et al. Language models with image descriptors are strong few-shot videolanguage learners. Advances in Neural Information Processing Systems, 2022c.

Wu, C.-Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., and Girshick, R. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.

Wu, C.-Y., Li, Y., Mangalam, K., Fan, H., Xiong, B., Malik, J., and Feichtenhofer, C. Me MVi T: Memoryaugmented multiscale vision transformer for efficient

long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

Xiao, J., Shang, X., Yao, A., and Chua, T.-S. Next-QA: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.

Xiao, J., Zhou, P., Yao, A., Li, Y., Hong, R., Yan, S., and Chua, T.-S. Contrastive video question answering via video graph transformer. ar Xiv preprint ar Xiv:2302.13668, 2023.

Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and Feichtenhofer, C. Video CLIP: Contrastive pre-training for zero-shot videotext understanding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021a.

Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., and Soatto, S. Long short-term transformer for online action detection. Advances in Neural Information Processing Systems, 2021b.

Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., and Schmid, C. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022a.

Yan, S., Zhu, T., Wang, Z., Cao, Y., Zhang, M., Ghosh, S., Wu, Y., and Yu, J. Video-text modeling with zeroshot transfer from contrastive captioners. ar Xiv preprint ar Xiv:2212.04979, 2022b.

Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., and Li, M. AIM: Adapting image models for efficient video action recognition. In International Conference on Learning Representations, 2023.

Ye, Q., Xu, G., Yan, M., Xu, H., Qian, Q., Zhang, J., and Huang, F. Hi Te A: Hierarchical temporal-aware videolanguage pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.

Yu, S., Cho, J., Yadav, P., and Bansal, M. Self-chained image-language model for video localization and question answering. In Advances in Neural Information Processing Systems, 2023.

Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big Bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 2020.

Memory Consolidation Enables Long-Context Video Understanding

Zeng, A., Attarian, M., Choromanski, K. M., Wong, A., Welker, S., Tombari, F., Purohit, A., Ryoo, M. S., Sindhwani, V., Lee, J., et al. Socratic models: Composing zero-shot multimodal reasoning with language. In International Conference on Learning Representations, 2022.

Zhang, C., Lu, T., Islam, M. M., Wang, Z., Yu, S., Bansal, M., and Bertasius, G. A simple LLM framework for long-range video question-answering. ar Xiv preprint ar Xiv:2312.17235, 2023.

Memory Consolidation Enables Long-Context Video Understanding

A. ST-Vi T and MA-Vi T Pseudocode

We include pseudocode for streaming Vi T and memoryaugmented Vi T in Algorithm 2 and Algorithm 3.

B. Training Details

B.1. Pretraining Short-Video Vi Vi T Encoders

In this work we re-purpose a standard Vi Vi T (Arnab et al., 2021) encoder into MC-Vi T and fine-tune it for long-context understanding as described in Section 3.3. We use a pretrained Vi Vi T from Papalampidi et al. (2024) (referred to as Short Vi Vi T therein), whose training details follow those of Xu et al. (2021a), and which we reproduce here for completeness. Note however that similar pretrained Vi Vi T s are available from public repositories2.

Phase 1: Image-text pretraining. A standard Vi T image encoder and an associated BERT text encoder are pretrained with multimodal contrastive learning (Radford et al., 2021) on a mixture of paired image-text datasets (ALIGN, Jia et al. (2022); LTIP, Alayrac et al. (2022); and JFT, Sun et al. (2017)). We utilize two variants of the underlying vision and text encoder: Vi T-B with BERT-M for MC-Vi T-B, as well as Vi T-L and BERT-B for MC-Vi T-L.

Phase 2: Short-video pretraining. Initializing from the phase 1 image/text encoders, Papalampidi et al. (2024) train a Vi Vi T video encoder and associated text encoder on short (16-frame) videos sampled from the How To100M (Miech et al., 2019) and VTP (Alayrac et al., 2022) datasets, together with the image datasets (treated as single-frame videos) from phase 1, again using multimodal contrastive learning (Equation 8). The parameters of the phase 1 Vi T and phase 2 Vi Vi T are identical except for the patchembedding projection and positional embeddings, which they extend temporally at initialization time by simply replicating them as in Arnab et al. (2021).

B.2. Implementation Details

We provide all experimental settings for training our MCVi T model variants in Table 5, categorized by dataset. We report our experimental setup for training the large model variants, but we follow the same settings when training the base versions, except for Diving48, where we use 128 instead of 64 frames.

For training on Next-QA, we additionally use low-rank adaptation (Lo RA) to avoid overfitting. Given each Vi Vi T layer, we decompose the linear QKV input projection, the output

2https://huggingface.co/docs/ transformers/model_doc/vivit

Algorithm 2 Streaming Vi T.

def streaming_vit(video, n_chunks, n_layers, pos_emb):

emb = linear_proj(video) + pos_emb chunked_video = np.split(emb, n_chunks, axis=1) zs = [] memory = None for z in chunked_video:

z_norm = layer_norm(z) for _ in range(n_layers):

y = self_attention(z_norm) + z y_norm = layer_norm(y) z = mlp(y_norm) + y zs.append(z) return np.concatenate(zs, axis=1)

Algorithm 3 Memory-augmented Vi T.

def ma_vit(video, n_chunks, n_layers, pos_emb):

emb = linear_proj(video) + pos_emb chunked_video = np.split(emb, n_chunks, axis=1) memory = {layer: None for layer in range(n_layers)} zs = [] for z in chunked_video:

z_norm = layer_norm(z) for _ in range(n_layers):

if memory[layer] is None:

y = self_attention(z_norm) + z else:

kv = np.concatenate(z_norm, memory[layer])) y = cross_attention(q=z_norm, kv=kv) + z y_norm = layer_norm(y) z = mlp(y_norm) + y memory[layer] = memory_concatenation(memory[

layer], z) memory[layer] = layer_norm(memory[layer]) zs.append(z) return np.concatenate(zs, axis=1)

self-attention projection, and the dense feed-forward block:

h = Wox + 1

where Wo is the frozen pretrained weight matrix, B and A are zero-initialized low-rank matrices, B Rdm r, A Rr dm, r dm is the rank of the decomposition matrices, and α is a hyperparameter for easier tuning of the model, as recommended by (Hu et al., 2021). We use a r=128, resulting in 12% of model parameters for the large model version.

C. Additional Baselines Joint space-time attention with masking. To implement joint space-time attention with masking, we follow Papalampidi et al. (2024) and randomly mask 25% of the input video tokens before feeding them into the transformer. To better align training with evaluation, we perform masking both at fine-tuning and inference time. In particular, this allows the masked model to be more memoryand compute-efficient than full space-time attention.

Late temporal fusion. A common approach for efficiently processing videos is applying a learnable late temporal fusion module for aggregating information over framelevel (Alayrac et al., 2022; Yan et al., 2022b) or shortvideo representations (Piergiovanni et al., 2024). We follow Alayrac et al. (2022) and Yan et al. (2022b) and use

Memory Consolidation Enables Long-Context Video Understanding

16 32 64 128 Number of training frames

Diving48 Top-1 accuracy

Joint space-time attention Streaming, no memory Memory, consolidated

1 2 3 4 Peak training memory / example (Gi B)

Joint space-time attention Streaming, no memory Memory, consolidated

Training FLOPS / example

Joint space-time attention Streaming, no memory Memory, consolidated

Figure A.1. MC-Vi T efficiently trains on long videos. Fine-grained video understanding on Diving48 as a function of number of training frames (left), train-time memory consumption (middle), and training complexity (FLOPS, right), for joint space-time attention (red), memory-less streaming setting (green), and our proposed method MC-Vi T (blue). MC-Vi T reaches the highest accuracy with 3.5 less training memory and 3.2 fewer training FLOPS than joint space-time attention.

Algorithm 4 GPT-4V Zero-Shot Prompt.

You are a helpful assistant, an expert in answering

questions about videos. You will be given a question about a video and five possible answer options. You will be provided frames from the video, sampled evenly across the video. You are very capable, think step-by-step when answering the question.

Question: <question>

Possible answer choices: A. <Answer Choice 1> B. <Answer Choice 2> C. <Answer Choice 3> D. <Answer Choice 4> E. <Answer Choice 5>

ONLY output A, B, C, D, or E to answer. DO NOT OUTPUT

with the full answer text or any other words, output only the single letter indicating the correct choice: one of A, B, C, D, or E.

<video frames attached to the HTTP request>

a light-weight Perceiver-resampler (Jaegle et al., 2022) to contextualize and compress information across segmentlevel information. We process the long video in 16-frame segments similarly to ST-Vi T and feed all output tokens across segments into a single layer of a randomly initialized Perceiver-resampler. We use 256 latent learnable queries for learning a global compressed representation of the video.

Me MVi T. We implement Me MVi T s convolutional compression module (Wu et al., 2022) and apply it to every video segment as a parametric alternative to MC-Vi T s consolidation mechanism. We experiment with 4 compression (i.e. kernel sizes of 1 2 2 and 4 1 1), 16 compression (kernel size 4 2 2) and 64 compression (kernel size 4 4 4). Similarly to the original paper, we find the 4 2 2 kernel to yield best results on the Diving48 dataset, corresponding to MC-Vi T s memory segment length of K =128. We also an-

Algorithm 5 GPT-4 Turbo Zero-Shot Prompt.

You are a helpful assistant, an expert in answering

questions about videos. You will be given a question about a video and five possible answer options. You will not be provided frames from the

video, but still do your best to guess an answer. You are very capable, think step-by-step when answering the question.

Question: <question>

Possible answer choices: A. <Answer Choice 1> B. <Answer Choice 2> C. <Answer Choice 3> D. <Answer Choice 4> E. <Answer Choice 5>

ONLY output A, B, C, D, or E to answer. DO NOT OUTPUT

with the full answer text or any other words, output only the single letter indicating the correct choice: one of A, B, C, D, or E.

alyze the memory footprint and FLOPs of Me MVi T during training and inference and we find the differences between MC-Vi T and Me MVi T to be almost indistinguishable, with Me MVi T s memory footprint slightly higher (+4%) due to additional convolutional kernel parameters.

D. Evaluating Proprietary Models

We evaluated multiple choice video question answering using the proprietary GPT-4V ( gpt-4-vision-preview ) model (version as deployed on January 6 through 9th, 2023). This model has a context window of 128K tokens, with training data up to April 2023. The model was queried in a zero-shot manner, using the prompt described in Algorithm 4. Each request to the model was attached with a fixed number of frames, varying from 16 frames to 512 frames. We use 256 256 resolution frames for all queries. Note that the token count (and corresponding cost) also increases dramat-

Memory Consolidation Enables Long-Context Video Understanding

Table 5. Training specifications for fine-tuning MC-Vi T per dataset.

Ego Schema Diving48 Next-QA Perception Test

Optimizer Adam W Learning rate schedule Cosine with linear warmup Gradient clip 2.0 Linear warmup steps 1k Frame-level resolution 256 256 Frame-level cropping center crop Convolution kernel 2 16 16 Batch size 128 256 Label smoothing 0 0.1 # memories/segment (K) 128 512 Frame sampling Uniform 4 FPS Weight decay rate 0 0 1e-2 Base learning rate 2e-5 5e-5 1e-6 Training steps 5k 30k 20k

ically as the number of frames accompanying the question increases: 16 frames require roughly 4.3K tokens per question/answer pair, 64 frames require 16.5K tokens, and 256 frames require 65.5K tokens. On the Ego Schema subset we found similar performance from using 16, 64, and 256 frames (63.5%, 63.9%, and 63.6% accuracy respectively), hence we used 16 frames when evaluating on the full set. Frames were sampled linearly from the video, and we use the auto (-select) resolution quality setting.

For obtaining blind results, we used the gpt-4-1106preview (GPT-4 Turbo) model. The model was queried in a zero-shot manner, using the prompt described in Algorithm 5. Models were queried on January 17, 2023. Each request was approximately 250 tokens.

For both the blind and vision-enabled model, a small number of requests come back with a format that doesn t adhere to the A/B/C/D/E characters representing the five possible answer choices (uppercase or lowercase). For example, the model sometimes responds with The answer is C , or the actual text of the answer choice, instead of the corresponding answer choice letter. In those cases, we apply answer post-processing to extract out the corresponding answer choice by lowercasing the response and checking for the common regex, the answer is (a|b|c|d|e). If the match fails, we compute the Levenshtein edit distance with each answer choice, choosing the lowercased answer response that most closely matches the lowercased proprietary model output.