# tree_cross_attention__77ad80b7.pdf

Published as a conference paper at ICLR 2024

TREE CROSS ATTENTION

Leo Feng Mila Universit e de Montr eal & Borealis AI leo.feng@mila.quebec

Frederick Tung Borealis AI frederick.tung@borealisai.com

Hossein Hajimirsadeghi Borealis AI hossein.hajimirsadeghi@borealisai.com

Yoshua Bengio Mila Universit e de Montr eal yoshua.bengio@mila.quebec

Mohamed Osama Ahmed Borealis AI mohamed.o.ahmed@borealisai.com

Cross Attention is a popular method for retrieving information from a set of context tokens for making predictions. At inference time, for each prediction, Cross Attention scans the full set of O(N) tokens. In practice, however, often only a small subset of tokens are required for good performance. Methods such as Perceiver IO are cheap at inference as they distill the information to a smaller-sized set of latent tokens L < N on which cross attention is then applied, resulting in only O(L) complexity. However, in practice, as the number of input tokens and the amount of information to distill increases, the number of latent tokens needed also increases significantly. In this work, we propose Tree Cross Attention (TCA) - a module based on Cross Attention that only retrieves information from a logarithmic O(log(N)) number of tokens for performing inference. TCA organizes the data in a tree structure and performs a tree search at inference time to retrieve the relevant tokens for prediction. Leveraging TCA, we introduce Re Treever, a flexible architecture for token-efficient inference. We show empirically that Tree Cross Attention (TCA) performs comparable to Cross Attention across various classification and uncertainty regression tasks while being significantly more token-efficient. Furthermore, we compare Re Treever against Perceiver IO, showing significant gains while using the same number of tokens for inference.

1 INTRODUCTION

With the rapid growth in applications of machine learning, an important objective is to make inference efficient both in terms of compute and memory. NVIDIA (Leopold, 2019) and Amazon (Barr, 2019) estimate that 80 90% of the ML workload is from performing inference. Furthermore, with the rapid growth in low-memory/compute domains (e.g. Io T devices) and the popularity of attention mechanisms in recent years, there is a strong incentive to design more efficient attention mechanisms for performing inference.

Cross Attention (CA) is a popular method at inference time for retrieving relevant information from a set of context tokens. CA scales linearly with the number of context tokens O(N). However, in reality many of the context tokens are not needed, making CA unnecessarily expensive in practice. General-purpose architectures such as Perceiver IO (Jaegle et al., 2021) perform inference cheaply by first distilling the contextual information down into a smaller fixed-sized set of latent tokens L < N. When performing inference, information is instead retrieved from the fixed-size set of latent tokens O(L). Methods that achieve efficient inference via distillation are problematic since (1) problems with a high intrinsic dimensionality naturally require a large number of latents, and (2) the number of latents (capacity of the inference model) is a hyperparameter that requires specifying before training. However, in many practical problems, the required model s capacity may not be

Published as a conference paper at ICLR 2024

known beforehand. For example, in settings where the amount of data increases overtime (e.g., Bayesian Optimization, Contextual Bandits, Active Learning, etc...), the number of latents needed in the beginning and after many data acquisition steps can be vastly different.

In this work, we propose (1) Tree Cross Attention (TCA), a replacement for Cross Attention that performs retrieval, scaling logarithmically O(log(N)) with the number of tokens. TCA organizes the tokens into a tree structure. From this, it then performs retrieval via a tree search, starting from the root. As a result, TCA selectively chooses the information to retrieve from the tree depending on a query feature vector. TCA leverages Reinforcement Learning (RL) to learn good representations for the internal nodes of the tree. Building on TCA, we also propose (2) Re Treever, a flexible architecture that achieves token-efficient inference.

In our experiments, we show (1) TCA achieves results competitive to that of Cross Attention while only requiring a logarithmic number of tokens, (2) Re Treever outperforms Perceiver IO on various classification and uncertainty estimation tasks while using the same number of tokens, (3) Re Treever s optimization objective can leverage non-differentiable objectives such as classification accuracy, and (4) TCA s memory usage scales logarithmically with the number of tokens unlike Cross Attention which scales linearly.

2 BACKGROUND

2.1 ATTENTION

Attention retrieves information from a context set Xc as follows:

Attention(Q, K, V ) = softmax(QKT

where Q is the query matrix, K is the key matrix, V is the value matrix, and d is the dimension of the key and query vectors. In this work, we focus on Cross Attention where the objective is to retrieve information from the set of context tokens Xc given query feature vectors, i.e., K and V are embeddings of the context tokens Xc while Q is the embeddings of a batch of query feature vectors. In contrast, in Self Attention, K, V , and Q are embeddings of the same set of token Xc and the objective is to compute higher-order information for downstream calculations.

2.1.1 PERCEIVER IO

Perceiver IO (Jaegle et al., 2021) is a general attention-based neural network architecture applicable to various tasks. Perceiver IO is composed of a stacked iterative attention encoder (RN D RL D) and a Cross Attention module where N is the number of context tokens and L is a hyperparameter. Perceiver IO s encoder aims to compress the information of the context tokens to a smaller constant (L) number of latent tokens (Latents) whose initialization is learned. More specifically, each of the encoder s stacked iterative attention blocks work as follows:

Latents Self Attention(Cross Attention(Latents, DC))

When performing inference, Perceiver IO simply retrieves information from the set of latents to make predictions via cross attention. As a result, all the necessary information needed for performing inference must be compressed into its small fixed number of latents. However, this is not practical for problems with high intrinsic dimensionality.

2.2 REINFORCEMENT LEARNING

Markov Decision Processes (MDPs) are defined as a tuple M = (S, A, T, R, ρ0, γ, H), where S and A denote the state and action spaces respectively, T(st+1|st, a) the transition dynamics, R(rt+1|st, at, st+1) the reward function, ρ0 the initial state distribution, γ (0, 1] the discount factor, and H the horizon. In the standard Reinforcement Learning setting, the objective is to optimize a policy π(a|s) that maximizes the expected discounted return Eπ,T,ρ0[PH 1 t=0 γt R(rt+1|st, at, st+1)].

Published as a conference paper at ICLR 2024

Figure 1: Architecture Diagram of Re Treever. Input Array comprises a set of N context tokens which are fed through an encoder to compute a set of context encodings. Query Array denotes a batch of M query feature vectors. Tree Cross Attention organizes the encodings and constructs a tree T . At inference time, given a query feature vector, a logarithmic-sized subset of nodes (encodings) is retrieved from the tree T . The query feature vector retrieves information from the subset of encodings via Cross Attention and makes a prediction.

Figure 2: Diagram of the aggregation procedure performed during the Tree Construction phase. The aggregation procedure is performed bottom-up beginning from the parents of the leaves and ending at the root of the tree. The complexity of this procedure is O(N) but this only needs to be performed once for a set of context tokens. Compared to the cost of performing multiple predictions, the onetime cost of the aggregation process is minor.

3 TREE CROSS ATTENTION

We propose Tree Cross Attention (TCA), a token-efficient variant of Cross Attention. Tree Cross Attention is composed of three phases: (1) Tree Construction, (2) Retrieval, and (3) Cross Attention. In the Tree Construction phase (RN D T ), TCA organizes the context tokens (RN D) into a tree structure (T ) such that the context tokens are the leaves in the tree. The internal (i.e., non-leaf) nodes of the tree summarise the information in its subtree. Notably, this phase only needs to be performed once for a set of context tokens.

The Retrieval and Cross Attention is performed multiple times at inference time. During the Retrieval phase (T RD RO(log(N)) D), the model retrieves a logarithmic-sized selected subset of nodes (S) from the tree using a query feature vector q RD (or a batch of query feature vectors RM D). Afterwards, Cross Attention (RO(log(N)) D RD RD) is performed by retrieving the information from the subset of nodes with the query feature vector. The overall complexity of inference is O(log(N)) per query feature vector. We detail these phases fully in the below subsections.

After describing the phases of TCA, we introduce Re Treever, a general architecture for tokenefficient inference. Figure 1 illustrates the architecture of Tree Cross Attention and Re Treever.

Published as a conference paper at ICLR 2024

Figure 3: Example result of a Retrieval phase. The policy creates a path from the tree s root to its leaves, selecting a subset of nodes: the terminal leaves in the path and the highest-level unexplored ancestors of the other leaves. The green arrows represent the path (actions) chosen by the policy π. The red arrows represent the actions rejected by the policy. The green nodes denote the subset of nodes selected, i.e., S = {h2, h3, h9, h10}. The grey nodes denote nodes that were explored at some point but not selected. The red nodes denote the nodes that were not explored or selected.

3.1 TREE CONSTRUCTION

The tokens are organized in a tree T such that the leaves of the tree consist of all the tokens. The internal nodes (i.e., non-leaf nodes) summarise the information of the nodes in their subtree. The information stored in a node has two specific desideratas, summarising the information in its subtree needed for performing (1) predictions and (2) retrieval (i.e., finding the specific nodes in its subtree relevant for the query feature vector).

The method of organizing the data in the tree is flexible and can be done either with prior knowledge of the structure of the data, with simple heuristics, or potentially learned. For example, heuristics used to organize the data in traditional tree algorithms can be used to organize the data, e.g., balltree (Omohundro, 1989) or k-d tree (Cormen et al., 2006).

After organizing the data in the tree, an aggregator function Agg : P(Rd) Rd (where P denotes the power set) is used to aggregate the information of the tokens. Starting from the parent nodes of the leaves of the tree, the following aggregation procedure is performed bottom-up until the root of the tree: hv = Agg({hu|u Cv}) where Cv denotes the set of children nodes of a node v and hv denotes the vector representing the node. Figure 2 illustrates the aggregation process.

In our experiments, we consider a balanced binary tree. To organize the data, we use a k-d tree. During their construction, k-d trees select an axis and split the data evenly according to the median, grouping similar data together. For sequence data such as time series, the data is split according to the x-axis (e.g., time). For images, this means that the pixels are split according to the x or y-axis, i.e., vertically or horizontally. To ensure the tree is perfectly balanced, padding tokens are added such that the values are masked during computation. Notably, these padding nodes are less than the number of tokens. As such, it does not affect the big-O complexity.

3.2 RETRIEVAL

To learn informative internal node embeddings for retrieval, we leverage a policy π learned via Reinforcement Learning for retrieving a subset of nodes from the tree. Algorithm 1 describes the process of retrieving the set of selected nodes S from the tree T . Figure 3 illustrates an example of the result of this phase. Figures 7 to 11 in the Appendix illustrate a step-by-step example of the retrieval described in Algorithm 1.

v denotes the node of the tree currently being explored. At the start of the search, it is initialized as v Root(T ). The policy s state is the set of children nodes of the current node being explored Cv and the query feature vector q, i.e., s = (Cv, q). Starting from the root, the policy samples one of the children nodes, i.e., v πθ(a|Cv, q) where a Cv, to further explore for more detailed

Published as a conference paper at ICLR 2024

information retrieval. The nodes that are not being further explored are added to the set of selected nodes S S (Cv\v ). Afterwards, v is updated (v v ) with the new node and the process is repeated until v is a leaf node (i.e., the search is complete).

Algorithm 1 Retrieval

Input: Tree Structure (T ), policy (πθ), and query feature vector q Output: Set of selected nodes (S)

v Root(T ) S while v is not leaf do

v πθ(a|Cv, q) where a Cv S S (Cv\v ) Cv denotes children nodes of v

v v end while S S v

Since the height of a balanced tree is O(log(N)), as such, the number of nodes that are explored and retrieved is logarithmic: ||S|| = O(log(N)).

3.3 CROSS ATTENTION

The retrieved set of nodes (S) is used as input for Cross Attention alongside the query feature vector q. Since ||S|| = O(log(N)), this results in a Cross Attention with overall logarithmic complexity O(log(N)) complexity per query feature vector. In contrast, applying Cross Attention to the full set of tokens has a linear complexity O(N). Notably, the set of nodes S has a full receptive field of the entire set of tokens, i.e., either a token is part of the selected set of nodes or is a descendant (in the tree) of one of the selected nodes.

3.4 RETREEVER: RETRIEVAL VIA TREE CROSS ATTENTION

In this section, we propose Re Treever (Figure 1), a general-purpose model that achieves tokenefficient inference by leveraging Tree Cross Attention. The architecture is similar in style to Perceiver IO s (Figure 5 in the Appendix).

In the case of Perceiver IO, the model is composed of (1) an iterative attention encoder (RN D RL D) which compresses the information into a smaller fixed-sized L set of latent tokens and (2) a Cross Attention module used during inference for performing information retrieval from the set of latent tokens. As a result, Perceiver IO s inference complexity is O(L).

In contrast, Re Treever is composed of (1) an encoder (RN D RN D) and (2) a Tree Cross Attention (TCA) module used during inference for performing information retrieval from a tree structure. Unlike Perceiver IO which compresses information via a specialized encoder to achieve efficient inference, Re Treever s inference is token-efficient irrespective of the encoder, scaling logarithmically O(log(N)) with the number of tokens. As such, the choice of encoder is flexible and can be, for example, a Transformer Encoder or efficient versions such as Linformer, Chord Mixer, etc.

3.5 TRAINING OBJECTIVE

The objective Re Treever optimises consists of three components with hyperparameters λRL and λCA, denoting the weight of the terms:

LRe T reever = LT CA + λRLLRL + λCALCA

LT CA aims to tackle the first desiderata, i.e., learning node representations in the tree that summarise the relevant information in its subtree for making good predictions.

LT CA = Loss(Cross Attention(x, S), y)

LRL tackles the second desiderata, i.e., learning internal node representations for retrieval (the RL policy π) within the tree structure1.

t=0 [R log πθ(at|st) + αH[πθ( |st)]]

1The described objective is the standard REINFORCE loss (without discounting γ = 1) using an entropy bonus for a sparse reward environment.

Published as a conference paper at ICLR 2024

The horizon is the height of the tree log(N), H denotes entropy, and R denotes the reward/objective we want Re Treever to maximise2. Typically this reward corresponds to the negative TCA loss R = LT CA, e.g., Negative MSE, Log-Likelihood, and Negative Cross Entropy for regression, uncertainty estimation, and classification tasks, respectively. However, crucially, the reward does not need to be differentiable. As such, R can also be an objective we are typically not able to optimize directly via gradient descent, e.g., accuracy for classification tasks (R = 1 for correct classification and R = 0 for incorrect classification).

Furthermore, at inference time, the objective of the policy π and the Cross Attention module are intertwined in that the policy aims to select relevant nodes for the Cross Attention. As such, to improve training, the weights are shared between the policy and the Cross Attention module. More specifically, the policy is parameterized as an attention module where the policy s action probabilities are the attention weights of the context (i.e., child nodes Cv) given the query (i.e., q).

Lastly, to improve the early stages of training and encourage TCA to learn good node representations, LCA is included:

LCA = Loss(Cross Attention(x, Leaves(T )), y)

4 EXPERIMENTS

Our objective in the experiments is: (1) Compare Tree Cross Attention (TCA) with Cross Attention in terms of the number of tokens used and the performance. We answer this in two ways: (i) by comparing Tree Cross Attention and Cross Attention directly on a memory retrieval task and (ii) by comparing our proposed model Re Treever which uses TCA with models which use Cross Attention. (2) Compare the performances of Re Treever and Perceiver while using the same number of tokens. In terms of analyses, we aim to: (3) Highlight how Re Treever and TCA can optimize for non-differentiable objectives using LRL, improving performance. (4) Show the importance of the different loss terms in the training objective. (5) Empirically show the rate at which Cross Attention and TCA memory grow with respect to the number of tokens.

To tackle those objectives, we consider benchmarks for classification (Copy Task, Human Activity) and uncertainty estimation settings (GP Regression and Image Completion). Our focus is on (1) comparing Tree Cross Attention with Cross Attention, and (2) comparing Re Treever with generalpurpose models (i) Perceiver IO for the same number of latents and (ii) Transformer (Encoder) + Cross Attention. Full details regarding the hyperparameters are included in the appendix3.

For completeness, in the appendix, we include the results for many other baselines (including recent state-of-the-art methods) for the problem settings considered in this paper. The results show the baseline Transformer + Cross Attention achieves results competitive with prior state-of-the-art. Specifically, we compared against the following baselines for GP Regression and Image Completion: LBANPs (Feng et al., 2023a), TNPs (Nguyen & Grover, 2022), NPs (Garnelo et al., 2018b), BNPs (Lee et al., 2020), CNPs (Garnelo et al., 2018a), CANPs (Kim et al., 2019), ANPs (Kim et al., 2019), and BANPs (Lee et al., 2020). We compared against the following baselines for Human Activity: Se FT (Horn et al., 2020), RNN-Decay (Che et al., 2018), IP-Nets (Shukla & Marlin, 2019), L-ODE-RNN (Chen et al., 2018), L-ODE-ODE (Rubanova et al., 2019), and m TAND-Enc (Shukla & Marlin, 2021). Additionally, we include results in the Appendix for Perceiver IO with a various number of latent tokens, showing Perceiver IO requires several times more tokens to achieve comparable performance to Re Treever.

4.1 COPY TASK

We first verify the ability of Cross Attention and Tree Cross Attention (TCA) to perform retrieval. The models are provided with a sequence of length N = 2k, beginning with a [BOS] (Beginning of Sequence) token, followed by a randomly generated palindrome comprising of 2k 2 digits, ending with a [EOS] (End of Sequence) token. The objective of the task is to predict the second half (2k 1) of the sequence given the first half (2k 1 tokens) of the sequence as context. To make a correct

2The only reward R is given at the end of the episode. As such, the undiscounted return for each timestep is R. 3The code is available at https://github.com/Borealis AI/tree-cross-attention

Published as a conference paper at ICLR 2024

Method N = 256 N = 512 N = 1024 % Tokens Accuracy % Tokens Accuracy % Tokens Accuracy Cross Attention 100.0% 100.0 0.0 100.0% 100.0 0.0 100.0% 99.9 0.2 Random 8.3 0.0 8.3 0.0 8.3 0.0 Perceiver IO 6.3% 15.2 0.0 3.5% 13.4 0.2 2.0% 11.6 0.4 TCA 6.3% 100.0 0.0 3.5% 100.0 0.0 2.0% 99.6 0.6

Table 1: Copy Task Results with accuracy (higher is better) and % tokens (lower is better) metrics.

Method % Tokens RBF Matern 5/2 Transformer + Cross Attention 100% 1.35 0.02 0.91 0.02 Perceiver IO 14.9% 1.06 0.05 0.58 0.05 Perceiver IO (L = 32) 68.0% 1.25 0.04 0.78 0.05 Re Treever 14.9% 1.25 0.02 0.81 0.02

Table 2: 1-D Meta-Regression Experiments with log-likelihood (higher is better) and % tokens (lower is better) metrics.

prediction for index 2k i (where 0 < i 2k 1), the model must retrieve information from its input/context sequence at index i + 1. The model is evaluated on its accuracy on 3200 randomly generated test sequences. As a point of comparison, we include Random as a baseline, which makes random predictions sampled from the set of digits (0 9), [EOS], and [BOS] tokens.

Results. We found that both Cross Attention and Tree Cross Attention were able to solve this task perfectly (Table 1). In comparison, TCA requires 50 fewer tokens than Cross Attention. Furthermore, we found that Perceiver IO s performance for the same number of tokens was dismal (15.2 0.0% accuracy at N = 256), further dropping in performance as the length of the sequence increased. This result is expected as Perceiver IO aims to compress all the relevant information for any predictions into a small fixed-sized set of latents. However, all the tokens are relevant depending on the queried index. As such, methods which distill information to a lower dimensional space are insufficient. In contrast, TCA performs tree-based memory retrieval, selectively retrieving a small subset of tokens for predictions. As such, TCA is able to solve the task perfectly while using the same number of tokens as Perceiver.

4.2 UNCERTAINTY ESTIMATION: GP REGRESSION AND IMAGE COMPLETION

We evaluate Re Treever on popular uncertainty estimation settings used in (Conditional) Neural Processes literature and which have been benchmarked extensively (Table 13 in Appendix) (Garnelo et al., 2018a;b; Kim et al., 2019; Lee et al., 2020; Nguyen & Grover, 2022; Feng et al., 2023a;b).

4.2.1 GP REGRESSION

The goal of the GP Regression task is to model an unknown function f given N points. During training, the functions are sampled from a GP prior with an RBF kernel fi GP(m, k) where m(x) = 0 and k(x, x ) = σ2 f exp( (x x )2

2l2 ). The hyperparameters of the kernel are randomly sampled according to l U[0.6, 1.0), σf U[0.1, 1.0), N U[3, 47), and M U[3, 50 N). After training, the models are evaluated according to the log-likelihood of functions sampled from GPs with RBF and Matern 5/2 kernels.

Results. In Table 2, we see that Re Treever (1.25 0.02) outperforms Perceiver IO (1.06 0.05) by a large margin while using the same number of tokens for inference. To see how many latents Perceiver IO would need to achieve performance comparable to Re Treever, we varied the number of latents (see Appendix Table 13 for full results). We found that Perceiver IO needed 4.6 the number of tokens to achieve comparable performance.

4.2.2 IMAGE COMPLETION

The goal of the Image Completion task is to make predictions for the pixels of an image given a random subset of pixels of an image. The Celeb A dataset comprises coloured images of celebrity

Published as a conference paper at ICLR 2024

Method Celeb A EMNIST % Tokens Log Likelihood % Tokens Log Likelihood Transformer + Cross Attention 100% 3.88 0.01 100% 1.41 0.00 Perceiver IO 4.6% 3.20 0.01 4.6% 1.25 0.01 Re Treever 4.6% 3.52 0.01 4.6% 1.30 0.01

Table 3: Image Completion Experiments. The methods are evaluated according to the log-likelihood (higher is better) and % tokens (lower is better) metrics.

Model % Tokens Accuracy Transformer + Cross Attention 100% 89.1 1.3 Perceiver IO 14% 87.6 0.3 Re Treever 14% 88.9 0.4

Table 4: Human Activity Experiments with accuracy (higher is better) and % tokens (lower is better) metrics.

faces. The images are downsized to size 32 32. The EMNIST dataset comprises black and white images of handwritten letters with a resolution of 32 32. We consider 10 classes. Following prior works, the x values of the images are rescaled to be between [-1, 1], the y values to be between [-0.5, 0.5], and the number of pixels is sampled according to N U[3, 197) and M U[3, 200 N).

Results. In Table 3, we see that Re Treever outperforms Perceiver IO significantly on both Celeb A and EMNIST while using the same number of tokens. Compared with Transformer + Cross Attention, Re Treever uses 21 fewer tokens, making it significantly more token-efficient.

4.3 TIME SERIES: HUMAN ACTIVITY

The human activity dataset consists of 3D positions of the waist, chest, and ankles (12 features) collected from five individuals performing various activities such as walking, lying, and standing. We follow prior works preprocessing steps, constructing a dataset of 6, 554 sequences with 50 time points. The objective of this task is to classify each time point into one of eleven types of activities.

Results. Table 4 show similar conclusions as our prior experiments. (1) Re Treever performs comparable to Transformer + Cross Attention, requiring far fewer tokens (86% less), and (2) Re Treever outperforms Perceiver IO significantly while using the same number of tokens.

4.4 ANALYSES

Optimising non-differentiable objectives using LRL. In classification, typically, accuracy is the metric we truly care about. However, accuracy is a non-differentiable objective, so instead in deep learning, we often use cross entropy so we can optimize the model s parameters via gradient descent. However, in the case of RL, the reward does not need to be differentiable. As such, we compare the performance of Re Treever when optimizing for the accuracy and the negative cross entropy. Table 5 shows that accuracy as the RL reward improves upon using negative cross entropy. This is expected since (1) the metric we evaluate our model on is accuracy and not cross entropy, and (2) accuracy as a reward is simpler compared to negative cross entropy as a reward, making it easier for a policy to optimize. More specifically, a reward using accuracy as the metric is either: R = 0 for an incorrect prediction or R = 1 for a correct prediction. However, a reward based on cross entropy can have vastly different values for incorrect predictions.

Method N = 256 N = 512 N = 1024 % Tokens Accuracy % Tokens Accuracy % Tokens Accuracy TCA (Acc) 6.3% 100.0 0.0 3.5% 100.0 0.0 2.0% 99.6 0.6 TCA (Neg. CE) 6.3% 99.8 0.3 3.5% 95.5 3.8 2.0% 80.3 14.8

Table 5: Comparison of Accuracy and Negative Cross Entropy as the reward.

Published as a conference paper at ICLR 2024

Figure 4: Analyses Plots. (left) Memory usage plot comparing the rate in which memory usage grows at inference time relative to the number of tokens. (middle) Training curve with varying weights (λCA) for the Cross Attention loss term. (right) Training curve with varying weights (λRL) for the RL retrieval loss term.

Memory Usage. Tree Cross Attention s memory usage (Figure 4 (left)) is significantly more efficient than Cross Attention, growing logarithmically in the number of tokens compared to linearly.

Importance of the different loss terms λCA and λRL. Figure 4 (right) shows that the weight of the RL loss term is crucial to achieving good performance. λRL = 1.0 performed the best. If the weight of the term is too small λRL {0.0, 0.1} or too large λRL = 10.0, the model is unable to solve the task. Figure 4 (middle) shows that the weight of the Cross Attention loss term improves the stability of the training, particularly in the early stages. Setting λCA = 1.0 performed the best. Setting too small values for λCA {0.0, 0.1} causes some erraticness in the early stages of training. Setting too large of a value λCA = 10.0 slows down training.

5 RELATED WORK

There have been a few works which have proposed to leverage a tree-based architecture (Tai et al., 2015; Nguyen et al., 2020; Madaan et al., 2023; Wang et al., 2019) for attention, the closest of which to our work is Treeformer (Madaan et al., 2023). Unlike prior works (including Treeformer) which focused on replacing self-attention in transformers with tree-based attention, in our work, we focus on replacing cross attention with a tree-based cross attention mechanism for efficient memory retrieval for inferences. Furthermore, Treeformer (1) uses a decision tree to find the leaf that closest resembles the query feature vector, (2) TF-A has a partial receptive field comprising only the tokens in the selected leaf, and (3) TF-A has a worst-case linear complexity in the number of tokens per query. In contrast, TCA (1) learns a policy to select a subset of nodes, (2) retrieves a subset of nodes with a full receptive field, and (3) has a guaranteed logarithmic complexity.

As trees are a form of graphs, Tree Cross Attention bears a resemblance to Graph Neural Networks (GNNs). The objectives, however, of GNNs and TCA are different. In GNNs, the objective is to perform edge, node, or graph predictions. However, the goal of TCA is to search the tree for a subset of nodes that is relevant for a query. Furthermore, unlike GNNs which typically consider the tokens to correspond one-to-one with nodes of the graph, TCA only considers the leaves of the tree to be tokens. We refer the reader to surveys on GNNs (Wu et al., 2020; Thomas et al., 2023).

6 CONCLUSION

In this work, we proposed Tree Cross Attention (TCA), a variant of Cross Attention that only requires a logarithmic O(log(N)) number of tokens when performing inferences. By leveraging RL, TCA can optimize non-differentiable objectives such as accuracy. Building on TCA, we introduced Re Treever, a flexible architecture for token-efficient inference. We evaluate across various classification and uncertainty prediction tasks, showing (1) TCA achieves performance comparable to Cross Attention while being significantly more token efficient and (2) Re Treever outperforms Perceiver IO while using the same number of tokens for inference.

Published as a conference paper at ICLR 2024

Jeff Barr. Amazon ec2 update inf1 instances with aws inferentia chips for high performance cost-effective inferencing, Dec 2019. URL https://aws.amazon.com/blogs/aws/ amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-forhigh-performance-cost-effective-inferencing/.

Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values. Scientific reports, 8(1):6085, 2018.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms, Third Edition. 2006.

Leo Feng, Hossein Hajimirsadeghi, Yoshua Bengio, and Mohamed Osama Ahmed. Latent bottlenecked attentive neural processes. In International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=y Ixteviz EA.

Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Yoshua Bengio, and Mohamed Osama Ahmed. Constant memory attentive neural processes. ar Xiv preprint ar Xiv:2305.14567, 2023b.

Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In International Conference on Machine Learning, pp. 1704 1713. PMLR, 2018a.

Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. ar Xiv preprint ar Xiv:1807.01622, 2018b.

Max Horn, Michael Moor, Christian Bock, Bastian Rieck, and Karsten Borgwardt. Set functions for time series. In International Conference on Machine Learning, pp. 4353 4363. PMLR, 2020.

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2021.

Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. 2019.

Juho Lee, Yoonho Lee, Jungtaek Kim, Eunho Yang, Sung Ju Hwang, and Yee Whye Teh. Bootstrapping neural processes. Advances in neural information processing systems, 33:6606 6615, 2020.

George Leopold. Aws to offer nvidia s t4 gpus for ai inferencing, Mar 2019. URL https://www.hpcwire.com/2019/03/19/aws-upgrades-its-gpu-backedai-inference-platform/.

Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain, and Prateek Jain. Treeformer: Dense gradient trees for efficient attention computation. In International Conference on Learning Representations, 2023.

Tung Nguyen and Aditya Grover. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. In International Conference on Machine Learning, pp. 16569 16594. PMLR, 2022.

Xuan-Phi Nguyen, Shafiq Joty, Steven Hoi, and Richard Socher. Tree-structured attention with hierarchical accumulation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJx K5p EYvr.

Stephen M Omohundro. Five balltree construction algorithms. International Computer Science Institute Berkeley, 1989.

Published as a conference paper at ICLR 2024

Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems, 32, 2019.

Satya Narayan Shukla and Benjamin Marlin. Interpolation-prediction networks for irregularly sampled time series. In International Conference on Learning Representations, 2019.

Satya Narayan Shukla and Benjamin Marlin. Multi-time attention networks for irregularly sampled time series. In International Conference on Learning Representations, 2021.

Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from tree-structured long short-term memory networks. ar Xiv preprint ar Xiv:1503.00075, 2015.

Josephine Thomas, Alice Moallemy-Oureh, Silvia Beddar-Wiesing, and Clara Holzh uter. Graph neural networks designed for different graph types: A survey. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id= h4BYt Z79uy.

Yau-Shian Wang, Hung-Yi Lee, and Yun-Nung Chen. Tree transformer: Integrating tree structures into self-attention. ar Xiv preprint ar Xiv:1909.06639, 2019.

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4 24, 2020.

Published as a conference paper at ICLR 2024

A APPENDIX: ILLUSTRATIONS

A.1 BASELINE ARCHITECTURES

Figures 5 and 6 illustrate the difference in architectures between the main baselines considered in the paper. Transformer + Cross Attention is composed of a Transformer encoder (RN D RN D) and a Cross Attention module (RM D RN D RM D). Perceiver IO is composed of an iterative attention encoder (RN D RL D) and a Cross Attention module (RM D RL D RM D). In contrast, Re Treever (Figure 1) is composed of a flexible encoder (RN D RN D) and a Tree Cross Attention module.

Empirically, we showed that Transformer + Cross Attention achieves good performance. However, its Cross Attention is inefficient due to retrieving from the full set of tokens. Notably, Perceiver IO achieves its inference-time efficiency by performing a compression via an iterative attention encoder. To achieve token-efficient inferences, the number of latents needs to be significantly less than the number of tokens originally, i.e., L << N. However, this is not practical in hard problems since setting low values for L results in significant information loss due to the latents being a bottleneck. In contrast, Re Treever is able to perform token-efficient inference while achieving better performance than Perceiver IO for the same number of tokens. Re Treever does this by using Tree Cross Attention to retrieve the necessary tokens, only needing a logarithmic number of tokens log(N) << N, making it efficient regardless of the encoder used. Crucially, this also means that Re Treever is more flexible with its encoder than Perceiver IO, allowing for customizing the kind of encoder depending on the setting.

Figure 5: Architecture Diagram of Perceiver IO. The model is composed of an iterative attention encoder (RN D RL D) and a Cross Attention module (RM D RL D RM D).

Figure 6: Architecture Diagram of Transformer + Cross Attention. The model is composed of a Transformer encoder (RN D RN D) and a Cross Attention module (RM D RN D RM D).

A.2 EXAMPLE RETRIEVAL

In Figures 7 to 11, we illustrate an example of the retrieval process described in Algorithm 1.

Published as a conference paper at ICLR 2024

Figure 7: Example Retrieval Step 0 Illustration. The diamond icon on a node indicates the node that will be explored. The retrieval process begins at the root, i.e., v 0 where 0 is the index of the root node. Yellow nodes denote the nodes that have yet to be explored and may be explored in the future. Grey nodes denote the nodes that are or have been explored but have not been added to the selected set S. Red nodes denote the nodes that have not been explored and will not be explored in the future. Green nodes denote the subset of nodes selected, i.e., S.

Figure 8: Example Retrieval Step 1 Illustration. Green arrows represent the path (actions) chosen by the policy π for a query feature vector. Red arrows represent the actions rejected by the policy. The right child of v = 0 was rejected, so it is added to the selected set S. As a consequence, descendant nodes of v s right child will never be explored, so they are coloured red in the diagram. Tentatively, S = {h2}. Because the policy selected the left child of v = 0, we thus have v 1

Published as a conference paper at ICLR 2024

Figure 9: Example Retrieval Step 2 Illustration. The left child of v = 1 is rejected this time, so it is added to the selected set S (green). Tentatively, S = {h2, h3}. Consequently, the descendant nodes of v s left child will never be explored, so they are coloured red. Because the policy selected the right child of v = 1, we thus have v 4.

Figure 10: Example Retrieval Step 3 Illustration. The right child of v = 4 is rejected this time, so it is added to the selected set S. Tentatively, S = {h2, h3, h10}. The descendant nodes of v s right child are to be coloured red, but there are none. Because the policy selected the left child of v = 4, we thus have v 9

Figure 11: Example Retrieval Step 4 Illustration. v = 9 is a leaf, so the tree search has reached its end. At the end of the tree search, we simply add the node to the selected set, resulting in S = {h2, h3, h9, h10}.

Published as a conference paper at ICLR 2024

Method % Tokens Celeb A EMNIST Transformer + Cross Attention 100% 3.88 0.01 1.41 0.00 Perceiver IO 4.6% 3.20 0.01 1.25 0.00 Re Treever-Random 4.6% 3.12 0.02 1.26 0.01 Re Treever-KD-x0 4.6% 3.50 0.02 1.30 0.01 Re Treever-KD-x1 4.6% 3.52 0.01 1.26 0.02 Re Treever-KD-RP 4.6% 3.51 0.01 1.29 0.02

Table 6: Tree Design Analyses Experiments. We compare the performance of Re Treever with various heuristics for structuring the tree (1) organized randomly (Re Treever-Random), (2) sorting via the first index x0 of the image data (Re Treever-KD-x0), and (3) sorting via the second index x1 of the image data (Re Treever-KD-x1), and (4) sorting the value after a Random Projection (Re Treever KD-RP).

Method GP Regression Image Completion % Tokens RBF Matern 5/2 % Tokens Celeb A Re Treever 14.9% 1.25 0.02 0.81 0.02 4.6% 3.52 0.01 Re Treever-Full 100% 1.33 0.02 0.90 0.01 100% 3.77 0.01

Table 7: Comparison of Re Treever with Re Treever-Full which naively selects all the leaves of the tree instead of leveraging the learned policy to select a subset of the nodes.

B APPENDIX: ADDITIONAL EXPERIMENTS AND ANALYSES

B.1 ADDITIONAL ANALYSES

B.1.1 IMPORTANCE OF TREE DESIGN

Unlike sequential data where sorting according to its index is a natural way to organize the data, it is less clear how to organize other data modalities (e.g., pixels) as leaves in a balanced tree. As such, in this experiment (Table 6), we compare the performance of Re Treever with various heuristics for structuring the tree (1) organized randomly (Re Treever-Random), (2) sorting via the first index x0 of the image data (Re Treever-KD-x0), and (3) sorting via the second index x1 of the image data (Re Treever-KD-x1), and (4) sorting the value after a Random Projection (Re Treever-KD-RP)4. Re Treever-KD-RP generates a random matrix w Rdim(x) 1 that is fixed during training and orders the context tokens in the tree according to w T x instead.

In Table 6, we see that Re Treever outperforms Perceiver IO significantly when splitting according to either of the axes. In contrast, organizing the tree randomly causes Re Treever to perform worse than Perceiver IO. There is, however, a minor difference between KD-x0 and KD-x1. We hypothesize that this is in part due to image data being multi-dimensional and not all dimensions are equal.

B.1.2 RETREEVER-FULL

Re Treever leverages RL to learn good node representations for informative retrieval. However, Re Treever is not limited to only using the learned policy for retrieval at inference time. For example, instead of retrieving a subset of the nodes according to a policy, we can (as a proof of concept) choose to retrieve all the leaves of the tree Re Treever-Full (see Table 7).

Results. On Celeb A and GP Regression, we see that Re Treever-Full achieves performance close to state-of-the-art. For example, on Celeb A, Re Treever-Full achieves 3.77 0.01, and Transformer + Cross Attention achieves 3.88 0.01. On GP, Re Treever-Full achieves 1.33 0.02, and Transformer + Cross Attention achieves 1.35 0.02. In contrast, Re Treever achieves significantly lower results. The performance gap between Re Treever and Re Treever-Full suggests that the performance of Re Treever can vary depending on the number of nodes (tokens) selected from the tree. As such, an interesting future direction is the design of alternative methods for selecting subsets of nodes.

4Random Projection is popular in dimensionality reduction literature as it preserves partial information regarding the distances between the points.

Published as a conference paper at ICLR 2024

Method N = 256 N = 512 N = 1024 % Tokens Accuracy % Tokens Accuracy % Tokens Accuracy Cross Attention 100.0% 100.0 0.0 100.0% 100.0 0.0 100.0% 99.9 0.2 Random 8.3 0.0 8.3 0.0 8.3 0.0 Perceiver IO 6.3% 15.2 0.0 3.5% 13.4 0.2 2.0% 11.6 0.4 Perceiver IO 100.0% 63.6 45.4 100.0% 14.1 2.3 100.0% OOM TCA 6.3% 100.0 0.0 3.5% 100.0 0.0 2.0% 99.6 0.6

Table 8: Copy Task Results with accuracy (higher is better) and % tokens (lower is better) metrics.

Perceiver IO Num. Latents (L)

Copy Task N = 128 N = 256 N = 512 L = 16 39.83 40.21 15.16 0.03 13.41 0.15 L = 32 58.92 47.44 34.75 39.14 19.41 12.07 L = 64 79.45 41.11 27.10 21.23 14.00 0.72 L = 128 100.00 0.00 63.50 39.37 13.23 0.42

Table 9: Analysis of Perceiver IO s performance relative to the number of latents (L) and the sequence length (N) on the Copy Task

B.1.3 PERCEIVER IO S COPY TASK PERFORMANCE

In Table 8, we include results evaluating Perceiver IO with increased number of latent tokens on varying lengths of the copy task. Specifically, we tested using 100% tokens to evaluate its performance. Perceiver IO s performance degraded sharply as the difficulty of the task increased. Furthermore, due to Perceiver IO s encoder which aims to map the context tokens to a set of latent tokens, we ran out of memory (on a Nvidia P100 GPU (16 GB)) trying to train the model on a sequence length of 1024.

To analyze the reason behind the degradation in performance of Perceiver IO, we evaluated Perceiver IO (Table 9) with a varying number of latent tokens L {16, 32, 64, 128} and varying sequence lengths N {128, 256, 512}. Generally, we found that performance improved as the number of latents increased. However, the results were relatively unstable. For some runs, Perceiver IO was able to solve the task completely. Other times, Perceiver IO got stuck in lower accuracy local optimas. For longer sequences N = 512, increasing the number of latents did not make a difference as the model simply got stuck in poor local optimas.

We hypothesize that the poor performance is due to the incompatibility between the nature of the Copy Task and Perceiver IO s model. The copy task (1) has high intrinsic dimensionality that scales with the number of tokens and (2) does not require computing higher-order information between the tokens. In contrast, Perceiver IO (1) is designed for tasks with lower intrinsic dimensionality by compressing information via its latent tokens and iterative attention encoder, and (2) computes higher-order information via a transformer in the latent space. Naturally, these factors make learning the task difficult for Perceiver IO.

B.1.4 RUNTIME ANALYSES

Runtime of Tree Construction and Aggregation. For 256 tokens, the wall-clock time for constructing a tree with our implementation was 0.076 0.022 milliseconds. The wall-clock time for the aggregation is 12.864 2.792 milliseconds.

Runtime of Tree Cross Attention-based models vs. vanilla Cross Attention-based models. In table 10, we measured the wall-clock time of the modules used during inference time, i.e., Re Treever s Tree Cross Attention module (RM D T RM D), Transformer+Cross Attention s Cross Attention module (RM D RN D RM D), and Perceiver IO s Cross Attention module (RM D RL D RM D).

We evaluated the modules on the Copy Task for both a GPU and CPU. Although Tree Cross Attention is slower, all methods only require milliseconds to perform inference. Both Tree Cross Attention and Cross Attention learn to solve the task perfectly. However, Tree Cross Attention uses

Published as a conference paper at ICLR 2024

Model % Tokens Accuracy CPU time GPU Time Cross Attention 100.0% 100.0 0.0 12.05 1.61 Tree Cross Attention 3.5% 100.0 0.0 19.31 9.09 Perceiver IO s CA 3.5% 13.4 0.2 10.98 1.51

Table 10: Comparison of runtime (in milliseconds) for the inference modules of Re Treever, Transformer + Cross Attention, and Perceiver IO, i.e., Tree Cross Attention, Cross Attention, and Perceiver IO s CA respectively. Runtime is reported as the average of 10 runs.

Model Tree Height (H) % Tokens CPU Time (in ms) GPU Time (in ms) Cross Attention N/A 100.0 % 12.05 1.61 TCA (bf = 256) 1 100.0 % 16.21 2.05 TCA (bf = 32) 2 15.2 % 14.05 3.99 TCA (bf = 16) 2 12.1 % 13.25 4.13 TCA (bf = 8) 3 7.0 % 15.53 5.73 TCA (bf = 4) 3 3.9 % 16.30 6.45 TCA (bf = 2) 8 3.5 % 21.92 9.83

Table 11: Memory-Runtime trade-off with respect to the tree height (H).

only 3.5% of the tokens. As such, Tree Cross Attention tradesoff between the number of tokens and computational time. Perceiver IO, however, fails to solve the task for the same number of tokens.

The effect of tree design on Memory-Runtime trade-off. Tree Cross Attention sequentially searches down a tree for the relevant tokens. As such, the factor which primarily affects the overall runtime is the height of tree (H) which is dependent on the branching factor bf 2. The relationship between the height of the tree, the branching factor, and the number of tokens is as follows: H = logbf (N) . In our results, we showed the performance by using a binary tree (bf = 2) since this corresponds to a tree design that uses few tokens. By selecting a higher branching factor, the runtime can be decreased at the expense of requiring more tokens (see Table 11). This feature is not available in standard cross attention. As such, TCA enables more control over the memory and computation time trade-off.

B.2 ADDITIONAL EXPERIMENTS

B.2.1 COPY TASK

Method N = 128 % Tokens Accuracy Cross Attention 100.0% 100.0 0.0 Random 8.3 0.0 Perceiver IO 10.9% 17.8 0.0 Perceiver IO 100.0% 100.0 0.0 TCA (Acc) 10.9% 100.0 0.0 TCA (CE) 10.9% 100.0 0.0

Table 12: Copy Task Results with accuracy (higher is better) and % tokens (lower is better) metrics.

Results. In Table 12, we include additional results for the Copy Task where N = 128. Similar to previous results, we see that Cross Attention and TCA are able to solve this task perfectly. In contrast, Perceiver IO is not able to solve this task for the same number of tokens. We also see both TCA using accuracy and TCA using negative cross entropy as the reward were able to solve the task perfectly.

Published as a conference paper at ICLR 2024

Method % Tokens RBF Matern 5/2 CNP (Garnelo et al., 2018a) 0.26 0.02 0.04 0.02 CANP (Kim et al., 2019) 0.79 0.00 0.62 0.00 NP (Garnelo et al., 2018b) 0.27 0.01 0.07 0.01 ANP (Kim et al., 2019) 0.81 0.00 0.63 0.00 BNP (Lee et al., 2020) 0.38 0.02 0.18 0.02 BANP (Lee et al., 2020) 0.82 0.01 0.66 0.00 TNP-D (Nguyen & Grover, 2022) 1.39 0.00 0.95 0.01 LBANP (Feng et al., 2023a) 1.27 0.02 0.85 0.02 CMANP (Feng et al., 2023b) 1.24 0.02 0.80 0.01 Perceiver IO (L = 4) 8.5% 1.02 0.03 0.56 0.03 Perceiver IO (L = 8) 17.0% 1.13 0.03 0.65 0.03 Perceiver IO (L = 16) 34.0% 1.22 0.05 0.75 0.06 Perceiver IO (L = 32) 68.0% 1.25 0.04 0.78 0.05 Perceiver IO (L = 64) 136.0% 1.30 0.03 0.85 0.02 Perceiver IO (L = 128) 272.0% 1.29 0.04 0.84 0.04 Transformer + Cross Attention 100% 1.35 0.02 0.91 0.02 Perceiver IO 14.9% 1.06 0.05 0.58 0.05 Re Treever 14.9% 1.25 0.02 0.81 0.02

Table 13: 1-D Meta-Regression Experiments with log-likelihood metric (higher is better).

Figure 12: Visualizations of GP Regression

Published as a conference paper at ICLR 2024

Method % Tokens Celeb A CNP 2.15 0.01 CANP 2.66 0.01 NP 2.48 0.02 ANP 2.90 0.00 BNP 2.76 0.01 BANP 3.09 0.00 TNP-D 3.89 0.01 LBANP (8) 3.50 0.05 LBANP (128) 3.97 0.02 CMANP 3.93 0.05 Perceiver IO (L = 8) 4.1% 3.18 0.03 Perceiver IO (L = 16) 8.1% 3.35 0.02 Perceiver IO (L = 32) 16.2% 3.50 0.02 Perceiver IO (L = 64) 32.5% 3.61 0.03 Perceiver IO (L = 128) 65.0% 3.74 0.02 Transformer + Cross Attention 100% 3.88 0.01 Perceiver IO 4.6% 3.20 0.01 Re Treever 4.6% 3.52 0.01

Table 14: Celeb A Image Completion Experiments. The methods are evaluated according to the loglikelihood (higher is better).

B.2.2 GP REGRESSION

Results. In Table 13, we show results for all baselines on GP Regression. We see that Transformer + Cross Attention performs comparable to the state-of-the-art, outperforming all Neural Process baselines except for TNP-D by a significant margin. We evaluated Perceiver IO with varying number of latents L. We see that Perceiver IO (L = 32) uses 4.6 the number of tokens to achieve performance comparable to Re Treever. Re Treever by itself already outperforms all NP baselines except for LBANP and TNP-D. Visualizations for different kinds of test functions are included in Figure 12.

B.2.3 IMAGE COMPLETION (CELEBA)

Results. In Table 14, we show results for all baselines on Celeb A. We see that Transformer + Cross Attention achieves results competitive with the state-of-the-art. We evaluated Perceiver IO with varying numbers of latents L. We see that Perceiver IO (L = 32) uses 3.5 the number of tokens to achieve performance comparable to the Re Treever. Visualizations are available in Figure 13.

B.2.4 IMAGE COMPLETION (EMNIST)

Results. In Table 15, we see that Transformer + Cross Attention achieves results competitive with state-of-the-art, outperforming all baselines except TNP-D.

B.2.5 TIME SERIES: HUMAN ACTIVITY

Results. In Table 16, we show results on the Human Activity task comparing with several time series baselines. We see that Transformer + Cross Attention outperforms all baselines except for m TAND-Enc. The confidence intervals are, however, very close to overlapping. We would like to note, however, that m TAND-Enc was carefully designed for time series, combining attention, bidirectional RNNs, and reference points. In contrast, Transformer + Cross Attention is a generalpurpose model made by just leveraging simple attention modules. We hypothesize the performance could be further improved by leveraging some features of m TAND-Enc, but this is outside the scope of our work.

Notably, Re Treever also outperforms all baselines except for m TAND-Enc. Comparing, Transformer + Cross Attention and Re Treever, we see that Re Treever achieves comparable performance

Published as a conference paper at ICLR 2024

Figure 13: Visualizations of Celeb A32

Method % Tokens EMNIST CNP 0.73 0.00 CANP 0.94 0.01 NP 0.79 0.01 ANP 0.98 0.00 BNP 0.88 0.01 BANP 1.01 0.00 TNP-D 1.46 0.01 LBANP (8) 1.34 0.01 LBANP (128) 1.39 0.01 CMANP 1.36 0.01 Transformer + Cross Attention 100% 1.41 0.00 Perceiver IO 4.6% 1.25 0.00 Re Treever 4.6% 1.30 0.01

Table 15: EMNIST Image Completion Experiments. The methods are evaluated according to the log-likelihood (higher is better).

Published as a conference paper at ICLR 2024

Model % Tokens Accuracy RNN-Impute (Che et al., 2018) 85.9 0.4 RNN-Decay (Che et al., 2018) 86.0 0.5 RNN GRU-D (Che et al., 2018) 86.2 0.5 IP-Nets (Shukla & Marlin, 2019) 86.9 0.7 Se FT (Horn et al., 2020) 81.5 0.2 ODE-RNN (Rubanova et al., 2019) 88.5 0.8 L-ODE-RNN (Chen et al., 2018) 83.8 0.4 L-ODE-ODE (Rubanova et al., 2019) 87.0 2.8 m TAND-Enc (Shukla & Marlin, 2021) 90.7 0.2 Transformer + Cross Attention 100% 89.1 1.3 Perceiver IO 14% 87.6 0.4 Re Treever 14% 88.9 0.4

Table 16: Human Activity Experiments with Accuracy metric.

while using far fewer tokens. Furthermore, we see that Re Treever outperforms Perceiver IO significantly while using the same number of tokens.

C APPENDIX: IMPLEMENTATION AND HYPERPARAMETERS DETAILS

C.1 IMPLEMENTATION

For the experiments on uncertainty estimation, we use the official repositories for TNPs (https://github.com/tung-nd/TNP-pytorch) and LBANPs (https://github.com/Borealis AI/latentbottlenecked-anp). The GP regression and Image Classification (Celeb A and EMNIST) datasets are available in the same links. For the Human Activity dataset, we use the official repository for m TAN (Multi-Time Attention Networks) https://github.com/reml-lab/m TAN. The Human Activity dataset is available in the same link. Our Perceiver IO baseline is based on the popular Perceiver (IO) repository (https://github.com/lucidrains/perceiverpytorch/blob/main/perceiver pytorch/perceiver pytorch.py). We report the baseline results listed in Nguyen & Grover (2022) and Shukla & Marlin (2021). When comparing methods such as Perceiver, Re Treever, Transformer + Cross Attention, LBANP, and TNP-D, we use the same number of blocks (i.e., CMABs, iterative attention, and transformer) in the encoder.

C.2 HYPERPARAMETERS

Following previous works on Neural Processes (LBANPs and TNPs), Re Treever uses 6 layers in the encoder for all experiments. Our aggregator function is a Self Attention (Transformer) module whose output is averaged. Following standard RL practices, at test time, TCA and Re Treever s policy selects the most confident actions: u = argmaxa Cv πθ(a|s) We tuned λRL (RL loss term weight) and λCA (CA loss term weight) between {0.0, 0.1, 1.0, 10.0} on the Copy Task (see analysis in main paper) with N = 256. We also tuned α (entropy bonus weight) between {0.0, 0.01, 0.1, 1.0}. We found that simply setting λRL = λCA = 1.0 and α = 0.01 performed well in practice on the Copy Task where N = 256. As such, for the purpose of consistency, we set λRL = λCA = 1.0 and α = 0.01 was set for all tasks. We used an ADAM optimizer with a standard learning rate of 5e 4. All experiments were run with 3 seeds. Cross Attention and Tree Cross Attention tended to get stuck in a local optimum on the Copy Task, so to prevent this, dropout was set to 0.1 for the Copy Task for all methods. For image data, we selected a single axis to split the data according to the k-d tree. In the case of EMNIST, the data was split according to the first axis. In the case of Celeb A, the data was split according to the second axis.

C.3 COMPUTE

Our experiments were run using a mix of Nvidia GTX 1080 Ti (12 GB) or Nvidia Tesla P100 (16 GB) GPUs. GP regression experiments took 2.5 hours. EMNIST experiments took 1.75 hours. Celeb A experiments took 15 hours. Human Activity experiments took 0.75 hours.