# trajectory_prediction_using_equivariant_continuous_convolution__bf8aa01f.pdf

Published as a conference paper at ICLR 2021

TRAJECTORY PREDICTION USING EQUIVARIANT CONTINUOUS CONVOLUTION

Robin Walters Northeastern University r.walters@northeastern.edu

Jinxi Li Northeastern University li.jinxi1@northeastern.edu

Rose Yu University of California, San Diego roseyu@ucsd.edu

Trajectory prediction is a critical part of many AI applications, for example, the safe operation of autonomous vehicles. However, current methods are prone to making inconsistent and physically unrealistic predictions. We leverage insights from ﬂuid dynamics to overcome this limitation by considering internal symmetry in real-world trajectories. We propose a novel model, Equivariant Continous COnvolution (ECCO) for improved trajectory prediction. ECCO uses rotationallyequivariant continuous convolutions to embed the symmetries of the system. On both vehicle and pedestrian trajectory datasets, ECCO attains competitive accuracy with signiﬁcantly fewer parameters. It is also more sample efﬁcient, generalizing automatically from few data points in any orientation. Lastly, ECCO improves generalization with equivariance, resulting in more physically consistent predictions. Our method provides a fresh perspective towards increasing trust and transparency in deep learning models. Our code and data can be found at https://github.com/Rose-STL-Lab/ECCO.

1 INTRODUCTION

Trajectory prediction is one of the core tasks in AI, from the movement of basketball players to ﬂuid particles to car trafﬁc (Sanchez-Gonzalez et al., 2020; Gao et al., 2020; Shah & Romijnders, 2016). A common abstraction underlying these tasks is the movement of many interacting agents, analogous to a many-particle system. Therefore, understanding the states of these particles, their dynamics, and hidden interactions is critical to accurate and robust trajectory forecasting.

Figure 1: Car trajectories in two scenes. Though the entire scenes are not related by a rotation, the circled areas are. ECCO exploits this symmetry to improve generalization and sample efﬁciency.

Even for purely physical systems such as in particle physics, the complex interactions among a large number of particles makes this a difﬁcult problem. For vehicle or pedestrian trajectories, this challenge is further compounded with latent factors such as human psychology. Given these difﬁculties, current approaches require large amounts of training data and many model parameters. State-of-the-art methods in this domain such as Gao et al. (2020) are based on graph neural networks. They do not exploit the physical properties of system and often make predictions which are not self-consistent or physically meaningful. Furthermore, they predict a single agent trajectory at a time instead of multiple agents simultaneously.

Equal Contribution

Published as a conference paper at ICLR 2021

Our model is built upon a key insight of many-particle systems pertaining to intricate internal symmetry. Consider a model which predicts the trajectory of cars on a road. To be successful, such a model must understand the physical behavior of vehicles together with human psychology. It should distinguish left from right turns, and give consistent outputs for intersections rotated with different orientation. As shown in Figure 1, a driver s velocity rotates with the entire scene, whereas vehicle interactions are invariant to such a rotation. Likewise, psychological factors such as reaction speed or attention may be considered vectors with prescribed transformation properties. Data augmentation is a common practice to deal with rotational invariance, but it cannot guarantee invariance and requires longer training. Since rotation is a continuous group, augmentation requires sampling from inﬁnitely many possible angles.

In this paper, we propose an equivariant continuous convolutional model, ECCO, for trajectory forecasting. Continuous convolution generalizes discrete convolution and is adapted to data in manyparticle systems with complex local interactions. Ummenhofer et al. (2019) designed a model using continuous convolutions for particle-based ﬂuid simulations. Meanwhile, equivariance to group symmetries has proven to be a powerful tool to integrate physical intuition in physical science applications (Wang et al., 2020; Brown & Lunter, 2019; Kanwar et al., 2020). Here, we test the hypothesis that an equivariant model can also capture internal symmetry in non-physical human behavior. Our model utilizes a novel weight sharing scheme, torus kernels, and is rotationally equivariant.

We evaluate our model on two real-world trajectory datasets: Argoverse autonomous vehicle dataset (Chang et al., 2019) and Traj Net++ pedestrian trajectory forecasting challenge (Kothari et al., 2020). We demonstrate on par or better prediction accuracy to baseline models and data augmentation with fewer parameters, better sample efﬁciency, and stronger generalization properties. Lastly, we demonstrate theoretically and experimentally that our polar coordinate-indexed ﬁlters have lower equivariance discretization error due to being better adapted to the symmetry group.

Our main contributions are as follows:

We propose Equivariant Continous COnvolution (ECCO), a rotationally equivariant deep neural network that can capture internal symmetry in trajectories.

We design ECCO using a novel weight sharing scheme based on orbit decomposition and polar coordinate-indexed ﬁlters. We implement equivariance for both the standard and regular representation L2(SO(2)).

On benchmark Argoverse and Traj Net++ datasets, ECCO demonstrates comparable accuracy while enjoying better generalization, fewer parameters, and better sample complexity.

2 RELATED WORK

Trajectory Forecasting For vehicle trajectories, classic models in transportation include the Car Following model (Pipes, 1966) and Intelligent Driver model (Kesting et al., 2010). Deep learning has also received considerable attention; for example, Liang et al. (2020) and Gao et al. (2020) use graph neural networks to predict vehicle trajectories. Djuric et al. (2018) use rasterizations of the scene with CNN. See the review paper by Veres & Moussa (2019) for deep learning in transportation. For human trajectory modeling, Alahi et al. (2016) propose Social LSTM to learn these humanhuman interactions. Traj Net (Sadeghian et al., 2018) and Traj Net++ (Kothari et al., 2020) introduce benchmarking for human trajectory forecasting. We refer readers to Rudenko et al. (2020) for a comprehensive survey. Nevertheless, many deep learning models are data-driven. They require large amounts of data, have many parameters, and can generate physically inconsistent predictions.

Continuous Convolution Continuous convolutions over point clouds (Cts Conv) have been successfully applied to classiﬁcation and segmentation tasks (Wang et al., 2018; Lei et al., 2019; Xu et al., 2018; Wu et al., 2019; Su et al., 2018; Li et al., 2018; Hermosilla et al., 2018; Atzmon et al., 2018; Hua et al., 2018). More recently, a few works have used continuous convolution for modeling trajectories or ﬂows. For instance, Wang et al. (2018) uses Cts Conv for inferring ﬂow on LIDAR data. Schenck & Fox (2018) and Ummenhofer et al. (2019) model ﬂuid simulation using Cts Conv. Closely related to our work is Ummenhofer et al. (2019), who design a continuous convolution network for particle-based ﬂuid simulations. However, they use a ball-to-sphere mapping which is not well-adapted for rotational equivariance and only encode 3 frames of input. Graph neural networks (GNNs) are a related strategy which have been used for modeling particle system

Published as a conference paper at ICLR 2021

dynamics (Sanchez-Gonzalez et al., 2020). GNNs are also permutation invariant, but they do not natively encode relative positions and local interaction as a Cts Conv-based network does.

Equivariant and Invariant Deep Learning Developing neural nets that preserve symmetries has been a fundamental task in image recognition (Cohen et al., 2019b; Weiler & Cesa, 2019; Cohen & Welling, 2016a; Chidester et al., 2018; Lenc & Vedaldi, 2015; Kondor & Trivedi, 2018; Bao & Song, 2019; Worrall et al., 2017; Cohen & Welling, 2016b; Weiler et al., 2018; Dieleman et al., 2016; Maron et al., 2020). Equivariant networks have also been used to predict dynamics: for example, Wang et al. (2020) predicts ﬂuid ﬂow using Galilean equivariance but only for gridded data. Fuchs et al. (2020) use SE(3)-equivariant transformers to predict trajectories for a small number of particles as a regression task. As in this paper, both Bekkers (2020) and Finzi et al. (2020) address the challenge of parameterizing a kernel over continuous Lie groups. Finzi et al. (2020) apply their method to trajectory prediction on point clouds using a small number of points following strict physical laws. Worrall et al. (2017) also parameterizes convolutional kernels using polar coordinates, but maps these onto a rectilinear grid for application to image data. Weng et al. (2018) address rotational equivariance by inferring a global canonicalization of the input. Similar to our work, Esteves et al. (2018) use functions evenly sampled on the circle, however, their features are only at a single point whereas we assign feature vectors to each point in a point cloud. Thomas et al. (2018) introduce Tensor Field Networks which are SO(3)-equivariant continuous convolutions. Unlike our work, both Worrall et al. (2017) and Thomas et al. (2018) deﬁne their kernels using harmonic functions. Our weight sharing method using orbits and stabilizers is simpler as it does not require harmonic functions or Clebsch-Gordon coefﬁcients. Unlike previous work, we implement a regular representation for the continuous rotation group SO(2) which is compatible with pointwise nonlinearities and enjoys an empirical advantage over irreducible representations.

3 BACKGROUND

We ﬁrst review the necessary background of continuous convolution and rotational equivariance.

3.1 CONTINUOUS CONVOLUTION

Continuous convolution (Cts Conv) generalizes the discrete convolution to point clouds. It provides an efﬁcient and spatially aware way to model the interactions of nearby particles. Let f (i) Rcin denote the feature vector of particle i. Thus f is a vector ﬁeld which assigns to the points x(i) a vector in Rcin. The kernel of the convolution K : R2 Rcout cin is a matrix ﬁeld: for each point x R2, K(x) is a cout cin matrix. Let a be a radial local attention map with a(r) = 0 for r > R. The output feature vector g(i) of particle i from the continous convolution is given by

g(i) = Cts Conv K,R(x, f; x(i)) = X

j a( x(j) x(i) )K(x(j) x(i)) f (j). (1)

Cts Conv is naturally equivariant to permutation of labels and is translation invariant. Equation 1 is closely related to graph neural network (GNN) (Kipf & Welling, 2017; Battaglia et al., 2018), which is also permutation invariant. Here the graph is dynamic and implicit with nodes x(i) and edges eij if x(i) x(j) < R. Unlike a GNN which applies the same weights to all neighbours, the kernel K depends on the relative position vector x(i) x(j).

3.2 ROTATIONAL EQUIVARIANCE

Continuous convolution is not naturally rotationally equivariant. Fortunately, we can translate the technique of rotational equivariance on CNNs to continuous convolutions. We use the language of Lie groups and their representations. For more background, see Hall (2015) and Knapp (2013).

More precisely, we denote the symmetry group of 2D rotations by SO(2) = {Rotθ : 0 θ < 2π}. As a Lie group, it has both a group structure Rotθ1 Rotθ2 = Rot(θ1+θ2)mod2π which a continuous map with respect to the topological structure. As a manifold, SO(2) is homomeomorphic to the circle S1 = {x R2 : x = 1}. The group SO(2) can act on a vector space Rc by specifying a representation map ρ: SO(2) GL(Rc) which assigns to each element of SO(2) an element of the set of invertible c c matrices GL(Rc). The map ρ must a be homomorphism

Published as a conference paper at ICLR 2021

ρ(Rotθ1)ρ(Rotθ1) = ρ(Rotθ1 Rotθ2). For example, the standard representation ρ1 on R2 is by 2 2 rotation matrices. The regular representation ρreg on L2(SO(2)) = {ϕ : SO(2) R : |ϕ|2 is integrable} is ρreg(Rotφ)(ϕ) = ϕ Rot φ. Given input f with representation ρin and output with representation ρout, a map F is SO(2)-equivariant if

F(ρin(Rotθ)f) = ρout(Rotθ)F(f).

Discrete CNNs are equivariant to a group G if the input, output, and hidden layers carry a G-action and the linear layers and activation functions are all equivariant (Kondor & Trivedi, 2018). One method for constructing equivariant discrete CNNs is steerable CNN (Cohen & Welling, 2016b). Cohen et al. (2019a) derive a general constraint for when a convolutional kernel K : Rb Rcout cin is G-equivariant. Assume G acts on Rb and that Rcout and Rcin are G-representations ρout and ρin respectively, then K is G-equivariant if for all g G, x R2,

K(gx) = ρout(g)K(x)ρin(g 1). (2)

For the group SO(2), Weiler & Cesa (2019) solve this constraint using circular harmonic functions to give a basis of discrete equivariant kernels. In contrast, our method is much simpler and uses orbits and stabilizers to create continuous convolution kernels.

4 ECCO: TRAJECTORY PREDICTION USING ROTATIONALLY EQUIVARIANT CONTINUOUS CONVOLUTION

In trajectory prediction, given historical position and velocity data of n particles over tin timesteps, we want to predict their positions over the next tout timesteps. Denote the ground truth dynamics as ξ, which maps ξ(xt tin:t, vt tin:t) = xt:t+tout. Motivated by the observation in Figure 1, we wish to learn a model f that approximates the underlying dynamics while preserving the internal symmetry in the data, speciﬁcally rotational equivariance.

We introduce ECCO, a model for trajectory prediction based on rotationally equivariant continuous convolution. We implement rotationally equivariant continuous convolutions using a weight sharing scheme based on orbit decomposition. We also describe equivariant per-particle linear layers which are a special case of continuous convolution with radius R = 0 analogous to 1x1 convolutions in CNNs. Such layers are useful for passing information between layers from each particle to itself.

4.1 ECCO MODEL OVERVIEW

Figure 2: Overview of model architecture. Past velocities are aggregated by an encoder Enc. Together with map information this is then encoded by 3 Cts Convs into ρreg features. Then l + 1 Cts Conv layers are used to predict x. The predicted position ˆxt+1 = x+ x where x is a numerically extrapolated using velocity and accleration. Since x is translation invariant, ˆx is equivariant.

The high-level architecture of ECCO is illustrated in Figure 2. It is important to remember that the input, output, and hidden layers are all vector ﬁelds over the particles. Oftentimes, there is also

Published as a conference paper at ICLR 2021

environmental information available in the form of road lane markers. Denote marker positions by xmap and direction vectors by vmap. This data is thus also a particle ﬁeld, but static.

To design an equivariant network, one must choose the group representation. This choice plays an important role in shaping the learned hidden states. We focus on two representations of SO(2): ρ1 and ρreg. The representation ρ1 is that of our input features, and ρreg is for the hidden layers. For ρ1, we constrain the kernel in Equation 1. For ρreg, we further introduce a new operator, convolution with torus kernels.

In order to make continuous convolution rotationally equivariant, we translate the general condition for discrete CNNs developed in Weiler & Cesa (2019) to continuous convolution. We deﬁne the convolution kernel K in polar coordinates K(θ, r). Let Rcout and Rcin be SO(2)-representations ρout and ρin respectively, then the equivariance condition requires the kernel to satisfy

K(θ + φ, r) = ρout(Rotθ)K(φ, r)ρin(Rot 1 θ ). (3)

Imposing such a constraint for continuous convolution requires us to develop an efﬁcient weight sharing scheme for the kernels, which solve Equation 3.

4.2 WEIGHT SHARING BY ORBITS AND STABILIZERS.

Given a point x R2 and a group G, the set Ox = {gx : g G} is the orbit of the point x. The set of orbits gives a partition of R2 into the origin and circles of radius r > 0. The set of group elements Gx = {g : gx = x} ﬁxing x is called the stabilizer of the point x. We use the orbits and stabilizers to constrain the weights of K. Simply put, we share weights across orbits and constrain weights according to stabilizers, as shown in Figure 3-Left.

The ray D = {(0, r) : r 0} is a fundamental domain for the action of G = SO(2) on base space R2. That is, D contains exactly one point from each orbit. We ﬁrst deﬁne K(0, r) for each (0, r) D. Then we compute K(θ, r) from K(0, r) by setting φ = 0 in Equation 3 as such

K(θ, r) = ρout(Rotθ)K(0, r)ρin(Rot 1 θ ). (4)

For r > 0, the group acts freely on (0, r), i.e. the stabilizer contains only the identity. This means that Equation 3 imposes no additional constraints on K(0, r). Thus K(0, r) Rcout cin is a matrix of freely learnable weights.

For r = 0, however, the orbit O(0,0) is only one point. The stabilizer of (0, 0) is all of G, which requires K(0, 0) = ρout(Rotθ)K(0, 0)ρin(Rot 1 θ ) for all θ. (5) Thus K(0, 0) is an equivariant per-particle linear map ρin ρout.

Table 1: Equivariant linear maps for K(0, 0). Trainable weights are c R and κ: S1 R, where S1 is the manifold underlying SO(2).

ρin ρout = ρ1 ρout = ρreg ρ1 (a, b) 7 (ca, cb) ca cos(θ) + cb sin(θ)

ρreg f 7 c R

S1 f(θ) cos(θ)dθ R

S1 f(θ) sin(θ)dθ)

S1 κ(θ φ)f(φ)dφ

We can analytically solve Equation 5 for K(0, 0) using representation theory. Table 1 shows the unique solutions for different combinations of ρ1 and ρreg. For details see subsection A.3.

Note that 2D and 3D rotation equivariant continuous convolutions are implemented in Worrall et al. (2017) and Thomas et al. (2018) respectively. They both use harmonic functions which require expensive evaluation of analytic functions at each point. Instead, we provide a simpler solution. We require only knowledge of the orbits, stabilizers, and input/output representations. Additionally, we bypass Clebsch-Gordon decomposition used in Thomas et al. (2018) by mapping directly between the representations in our network. Next, we describe an efﬁcient implementation of equivariant continuous convolution.

4.3 POLAR COORDINATE KERNELS

Rotational equivariance informs our kernel discretization and implementation. We store the kernel K of continuous convolution as a 4-dimensional tensor by discretizing the domain. Speciﬁcally, we

Published as a conference paper at ICLR 2021

Figure 3: Left: A torus kernel ﬁeld K from a ρreg-ﬁeld to a ρreg-ﬁeld. The kernel is itself a ﬁeld: at each point x in space the kernel K(x) yields a different matrix. We denote the (φ2, φ1) entry of the matrix at x = (θ, r) by K(θ, r)(φ2, φ1). The matrices along the red sector are freely trainable. The matrices at all white sectors are determined by those in the red sector according to the circular shifting rule illustrated above. The matrix at the red bullseye is trainable but constrained to be circulant, i.e. preserved by the circular shifting rule. Right: The torus kernel acts on features which are functions on the circle. By cutting open the torus and features along the reg and orange lines we can identify the operation at each point with matrix multiplication.

discretize R2 using polar coordinates with kθ angular slices and kr radial steps. We then evaluate K at any (θ, r) using bilinear interpolation from four closest polar grid points. This method accelerates computation since we do not need to use Equation 4 to repeatedly compute K(θ, r) from K(0, r). The special case of K(0, 0) results in a polar grid with a bullseye at the center (see Figure 3-Left).

We discretize angles ﬁnely and radii more coarsely. This choice is inspired by real-world observation that drivers tend to be more sensitive to the angle of an incoming car than its exact distance, Our equivariant kernels are computationally efﬁcient and have very few parameters. Moreover, we will discuss later in Section 4.5 that despite discretization, the use of polar coordinates allows for very low equivariance error.

4.4 HIDDEN LAYERS AS REGULAR REPRESENTATIONS

Regular representation ρreg has shown better performance than ρ1 for ﬁnite groups (Cohen et al., 2019a; Weiler & Cesa, 2019). But the naive ρreg = {ϕ: G R} for an inﬁnite group G is too large to work with. We choose the space of square-integrable functions L2(G). It contains all irreducible representations of G and is compatible with pointwise non-linearities.

Discretization. However, L2(SO(2)) is still inﬁnite-dimensional. We resolve this by discretizing the manifold S1 underlying SO(2) into kreg even intervals. We represent functions f L2(SO(2)) by the vector of values [f(Rot2πi/kreg)]0 i<kreg. We then evaluate f(Rotθ) using interpolation.

We separate the number of angular slices kθ and the size of the kernel kreg. If we tie them together and set kθ = kreg, this is equivalent to implementing cyclic group Ckreg symmetry with the regular representation. Then increasing kθ would also increases kreg, which incurs more parameters.

Convolution with Torus Kernel. In addition to constraining the kernel K of Equation 1 as in ρ1, ρreg poses an additional challenge as it is a function on a circle. We introduce a new operator from functions on the circle to functions on the circle called a torus kernel.

First, we replace input feature vectors in f Rc with elements of L2(SO(2)). The input feature f becomes a ρreg-ﬁeld, that is, for each x R2, f (x) is a real-value function on the circle S1 R. For the kernel K, we replace the matrix ﬁeld with a map K : R2 ρreg ρreg. Instead of a matrix, K(x) is a map S1 S1 R. Here (φ1, φ2) S1 S1 plays the role of continuous matrix indices and we may consider K(x)(φ1, φ2) R analogous to a matrix entry. Topologically, S1 S1 is a torus and hence we call K(x) a torus kernel. The matrix multiplication K(x) f (x) in Equation 1

Published as a conference paper at ICLR 2021

must be replaced by the integral transform

K(x) f (x)(φ2) = Z

φ1 S1 K(x)(φ2, φ1)f (x)(φ1)dφ1, (6)

which is a linear transformation L2(SO(2)) L2(SO(2)). K(θ, r)(φ2, φ1) denotes the (φ2, φ1) entry of the matrix at point x = (θ, r), see the illustration in Figure 3-Right. We compute Equation 3 for ρreg ρreg as K(Rotθ(x))(φ2, φ1) = K(x)(φ2 θ, φ1 θ). We can use the same weight sharing scheme as in Section 4.2.

4.5 ANALYSIS: EQUIVARIANCE ERROR

Figure 4: Experimentally, we ﬁnd kθ and expected equivariance error are inversely proportional.

The practical value of equivariant neural networks has been demonstrated in a variety of domains. However, theoretical analysis (Kondor & Trivedi, 2018; Cohen et al., 2019a; Maron et al., 2020) of continuous Lie group symmetries is usually performed assuming continuous functions and using the integral representation of the convolution operator. In practice, discretization can cause the model f to be not exactly equivariant, with some equivariance error (EE) EE = f(T(x)) T (f(x)) with respect to group transformations T and T of input and output respectively (Wang et al., 2020, A6). Rectangular grids are well-suited to translations, but poorlysuited to rotations. The resulting equivariance error can be so large to practically undermine the advantages of a theoretically equivariant network.

Our polar-coordinate indexed circular ﬁlters are designed speciﬁcally to adapt well to the rotational symmetry. In Figure 4, we demonstrate experimentally that expected EE is inversely proportional to the number of angular slices kθ. For example, choosing kθ 16 gives very low EE and does not increase the number of parameters. We also prove for ρ1 features that the equivariance error is low in expectation. See Appendix A.6 for the precise statement and proof. Proposition. Let α = 2π/kθ, and θ be θ rounded to nearest value in Zα, and ˆθ = |θ θ|. Let F = Cts Conv K,R and T = ρ1(Rotθ). For some constant C, the expected EE is bounded

EK,f,x[T(F(f, x)) F(T(f), T(x))] | sin(ˆθ)|C 2πC/kθ.

5 EXPERIMENTS

In this section, we present experiments in two different domains, trafﬁc and pedestrian trajectory prediction, where interactions among agents are frequent and inﬂuential. We ﬁrst introduce the statistics of the datasets and the evaluation metrics. Secondly, we compare different feature encoders and hidden feature representation types. Lastly, we compare our model with baselines.

5.1 EXPERIMENTAL SET UP

Dataset We discuss the performances of our models on (1) Argoverse autonomous vehicle motion forecasting (Chang et al., 2019), a recently released vehicle trajectory prediction benchmark, and (2) Traj Net++ pedestrian trajectory forecasting challenge (Kothari et al., 2020). For Argoverse, the task is to predict three-second trajectories based on all vehicles history in the past 2 seconds. We split 32K samples from the validation set as our test set.

Baselines We compare against several state-of-the-art baselines used in Argoverse and Traj Net++. We use three original baselines from (Chang et al., 2019): Constant velocity, Nearest Neighbour, and Long Short Term Memory (LSTM). We also compare with a non-equivariant continuous convolutional model, Cts Conv (Ummenhofer et al., 2019) and a hierarchical GNN model Vector Net (Gao et al., 2020). Note that Vector Net only predicts a single agent at a time, which is not directly comparable with ours. We include Vector Net as a reference nevertheless.

Published as a conference paper at ICLR 2021

Evaluation Metrics We use domain standard metrics to evaluate the trajectory prediction performance, including (1) Average Displacement Error (ADE): the average L2 displacement error for the whole 30 timestamps between prediction and ground truth, (2) Displacement Error at t seconds (DE@ts): the L2 displacement error at a given timestep t. DE@ts for the last timestamp is also called Final Displacement Error (FDE). For Argoverse, we report ADE and DE@ts for t {1, 2, 3}. For Traj Net++, we report ADE and FDE.

5.2 PREDICTION PERFORMANCE COMPARISON

We evaluate the performance of different models from multiple aspects: forecasting accuracy, parameter efﬁciency and the physical consistency of the predictions. The goal is to provide a comprehensive view of various characteristics of our model to guide practical deployment. See Appendix A.9 for an additional ablative study.

Forecasting Accuracy We compare the trajectory prediction accuracy across different models on Argoverse and Traj Net++. Table 2 displays the prediction ADE and FDE comparision. We can see that ECCO with the regular representation ρreg achieves on par or better forecasting accuracy on both datasets. Comparing ECCO and a non-equivariant counterpart of our model Cts Conv, we observe a signiﬁcant 14.8% improvement in forecasting accuracy. Compare with data augmentation, we also observe a 9% improvement over the non-equivariant Cts Conv trained on random-rotationaugmented dataset. These results demonstrate the beneﬁts of incorporating equivariance principles into deep learning models.

Model Argoverse Traj Net++ #Param ADE DE@1s DE@2s DE@3s ADE FDE

Constant Velocity 3.86 2.43 5.10 7.91 1.39 2.86 - Nearest Neighbor 3.49 2.02 4.98 7.84 1.38 2.79 - LSTM 2.13 1.16 2.81 4.83 1.11 2.03 50.6K Cts Conv 1.85 0.99 2.42 4.32 0.86 1.79 1078.1K Cts Conv (Aug.) 1.77 0.96 2.31 4.05 - - 1078.1K ρ1-ECCO 1.70 0.93 2.22 3.89 0.88 1.83 51.4K ρreg-ECCO 1.62 0.89 2.12 3.68 0.84 1.76 129.8K

Vector Net 1.66 0.92 2.06 3.67 - - 72K + Decoder

Table 2: Parameter efﬁciency and accuracy comparison. Number of parameters for each model and their detailed forecasting accuracy at DE@ts. Cts Conv(Aug.) is Cts Conv trained with rotation augmented data.

Parameter Efﬁciency Another important feature in deploying deep learning models to embedded systems such as autonomous vehicles is parameter efﬁciency. We report the number of parameters in each of the models in Table 2. Compare with LSTM, our forecasting performance is signiﬁcantly better. Cts Conv and Vector Net have competitive forecasting performance, but uses much more parameters than ECCO. By encoding equivariance into Cts Conv, we drastically reduce the number of the parameters needed in our model. For Vector Net, Gao et al. (2020) only provided the number of parameters for their encoder; a fair decoder size can be estimated based on MLP using 59 polygraphs with each 64 dimensions as input, predicting 30 timestamps, that is 113K.

Runtime and Memory Efﬁciency We compare the runtime and memory usage with Vector Net Gao et al. (2020). Since Vector Net is not open-sourced, we compare with a version of Vector Net that we implement. Firstly, we compare ﬂoating point operations (FLOPs). Vector Net reported n 0.041 GFLOPs for the encoder part of their model alone, where n is the number of predicted vehicles. We tested ECCO on a scene with 30 vehicles and approximately 180 lane marker nodes, which is similar to the test conditions used to compute FLOPs in Gao et al. (2020). Our full model used 1.03 GFLOPs versus 1.23 GFLOPs for Vector Net s encoder. For runtimes on the same test machine, ECCO runs 684ms versus 1103ms for Vector Net. Another disadvantage of Vector Net is needing to reprocess the scene for each agent, whereas ECCO predicts all agents simultaneously. For memory usage in the same test ECCO uses 296 MB and Vector Net uses 171 MB.

Published as a conference paper at ICLR 2021

2140 2160 2180 2200

2140 2160 2180 2200

2140 2160 2180 2200

2140 2160 2180 2200

2330 2320 2310 2300 2290 2280 2270 2260

2340 2330 2320 2310 2300 2290 2280 2270

2330 2320 2310 2300 2290 2280 2270

2330 2320 2310 2300 2290 2280 2270

Figure 6: The x,y-axes are the position (m). The dashed line represents the 2s past trajectory. The solid line represents the 3s prediction. Red represents the agent. Top row: The predictions are made on the original data. Bottom row: We rotate the whole scene by 160 and make predictions on rotated data. From left to right are visualizations of ground truth, Cts Conv, ρ1-ECCO, ρreg-ECCO.

0 3.2K 6.4K 9.6K 12.8K 16K Training Samples

DE@3s on Validation Set

Figure 5: The learning curves on the validation set. Equivariant models converge faster using fewer samples than the non-equivariant models.

Sample Efﬁciency A major beneﬁt of incorporating the inductive bias of equivariance is to improve the sample efﬁciency of learning. For each sample which an equivariant model is trained on, it learns as if it were trained on all transformations of that sample by the symmetry group (Wang et al., 2020, Prop 3). Thus ECCO requires far fewer samples to learn from. In Figure 5, we plot a comparison of validation FDE over number of training samples and show the equivariant models converge faster.

Physical Consistency We also visualize the predictions from ECCO and non-equivariant Cts Conv, as shown in Figure 6. Top row visualizes the predictions on the original data. In the bottom row, we rotate the whole scene by 160 and make predictions on rotated data. This mimics the covariate shift in the real world. Note that Cts Conv predicts inconsistently: a right turn in the top row but a left turn after the scene has been rotated. We see similar results for Traj Net++ (see Figure 8 in Appendix A.10).

6 CONCLUSION

We propose Equivariant Continuous Convolution (ECCO), a novel model for trajectory prediction by imposing symmetries as inductive biases. On two real-world vehicle and pedestrians trajectory datasets, ECCO attains competitive accuracy with signiﬁcantly fewer parameters. It is also more sample efﬁcient; generalizing automatically from few data points in any orientation. Lastly, equivariance gives ECCO improved generalization performance. Our method provides a fresh perspective towards increasing trust in deep learning models through guaranteed properties. Future directions include applying equivariance to probabilistic predictions with many possible trajectories, or developing a faster version of ECCO which does not require autoregressive computation. Moreover, our methods may be generalized from 2-dimensional space to Rn. The orbit-stabilizer weight sharing scheme and discretized regular representation may be generalized by replacing SO(2) with SO(n), and polar coordinate kernels may be generalized using spherical coordinates.

ACKNOWLEDGEMENT

This work was supported in part by Google Faculty Research Award, NSF Grant #2037745, and the U. S. Army Research Ofﬁce under Grant W911NF-20-1-0334. Walters is supported by a Postdoctoral Fellowship from the Institute for Experiential AI at the Roux Institute.

Published as a conference paper at ICLR 2021

Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 961 971, 2016.

Matan Atzmon, Haggai Maron, and Yaron Lipman. Point convolutional neural networks by extension operators. ACM Transactions on Graphics (TOG), 37(4):71, 2018.

Erkao Bao and Linqi Song. Equivariant neural networks and equivariﬁcation. ar Xiv preprint ar Xiv:1906.07172, 2019.

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261, 2018.

Erik J Bekkers. B-spline CNNs on Lie groups. In International Conference on Learning Representations, 2020.

Richard C Brown and Gerton Lunter. An equivariant Bayesian convolutional network predicts recombination hotspots and accurately resolves binding motifs. Bioinformatics, 35(13):2177 2184, 2019.

Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8748 8757, 2019.

Benjamin Chidester, Minh N. Do, and Jian Ma. Rotation equivariance and invariance in convolutional neural networks. ar Xiv preprint ar Xiv:1805.12301, 2018.

Taco S. Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning (ICML), pp. 2990 2999, 2016a.

Taco S. Cohen and Max Welling. Steerable CNNs. ar Xiv preprint ar Xiv:1612.08498, 2016b.

Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant CNNs on homogeneous spaces. In Advances in Neural Information Processing Systems, pp. 9142 9153, 2019a.

Taco S. Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral CNN. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97, pp. 1321 1330, 2019b.

Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. In International Conference on Machine Learning (ICML), 2016.

Nemanja Djuric, Vladan Radosavljevic, Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, and Jeff Schneider. Short-term motion prediction of trafﬁc actors for autonomous driving using deep convolutional networks. ar Xiv preprint ar Xiv:1808.05819, 2018.

Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar transformer networks. In International Conference on Learning Representations (ICLR), 2018.

Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolutional neural networks for equivariance to Lie groups on arbitrary continuous data. ar Xiv preprint ar Xiv:2002.12880, 2020.

Fabian B Fuchs, Daniel E Worrall, Volker Fischer, and Max Welling. SE(3)-transformers: 3D roto-translation equivariant attention networks. ar Xiv preprint ar Xiv:2006.10503, 2020.

Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, and Cordelia Schmid. Vectornet: Encoding HD maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11525 11533, 2020.

Published as a conference paper at ICLR 2021

Brian Hall. Lie groups, Lie algebras, and representations: an elementary introduction, volume 222. Springer, 2015.

Pedro Hermosilla, Tobias Ritschel, Pere-Pau V azquez, Alvar Vinacua, and Timo Ropinski. Monte carlo convolution for learning on non-uniformly sampled point clouds. ACM Transactions on Graphics (TOG), 37(6):1 12, 2018.

Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997.

Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 984 993, 2018.

Gurtej Kanwar, Michael S Albergo, Denis Boyda, Kyle Cranmer, Daniel C Hackett, S ebastien Racani ere, Danilo Jimenez Rezende, and Phiala E Shanahan. Equivariant ﬂow-based sampling for lattice gauge theory. ar Xiv preprint ar Xiv:2003.06413, 2020.

Arne Kesting, Martin Treiber, and Dirk Helbing. Enhanced intelligent driver model to access the impact of driving strategies on trafﬁc capacity. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 368(1928):4585 4605, 2010.

Thomas N. Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.

Anthony W Knapp. Lie groups beyond an introduction, volume 140. Springer Science & Business Media, 2013.

Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80, pp. 2747 2755, 2018.

Parth Kothari, Sven Kreiss, and Alexandre Alahi. Human trajectory forecasting in crowds: A deep learning perspective. ar Xiv preprint ar Xiv:2007.03639, 2020.

Huan Lei, Naveed Akhtar, and Ajmal Mian. Octree guided CNN with spherical kernels for 3D point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9631 9640, 2019.

Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 991 999, 2015.

Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. In Advances in neural information processing systems, pp. 820 830, 2018.

Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao, Song Feng, and Raquel Urtasun. Learning lane graph representations for motion forecasting. ar Xiv preprint ar Xiv:2007.13732, 2020.

Haggai Maron, Or Litany, Gal Chechik, and Ethan Fetaya. On learning sets of symmetric elements. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.

Louis Albert Pipes. Car following models and the fundamental diagram of road trafﬁc. Transportation Research/UK/, 1966.

Andrey Rudenko, Luigi Palmieri, Michael Herman, Kris M Kitani, Dariu M Gavrila, and Kai O Arras. Human motion trajectory prediction: A survey. The International Journal of Robotics Research, 39(8):895 935, 2020.

Amir Sadeghian, Vineet Kosaraju, Agrim Gupta, Silvio Savarese, and Alexandre Alahi. Trajnet: Towards a benchmark for human trajectory prediction. ar Xiv preprint, 2018.

Published as a conference paper at ICLR 2021

Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter W Battaglia. Learning to simulate complex physics with graph networks. ar Xiv preprint ar Xiv:2002.09405, 2020.

Connor Schenck and Dieter Fox. Spnets: Differentiable ﬂuid dynamics for deep neural networks. In Conference on Robot Learning, pp. 317 335, 2018.

Rajiv Shah and Rob Romijnders. Applying deep learning to basketball trajectories. ar Xiv preprint ar Xiv:1608.03793, 2016.

Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2530 2539, 2018.

Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor ﬁeld networks: Rotation-and translation-equivariant neural networks for 3d point clouds. ar Xiv preprint ar Xiv:1802.08219, 2018.

Benjamin Ummenhofer, Lukas Prantl, Nils Thuerey, and Vladlen Koltun. Lagrangian ﬂuid simulation with continuous convolutions. In International Conference on Learning Representations, 2019.

Matthew Veres and Medhat Moussa. Deep learning for intelligent transportation systems: A survey of emerging trends. IEEE Transactions on Intelligent transportation systems, 2019.

Rui Wang, Robin Walters, and Rose Yu. Incorporating symmetry into deep dynamics models for improved generalization. ar Xiv preprint ar Xiv:2002.03061, 2020.

Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2589 2597, 2018.

Maurice Weiler and Gabriele Cesa. General E(2)-equivariant steerable CNNs. In Advances in Neural Information Processing Systems (Neur IPS), pp. 14334 14345, 2019.

Maurice Weiler, Fred A. Hamprecht, and Martin Storath. Learning steerable ﬁlters for rotation equivariant CNNs. Computer Vision and Pattern Recognition (CVPR), 2018.

Xinshuo Weng, Shangxuan Wu, Fares Beainy, and Kris M Kitani. Rotational rectiﬁcation network: Enabling pedestrian detection for mobile vision. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1084 1092. IEEE, 2018.

Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028 5037, 2017.

Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9621 9630, 2019.

Shi Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802 810, 2015.

Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional ﬁlters. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 87 102, 2018.

Published as a conference paper at ICLR 2021

A.1 CONTINUOUS CONVOLUTION INVOLVING ρreg

This section is a more detailed version of Section 4.4.

Deﬁne the input f to be ρreg-ﬁeld, that is, a distribution over R2 valued in ρreg. Deﬁne K : R2 ρreg ρreg. After identifying SO(2) with its underlying manifold S1, we can identify K(x) as a map S1 S1 R and f (x) : S1 R. Deﬁne the integral transform

K(x) f (x)(φ2) = Z

φ1 S1 K(x)(φ2, φ1)f (x)(φ1)dφ1.

For y R2, deﬁne the convolution g = K f by

x R2 K(x) f(x + y)dx.

The -operation parameterizes linear maps ρreg ρreg and is thus analogous to matrix multiplication. If we chose to restrict our choice of κ to κ(φ2, φ1) = κ(φ2 φ1) for some function κ: S1 R then this becomes the circular convolution operation.

The SO(2)-action on ρreg by Rotθ(f)(φ) = f(φ θ) induces an action on κ: S1 S1 R by

Rotθ(κ)(φ2, φ1) = κ(φ2 θ, φ1 θ).

This, in turn, gives an action on the torus-ﬁeld K by

Rotθ(K)(x)(φ2, φ1) = K(Rot θ(x))(φ2 θ, φ1 θ).

Thus Equation 3, the convolutional kernel constraint, implies that K is equivariant if and only if

K(Rotθ(x))(φ2, φ1) = K(x)(φ2 θ, φ1 θ).

We use this to deﬁne a weight sharing scheme as described in Section 3.2. The cases of continuous convolution ρ1 ρreg and ρreg ρ1 may be derived similarly.

A.2 COMPLEXITY OF CONVOLUTION WITH TORUS KERNEL

The complexity class of the convolution with torus kernel is O(n k2 reg cout cin), where n is the number of particles, the regular representation is discretized into kreg pieces, and the input and output contain cin and cout copies of the regular representation respectively. We are not counting the complexity of the interpolation operation for looking up K(θ, r).

A.3 EQUIVARIANT PER-PARTICLE LINEAR LAYERS

Since this operation is pointwise, unlike positive radius continuous convolution, we cannot map between different irreducible representations of SO(2). Consider as input a ρin-ﬁeld I and output a ρout-ﬁeld O where ρin and ρout are ﬁnite-dimensional representations of SO(2). We deﬁne O(i) = WI(i) using the same W, an equivariant linear map, for each particle 1 i N. Denote the decomposition of ρin and ρout into irreducible representations of SO(2) as ρin = ρi1 1 . . . ρin n and ρout = ρj1 1 . . . ρjn n respectively. By Schur s lemma, the equivariant linear map W : ρin ρout is deﬁned by a block diagonal matrix with blocks {Wk}n k=1 where Wk is an ik jk matrix. That is, maps between different irreducible representations are zero and each map ρk ρk is given by a single scalar.

Per-particle linear mapping ρ1 ρreg and ρ1 ρreg. Since the input and output features are ρ1-ﬁelds, but the hidden features may be represented by ρreg, we need mappings between ρ1 and ρreg. In all cases we pair continuous convolutions with dense per-particle mappings, this we must describe per-particle mappings between ρ1 and ρreg.

Published as a conference paper at ICLR 2021

By the Peter-Weyl theorem, L2(SO(2)) = L i=0 ρi. In the case of SO(2), this decomposition is also called the Fourier decomposition or decomposition into circular harmonics. Most importantly, there is one copy of ρ1 inside of L2(SO(2)). Hence, up to scalar, there is a unique linear map i1 : ρ1 L2(SO(2)) given by (a, b) 7 a cos(θ) + b sin(θ).

The reverse mapping pr1 : L2(SO(2)) ρ1 is projection onto the ρ1 summand and is given by the Fourier transform pri(f) = ( R

S1 f(θ) cos(θ)dθ, R

S1 f(θ) sin(θ)dθ).

Per-particle linear mapping ρreg ρreg. Though ρreg is not ﬁnite-dimensional, the fact that it decomposes into a direct sum of irreducible representations means that we may take ρin = ρout = ρreg above. Practically, however, it is easier to realize the linear equivariant map ρi reg ρj reg as a convolution over S1,

φ S1 κ(θ φ)I(φ)

where κ(θ) is an i j matrix of trainable weights, independent for each θ.

A.4 ENCODING INDIVIDUAL PARTICLE PAST BEHAVIOR

We can encode these individual attributes using a per vehicle LSTM (Hochreiter & Schmidhuber, 1997). Let X(i) t denote the position of car i at time t. Denote a fully connected LSTM cell by ht, ct = LSTM(X(i) t , ht 1, ct 1). Deﬁne h0 = c0 = 0. We then use the concatenation of the hidden states [h(1) tin . . . h(n) tin ] of all particles as Z RN Rk as the encoded per-vehicle latent features.

A.5 ENCODING PAST INTERACTIONS

In addition, we also encode past interactions of particles by introducing a continuous convolution LSTM. Similar to conv LSTM we replace the fully connected layers of the original LSTM above with another operation Xingjian et al. (2015). While conv LSTM is well-suited for capturing spatially local interactions over time, it requires gridded information. Since the particle system we consider are distributed in continuous space, we replace the standard convolution with rotation-equivariant continuous convolutions.

We can now deﬁne Ht, Ct = Cts Conv LSTM(Xt, Ht 1, Ct 1) which is an LSTM cell using equivariant continuous convolutions throughout. Note that in this case Xt, Ht 1, Ct 1 are all particle feature ﬁelds, that is, functions {1, . . . , n} Rk.

Deﬁne Cts Conv LSTM by

it = σ(Wix cts X(i) t + Wih cts ht 1 + Wic ct 1 + bi)

ft = σ(Wfx cts X(i) t + Wfh cts ht 1 + Wfc ct 1 + bi)

ct = ft ct 1 + it tanh(Wcx cts X(i) t + Wch cts ht 1 + bc)

ot = σ(Wox cts X(i) t + Woh cts ht 1 + Woc ct + bo) ht = ot tanh(ct),

where cts denotes Cts Conv. We then can use Htin as input feature for the prediction network.

A.6 EQUIVARIANCE ERROR

We prove the proposition in Section 4.5.

Proposition. Let α = 2π/kθ. Let θ be θ rounded to nearest value in Zα. Set ˆθ = |θ θ|. Assume n particles samples uniformly in a ball of radius R with features f ρc 1. Let f and K have entries sampled uniformly in [ a, a]. Let the bullseye have radius 0 < Re < R. Let F = Cts Conv K,R and Tθ = ρ1(Rotθ). Then the expected EE is bounded

EK,f,x[T(F(f, x)) F(T(f), T(x))] | sin(ˆθ)|C 2πC/kθ

where C = 4cna2(1 R2 e/R2).

Published as a conference paper at ICLR 2021

Proof. We may compute for a single particle x = (ψ, r) and multiply our result by n by linearity. We separate two cases: x in bullseye with probability R2 e/R2 and x in angular slice with probability 1 R2 e/R2. If x is in the bullseye, then there is no equivariance error since K(x) is a scalar matrix. Assume x is an angular sector.

For nearest interpolation, the equivariance error is then

ρ1( θ)K(x)ρ1( θ)ρ1(θ)f ρ1(θ)K(x)f .

Since ρ1(θ) is length preserving, this is

ρ1( θ)ρ1( θ)K(x)ρ1( θ)ρ1(θ)f K(x)f = ρ1(β)K(x)ρ1( β)f K(x)f (7)

where β = ˆθ. We consider only a single factor of ρ1 in f. The result will then be multiplied by c. Let

K(x) = k11 k12 k21 k22

, f = f1 f2

We can factor out an a from K(x) and an a from f and assume kij, fi samples from Uniform([ 1, 1]). One may then directly compute that Equation 7 equals q

((k21 + k12)2 + (k11 k22)2)(f 2 1 + f 2 2 ) sin2(β)

This is bounded above by 4| sin(β)| = 4| sin(ˆθ)|. Collecting the above factors, this proves the bound C| sin(β)|.

The further bound follows by the ﬁrst order bound,

| sin(ˆθ)| |ˆθ| 2π/kθ.

The relationship EE 2πC/kθ is visible in Figure 4. We can also see clearly the signiﬁcance of the term | sin(ˆθ)| by plotting equivariance error against θ as in Figure 7.

Figure 7: The above plot is generated from random input and kernels. We can clearly see the dependence of of EE on | sin(ˆθ)|

A.7 DATA DETAILS

Argoverse dataset includes 324K samples, which are split into 206K training data, 39K validation and 78K test set. All the samples are real data extracted from Miami and Seattle, and the dataset provides HD maps of lanes in each city. Every sample contains data for 5 seconds long, and is sampled in 10Hz frequency.

Published as a conference paper at ICLR 2021

Traj Net++ Real dataset contains 200K samples. All the tracking in this dataset is captured in both indoor and outdoor locations, for example, university, hotel, Zara, and train stations. Every sample in this dataset contains 21 timestamps, and the goal is to predict the 2D spatial positions for each pedestrain in the future 12 timestamps.

A.8 IMPLEMENTATION DETAILS

Argoverse dataset is not fully observed, so we only use cars with complete observation as our input. Since every sample doesn t include the same number of cars, we only choose those scenes with less than or equal to 60 cars and insert dummy cars into them to achieve consistent car numbers. Traj Net++ Real dataset is also not fully observed. And here we keep our pedestrain number consistent to 160.

Moreover, for each car, we use the average velocity in the past 0.1 second as an approximate to the current instant velocity, i.e. vt = (pt pt 1)/2. As for map information, we only include center lanes with lane directions as features. Also, we introduce dummy lane node into each scene to make lane numbers consistently equal to 650.

In Traj Net++ task, no map information is included. And since pedestrians don t have a speedometers to tell them exactly how fast they are moving as drivers, instead they depends more on the relative velocities and relative positions to other pedestrians, we tried different combination of features in ablative study besides only using history velocities.

Our models are all trained by Adam optimizer with base learning rate 0.001, and the gamma rate for linear rate scheduler is set to be 0.95. All our models without map information are trained for 15K iterations with batch size 16 and learning rate is updated every 300 iterations; for models with map information, we train them for 30K iterations with batch size 16 and learning rate is updated every 600 iterations.

For Cts Conv, we set the layer sizes to be 32, 64, 64, 64, and kernel size 4 4 4; for ρ1-ECCO, the layer sizes are 16, 32, 32, 32, kθ is 16, kr is 3; for ρreg-ECCO, we choose layer size 8, 16, 8, 8, kθ 16, kr 3, and regular feature dimension is set to be 8. For Argoverse task, we set the Cts Conv radius to be 40, and for Traj Net++ task we set it to be 6.

A.9 ABLATIVE STUDY

We perform ablative study for ECCO to further diagnose different encoders, usage of HD maps and other model design choices.

Choice of encoders Unlike ﬂuid simulations (Ummenhofer et al., 2019) where the dynamics are Markovian, human behavior exhibit long-term dependency. We experiment with three different encoders refered to as Enc to model such long-term dependency: (1) concatenating the velocities from the past m frames as input feature, (2) passing the past velocities of each particle to the same LSTM to encode individual behavior of each particle, and (3) implementing continuous convolution LSTM to encode past particle interactions. Our continuous convolution LSTM is similar to conv LSTM (Xingjian et al., 2015) but uses continuous convolutions instead of discrete gridded convolutions.

We use different encoders to time-aggregate features and compare their performances (Table 3).

Use of HD Maps In Table 4, we compare performance with and without map input features.

Choice of features for pedestrian Unlike vehicles, people do not have a velocity meter to tell him how fast they actually walk. We realize that people actually tend to adjust their velocities based on others relative velocity and relative position. We experiment different combination of features (Table 5), ﬁnding using relative velocities and relative positions as feature has the best performance.

A.10 QUALITATIVE RESULTS FOR TRAJNET++

Figure 8 show qualitative results for Traj Net++. Note that the non-equivariant baseline (2nd column) depends highly on the global orientation whereas the ground truth and equivariant models do not.

Published as a conference paper at ICLR 2021

Encoder Argoverse Traj Net++ ADE DE@1s DE@2s DE@3s ADE FDE

Markovian 4.67 - - 9.84 0.969 1.952 LSTM 2.05 1.06 2.51 4.71 0.909 1.909 Cts Conv LSTM 3.98 2.02 5.11 8.40 0.962 1.941 Cts Conv DLSTM 2.02 1.03 2.46 4.58 0.910 1.916 D-Concat(20t feats) 1.87 1.01 2.43 4.22 0.895 1.872

Table 3: Ablation study on encoders for Argoverse and Traj Net++. Markovian: Use the velocity from the most recent time step as input feature. LSTM: Used LSTM to encode velocities of 20 timestamps. Cts Conv LSTM: Instead of dense layer, the gate functions in LSTM are replaced by Cts Conv. Cts Conv DLSTM: Replaced gate functions by Cts Conv + Dense. D-Concat (20t feats): Stacked velocities of 20 time steps as input.

Model w/o Map w/ Map

ADE DE@1s DE@2s DE@3s ADE DE@1s DE@2s DE@3s

Cts Conv 1.87 1.01 2.43 4.22 1.85 0.99 2.42 4.32 ρ1-ECCO 1.81 1.02 2.42 4.14 1.70 0.93 2.22 3.89 ρreg-ECCO 1.81 1.00 2.38 4.12 1.62 0.89 2.12 3.68

Table 4: Ablative study on HD maps for Argoverse. Prediction accuracy comparison with and without HD Maps.

Velocity Relative Position Acceleration ADE FDE

Absolute 0.92 1.95 Absolute 0.90 1.87 Relative 0.89 1.86 Relative 0.86 1.79

Table 5: Ablative study on features for Traj++. Acceleration means whether we used acceleration to make numerically extrapolated position.

4 6 8 10 12 14 16 18

4 6 8 10 12 14 16 18

4 6 8 10 12 14 16 18

4 6 8 10 12 14 16 18

22 20 18 16 14 12 10 8

18 16 14 12 10 8

22 20 18 16 14 12 10 8

22 20 18 16 14 12 10 8

Figure 8: The x,y-axes are the position (m). The dashed line represents the 2s past trajectory. The solid line represents the 3s prediction. Red represents the agent. Top row: The predictions are made on the original data. Bottom row: We rotate the whole scene by 160 and make predictions on rotated data. From left to right are visualizations of ground truth, Cts Conv, ρ1-ECCO, ρreg-ECCO.