# spatiotemporal_graph_scattering_transform__2183cb32.pdf

Published as a conference paper at ICLR 2021

SPATIO-TEMPORAL GRAPH SCATTERING TRANSFORM

Chao Pan University of Illinois at Urbana-Champaign Champaign, IL, USA chaopan2@illinois.edu

Siheng Chen Shanghai Jiao Tong University Shanghai, China sihengc@sjtu.edu.cn

Antonio Ortega University of Southern California Los Angeles, CA, USA antonio.ortega@ee.usc.edu

Although spatio-temporal graph neural networks have achieved great empirical success in handling multiple correlated time series, they may be impractical in some real-world scenarios due to a lack of sufﬁcient high-quality training data. Furthermore, spatio-temporal graph neural networks lack theoretical interpretation. To address these issues, we put forth a novel mathematically designed framework to analyze spatio-temporal data. Our proposed spatio-temporal graph scattering transform (ST-GST) extends traditional scattering transforms to the spatiotemporal domain. It performs iterative applications of spatio-temporal graph wavelets and nonlinear activation functions, which can be viewed as a forward pass of spatio-temporal graph convolutional networks without training. Since all the ﬁlter coefﬁcients in ST-GST are mathematically designed, it is promising for the real-world scenarios with limited training data, and also allows for a theoretical analysis, which shows that the proposed ST-GST is stable to small perturbations of input signals and structures. Finally, our experiments show that i) ST-GST outperforms spatio-temporal graph convolutional networks by an increase of 35% in accuracy for MSR Action3D dataset; ii) it is better and computationally more efﬁcient to design the transform based on separable spatio-temporal graphs than the joint ones; and iii) the nonlinearity in ST-GST is critical to empirical performance.

1 INTRODUCTION

Processing and learning from spatio-temporal data have received increasing attention recently. Examples include: i) skeleton-based human action recognition based on a sequence of human poses (Liu et al. (2019)), which is critical to human behavior understanding (Borges et al. (2013)), and ii) multi-agent trajectory prediction (Hu et al. (2020)), which is critical to robotics and autonomous driving (Shalev-Shwartz et al. (2016)). A common pattern across these applications is that data evolves in both spatial and temporal domains. This paper aims to analyze this type of data by developing novel spatio-temporal graph-based data modeling and operations.

Spatio-temporal graph-based data modeling. Graphs are often used to model data where irregularly spaced samples are observed over time. Good spatio-temporal graphs can provide informative priors that reﬂect the internal relationships within data. For example, in skeleton-based human action recognition, we can model a sequence of 3D joint locations as data supported on skeleton graphs across time, which reﬂects both the human physical constraints and temporal consistency (Yan et al. (2018)). Recent studies on modeling spatio-temporal graphs have followed either joint or separable processing frameworks. Joint processing is based on constructing a single spatio-temporal graph and processing (e.g., ﬁltering) via operations on this graph (Kao et al. (2019); Liu et al. (2020)). In contrast, a separable processing approach works separately, and possibly with different operators, across the space and time dimension. In this case, independent graphs are used for space and

This work was mainly done while Chao Pan and Siheng Chen were working at Mitsubishi Electric Research Laboratories (MERL).

Published as a conference paper at ICLR 2021

time (Yan et al. (2018); Cheng et al. (2020)). However, no previous work thoroughly analyzes and compares these two constructions. In this work, we mathematically study these two types of graphs and justify the beneﬁts of separable processing from both theoretical and empirical aspects.

Spatio-temporal graph-based operations. Graph operations can be performed once the graph structure is given. Some commonly used graph operations include the graph Fourier transform (Shuman et al. (2013)), and graph wavelets (Hammond et al. (2011)). It is possible to extend those operations to the spatio-temporal graph domain. For example, Grassi et al. (2017) developed the short time-vertex Fourier transform and spectrum-based time-vertex wavelet transform. However, those mathematically designed, linear operations show some limitations in terms of empirical performances. In comparison, many recent deep neural networks adopt trainable graph convolution operations to analyze spatio-temporal data (Yan et al. (2018); Liu et al. (2020)). However, most networks are designed through trial and error. It is thus hard to explain the rationale behind empirical success and further improve the designs (Monga et al. (2019)). In this work, to ﬁll in the gap between mathematically designed linear transforms and trainable spatio temporal graph neural networks, we propose a novel spatio-temporal graph scattering transform (ST-GST), which is a mathematically designed, nonlinear operation.

Speciﬁcally, to characterize the spatial and temporal dependencies, we present two types of graphs corresponding to joint and separable designs. We then construct spatio-temporal graph wavelets based on each of these types of graphs. We next propose the framework of ST-GST, which adopts spatio-temporal graph wavelets followed by a nonlinear activation function as a single scattering layer. All the ﬁlter coefﬁcients in ST-GST are mathematically designed beforehand and no training is required. We further show that i) a design based on separable spatio-temporal graph is more ﬂexible and computationally efﬁcient than a joint design; and ii) ST-GST is stable to small perturbations on both input spatio-temporal graph signals and structures. Finally, our experiments on skeletonbased human action recognition show that the proposed ST-GST outperforms spatio-temporal graph convolutional networks by 35% accuracy in MSR Action3D dataset.

We summarize the main contributions of this work as follows:

We propose wavelets for both separable and joint spatio-temporal graphs. We show that it is more ﬂexible and computationally efﬁcient to design wavelets based on separable spatio-temporal graphs;

We propose a novel spatio-temporal graph scattering transform (ST-GST), which is a non-trainable counterpart of spatio-temporal graph convolutional networks and a nonlinear version of spatiotemporal graph wavelets. We also theoretically show that ST-GST is robust and stable in the presence of small perturbations on both input spatio-temporal graph signals and structures;

For skeleton-based human action recognition, our experiments show that: i) ST-GST can achieve similar or better performances than spatio-temporal graph convolutional networks and other nondeep-learning approaches in small-scale datasets; ii) separable spatio-temporal scattering works signiﬁcantly better than joint spatio-temporal scattering; and iii) ST-GST signiﬁcantly outperforms spatio-temporal graph wavelets because of the nonlinear activation function.

2 RELATED WORK

Scattering transforms. Convolutional neural networks (CNNs) use nonlinearities coupled with trained ﬁlter coefﬁcients and are well known to be hard to analyze theoretically (Anthony & Bartlett (2009)). As an alternative, Mallat (2012); Bruna & Mallat (2013) propose scattering transforms, which are non-trainable versions of CNNs. Under admissible conditions, the resulting transform enjoys both great performance in image classiﬁcation and appealing theoretical properties. These ideas have been extended to the graph domain (Gama et al. (2019a); Zou & Lerman (2020); Gao et al. (2019); Ioannidis et al. (2020)). Speciﬁcally, the graph scattering transform (GST) proposed in (Gama et al. (2019a)) iteratively applies predeﬁned graph ﬁlter banks and element-wise nonlinear activation function. In this work, we extend classical scattering transform to the spatio-temporal domain and provide a new mathematically designed transform to handle spatio-temporal data. The key difference between GST and our proposed ST-GST lies in the graph ﬁlter bank design, where ST-GST needs to handle both spatial and temporal domains.

Spatio-temporal neural networks. Deep neural networks have been adapted to operate on spatiotemporal domain. For example, Liu et al. (2019) uses LSTM to process time series information, while ST-GCN (Yan et al. (2018)) combines a graph convolution layer and a temporal convolution

Published as a conference paper at ICLR 2021

layer as a unit computational block in the network architecture. However, those networks all require a huge amount of high-quality labeled data, and training them is computationally expensive, which may make them impractical for many real-world scenarios. Furthermore, many architectures are designed through trial and error, making it hard to justify the design choices and further improve them. In this work, the proposed ST-GST is a nonlinear transform with a forward procedure similar to that of ST-GCN. However, ST-GST does not require any training, which is useful in many applications where only limited training data is available. Furthermore, since all ﬁlter coefﬁcients in ST-GST are predeﬁned, it allows us to perform theoretical analysis and the related conclusions potentially shed some light on the design of spatio-temporal networks.

Skeleton-based human action recognition. Conventional skeleton-based action recognition models learn semantics based on hand-crafted features (Wang et al. (2012)). To handle time series information, some recurrent-neural-network-based models are proposed to capture the temporal dependencies between consecutive frames (Kim & Reiter (2017)). Recently, graph-based approaches have gained in popularity while achieving excellent performance (Yan et al., 2018; Li et al., 2019). In this work, our experiments focus on this task and show that ST-GST outperforms the state-of-the-art spatio-temporal graph neural networks, like MS-G3D (Liu et al., 2020), on small-scale datasets.

3 SPATIO-TEMPORAL GRAPH SCATTERING TRANSFORM

In this section, we ﬁrst deﬁne spatio-temporal graph structures and signals. We next design our spatio-temporal graph wavelets. Finally, we present ST-GST.

Figure 1: Visualization of spatial graph, temporal graph and three commonly used product graphs.

3.1 SPATIO-TEMPORAL GRAPH STRUCTURES AND SIGNALS

Spatio-temporal data can be represented as a matrix X RN T , where N is the number of the spatial positions and T is the number of time stamps. In this matrix, each row is a time-series for a spatial node, and each column is a spatial signal at a certain time stamp. Note that the index of spatial information can be arbitrary: we will associate each spatial location to a vertex on the spatial graph, and the edges will provide information about the relative position of the nodes. We can reshape the matrix to form a vector x of length NT, where the element x(s,t) := x(s 1)T +t is the feature value corresponding to the s-th vertex at time t. To construct a spatio-temporal graph, we create connections based on physical constraints. For example, for skeleton-based action recognition, the spatial graph is the human skeleton graph, reﬂecting bone connections; see Fig. 1(a); and the temporal graph is a line graph connecting consecutive time stamps; see Fig. 1(b).

As a starting point, we choose a spatial graph Gs = (Vs, Es, As) with |Vs| = N, reﬂecting the graph structure of each column in X and a temporal graph Gt = (Vt, Et, At) with |Vt| = T, reﬂecting the graph structure of each row in X. The separable spatio-temporal design is achieved by processing the columns and rows of X separately based on their respective graphs.

As an alternative, a product graph, denoted as G = Gs Gt = (V, E, A) can be constructed to unify the relations in both the spatial and temporal domains, allowing us to process data jointly across space and time. The product graph Gs Gt has |V| = NT nodes and an appropriately deﬁned NT NT adjacency matrix A. The operation interweaves two graphs to form a unifying graph structure. The edge weight A(s1,t1),(s2,t2) := A(s1 1)T +t1,(s2 1)T +t2 characterizes the relation, such as similarity or dependency, between the s1-th spatial node at the t1-th time stamp and the s2-th spatial node at the t2-th time stamp. There are three commonly used product graphs (Sandryhaila & Moura, 2014): i) Kronecker product: G = Gs Gt with graph adjacency matrix as A = As At and

Published as a conference paper at ICLR 2021

represents the Kronecker product of matrices; see Fig 1(c); ii) Cartesian product: G = Gs Gt with A = As IT + IN At; see Fig 1(d); and iii) strong product: G = Gs Gt with A = As At + As IT + IN At, which can be viewed as a combination of Kronecker and Cartesian products; see Fig 1(e). The joint spatio-temporal design is achieved based on a product graph.

In this paper, we consider designs based on both separable graphs and product graphs.

3.2 SPATIO-TEMPORAL GRAPH FILTERING

We now show two graph ﬁlter designs, separable and joint ﬁltering, based on the corresponding spatio-temporal graphs we just described. For each design, we ﬁrst deﬁne the spatio-temporal graph shift, which is the most elementary graph ﬁlter and deﬁnes how information should propagate in a spatio-temporal graph. We then propose spatio-temporal graph ﬁltering in both graph vertex and graph spectral domains.

Separable graph ﬁlters. Given the spatial graph Gs = (Vs, Es, As) and the temporal graph Gt = (Vt, Et, At), let the spatial graph shift be Ss = As and the temporal graph shift be St = At 1. For simplicity, we focus on the symmetric graph shifts. For a spatio-temporal graph signal, spatial and temporal graph ﬁltering work as, H(Ss)X = PP 1 p=0 hp Sp s X and XG (St) = X(PQ 1 q=0 gq Sq t) , where hp are gq are spatial and temporal ﬁlter coefﬁcients, respectively. In each modality, the graph ﬁlter is a polynomial of the graph shift. The polynomial orders P and Q control the length of ﬁlters in the spatial and temporal modalities, respectively. Note that these two values can be chosen to be different, which provides additional design ﬂexibility. Then, a separable spatio-temporal graph ﬁltering operation can be deﬁned as

H(Ss)XG (St) :=

p=0 hp Sp s

q=0 gq Sq t

= (H(Ss) G(St)) x, (1)

where the second equality follows from the property: M1XMT 2 = (M1 M2)x.

We can also represent the ﬁltering process in the graph spectral domain. Let the eigen-decomposition of the spatial and temporal graphs be Ss = VsΛs Vs and St = VtΛt Vt , respectively, where Vs RN N, Vt RT T form the spatial and temporal graph Fourier bases. The elements along the diagonals of Λs, Λt represent the spatial and temporal graph frequencies. We have H(Ss)X = Vs PP 1 p=0 hpΛp s Vs X = Vs H(Λs)Vs X, XG (St) = XVt PQ 1 q=0 gqΛq t Vt = XVt G(Λt)Vt . Letting V = Vs Vt, the spectral representation of the separable spatio-temporal graph ﬁltering is then (Vs H(Λs)V s )X(Vt G(Λt)V t ) = V (H(Λs) G(Λt)) V x.

Joint graph ﬁlters. Given the joint graph structure G = Gs Gt = (V, E, A), let the spatio-temporal graph shift be S = A. Then, a joint spatio-temporal graph ﬁltering operation can be deﬁned as:

k=0 hk Skx = V

V x = VH(Λ)V x, (2)

where hk is the ﬁlter coefﬁcient. The kernel function h(λ) = PK 1 k=0 hkλk is applied to each diagonal element of Λ to obtain H(Λ). Here h(λ) is independent of any speciﬁc graph structure, and characterizes the ﬁlter response in the graph spectral domain. Note that h = (h0, , h K 1), h(λ) and H( ) are essentially the same thing, and are used interchangeably in this paper.

It is worth pointing out that these three product graphs share the same form of joint spatio-temporal graph ﬁltering (2) as well as the same graph Fourier bases V = Vs Vt. Following from (2), the spectral representation of the joint spatio-temporal graph ﬁltering can be formulated as

Kronecker product: H(S) = V(H(Λs Λt))V ,

Cartesian product: H(S) = V(H(Λs IT + IN Λt))V ,

Strong product: H(S) = V(H(Λs Λt + Λs IT + IN Λt))V .

1Some other choices of a graph shift include the graph Laplacian matrix, graph transition matrix and their normalized counterparts. Adjacency matrix is considered here for notation simplicity.

Published as a conference paper at ICLR 2021

Comparison between separable and joint graph ﬁlters. First of all, we stress that both separable and joint spatio-temporal graph ﬁltering share the same Fourier bases, meaning that they share the same frequency space and their difference only comes from the frequency responses.

Second, designing ﬁlters based on separable spatio-temporal graphs provides additional ﬂexibility. Although it is conceptually simple to design graph ﬁlters directly on product graphs, the eigenvalues along the spatial and temporal domains are tied together, making it difﬁcult to adjust the frequency responses independently for the two modalities. Moreover, two domains are forced to share the same set of ﬁlter coefﬁcients and length. Take a ﬁlter deﬁned on Kronecker product graph as an example. By expanding the term H(Λs Λt) we can have that H(Λs Λt) = PK 1 k=0 hk(Λs Λt)k = PK 1 k=0 hk(Λk s Λk t ). This shows that the ﬁlter coefﬁcients are applied on the product of spatial and temporal eigenvalues, making it hard to decompose and interpret the functionality of the ﬁlter in single modality. Such limitations make them less practical for spatio-temporal signals which might have distinct patterns in each of the two modalities. This problem is overcome by separable graph ﬁltering, where different ﬁlters are applied to each modality. The ﬂexibility of separable graph ﬁlters means that one can design different ﬁlters (h and g) with independent ﬁlter lengths (P and Q) in the spatial and temporal domains. However, it is worth pointing out that the representation power of these two formulations does not have a clear relationship that one is a subset of the other. Consider a joint graph ﬁlter designed on a strong product graph with length K = 3. The ﬁlter kernel is deﬁned as H(Λs Λt + Λs IT + IN Λt) = P2 k=0 hk(Λs Λt + Λs IT + IN Λt)k. Similarly, the kernel of a separable graph ﬁlter with P = Q = 3 can be written as H(Λs) G(Λt) = (P2 p=0 hpΛp s) (P2 q=0 gqΛq t). By expanding the expression and rearranging the coefﬁcients, one can obtain the coefﬁcient matrices for the joint graph ﬁlter and the separable graph ﬁlter, C1 and C2, respectively; that is,

"h0 h1 h2 h1 h1 + 2h2 2h2 h2 2h2 h2

"h0g0 h0g1 h0g2 h1g0 h1g1 h1g2 h2g0 h2g1 h2g2

[g0 g1 g2] ,

where (i, j)-th element means the coefﬁcient of term Λi 1 s Λj 1 t .

On one hand, it is obvious that C2 is always a rank 1 matrix, while C1 could have rank 1, 2, or 3. So C1 is not a special case of C2. On the other hand, C1 is always a symmetric matrix, but C2 can be either symmetric or non-symmetric, depending on the choices of h and g. So C2 is also not a special case of C1. Therefore, the families spanned by two designs do not have any simple relationship that one is a subset of the other. Similar conclusions hold for the Kronecker and Cartesian products.

Third, designing based on separable spatio-temporal graphs is computationally more efﬁcient. In a separable graph ﬁltering process, we only need to deal with two small matrix multiplications (1), instead of one large matrix-vector multiplication (2), reducing the computational cost from O(N 2T 2) to O(NT(N + T)).

In short, the joint and separable graph ﬁlters are two different design methods for spatio-temporal graphs. Though the representation power of separable graph ﬁlters is not necessarily much stronger than joint ones, separable design enjoys the ﬂexibility, computation efﬁciency and straightforward interpretation. Empirical performances also show that the separable design outperforms the joint one; see Section 5. Note that this separable design coincides with the basic module widely used in spatio-temporal graph convolutional networks Li et al. (2019), which consists of one graph convolution layer followed by one temporal convolution layer.

3.3 SPATIO-TEMPORAL GRAPH WAVELETS

In time-series analysis and image processing, wavelets are one of the best tools to design ﬁlter banks, allowing us to trade-off between the time-frequency resolutions and touching the lower bound of uncertainty principle of the time-frequency representations (Akansu & Haddad, 2000). Inspired by this, we propose spatio-temporal graph wavelets, which include a series of mathematically designed graph ﬁlters to provide multiresolution analysis for spatio-temporal graph signals. The proposed spatio-temporal graph wavelets are later used at each layer in the proposed ST-GST framework. Based on two types of graph structures, we consider two designs: separable and joint wavelets.

Separable graph wavelets. Based on separable spatio-temporal graph ﬁltering (1), we are able to design spatial graph wavelets, {Hj1(Ss) = PP 1 p=0 h(j1) p Sp s}Js j1=1, and temporal graph wavelets,

Published as a conference paper at ICLR 2021

{Gj2(St) = PQ 1 q=0 g(j2) q Sq t}Jt j2=1, separately. For each modality, the ﬁlter at scale j is deﬁned

as Hj(S) = S2j 1 S2j = S2j 1(I S2j 1). There are also many other off-the-shelf graph wavelets we can choose from. More discussion about wavelets and their properties can be found in Appendix A. Since two modalities are designed individually, the number of wavelet scales for each modality could be different. This is important in practice because the number of time samples T is normally larger than the number of spatial nodes N. For each node in spatio-temporal graph, using different wavelet scales in the two domains allows for more ﬂexibility to diffuse the signal with its neighbors. Based on this construction, when we choose Js and Jt scales for spatial and temporal domains, respectively, the overall number of scales for spatio-temporal wavelets is then J = Js Jt.

Joint graph wavelets. When the joint ﬁltering (2) is chosen, we can directly apply existing graph wavelet designs, such as the spectral graph wavelet transform (Hammond et al., 2011).

3.4 SPATIO-TEMPORAL GRAPH SCATTERING TRANSFORM

The proposed ST-GST is a nonlinear version of spatio-temporal graph wavelets, which iteratively uses wavelets followed by a nonlinearity activation function. ST-GST includes three components: (i) spatio-temporal graph wavelets, (ii) a pointwise nonlinearity activation function σ( ), and (iii) a low-pass pooling operator U. These operations are performed sequentially to extract representative features from input spatio-temporal graph signal X. The main difference between ST-GST and spatio-temporal graph wavelets is the application of nonlinear activation at each layer. The nonlinear transformation disperses signals through the graph spectrum, producing more patterns in spectrum.

Separable ST-GST. Let Z RN T be a spatio-temporal graph signal. At each scattering layer, we sequentially use spatial graph wavelets {Hj1}Js j1=1 and temporal wavelets {Gj2}Jt j2=1 to convolve with Z. Since each graph ﬁlter generates a new spatio-temporal graph signal, separable spatiotemporal graph ﬁltering generates J = Js Jt spatio-temporal graph signals. Then, the nonlinear activation is applied for each spatio-temporal graph signal. For example, the (j1, j2)-th signal is Z(j1,j2) = σ(Hj1(Ss)ZG j2(St)). We can treat each ﬁltered spatio-temporal graph signal as one tree node. Given Z as the parent node, a scattering layer produces J children nodes.

To construct ST-GST, we ﬁrst initialize the input data Z0 = X be the root of the scattering tree; and then, we recursively apply scattering layers at each node to produce children nodes, growing a scattering tree; see Fig. 2. We can index all the nodes in this scattering tree by a unique path from the root to each node. For example, p(ℓ) = ((j(1) 1 , j(1) 2 ), . . . , (j(ℓ) 1 , j(ℓ) 2 )) is the path from root to one tree node in the ℓ-th layer, and the signal associated with it is Z(p(ℓ)). Data matrix Z(p(ℓ)) is then summarized by an pooling operator U( ) to obtain a lower-dimensional vector φ(p(ℓ)) = U Z(p(ℓ)) . Various pooling methods can lead to different dimensions of scattering features. Common choices for U( ) include average in the spatial domain (U = 1 N 11 N, φ = UZ RT ), average in the temporal domain (U = 1

T 1T 1, φ = ZU RN) and average in both modalities (U = 1 NT 1N T , φ = U Z R), where represent Hadamard product. Finally, all scattering features φ(p(ℓ)) are concatenated to construct a scattering feature map Φ(Ss, St, X) := {{φ(p(ℓ))}all p(ℓ)}L 1 ℓ=0 ,

Joint ST-GST. Since we deal with a unifying graph, we can use the spatio-temporal product graph directly in combination with the ordinary graph scattering transform (Gama et al. (2019b)).

Comparison with ST-GCNs. One distinct difference between ST-GST and ST-GCNs lies in the fact that the trainable graph convolution in each layer of ST-GCN performs the multiplication between a spatial graph shift and the feature matrix, which only extracts low-frequency information over the graph; while ST-GST leverages multiple spatio-temporal graph ﬁlters to cover multiple frequency bands. Furthermore, predeﬁned ﬁlter coefﬁcients conform a frame (3) in each layer of ST-GST, which is crucial for establishing the stability of ST-GST as shown in next section.

4 THEORETICAL ANALYSIS

Stability is the key to designing robust and reliable algorithms. However, since the training process of ST-GCNs is data-driven, it might be vulnerable to small perturbations added to training data, which may lead to signiﬁcant degradation in practice. Trainable parameters make it hard to develop a theoretical analysis for ST-GCNs. In contrast, here we show that the proposed separable ST-GST is stable to perturbations on both spatio-temporal graph signals and structures. All proofs of statements

Published as a conference paper at ICLR 2021

Figure 2: Scattering tree of separable ST-GST with L = 3, Js = Jt = 2.

in this section are explained thoroughly in Appendix B. Unless speciﬁed, x is the ℓ2 norm for vector x, while X and X 2 are the Frobenius and spectral norm for matrix X, respectively.

Here we show the results for separable spatio-temporal graph scattering transform. But all the results can be extended to the joint version. Before introducing perturbations, we ﬁrst show that separable spatio-temporal graph wavelets also satisfy certain frame bounds. Thus, with a separable construction, we can control bound constants for spatio-temporal wavelet and build tight frames.

Lemma 1. Let {Hj1}Js j1=1 and {Gj2}Jt j2=1 be the wavelet ﬁlter bank for spatial domain and for temporal domain, respectively. Both satisfy frame properties such that for any x RN and y RT ,

j1=1 Hj1(Ss)x 2 B2 1 x 2, A2 2 y 2

j2=1 Gj2(St)y 2 B2 2 y 2, (3)

Then, for any Z RN T and its corresponding reshaped vector z RNT , it holds that

A2 1A2 2 Z 2

j1,j2=1 (Hj1(Ss) Gj2(St))z 2 =

Hj1(Ss)ZG j2(St) 2 B2 1B2 2 Z 2.

Lemma 1 guarantees that separable design can also lead to valid wavelets. Furthermore, when we choose both spatial {Hj1}Js j1=1 and temporal {Gj2}Jt j2=1 to be tight frames with A1 = B1 = A2 = B2 = 1 (Shuman et al., 2015), the resulting separable wavelet also conforms a tight frame. In later context, denote B = B1 B2 as the frame bound constant for separable spatio-temporal graph wavelet, and separable ST-GST is conﬁgured with L layers and J = Js Jt scales at each layer.

4.1 STABILITY TO PERTURBATION OF SPATIO-TEMPORAL GRAPH SIGNALS

Consider the perturbed spatio-temporal graph signal e X = X + RN T , where RN T is the perturbation matrix. Such an additive model can represent measurement noise caused by devices or adversarial perturbations added manually. Theorem 1 shows that the feature map for perturbed signal will not deviate much from original feature map under small input perturbations. Theorem 1. Consider the additive noise model for input data X, it then holds that

Φ(Ss, St, X) Φ(Ss, St, e X) q

T PL 1 ℓ=0 Jℓ 1

PL 1 ℓ=0 B2ℓ PL 1 ℓ=0 Jℓ . (4)

The difference of output is normalized by the squared root of dimension of the ﬁnal feature map. Note that we can construct spatio-temporal wavelet easily with B = 1 when spatial and temporal wavelets are both tight frames, then the normalized bound presented in (4) indicates that the transform is insensitive to perturbations on input signals as the factor is much smaller than 1.

Published as a conference paper at ICLR 2021

Method Accuracy (%)

GFT+TPM 74.0 HDM 81.8

Temporal Conv. 72.1 ST-GCN (ﬁxed topology) 52.0 MS-G3D (learnable topology) 82.2

Separable ST-GST (5, 10, 3) 81.4 Separable ST-GST (5, 20, 3) 87.0 Joint Kronecker ST-GST (15, 3) 61.0 Joint Cartesian ST-GST (15, 3) 59.1 Joint Strong ST-GST (15, 3) 61.7

Table 1: Classiﬁcation accuracy (MSR Action3D with 288 training and 269 testing samples).

Method Accuracy (%)

Deep LSTM 60.7 PA-LSTM 62.9 ST-LSTM+TG 69.2 Temporal Conv. 74.3 ST-GCN (ﬁxed topology) 75.8

Separable ST-GST (5, 20, 2) 68.7 Separable ST-GST (5, 20, 3) 73.1 Joint Kronecker ST-GST (15, 3) 55.7 Joint Cartesian ST-GST (15, 3) 56.2 Joint Strong ST-GST (15, 3) 57.1

Table 2: Classiﬁcation accuracy (NTU-RGB+D with 40, 320 training and 16, 560 testing samples).

(a) Limited training data. (b) Performance vs. time. (c) Nonlinear vs. linear.

Figure 3: Performance comparisons with various settings about training ratio, time and nonlinearity.

4.2 STABILITY TO PERTURBATION OF SPATIO-TEMPORAL GRAPH STRUCTURES

Perturbations on the underlying graph usually happen when the graph is unknown or when the graph changes over time (Segarra et al., 2017). Since such kind of perturbations usually happen in spatial domain, here we simply consider the structure perturbations on the spatial graph only. Speciﬁcally, we consider the perturbed spatial graph b Ss = Ss +E Ss +Ss E, where E is the perturbation matrix and temporal graph St is not changed. Detailed descriptions see Appendix B.

Theorem 2. Suppose eigenvalues {mi}N i=1 of E RN N are organized in order such that |m1| |m2| |m N|, satisfying |m N| ϵ/2 and |mi/m N 1| ϵ for ϵ > 0 and all i s. Suppose spatial ﬁlter bank {Hj1}Js j1=1 satisﬁes maxi |λh i(λ)| C and temporal ﬁlter bank {Gj2}Jt j2=1 satisﬁes limited spectral response maxi |gi(λ)| D. It then holds that

Φ(Ss, St, X) Φ(b Ss, St, X) q

T PL 1 ℓ=0 Jℓ ϵCD B

PL 1 ℓ=0 ℓ2(B2J)ℓ

PL 1 ℓ=0 Jℓ X , (5)

where ϵ characterizes the perturbation level. Theorem 2 shows that ST-GST is a stable transform also for structure deformation, as the norm of change of feature map is linear in ϵ. It is worth pointing out that the upper bound in both Theorems 1 and 2 only depend on the choice of ﬁlter banks and structure of scattering tree, instead of quantities related with speciﬁc graph support Ss and St that are shown in previous works (Gama et al., 2019b; Zou & Lerman, 2020; Levie et al., 2019).

5 EXPERIMENTAL RESULTS

We now evaluate the performance of proposed ST-GST in skeleton-based action recognition task.

Experimental setup. The number of layers, L, the number of spatial wavelet scales, Js, and the number of temporal wavelet scales, Jt, are represented by (Js, Jt, L) for separable ST-GST, and (J, L) for joint ST-GST. Training ratio means the fraction of data used for training in the training set. For the spatial domain, we use the skeleton graph; and for the temporal domain, we use a line graph connecting consecutive time stamps, see Fig. 1(a)(b). Geometric scattering wavelets are used

Published as a conference paper at ICLR 2021

in both domain, and the nonlinear activation σ( ) is absolute value function which has the property of energy-preserving. Features output by ST-GST are then utilized by random forest classiﬁer with 300 decision trees for classiﬁcation.

Comparison with state-of-the-art methods. We consider two datasets, MSR Action3D and NTURGB+D (cross-subject). For MSR Action3D, the proposed ST-GST is compared with GFT facilitated by temporal pyramid matching (GFT+TPM) (Kao et al., 2019), Bayesian hierarchical dynamic model (HDM) (Zhao et al., 2019), and a few deep learning approaches, including temporal convolution neural networks (Kim & Reiter, 2017), ST-GCN (Yan et al., 2018), and MS-G3D (Liu et al., 2020). For NTU-RGB+D, Deep LSTM (Liu et al., 2019), part-aware LSTM (PA-LSTM) (Liu et al., 2019) and spatio-temporal LSTM with trust gates (ST-LSTM+TG) (Liu et al., 2016) are included in comparison. Methods labeled ﬁxed topology are modiﬁed so as not to use adaptive training of the adjacency matrix in order for the comparison with ST-GST to be fair. Tables 1 and 2 compares the classiﬁcation accuracies on MSR Action3D and NTU-RGB+D, respectively. We see that even without any training, the performance of ST-GST is better than other non-deep-learning and LSTMbased methods, and is comparable with state-of-the-art GCN-based methods in large-scale dataset. Further, ST-GST outperforms all other methods when the size of training set is small. Fig. 3(a) shows the classiﬁcation accuracy as a function of the training ratio. When training ratio is less than 20%, ST-GST signiﬁcantly outperforms ST-GCN. Fig. 3(b) shows the accuracy-running time plot, reﬂecting that ST-GST is much faster than ST-GCN with similar classiﬁcation performance.

ST-GST works well in small-scale-data regime. Table 1 and Fig. 3(a) show that ST-GST outperforms other deep learning methods in the small-scale-data regime, which can be explained as follows. The good performance of spatio-temporal graph neural networks highly relies on the assumption that the training data is abundant. When the size of training set is limited, most of them can be easily trapped into bad local optima due to overﬁtting, resulting in a signiﬁcant drop of classiﬁcation accuracy. But in practice, obtaining a huge amount of training data with high-quality labels could be extremely expensive. On the other hand, since ST-GST is a non-trainable framework, ﬁlter coefﬁcients in ST-GST are mathematically designed rather than trained by data, which avoids the problem of overﬁtting when the training ratio is low. Another advantage of ST-GST compared to ST-GCN is that it requires less computation because no training process is involved in ST-GST.

Separable design is better than joint design. Tables 1 and 2 also show that separable spatiotemporal graph wavelets work much better than joint ones, achieving 25% increase in classiﬁcation accuracy for MSR Action3D dataset. The result is consistent with our analysis in Section 3.2. The intuition is that when dealing with spatio-temporal data generated from complex structures like skeleton sequences, the ﬁxed dependencies generated by product graphs highly restrict the way how signals can be diffused in spatio-temporal graphs and thus limit the efﬁciency of feature extraction.

Nonlinearity is beneﬁcial. Fig. 3(c) compares ST-GST with and without nonlinearity and shows that it is critical to ST-GST, also reﬂecting the potential effect of nonlinearity in ST-GCNs.

6 CONCLUSIONS

In this work we propose a novel spatio-temporal graph scattering transform (ST-GST), which can be viewed as one forward pass of spatio-temporal graph convolutional networks (ST-GCNs) without any training. ST-GST is stable to small perturbations on both input signals and structures. Our experiments show that: i) ST-GST can achieve better performance than both non-deep-learning and ST-GCNs based methods when the size of training samples is limited; ii) designing spatial and temporal graph ﬁlters separately is more ﬂexible and computationally efﬁcient than designing them jointly; and iii) the nonlinearity is critical to the performance.

7 ACKNOWLEDGEMENT

This work is fully supported by Mitsubishi Electric Research Laboratories (MERL), where Chao Pan was a research intern, Siheng Chen was a research scientist and Antonio Ortega is a consultant.

Ali N. Akansu and Richard A. Haddad. Multiresolution signal decomposition: transforms, subbands, and wavelets. Academic press, 2 edition, 2000.

Published as a conference paper at ICLR 2021

Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.

Mustafa Bilgic, Lilyana Mihalkova, and Lise Getoor. Active learning for networked data. In Proceedings of the 27th international conference on machine learning, pp. 79 86, 2010.

Paulo Vinicius Koerich Borges, Nicola Conci, and Andrea Cavallaro. Video-based human behavior understanding: A survey. IEEE transactions on circuits and systems for video technology, 23(11): 1993 2008, 2013.

Joan Bruna and St ephane Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872 1886, 2013.

Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183 192, 2020.

Fernando Gama, Alejandro Ribeiro, and Joan Bruna. Diffusion scattering transforms on graphs. In International Conference on Learning Representations, 2019a.

Fernando Gama, Alejandro Ribeiro, and Joan Bruna. Stability of graph scattering transforms. In Advances in Neural Information Processing Systems, pp. 8038 8048, 2019b.

Feng Gao, Guy Wolf, and Matthew Hirn. Geometric scattering for graph data analysis. In International Conference on Machine Learning, pp. 2122 2131, 2019.

Francesco Grassi, Andreas Loukas, Nathana el Perraudin, and Benjamin Ricaud. A time-vertex signal processing framework: Scalable processing and meaningful representations for time-series on graphs. IEEE Transactions on Signal Processing, 66(3):817 829, 2017.

David K Hammond, Pierre Vandergheynst, and R emi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129 150, 2011.

Yue Hu, Siheng Chen, Ya Zhang, and Xiao Gu. Collaborative motion prediction via neural motion message passing. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 6318 6327. IEEE, 2020.

Vassilis N Ioannidis, Siheng Chen, and Georgios B Giannakis. Pruned graph scattering transforms. In International Conference on Learning Representations, 2020.

Jiun-Yu Kao, Antonio Ortega, Dong Tian, Hassan Mansour, and Anthony Vetro. Graph based skeleton modeling for human activity analysis. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 2025 2029. IEEE, 2019.

Tae Soo Kim and Austin Reiter. Interpretable 3d human action analysis with temporal convolutional networks. In 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp. 1623 1631. IEEE, 2017.

Ron Levie, Elvin Isuﬁ, and Gitta Kutyniok. On the transferability of spectral graph ﬁlters. In 2019 13th International conference on Sampling Theory and Applications (Samp TA), pp. 1 5. IEEE, 2019.

Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. Actional-structural graph convolutional networks for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 3595 3603. Computer Vision Foundation / IEEE, 2019.

Wanqing Li, Zhengyou Zhang, and Zicheng Liu. Action recognition based on a bag of 3d points. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 9 14. IEEE, 2010.

Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In European conference on computer vision, pp. 816 833. Springer, 2016.

Published as a conference paper at ICLR 2021

Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C. Kot. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. doi: 10.1109/TPAMI.2019.2916873.

Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143 152, 2020.

St ephane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012.

Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm unrolling: Interpretable, efﬁcient deep learning for signal and image processing. ar Xiv preprint ar Xiv:1912.10557, 2019.

Aliaksei Sandryhaila and Jose MF Moura. Big data analysis with signal processing on graphs: Representation and processing of massive data sets with irregular structure. IEEE Signal Processing Magazine, 31(5):80 90, 2014.

Santiago Segarra, Antonio G Marques, Gonzalo Mateos, and Alejandro Ribeiro. Network topology inference from spectral templates. IEEE Transactions on Signal and Information Processing over Networks, 3(3):467 483, 2017.

Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. ar Xiv preprint ar Xiv:1610.03295, 2016.

David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging ﬁeld of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE signal processing magazine, 30(3):83 98, 2013.

David I Shuman, Christoph Wiesmeyr, Nicki Holighaus, and Pierre Vandergheynst. Spectrumadapted tight graph wavelet and vertex-frequency frames. IEEE Transactions on Signal Processing, 63(16):4223 4235, 2015.

Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. Mining actionlet ensemble for action recognition with depth cameras. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290 1297. IEEE, 2012.

Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, pp. 7444 7452. AAAI Press, 2018.

Rui Zhao, Wanru Xu, Hui Su, and Qiang Ji. Bayesian hierarchical dynamic model for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7733 7742, 2019.

Dongmian Zou and Gilad Lerman. Graph convolutional neural networks via scattering. Applied and Computational Harmonic Analysis, 49(3):1046 1074, 2020.

Published as a conference paper at ICLR 2021

A DIFFERENT DESIGN OF GRAPH WAVELETS

There are many off-the-shelf, well-developed graph wavelets we can choose. They mainly focus on extracting features from multiple frequency bands of input signal spectrum. Some of them are shown as follows.

Monic Cubic wavelets. Monic Cubic wavelets (Hammond et al., 2011) deﬁne the kernel function h(λ) as

λ for λ < 1; 5 + 11λ 6λ2 + λ3 for 1 λ 2; 2/λ for λ > 2.

Different scales of ﬁlters are implemented by scaling and translation of above kernel function.

Itersine wavelets. Itersine wavelets deﬁne the kernel function at scale j as

hj(λ) = sin π

2 cos2(π(λ j 1

Itersine wavelets form tight frames.

Geometric scattering wavelets. Geometric scattering wavelet ﬁlter bank (Gao et al., 2019) contains a set of ﬁlters based on lazy random walk matrix. The ﬁlter at scale j is deﬁned as Hj(S) = S2j 1 S2j = S2j 1(I S2j 1), where S = 1

2(I + AD 1) is the lazy random walk matrix and D is the degree matrix.

Note that one is also allowed to customize either spatial or temporal graph wavelets, once they conform a frame and satisfy integral Lipschitz constraint shown as follows

j=1 Hjx 2 B2 x 2, |λh (λ)| const λ,

where A, B are scalar constants and h ( ) is the gradient of the kernel function.

B.1 PROOF OF LEMMA 1

By reshaping the signal from Z to z with Zs,t = z(s 1)T +t, we can have that

j1,j2=1 (Hj1(Ss) Gj2(St))z 2 =

Hj1(Ss)ZG j2(St) 2 .

Since Ss and St do not change over computation process, we just use Hj1 and Gj2 to represent

Hj1(Ss) and Gj2(St), respectively. Suppose Hj1 =

h11 h1N ... h N1 h NN

RN N, then we have

the Kronecker product as Hj1 Gj2 =

h11Gj2 h1NGj2 ... h N1Gj2 h NNGj2

. Apply it to vector z and we

can have a ﬁltered signal yj1,j2 = (Hj1 Gj2)z RNT . The ﬁrst T elements of y can also be written as

yj1,j2(1 : T) =

i=1 h1i Gj2

Zi,1 Zi,2 ... Zi,T

Zi,1 Zi,2 ... Zi,T

Published as a conference paper at ICLR 2021

Therefore we have

Zi,1 Zi,2 ... Zi,T

j2 yj1,j2(1 : T) 2 B2 2

Zi,1 Zi,2 ... Zi,T

j2 yj1,j2 2 can be sandwiched as

Zi,1 Zi,2 ... Zi,T

j2 yj1,j2 2 B2 2

Zi,1 Zi,2 ... Zi,T

By deﬁnition of vector ℓ2 norm, we can rewrite the upper and lower bound in Eq. (6) as

Z1,i Z2,i ... ZN,i

j2 yj1,j2 2 B2 2

Z1,i Z2,i ... ZN,i

Summing above quantity over j1 gives us that

A2 1A2 2 Z 2 = A2 1A2 2

Z1,i Z2,i ... ZN,i

j1,j2 yj1,j2 2 B2 1B2 2

Z1,i Z2,i ... ZN,i

= B2 1B2 2 Z 2,

which completes the proof. Lemma 1 is a very handful result. It shows that we can easily construct new spatio-temporal wavelets just by combining spatio and temporal ones. Moreover, the constants for new frame bound can be easily obtained once we know the characteristics of the wavelets in each domain. In particular, it also provides us a convenient way to build tight frames for spatio-temporal data analysis with A = B, because we just need to choose tight frames for spatial and temporal domain separately without considering possible correlations.

B.2 PROOF OF THEOREM 1

We are considering pooling operator U( ) as average in spatial domain in this proof, so U = 1

N 11 N and φ = UZ RT . The proof techniques can be easily generalized to any form of U( ). When reshaping Z RN T to z RNT , the new pooling operator can be simply represented as

N (IT , IT , , IT ) RT NT , φ = U z.

Note that U 2 = 1

N . Consider scattering tree nodes at the last layer L 1. Suppose they are indexed from 1 to JL 1 associated with signal a1, , a JL 1, and their parent nodes are indexed from 1 to JL 2 associated with signal b1, , b JL 2. When the input data X is perturbed, all signals in scattering tree will change correspondingly. Here we simply denote them as ea, eb. Then for the change of feature vector located at node with a1, it holds that

φa1 φea1 2 = U (a1 ea1) 2 U 2 a1 ea1 2 1

N σ((Hj1 Gj2)(b1 eb1)) 2, (6)

where j1 = j2 = 1. The last inequality holds because we are using absolute value function as nonlinear activation, which is non-expansive. Summing above quantity over j1, j2 and by the frame bound proved in Lemma 1, we can have that

i=1 φai φeai 2 B2

i=1 bi ebi 2. (7)

Published as a conference paper at ICLR 2021

Note that for sum of square norm of change at layer L 2 it is

i=1 φbi φebi 2 1

i=1 bi ebi 2. (8)

Compare Eq. (7) and (8). The upper bound only differs with a factor B2. Then by induction we can have that

Φ(Ss, St, X) Φ(Ss, St, e X) 2 1

ℓ=0 B2ℓ x ex 2 = 1

Normalize it with the dimension of ﬁnal feature map, we have

Φ(Ss, St, X) Φ(Ss, St, e X) q

T PL 1 ℓ=0 Jℓ 1

PL 1 ℓ=0 B2ℓ PL 1 ℓ=0 Jℓ . (9)

B.3 PROOF OF THEOREM 2

Perturbations on the underlying graph usually happen when the graph is unknown or when the graph changes over time (Segarra et al., 2017). Take skeleton-based action recognition as an example. Some joints may be misrecognized with others due to measurement noise of devices during certain frames, thus the location signals of those joints are interchanged. This leads to different spatial graph structures at those time stamps. Since such kind of perturbations usually happen in spatial domain, here we simply consider the structure perturbations on the spatial graph only. But the results can be extended to more general cases.

Consider the original spatio-temporal graph as G with spatial graph shift matrix Ss and temporal one St, and the perturbed graph as b G with b Ss and St. We ﬁrst show that ST-GST is invariant to node permutations in spatial domain, where the set of permutation matrices is deﬁned as P = P {0, 1}N N : P1 = 1, P 1 = 1, PP = IN . Note that we are considering average in spatial domain for U( ), so U = 1

N 11 N and φ = UZ RT , b U = UP.

Lemma 2. Consider the spatial permutation b Ss = P Ss P and input data b X = P X are also permuted in spatial domain correspondingly. Then, it holds that

Φ(Ss, St, X) = Φ(b Ss, St, b X) (10)

Proof. Note that the permutation holds for all signals computed in scattering tree; that is to say, b Z(p(ℓ)) = P Z(p(ℓ)). Suppose for path p(ℓ) the last two ﬁlter are chosen as H(b Ss) and G(St), then the feature vector after pooling with respect to new graph support and data can be written as

φ(p(ℓ))(b Ss, St, b Z(p(ℓ))) = b U(σ(H(b Ss)b Z(p(ℓ))G (St)))

= UPσ(P H(Ss)PP Z(p(ℓ))G (St))

The last equation holds due to deﬁnition of H(S). Since nonlinear activation is applied elementwise, we can rewrite it as

φ(p(ℓ))(b Ss, St, b Z(p(ℓ))) = Uσ(PP H(Ss)PP Z(p(ℓ))G (St))

= Uσ(H(Ss)Z(p(ℓ))G (St))

= φ(p(ℓ))(Ss, St, Z(p(ℓ))).

This conclusion holds independently of speciﬁc path p(ℓ), so it holds for all feature vector after pooling in scattering tree. Since ﬁnal feature map is just a concatenation of all feature vectors, the proof is complete.

Lemma 2 shows that the output of ST-GST is essentially independent of the node ordering in spatial domain, as long as the permutation is consistent across all time stamps. This result is intuitive

Published as a conference paper at ICLR 2021

because the output of graph convolution should only depend on relative neighborhood structure of each node. Since node reordering will not alter neighborhood topology, the output should remain unchanged.

Based on Lemma 2, we use a relative perturbation model for structure modiﬁcations (Gama et al., 2019b), which focuses more on the change of neighborhood topology compared to absolute perturbations adopted in Levie et al. (2019). Deﬁne the set of permutations that make Ss and b S the closet as Ps := arg min P P P b Ss P Ss 2. Consider the set of perturbation matrices E(S, b S) = {E|P b Ss P = Ss + E Ss + Ss E, P Ps, E RN N}. Then the relative distance to measure structure perturbations can be deﬁned as

d(Ss, b Ss) = min E E(Ss,b Ss) E 2

Note that if b Ss = P Ss P, meaning that the structure perturbation is purely permutation, then the relative distance d(Ss, b Ss) = 0, which is consistent with result shown in Lemma 2. Therefore, without loss of generality, we can assume that P = IN and b Ss = Ss + E Ss + Ss E in later context. With this formulation, we are ready to prove Lemma 3.

Lemma 3. Suppose eigenvalues {mi}N i=1 of E are organized in order such that |m1| |m2| |m N|, satisfying |m N| ϵ/2 and |mi/m N 1| ϵ for ϵ > 0. For spatial graph ﬁlter H(Ss) and temporal graph ﬁlter G(St), denote their kernel functions as h(λ) and g(λ), respectively. If for all λ, h(λ) is chosen to satisfy integral Lipschitz constraint |λh (λ)| C and g(λ) has bounded spectral response |g(λ)| D. Then it holds that

H(Ss) G(St) H(b Ss) G(St) 2 ϵCD + O(ϵ2). (11)

Proof. From Proposition 2 in Gama et al. (2019b) we can have that when E satisﬁes above conditions, H(Ss) H(b Ss) 2 ϵC + O(ϵ2). So

H(Ss) G(St) H(b Ss) G(St) 2 = (H(Ss) H(b Ss)) G(St) 2

H(Ss) H(b Ss) 2 G(St) 2 ϵCD + O(ϵ2),

The second line holds because H(Ss) H(b Ss) is a symmetric matrix, which can be written as eigen-decomposition as FΩF . And (FΩF ) (VΛVT ) = (F V)(Ω Λ)(F V) holds, which ﬁnishes the proof. As for general structural perturbations, where we want to ﬁnd H(Ss) G(St) H(b Ss) G(b St) 2, we can add and subtract term H(b Ss) G(b St), use triangle inequality and further bound those two terms with more assumptions on h(λ) and g(λ).

The bound shown in Lemma 3 indicates that the difference of output caused by changing spatial graph support from Ss to b Ss is proportional to ϵ, which is a scalar characterizing the level of the perturbation. Constraints on eigenvalues of E limits the change of graph structure. A more detailed description explaining the necessity of such constraints can be found in Gama et al. (2019b). With Lemma 3 in hand, we are ready to show the change of feature vector after pooling at each node in scattering tree when such structure perturbations happen.

Lemma 4. Consider a ST-GST with L layers and J = Js Jt scales at each layer. Suppose that the graph ﬁlter bank forms a frame with upper bound B = B1 B2, where B1, B2 are frame bounds for spatial and temporal domain, respectively. Suppose for all λ, spatial wavelet ﬁlter bank {Hj1}Js j1=1 satisﬁes maxi |λh i(λ)| C and temporal wavelet ﬁlter bank {Gj2}Jt j2=1 satisﬁes maxi |gi(λ)| D, and other conditions the same as Lemma 3. Then for the change of feature vector φp(ℓ) associated with path p(ℓ) it holds that

φp(ℓ)(Ss, St, X) φp(ℓ)(b Ss, St, X) 1

N ϵℓCDBℓ 1 X . (12)

Published as a conference paper at ICLR 2021

Proof. Expand φp(ℓ)(Ss, St, X) φp(ℓ)(b Ss, St, X) as

U (σ(Hj(ℓ) 1 (Ss) Gj(ℓ) 2 (St)))p(ℓ)x U (σ(Hj(ℓ) 1 (b Ss) Gj(ℓ) 2 (St)))p(ℓ)x

N (σ(Hj(ℓ) 1 (Ss) Gj(ℓ) 2 (St)))p(ℓ)x (σ(Hj(ℓ) 1 (b Ss) Gj(ℓ) 2 (St)))p(ℓ)x ,

where U 2 = 1/

N and (σ(Hj(ℓ) 1 (Ss) Gj(ℓ) 2 (St)))p(ℓ) is a shorthand for applying spatio-

temporal ﬁlters and nonlinear activation in order to input data ℓtimes according to the path p(ℓ). Add and subtract term σ(Hj(ℓ) 1 (Ss) Gj(ℓ) 2 (St))σ(Hj(ℓ 1) 1 (b Ss) Gj(ℓ 1) 2 (St)) σ(Hj(1) 1 (b Ss) Gj(1) 2 (St))x and apply triangle inequality, we can have that

(σ(Hj(ℓ) 1 (Ss) Gj(ℓ) 2 (St)))p(ℓ)x (σ(Hj(ℓ) 1 (b Ss) Gj(ℓ) 2 (St)))p(ℓ)x

σ(Hj(ℓ) 1 (Ss) Gj(ℓ) 2 (St)) (σ(Hj(ℓ 1) 1 (Ss) Gj(ℓ 1) 2 (St)))p(ℓ 1)

(σ(Hj(ℓ 1) 1 (b Ss) Gj(ℓ 1) 2 (St)))p(ℓ 1) x +

σ(Hj(ℓ) 1 (Ss) Gj(ℓ) 2 (St)) σ(Hj(ℓ) 1 (b Ss) Gj(ℓ) 2 (St))

(σ(Hj(ℓ 1) 1 (b Ss) Gj(ℓ 1) 2 (St)))p(ℓ 1)x .

Recursive quantities can be observed above and the bound can be solved explicitly (Gama et al., 2019b). By induction and conclusion from Lemma 3, we can get that

(σ(Hj(ℓ) 1 (Ss) Gj(ℓ) 2 (St)))p(ℓ)x (σ(Hj(ℓ) 1 (b Ss) Gj(ℓ) 2 (St)))p(ℓ)x ℓϵCDBℓ 1 x .

Multiplying the coefﬁcient 1/

N caused by pooling gets us the ﬁnal result.

Note that the upper bound in Lemma 4 holds for all path of length ℓ. Thus the square norm of change in ﬁnal feature map can be summarized by the sum of square norm of change at each layer, which ﬁnishes the proof of Theorem 2.

C ADDITIONAL EXPERIMENTS

C.1 DATASET

MSR Action3D dataset (Li et al., 2010) is a small dataset capturing indoor human actions. It covers 20 action types and 10 subjects, with each subject repeating each action 2 or 3 times. The dataset contains 567 action clips with maximum number of frames 76; however, 10 of them are discarded because the skeleton information are either missing or too noisy (Wang et al., 2012). For each clip, locations of 20 joints are recorded, and only one subject is present. Training and testing set is decided by cross-subject split for this dataset, with 288 samples for training and 269 for testing.

NTU-RGB+D (Liu et al., 2019) is currently the largest dataset with 3D joints annotations for human action recognition task. It covers 60 action types and 40 subjects. The dataset contains 56,880 action clips with maximum number of frames 300, and there are 25 joints for each subject in one clip. Each clip is guaranteed to have at most 2 subjects. The cross-subject benchmark of NTU-RGB+D includes 40,320 clips for training and 16,560 for testing.

Full table of performance on MSR Action3D dataset. The table contains performance comparison for different algorithms with different set of parameters on MSR Action3D dataset. Note that the triple shown after ST-GST represents the value for (Js, Jt, L). Methods labeled ﬁxed topology are modiﬁed so as not to use adaptive training of the adjacency matrix in order for the comparison with ST-GST to be fair. Methods labeled learnable topology means that we use adaptive training for adjacency matrix to further validate our claim. Other conﬁgurations of compared methods are then set by default. From the table we can see that ST-GST outperforms all other methods even when the graph topology can be learned by neural networks. The intuition behind this is that deep learning methods need large amount of training data due to the complex structures, and it can easily

Published as a conference paper at ICLR 2021

Method Accuracy (%)

GFT+TPM 74.0 HDM 81.8

ST-GCN (ﬁxed topology) 52.0 ST-GCN (learnable topology) 56.0 Temporal Conv. (resnet) 67.3 Temporal Conv. (resnet-v3-gap) 69.9 Temporal Conv. (resnet-v4-gap) 72.1 MS-G3D (GCN scales=10, G3D scales=6) 80.3 MS-G3D (GCN scales=5, G3D scales=5) 81.4 MS-G3D (GCN scales=8, G3D scales=5) 82.2

Separable ST-GST (5, 5, 3) 73.6 Separable ST-GST (5, 5, 4) 72.9 Separable ST-GST (5, 10, 3) 81.4 Separable ST-GST (5, 15, 3) 85.9 Separable ST-GST (5, 20, 3) 87.0 Joint Kronecker ST-GST (15, 3) 61.0 Joint Cartesian ST-GST (15, 3) 59.1 Joint Strong ST-GST (15, 3) 61.7

Table 3: Full comparison of classiﬁcation accuracy (MSR Action3D with 288 training and 269 testing samples).

Method Accuracy (%) Separable ST-GST (5, 5, 3) 73.4 0.8 Separable ST-GST (5, 20, 3) 86.7 0.4 Joint Kronecker ST-GST (5, 3) 46.3 1.2 Joint Cartesian ST-GST (5, 3) 42.2 1.1 Joint Strong ST-GST (5, 3) 45.0 1.2 Joint Kronecker ST-GST (15, 3) 59.6 0.5 Joint Cartesian ST-GST (15, 3) 58.6 1.0 Joint Strong ST-GST (15, 3) 60.0 1.0

Table 4: Performance for different methods on MSR Action3D with standard deviations.

be trapped into bad local optima due to overﬁtting when the size of training set is limited, which is common in practice. Also the good performance of ST-GST in sparse label regime could potentially inspire active learning for processing spatio-temporal data (Bilgic et al., 2010).

Performance on MSR Action3D dataset with standard deviations. We repeat part of our experiments 20 times on MSR Action3D dataset, especially for joint approaches, to obtain the standard deviations of classiﬁcation accuracy. The results are shown in Table 4. Note that since ST-GST is a mathematically designed transform, the output features should be the same for different trails, and the randomness comes from classiﬁers used later (random forest in this case). It can be seen that the standard deviations are comparable in all these methods, and therefore the conclusion that separable ST-GST consistently outperforms joint ST-GST still holds.

Comparison between different choices of wavelets. In practice we ﬁnd that using graph geometric scattering wavelets (Gao et al., 2019) for both spatial and temporal domain can achieve the best performance, which is reported in main text. Classiﬁcation accuracy using other type of wavelets is shown here. All experiments performed here are separable ST-GST with Js = 5, Jt = 15, L = 3 on MSR Action3D dataset. An interesting observation is that there is a signiﬁcant reduction in accuracy when we change temporal wavelet from diffusion based one (Geometric) to spectrum based one (Monic Cubic or Itersine). This may caused by the design of different wavelets.

Stability of ST-GST. We also show the classiﬁcation accuracy under different level of perturbations on spatio-temporal signals and spatial graph structures in Fig. 4. The experiments are con-

Published as a conference paper at ICLR 2021

Spatial wavelet Temporal wavelet Accuracy (%) Geometric Geometric 85.9 Geometric Monic Cubic 76.6 Geometric Itersine 73.6 Monic Cubic Geometric 82.9 Itersine Geometric 82.5 Monic Cubic Monic Cubic 80.7 Monic Cubic Itersine 78.4 Itersine Monic Cubic 76.2 Itersine Itersine 80.7

Table 5: Performance for different choices of spatial and temporal wavelets (MSR Action3D) with setting (5, 15, 3).

(a) Signal perturbations. (b) Structure perturbations.

Figure 4: Comparisons on performance under different level of perturbations.

ducted on MSR Action3D dataset. For signal perturbation, signal-to-noise ratio (SNR) is deﬁned as 10 log X 2

2 . For structure perturbation, E is set to be a diagonal matrix, whose diagonal elements satisfy corresponding constraints on ϵ. From both Fig. 4(a) and (b) we can see that ST-GST is stable and will not deviate much from original output when the perturbations are small.