# deepsphere_a_graphbased_spherical_cnn__1708de88.pdf

Published as a conference paper at ICLR 2020

DEEPSPHERE: A GRAPH-BASED SPHERICAL CNN

Micha el Defferrard, Martino Milani & Fr ed erick Gusset Ecole Polytechnique F ed erale de Lausanne (EPFL), Switzerland {michael.defferrard,martino.milani,frederick.gusset}@epfl.ch

Nathana el Perraudin Swiss Data Science Center (SDSC), Switzerland nathanael.perraudin@sdsc.ethz.ch

Designing a convolution for a spherical neural network requires a delicate tradeoff between efﬁciency and rotation equivariance. Deep Sphere, a method based on a graph representation of the sampled sphere, strikes a controllable balance between these two desiderata. This contribution is twofold. First, we study both theoretically and empirically how equivariance is affected by the underlying graph with respect to the number of vertices and neighbors. Second, we evaluate Deep Sphere on relevant problems. Experiments show state-of-the-art performance and demonstrates the efﬁciency and ﬂexibility of this formulation. Perhaps surprisingly, comparison with previous work suggests that anisotropic ﬁlters might be an unnecessary price to pay. Our code is available at https: //github.com/deepsphere.

1 INTRODUCTION

Spherical data is found in many applications (ﬁgure 1). Planetary data (such as meteorological or geological measurements) and brain activity are example of intrinsically spherical data. The observation of the universe, LIDAR scans, and the digitalization of 3D objects are examples of projections due to observation. Labels or variables are often to be inferred from them. Examples are the inference of cosmological parameters from the distribution of mass in the universe (Perraudin et al., 2019), the segmentation of omnidirectional images (Khasanova & Frossard, 2017), and the segmentation of cyclones from Earth observation (Mudigonda et al., 2017).

MEG evoked potential, 0.1s

CMB temperature map

(Planck 2015)

-0.00025 0.00025

CAM5 HAPPI20 run 1, TMQ, 2106-01-01

0 kg/m2 40 kg/m2 80 kg/m2 (c)

GHCN-daily, TMAX, 2014-01-01

20 C 0 C 20 C 40 C (d)

graph of GHCN stations

Figure 1: Examples of spherical data: (a) brain activity recorded through magnetoencephalography (MEG),1(b) the cosmic microwave background (CMB) temperature from Planck Collaboration (2016), (c) hourly precipitation from a climate simulation (Jiang et al., 2019), (d) daily maximum temperature from the Global Historical Climatology Network (GHCN).2A rigid full-sphere sampling is not ideal: brain activity is only measured on the scalp, the Milky Way s galactic plane masks observations, climate scientists desire a variable resolution, and the position of weather stations is arbitrary and changes over time. (e) Graphs can faithfully and efﬁciently represent sampled spherical data by placing vertices where it matters.

Published as a conference paper at ICLR 2020

As neural networks (NNs) have proved to be great tools for inference, variants have been developed to handle spherical data. Exploiting the locally Euclidean property of the sphere, early attempts used standard 2D convolutions on a grid sampling of the sphere (Boomsma & Frellsen, 2017; Su & Grauman, 2017; Coors et al., 2018). While simple and efﬁcient, those convolutions are not equivariant to rotations. On the other side of this tradeoff, Cohen et al. (2018) and Esteves et al. (2018) proposed to perform proper spherical convolutions through the spherical harmonic transform. While equivariant to rotations, those convolutions are expensive (section 2).

As a lack of equivariance can penalize performance (section 4.2) and expensive convolutions prohibit their application to some real-world problems, methods standing between these two extremes are desired. Cohen et al. (2019) proposed to reduce costs by limiting the size of the representation of the symmetry group by projecting the data from the sphere to the icosahedron. The distortions introduced by this projection might however hinder performance (section 4.3).

Another approach is to represent the sampled sphere as a graph connecting pixels according to the distance between them (Bruna et al., 2013; Khasanova & Frossard, 2017; Perraudin et al., 2019). While Laplacian-based graph convolutions are more efﬁcient than spherical convolutions, they are not exactly equivariant (Defferrard et al., 2019). In this work, we argue that graph-based spherical CNNs strike an interesting balance, with a controllable tradeoff between cost and equivariance (which is linked to performance). Experiments on multiple problems of practical interest show the competitiveness and ﬂexibility of this approach.

Deep Sphere leverages graph convolutions to achieve the following properties: (i) computational efﬁciency, (ii) sampling ﬂexibility, and (iii) rotation equivariance (section 3). The main idea is to model the sampled sphere as a graph of connected pixels: the length of the shortest path between two pixels is an approximation of the geodesic distance between them. We use the graph CNN formulation introduced in (Defferrard et al., 2016) and a pooling strategy that exploits hierarchical samplings of the sphere.

Sampling. A sampling scheme V = {xi S2}n i=1 is deﬁned to be the discrete subset of the sphere containing the n points where the values of the signals that we want to analyse are known. For a given continuous signal f, we represent such values in a vector f Rn. As there is no analogue of uniform sampling on the sphere, many samplings have been proposed with different tradeoffs. In this work, depending on the considered application, we will use the equiangular (Driscoll & Healy, 1994), HEALPix (Gorski et al., 2005), and icosahedral (Baumgardner & Frederickson, 1985) samplings.

Graph. From V, we construct a weighted undirected graph G = (V, w), where the elements of V are the vertices and the weight wij = wji is a similarity measure between vertices xi and xj. The combinatorial graph Laplacian L Rn n is deﬁned as L = D A, where A = (wij) is the weighted adjacency matrix, D = (dii) is the diagonal degree matrix, and dii = P

j wij is the weighted degree of vertex xi. Given a sampling V, usually ﬁxed by the application or the available measurements, the freedom in constructing G is in setting w. Section 3 shows how to set w to minimize the equivariance error.

Convolution. On Euclidean domains, convolutions are efﬁciently implemented by sliding a window in the signal domain. On the sphere however, there is no straightforward way to implement a convolution in the signal domain due to non-uniform samplings. Convolutions are most often performed in the spectral domain through a spherical harmonic transform (SHT). That is the approach taken by Cohen et al. (2018) and Esteves et al. (2018), which has a computational cost of O(n3/2) on isolatitude samplings (such as the HEALPix and equiangular samplings) and O(n2) in general.

1https://martinos.org/mne/stable/auto_tutorials/plot_visualize_evoked.html 2https://www.ncdc.noaa.gov/ghcn-daily-description

Published as a conference paper at ICLR 2020

On the other hand, following Defferrard et al. (2016), graph convolutions can be deﬁned as

i=0 αi Li !

where P is the polynomial order (which corresponds to the ﬁlter s size) and αi are the coefﬁcients to be optimized during training.3 Those convolutions are used by Khasanova & Frossard (2017) and Perraudin et al. (2019) and cost O(n) operations through a recursive application of L.4

Pooling. Downand up-sampling is natural for hierarchical samplings,5 where each subdivision divides a pixel in (an equal number of) child sub-pixels. To pool (down-sample), the data supported on the sub-pixels is summarized by a permutation invariant function such as the maximum or the average. To unpool (up-sample), the data supported on a pixel is copied to all its sub-pixels.

Architecture. All our NNs are fully convolutional, and employ a global average pooling (GAP) for rotation invariant tasks. Graph convolutional layers are always followed by batch normalization and Re LU activation, except in the last layer. Note that batch normalization and activation act on the elements of f independently, and hence don t depend on the domain of f.

3 GRAPH CONVOLUTION AND EQUIVARIANCE

While the graph framework offers great ﬂexibility, its ability to faithfully represent the underlying sphere for graph convolutions to be rotation equivariant highly depends on the sampling locations and the graph construction.

3.1 PROBLEM FORMULATION

A continuous function f : C(S2) FV R is sampled as TV(f) = f by the sampling operator TV : C(S2) FV Rn deﬁned as f : fi = f(xi). We require FV to be a suitable subspace of continuous functions such that TV is invertible, i.e., the function f FV can be unambiguously reconstructed from its sampled values f. The existence of such a subspace depends on the sampling V and its characterization is a common problem in signal processing (Driscoll & Healy, 1994). For most samplings, it is not known if FV exists and hence if TV is invertible. A special case is the equiangular sampling where a sampling theorem holds, and thus a closed-form of T 1 V is known. For samplings where no such sampling formula is available, we leverage the discrete SHT to reconstruct f from f = TVf, thus approximating T 1 V . For all theoretical considerations, we assume that FV exists and f FV.

By deﬁnition, the (spherical) graph convolution is rotation equivariant if and only if it commutes with the rotation operator deﬁned as R(g), g SO(3): R(g)f(x) = f g 1x . In the context of this work, graph convolution is performed by recursive applications of the graph Laplacian (1). Hence, if R(g) commutes with L, then, by recursion, it will also commute with the convolution h(L). As a result, h(L) is rotation equivariant if and only if RV(g)Lf = LRV(g)f, f FV and g SO(3),

where RV(g) = TVR(g)T 1 V . For an empirical evaluation of equivariance, we deﬁne the normalized equivariance error for a signal f and a rotation g as

EL(f, g) = RV(g)Lf LRV(g)f

More generally for a class of signals f C FV, the mean equivariance error deﬁned as

EL,C = Ef C,g SO(3) EL(f, g) (3) represents the overall equivariance error. The expected value is obtained by averaging over a ﬁnite number of random functions and random rotations.

3In practice, training with Chebyshev polynomials (instead of monomials) is slightly more stable. We believe it to be due to their orthogonality and uniformity. 4As long as the graph is sparsiﬁed such that the number of edges, i.e., the number of non-zeros in A, is proportional to the number of vertices n. This can always be done as most weights are very small. 5The equiangular, HEALPix, and icosahedral samplings are of this kind.

Published as a conference paper at ICLR 2020

spherical harmonic degree ℓ

mean equivariance error EL, C

Khasanova & Frossard, k = 4 Perraudin et al., k = 8 k-NN graph, k = 8 neighbors

k-NN graph, k = 20 neighbors

k-NN graph, k = 40 neighbors

Figure 2: Mean equivariance error (3). There is a clear tradeoff between equivariance and computational cost, governed by the number of vertices n and edges kn.

104 105 106 107

side pixels

kernel width t

k-NN graph, k = 60

k-NN graph, k = 40 k-NN graph, k = 20

k-NN graph, k = 8 Perraudin et al., k = 8

Figure 3: Kernel widths.

Figure 4: 3D object represented as a spherical depth map.

100 101 102 103

spherical harmonic degree

power spectrum

SHREC'17 (depth and normal) cosmology (convergence map) climate (16 variables)

Figure 5: Power spectral densities.

3.2 FINDING THE OPTIMAL WEIGHTING SCHEME

Considering the equiangular sampling and graphs where each vertex is connected to 4 neighbors (north, south, east, west), Khasanova & Frossard (2017) designed a weighting scheme to minimize (3) for longitudinal and latitudinal rotations6. Their solution gives weights inversely proportional to Euclidean distances:

wij = 1 xi xj . (4)

While the resulting convolution is not equivariant to the whole of SO(3) (ﬁgure 2), it is enough for omnidirectional imaging because, as gravity consistently orients the sphere, objects only rotate longitudinally or latitudinally.

To achieve equivariance to all rotations, we take inspiration from Belkin & Niyogi (2008). They prove that for a random uniform sampling, the graph Laplacian L built from weights

4t xi xj 2 (5)

converges to the Laplace-Beltrami operator S2 as the number of samples grows to inﬁnity. This result is a good starting point as S2 commutes with rotation, i.e., S2R(g) = R(g) S2. While the weighting scheme is full (i.e., every vertex is connected to every other vertex), most weights are small due to the exponential. We hence make an approximation to limit the cost of the convolution (1) by only considering the k nearest neighbors (k-NN) of each vertex. Given k, the optimal kernel width t is found by searching for the minimizer of (3). Figure 3 shows the optimal kernel widths found for various resolutions of the HEALPix sampling. As predicted by the theory, tn nβ, β R. Importantly however, the optimal t also depends on the number of neighbors k.

Considering the HEALPix sampling, Perraudin et al. (2019) connected each vertex to their 8 adjacent vertices in the tiling of the sphere, computed the weights with (5), and heuristically set t to half the average squared Euclidean distance between connected vertices. This heuristic however overestimates t (ﬁgure 3) and leads to an increased equivariance error (ﬁgure 2).

6Equivariance to longitudinal rotation is essentially given by the equiangular sampling.

Published as a conference paper at ICLR 2020

3.3 ANALYSIS OF THE PROPOSED WEIGHTING SCHEME

We analyze the proposed weighting scheme both theoretically and empirically.

Figure 6: Patch.

Theoretical convergence. We extend the work of (Belkin & Niyogi, 2008) to a sufﬁciently regular, deterministic sampling. Following their setting, we work with the extended graph Laplacian operator as the linear operator Lt n : L2(S2) L2(S2) such that

Lt nf(y) := 1

i=1 e xi y 2

4t (f(y) f(xi)) . (6)

This operator extends the graph Laplacian with the weighting scheme (5) to each point of the sphere (i.e., Lt nf = TVLt nf). As the radius of the kernel t will be adapted to the number of samples, we scale the operator as ˆLt n := |S2|(4πt2) 1Lt n. Given a sampling V, we deﬁne σi to be the patch of the surface of the sphere corresponding to xi, Ai its corresponding area, and di the largest distance between the center xi and any point on the surface σi. Deﬁne d(n) := maxi=1,...,n di and A(n) := maxi=1,...,n Ai.

Theorem 3.1. For a sampling V of the sphere that is equi-area and such that d(n) C nα , α (0, 1/2], for all f : S2 R Lipschitz with respect to the Euclidean distance in R3, for all y S2, there exists a sequence tn = nβ, β R such that

lim n ˆLtn n f(y) = S2f(y).

This is a major step towards equivariance, as the Laplace-Beltrami operator commutes with rotation. Based on this property, we show the equivariance of the scaled extended graph Laplacian. Theorem 3.2. Under the hypothesis of theorem 3.1, the scaled graph Laplacian commutes with any rotation, in the limit of inﬁnite sampling, i.e.,

y S2 R(g)ˆLtn n f(y) ˆLtn n R(g)f(y) n 0.

From this theorem, it follows that the discrete graph Laplacian will be equivariant in the limit of n as by construction Lt nf = TVLt nf and as the scaling does not affect the equivariance property of Lt n.

Importantly, the proof of Theorem 3.1 (in Appendix A) inspires our construction of the graph Laplacian. In particular, it tells us that t should scale as nβ, which has been empirically veriﬁed (ﬁgure 3). Nevertheless, it is important to keep in mind the limits of Theorem 3.1 and 3.2. Both theorems present asymptotic results, but in practice we will always work with ﬁnite samplings. Furthermore, since this method is based on the capability of the eigenvectors of the graph Laplacian to approximate the spherical harmonics, a stronger type of convergence of the graph Laplacian would be preferable, i.e., spectral convergence (that is proved for a full graph in the case of random sampling for a class of Lipschitz functions in (Belkin & Niyogi, 2007)). Finally, while we do not have a formal proof for it, we strongly believe that the HEALPix sampling does satisfy the hypothesis d(n) C nα , α (0, 1/2], with α very close or equal to 1/2. The empirical results discussed in the next paragraph also points in this direction. This is further discussed in Appendix A.

Empirical convergence. Figure 2 shows the equivariance error (3) for different parameter sets of Deep Sphere for the HEALPix sampling as well as for the graph construction of Khasanova & Frossard (2017) for the equiangular sampling. The error is estimated as a function of the sampling resolution and signal frequency. The resolution is controlled by the number of pixels n = 12N 2 side for HEALPix and n = 4b2 for the equiangular sampling. The frequency is controlled by setting the set C to functions f made of spherical harmonics of a single degree ℓ. To allow for an almost perfect implementation (up to numerical errors) of the operator RV, the degree ℓwas chosen in the range (0, 3Nside 1) for HEALPix and (0, b) for the equiangular sampling (Gorski et al., 1999). Using these parameters, the measured error is mostly due to imperfections in the empirical approximation of the Laplace-Beltrami operator and not to the sampling.

Published as a conference paper at ICLR 2020

performance size speed

F1 m AP params inference training

Cohen et al. (2018) (b = 128) - 67.6 1400 k 38.0 ms 50 h Cohen et al. (2018) (simpliﬁed,9b = 64) 78.9 66.5 400 k 12.0 ms 32 h Esteves et al. (2018) (b = 64) 79.4 68.5 500 k 9.8 ms 3 h Deep Sphere (equiangular, b = 64) 79.4 66.5 190 k 0.9 ms 50 m Deep Sphere (HEALPix, Nside = 32) 80.7 68.6 190 k 0.9 ms 50 m

Table 1: Results on SHREC 17 (3D shapes). Deep Sphere achieves similar performance at a much lower cost, suggesting that anisotropic ﬁlters are an unnecessary price to pay.

Figure 2 shows that the weighting scheme (4) from (Khasanova & Frossard, 2017) does indeed not lead to a convolution that is equivariant to all rotations g SO(3).7 For k = 8 neighbors, selecting the optimal kernel width t improves on (Perraudin et al., 2019) at no cost, highlighting the importance of this parameter. Increasing the resolution decreases the equivariance error in the high frequencies, an effect most probably due to the sampling. Most importantly, the equivariance error decreases when connecting more neighbors. Hence, the number of neighbors k gives us a precise control of the tradeoff between cost and equivariance.

4 EXPERIMENTS

4.1 3D OBJECTS RECOGNITION

The recognition of 3D shapes is a rotation invariant task: rotating an object doesn t change its nature. While 3D shapes are usually represented as meshes or point clouds, representing them as spherical maps (ﬁgure 4) naturally allows a rotation invariant treatment.

The SHREC 17 shape retrieval contest (Savva et al., 2017) contains 51,300 randomly oriented 3D models from Shape Net (Chang et al., 2015), to be classiﬁed in 55 categories (tables, lamps, airplanes, etc.). As in (Cohen et al., 2018), objects are represented by 6 spherical maps. At each pixel, a ray is traced towards the center of the sphere. The distance from the sphere to the object forms a depth map. The cos and sin of the surface angle forms two normal maps. The same is done for the object s convex hull.8 The maps are sampled by an equiangular sampling with bandwidth b = 64 (n = 4b2 = 16, 384 pixels) or an HEALPix sampling with Nside = 32 (n = 12N 2 side = 12, 288 pixels).

The equiangular graph is built with (4) and k = 4 neighbors (following Khasanova & Frossard, 2017). The HEALPix graph is built with (5), k = 8, and a kernel width t set to the average of the distances (following Perraudin et al., 2019). The NN is made of 5 graph convolutional layers, each followed by a max pooling layer which down-samples by 4. A GAP and a fully connected layer with softmax follow. The polynomials are all of order P = 3 and the number of channels per layer is 16, 32, 64, 128, 256, respectively. Following Esteves et al. (2018), the cross-entropy plus a triplet loss is optimized with Adam for 30 epochs on the dataset augmented by 3 random translations. The learning rate is 5 10 2 and the batch size is 32.

Results are shown in table 1. As the network is trained for shape classiﬁcation rather than retrieval, we report the classiﬁcation F1 alongside the m AP used in the retrieval contest.10 Deep Sphere achieves the same performance as Cohen et al. (2018) and Esteves et al. (2018) at a much lower cost, suggesting that anisotropic ﬁlters are an unnecessary price to pay. As the information in those spherical maps resides in the low frequencies (ﬁgure 5), reducing the equivariance error didn t translate into improved performance. For the same reason, using the more uniform HEALPix sampling or lowering the resolution down to Nside = 8 (n = 768 pixels) didn t impact performance either.

7We however veriﬁed that the convolution is equivariant to longitudinal and latitudinal rotations, as intended. 8Albeit we didn t observe much improvement by using the convex hull. 7As implemented in https://github.com/jonas-koehler/s2cnn. 10We omit the F1 for Cohen et al. (2018) as we didn t get the m AP reported in the paper when running it.

Published as a conference paper at ICLR 2020

accuracy time

Perraudin et al. (2019), 2D CNN baseline 54.2 104 ms Perraudin et al. (2019), CNN variant, k = 8 62.1 185 ms Perraudin et al. (2019), FCN variant, k = 8 83.8 185 ms k = 8 neighbors, t from section 3.2 87.1 185 ms k = 20 neighbors, t from section 3.2 91.3 250 ms k = 40 neighbors, t from section 3.2 92.5 363 ms

Table 2: Results on the classiﬁcation of partial convergence maps. Lower equivariance error translates to higher performance.

200 250 300 350 inference time [ms]

accuracy [%]

Figure 7: Tradeoff between cost and accuracy.

4.2 COSMOLOGICAL MODEL CLASSIFICATION

Given observations, cosmologists estimate the posterior probability of cosmological parameters, such as the matter density Ωm and the normalization of the matter power spectrum σ8. Those parameters are typically estimated by likelihood-free inference, which requires a function to predict the parameters from simulations. As that is complicated to setup, prediction methods are typically benchmarked on the classiﬁcation of spherical maps instead (Schmelzle et al., 2017). We used the same task, data, and setup as Perraudin et al. (2019): the classiﬁcation of 720 partial convergence maps made of n 106 pixels (1/12 8% of a sphere at Nside = 1024) from two ΛCDM cosmological models, (Ωm = 0.31, σ8 = 0.82) and (Ωm = 0.26, σ8 = 0.91), at a relative noise level of 3.5 (i.e., the signal is hidden in noise of 3.5 times higher standard deviation). Convergence maps represent the distribution of overand under-densities of mass in the universe (see Bartelmann, 2010, for a review of gravitational lensing).

Graphs are built with (5), k = 8, 20, 40 neighbors, and the corresponding optimal kernel widths t given in section 3.2. Following Perraudin et al. (2019), the NN is made of 5 graph convolutional layers, each followed by a max pooling layer which down-samples by 4. A GAP and a fully connected layer with softmax follow. The polynomials are all of order P = 4 and the number of channels per layer is 16, 32, 64, 64, 64, respectively. The cross-entropy loss is optimized with Adam for 80 epochs. The learning rate is 2 10 4 0.999step and the batch size is 8.

Unlike on SHREC 17, results (table 2) show that a lower equivariance error on the convolutions translates to higher performance. That is probably due to the high frequency content of those maps (ﬁgure 5). There is a clear cost-accuracy tradeoff, controlled by the number of neighbors k (ﬁgure 7). This experiment moreover demonstrates Deep Sphere s ﬂexibility (using partial spherical maps) and scalability (competing spherical CNNs were tested on maps of at most 10, 000 pixels).

4.3 CLIMATE EVENT SEGMENTATION

We evaluate our method on a task proposed by (Mudigonda et al., 2017): the segmentation of extreme climate events, Tropical Cyclones (TC) and Atmospheric Rivers (AR), in global climate simulations (ﬁgure 1c). The data was produced by a 20-year run of the Community Atmospheric Model v5 (CAM5) and consists of 16 channels such as temperature, wind, humidity, and pressure at multiple altitudes. We used the pre-processed dataset from (Jiang et al., 2019).11 There is 1,072,805 spherical maps, down-sampled to a level-5 icosahedral sampling (n = 10 4l + 2 = 10, 242 pixels). The labels are heavily unbalanced with 0.1% TC, 2.2% AR, and 97.7% background (BG) pixels.

The graph is built with (5), k = 6 neighbors, and a kernel width t set to the average of the distances. Following Jiang et al. (2019), the NN is an encoder-decoder with skip connections. Details in section C.3. The polynomials are all of order P = 3. The cross-entropy loss (weighted or nonweighted) is optimized with Adam for 30 epochs. The learning rate is 1 10 3 and the batch size is 64.

Results are shown in table 3 (details in tables 6, 7 and 8). The mean and standard deviation are computed over 5 runs. Note that while Jiang et al. (2019) and Cohen et al. (2019) use a weighted cross-entropy loss, that is a suboptimal proxy for the m AP metric. Deep Sphere achieves state-of-

11Available at http://island.me.berkeley.edu/ugscnn/data.

Published as a conference paper at ICLR 2020

accuracy m AP

Jiang et al. (2019) (rerun) 94.95 38.41 Cohen et al. (2019) (S2R) 97.5 68.6 Cohen et al. (2019) (R2R) 97.7 75.9 Deep Sphere (weighted loss) 97.8 0.3 77.15 1.94 Deep Sphere (non-weighted loss) 87.8 0.5 89.16 1.37

Table 3: Results on climate event segmentation: mean accuracy (over TC, AR, BG) and mean average precision (over TC and AR). Deep Sphere achieves state-of-the-art performance.

temp. (from past temp.) day (from temperature) day (from precipitations)

order P MSE MAE R2 MSE MAE R2 MSE MAE R2

0 10.88 2.42 0.896 0.10 0.10 0.882 0.58 0.42 0.980 4 8.20 2.11 0.919 0.05 0.05 0.969 0.50 0.18 0.597

Table 4: Prediction results on data from weather stations. Structure always improves performance.

the-art performance, suggesting again that anisotropic ﬁlters are unnecessary. Note that results from Mudigonda et al. (2017) cannot be directly compared as they don t use the same input channels.

Compared to Cohen et al. (2019) s conclusion, it is surprising that S2R does worse than Deep Sphere (which is limited to S2S). Potential explanations are (i) that their icosahedral projection introduces harmful distortions, or (ii) that a larger architecture can compensate for the lack of generality. We indeed observed that more feature maps and depth led to higher performance (section C.3).

4.4 UNEVEN SAMPLING

To demonstrate the ﬂexibility of modeling the sampled sphere by a graph, we collected historical measurements from n 10, 000 weather stations scattered across the Earth.12 The spherical data is heavily non-uniformly sampled, with a much higher density of weather stations over North America than the Paciﬁc (ﬁgure 1d). For illustration, we devised two artiﬁcial tasks. A dense regression: predict the temperature on a given day knowing the temperature on the previous 5 days. A global regression: predict the day (represented as one period of a sine over the year) from temperature or precipitations. Predicting from temperature is much easier as it has a clear yearly pattern.

The graph is built with (5), k = 5 neighbors, and a kernel width t set to the average of the distances. The equivariance property of the resulting graph has not been tested, and we don t expect it to be good due to the heavily non-uniform sampling. The NN is made of 3 graph convolutional layers. The polynomials are all of order P = 0 or 4 and the number of channels per layer is 50, 100, 100, respectively. For the global regression, a GAP and a fully connected layer follow. For the dense regression, a graph convolutional layer follows instead. The MSE loss is optimized with RMSprop for 250 epochs. The learning rate is 1 10 3 and the batch size is 64.

Results are shown in table 4. While using a polynomial order P = 0 is like modeling each time series independently with an MLP, orders P > 0 integrate neighborhood information. Results show that using the structure induced by the spherical geometry always yields better performance.

5 CONCLUSION

This work showed that Deep Sphere strikes an interesting, and we think currently optimal, balance between desiderata for a spherical CNN. A single parameter, the number of neighbors k a pixel is connected to in the graph, controls the tradeoff between cost and equivariance (which is linked to performance). As computational cost and memory consumption scales linearly with the number of pixels, Deep Sphere scales to spherical maps made of millions of pixels, a required resolution

12https://www.ncdc.noaa.gov/ghcn-daily-description

Published as a conference paper at ICLR 2020

to faithfully represent cosmological and climate data. Also relevant in scientiﬁc applications is the ﬂexibility offered by a graph representation (for partial coverage, missing data, and non-uniform samplings). Finally, the implementation of the graph convolution is straightforward, and the ubiquity of graph neural networks pushing for their ﬁrst-class support in DL frameworks will make implementations even easier and more efﬁcient.

A potential drawback of graph Laplacian-based approaches is the isotropy of graph ﬁlters, reducing in principle the expressive power of the NN. Experiments from Cohen et al. (2019) and Boscaini et al. (2016) indeed suggest that more general convolutions achieve better performance. Our experiments on 3D shapes (section 4.1) and climate (section 4.3) however show that Deep Sphere s isotropic ﬁlters do not hinder performance. Possible explanations for this discrepancy are that NNs somehow compensate for the lack of anisotropic ﬁlters, or that some tasks can be solved with isotropic ﬁlters. The distortions induced by the icosahedral projection in (Cohen et al., 2019) or the leakage of curvature information in (Boscaini et al., 2016) might also alter performance.

Developing graph convolutions on irregular samplings that respect the geometry of the sphere is another research direction of importance. Practitioners currently interpolate their measurements (coming from arbitrarily positioned weather stations, satellites or telescopes) to regular samplings. This practice either results in a waste of resolution or computational and storage resources. Our ultimate goal is for practitioners to be able to work directly on their measurements, however distributed.

ACKNOWLEDGMENTS

We thank Pierre Vandergheynst for advices, and Taco Cohen for his inputs on the intriguing results of our comparison with Cohen et al. (2019). We thank the anonymous reviewers for their constructive feedback. The following software packages were used for computation and plotting: Py GSP (Defferrard et al.), healpy (Zonca et al., 2019), matplotlib (Hunter, 2007), Sci Py (Virtanen et al., 2020), Num Py (Walt et al., 2011), Tensor Flow (Abadi et al., 2015).

Mart ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorﬂow.org.

M. Bartelmann. Gravitational lensing. Classical and Quantum Gravity, 2010.

John R Baumgardner and Paul O Frederickson. Icosahedral discretization of the two-sphere. SIAM Journal on Numerical Analysis, 1985.

Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. In Advances in Neural Information Processing Systems, 2007.

Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for laplacian-based manifold methods. Journal of Computer and System Sciences, 2008.

Wouter Boomsma and Jes Frellsen. Spherical convolutions and their application in molecular modelling. In Advances in Neural Information Processing Systems, 2017.

Davide Boscaini, Jonathan Masci, Emanuele Rodol a, and Michael Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In Advances in Neural Information Processing Systems, 2016.

Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le Cun. Spectral networks and locally connected networks on graphs. ar Xiv:1312.6203, 2013. URL https://arxiv.org/abs/ 1312.6203.

Published as a conference paper at ICLR 2020

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. ar Xiv:1512.03012, 2015.

Taco S Cohen, Mario Geiger, Jonas Koehler, and Max Welling. Spherical cnns. In International Conference on Learning Representations (ICLR), 2018. URL https://arxiv.org/abs/ 1801.10130.

Taco S. Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral cnn. In International Conference on Machine Learning (ICML), 2019. URL http://arxiv.org/abs/1902.04615.

Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classiﬁcation in omnidirectional images. In European Conference on Computer Vision, 2018.

Micha el Defferrard, Lionel Martin, Rodrigo Pena, and Nathana el Perraudin. Pygsp: Graph signal processing in python. URL https://github.com/epfl-lts2/pygsp/.

Micha el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral ﬁltering. In Advances in Neural Information Processing Systems, 2016. URL https://arxiv.org/abs/1606.09375.

Micha el Defferrard, Nathana el Perraudin, Tomasz Kacprzak, and Raphael Sgier. Deepsphere: towards an equivariant graph-based spherical cnn. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. URL https://arxiv.org/abs/1904.05146.

J. R. Driscoll and D. M. Healy. Computing fourier transforms and convolutions on the 2-sphere. Adv. Appl. Math., 1994. URL http://dx.doi.org/10.1006/aama.1994.1008.

Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning so(3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. URL https://arxiv.org/abs/1711.06721.

Krzysztof M Gorski, Benjamin D Wandelt, Frode K Hansen, Eric Hivon, and Anthony J Banday. The healpix primer. ar Xiv preprint astro-ph/9905275, 1999.

Krzysztof M Gorski, Eric Hivon, AJ Banday, Benjamin D Wandelt, Frode K Hansen, Mstvos Reinecke, and Matthia Bartelmann. Healpix: a framework for high-resolution discretization and fast analysis of data distributed on the sphere. The Astrophysical Journal, 2005.

J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3): 90 95, 2007. doi: 10.1109/MCSE.2007.55.

Chiyu Max Jiang, Jingwei Huang, Karthik Kashinath, Prabhat, Philip Marcus, and Matthias Niessner. Spherical cnns on unstructured grids. In International Conference on Learning Representations (ICLR), 2019. URL https://arxiv.org/abs/1901.02039.

Renata Khasanova and Pascal Frossard. Graph-based classiﬁcation of omnidirectional images. In Proceedings of the IEEE International Conference on Computer Vision, 2017. URL https: //arxiv.org/abs/1707.08301.

Mayur Mudigonda, Sookyung Kim, Ankur Mahesh, Samira Kahou, Karthik Kashinath, Dean Williams, Vincen Michalski, Travis O Brien, and Mr Prabhat. Segmenting and tracking extreme climate events using neural networks. In Deep Learning for Physical Sciences (DLPS) Workshop, held with NIPS Conference, 2017. URL https://dl4physicalsciences.github.io/ files/nips_dlps_2017_20.pdf.

Nathana el Perraudin, Micha el Defferrard, Tomasz Kacprzak, and Raphael Sgier. Deepsphere: Efﬁcient spherical convolutional neural network with healpix sampling for cosmological applications. Astronomy and Computing, 2019. URL https://arxiv.org/abs/1810.12186.

Planck Collaboration. Planck 2015 results. i. overview of products and scientiﬁc results. Astronomy & Astrophysics, 2016.

Published as a conference paper at ICLR 2020

Manolis Savva, Fisher Yu, Hao Su, Asako Kanezaki, Takahiko Furuya, Ryutarou Ohbuchi, Zhichao Zhou, Rui Yu, Song Bai, Xiang Bai, et al. Large-scale 3d shape retrieval from shapenet core55: Shrec 17 track. In Eurographics Workshop on 3D Object Retrieval, 2017.

J. Schmelzle, A. Lucchi, T. Kacprzak, A. Amara, R. Sgier, A. R efr egier, and T. Hofmann. Cosmological model discrimination with deep learning. arxiv:1707.05167, 2017.

Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360 imagery. In Advances in Neural Information Processing Systems, 2017.

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St efan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vand er Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald, Antˆonio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1. 0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientiﬁc Computing in Python. Nature Methods, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2.

St efan van der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efﬁcient numerical computation. Computing in Science & Engineering, 13(2):22 30, 2011.

Andrea Zonca, Leo Singer, Daniel Lenz, Martin Reinecke, Cyrille Rosset, Eric Hivon, and Krzysztof Gorski. healpy: equal area pixelization and spherical harmonics transforms for data on the sphere in python. Journal of Open Source Software, 4(35):1298, March 2019. doi: 10.21105/joss.01298. URL https://doi.org/10.21105/joss.01298.

Published as a conference paper at ICLR 2020

SUPPLEMENTARY MATERIAL

A PROOF OF THEOREM 3.1

Preliminaries. The proof of theorem 3.1 is inspired from the work of Belkin & Niyogi (2008). As a result, we start by restating some of their results. Given a sampling V = {xi M}n i=1 of a closed, compact and inﬁnitely differentiable manifold M, a smooth ( C (M)) function f : M R, and deﬁned the vector f of samples of f as follows: TVf = f Rn, fi = f(xi). The proof is constructed by leveraging 3 different operators:

The extended graph Laplacian operator, already presented in (6), is a linear operator Lt n : L2(M) L2(M) deﬁned as

Lt nf(y) := 1

i=1 e xi y 2

4t (f(y) f(xi)) . (7)

Note that we have the following relation Lt nf = TVLt nf. The functional approximation to the Laplace-Beltrami operator is a linear operator Lt : L2(M) L2(M) deﬁned as

4t (f(y) f(x)) dµ(x), (8)

where µ is the uniform probability measure on the manifold M, and vol(M) is the volume of M. The Laplace-Beltrami operator M is deﬁned as the divergence of the gradient

Mf(y) := div( Mf) (9)

of a differentiable function f : M R. The gradient f : M Tp M is a vector ﬁeld deﬁned on the manifold pointing towards the direction of steepest ascent of f, where Tp M is the afﬁne space of all vectors tangent to M at p.

Leveraging these three operators, Belkin & Niyogi (2008; 2007) have build proofs of both pointwise and spectral convergence of the extended graph Laplacian towards the Laplace-Beltrami operator in the general setting of any compact, closed and inﬁnitely differentiable manifold M, where the sampling V is drawn randomly on the manifold. For this reason, their results are all to be interpreted in a probabilistic sense. Their proofs consist in establishing that (6) converges in probability towards (8) as n and (8) converges towards (9) as t 0. In particular, this second step is given by the following: Proposition 1 (Belkin & Niyogi (2008), Proposition 4.4). Let M be a k-dimensional compact smooth manifold embedded in some Euclidean space RN, and ﬁx y M. Let f C (M). Then 1

t 1 (4πt)k/2 Ltf(y) t 0 1 vol(M) Mf(y). (10)

Building the proof. As the sphere is a compact smooth manifold embedded in R3, we can reuse proposition 1. Thus, our strategy to prove Theorem 3.1 is to (i) show that

lim n Lt nf(y) = Lt(y) (11)

for a particular class of deterministic samplings, and (ii) apply Proposition 1.

We start by proving that for smooth functions, for any ﬁxed t, the extended graph Laplacian Lt n converges towards its continuous counterpart Lt as the sampling increases in size. Proposition 2. For an equal area sampling {xi S2}n i=1 : Ai = Aj i, j of the sphere it is true that for all f : S2 R Lipschitz with respect to the Euclidean distance with Lipschitz constant Cf

S2 f(x)dµ(x) 1

Published as a conference paper at ICLR 2020

Furthermore, for all y S2 the Heat Kernel Graph Laplacian operator Lt n converges pointwise to the functional approximation of the Laplace Beltrami operator Lt

Lt nf(y) n Ltf(y).

Proof. Assuming f : S2 R is Lipschitz with Lipschitz constant Cf, we have

σi f(x)dµ(x) 1

nf(xi) Cfd(n) 1

where σi S2 is the subset of the sphere corresponding to the patch around xi. Remember that the sampling is equal area. Hence, using the triangular inequality and summing all the contributions of the n patches, we obtain

S2 f(x)dµ(x) 1

σi f(x)dµ(x) 1

nf(xi) n Cfd(n) 1

A direct application of this result leads to the following pointwise convergences

f Lipschitz, y S2, 1 n

4t Z e x y 2

f Lipschitz, y S2, 1 n

i e ||xi y||2

4t f(xi) Z e x y 2

4t f(x)dµ(x)

Deﬁnitions 6 and 8 end the proof.

The last proposition show that for a ﬁxed t, Lt nf(x) 1/4π2Ltf(x). To utilize Proposition 1 and complete the proof, we need to ﬁnd a sequence of tn for which this holds as tn 0. Furthermore this should hold with a faster decay than 1 4πt2n .

Proposition 3. Given a sampling regular enough, i.e., for which we assume Ai = Aj i, j and d(n) C nα , α (0, 1/2], a Lipschitz function f and a point y S2 there exists a sequence tn = nβ, β < 0 such that

f Lipschitz, x S2 1 4πt2n

Ltn n f(x) Ltnf(x) n 0.

Proof. To ease the notation, we deﬁne

Kt(x, y) := e x y 2

φt(x; y) := e x y 2

4t (f(y) f(x)) . (13)

We start with the following inequality

Lt nf Ltf = max y S2 Lt nf(y) Ltf(y)

i=1 φt(xi; y) Z

S2 φt(x; y)dµ(x)

1 nφt(xi; y) Z

σi φt(x; y)dµ(x)

d(n) max y S2 Cφt y, (14)

where Cφty is the Lipschitz constant of x φt(x, y) and the last inequality follows from Proposition 2. Using the assumption d(n) C n we ﬁnd

Lt nf Ltf C n max y S2 Cφty

Published as a conference paper at ICLR 2020

We now ﬁnd the explicit dependence between t and Cφty

Cφty = xφt( ; y)

= x Kt( ; y)f = x Kt( ; y)f + Kt( ; y) xf|| x Kt( ; y)f + Kt( ; y) xf x Kt( ; y) f + Kt( ; y) xf = x Kt( ; y) f + xf = CKty f + xf = CKty f +Cf

where CKty is the Lipschitz constant of the function x Kt(x; y). We note that this constant does not depend on y:

CKty = xe x2

2t = (2et) 1

Hence we have C n max y S2 Cφty C n

C f nα(2et)1/2 + C

Inculding this result in (14) and rescaling by 1/4πt2, we obtain 1 4πt2 Lt nf Ltf 1 4πt2 Lt nf Ltf

2e 1 nαt5/2 + Cf

In order for C

2e 1 nαt5/2 + Cf nαt2 i n t 0 0, we need nαt5/2 nαt2

It happens if t(n) = nβ, β ( 2α

5 , 0) t(n) = nβ, β ( α

2 , 0) = t(n) = nβ, β ( 2α

Indeed, we have nalphat5/2 = n5/2β+α n since 5

2β + α > 0 β > 2α

5 and nαt2 = n2β+α n since 2β + α > 0 β > α

As a result, for t = nβ with β ( 1

5, 0) we have

( (tn) n 0 1 4πt2n Ltn n f 1 4πt2n Ltnf

n 0, which concludes the proof.

Theorem 3.1, is then an immediate consequence of Proposition 3 and 1.

Proof of Theorem 3.1. Thanks to Proposition 3 and Proposition 1 we conclude that y S2

lim n 1 4πt2n Ltn n f(y) = lim n 1 4πt2n Ltnf(y) = 1 |S2| S2f(y)

In (Belkin & Niyogi, 2008), the sampling is drawn from a uniform random distribution on the sphere, and their proof heavily relies on the uniformity properties of the distribution from which the sampling is drawn. In our case the sampling is deterministic, and this is indeed a problem that we need to overcome by imposing the regularity conditions above.

Published as a conference paper at ICLR 2020

micro (label average) macro (instance average)

P@N R@N F1@N m AP P@N R@N F1@N m AP

Cohen et al. (2018) (b = 128) 0.701 0.711 0.699 0.676 - - - - Cohen et al. (2018) (simpliﬁed, b = 64) 0.704 0.701 0.696 0.665 0.430 0.480 0.429 0.385 Esteves et al. (2018) (b = 64) 0.717 0.737 - 0.685 0.450 0.550 - 0.444 Deep Sphere (equiangular b = 64) 0.709 0.700 0.698 0.665 0.439 0.489 0.439 0.403 Deep Sphere (HEALPix Nside = 32) 0.725 0.717 0.715 0.686 0.475 0.508 0.468 0.428

Table 5: Ofﬁcial metrics from the SHREC 17 object retrieval competition.

To conclude, we see that the result obtained is of similar form than the result obtained in (Belkin & Niyogi, 2008). Given the kernel density t(n) = nβ, Belkin & Niyogi (2008) proved convergence in the random case for β ( 1

4, 0) and we proved convergence in the deterministic case for β ( 2α

5 , 0), where α (0, 1/2] (for the spherical manifold).

B PROOF OF THEOREM 3.2

Proof. Fix x S2. Since any rotation R(g) is an isometry, and the Laplacian commutes with all isometries of a Riemanniann manifold, and deﬁning R(g)f =: f for ease of notation, we can write that R(g)ˆLtn n f(x) ˆLtn n R(g)f(x) R(g)ˆLtn n f(x) R(g) S2f(x) + R(g) S2f(x) ˆLtn n R(g)f(x) =

= R(g)(ˆLtn n f S2f)(x) + S2f (x) ˆLtn n f (x)

(ˆLtn n f S2f)(g 1(x)) + S2f (x) ˆLtn n f (x)

Since g 1(x) S2 and f still satisﬁes hypothesis, we can apply theorem 3.1 to say that (ˆLtn n f S2f)(g 1(x)) n 0 S2f (x) ˆLtn n f (x) n 0

to conclude that x S2 R(g)ˆLtn n f(x) ˆLtn n R(g)f(x) n 0

C EXPERIMENTAL DETAILS

C.1 3D OBJECTS RECOGNITION

Table 5 shows the results obtained from the SHREC 17 competition s ofﬁcial evaluation script.

(15) [GC16 + BN + Re LU]nside32 + Pool + [GC32 + BN + Re LU]nside16 + Pool + [GC64 + BN + Re LU]nside8 + Pool + [GC128 + BN + Re LU]nside4 + Pool + [GC256 + BN + Re LU]nside2 + Pool + GAP + FCN + softmax

C.2 COSMOLOGICAL MODEL CLASSIFICATION

(16) [GC16 + BN + Re LU]nside1024 + Pool + [GC32 + BN + Re LU]nside512 + Pool + [GC64 + BN + Re LU]nside256 + Pool + [GC64 + BN + Re LU]nside128 + Pool + [GC64 + BN + Re LU]nside64 + Pool + [GC2]nside32 + GAP + softmax

Published as a conference paper at ICLR 2020

TC AR BG mean

Mudigonda et al. (2017) 74 65 97 78.67 Jiang et al. (2019) (paper) 94 93 97 94.67 Jiang et al. (2019) (rerun) 93.9 95.7 95.2 94.95 Cohen et al. (2019) (S2R) 97.8 97.3 97.3 97.5 Cohen et al. (2019) (R2R) 97.9 97.8 97.4 97.7

DS (Jiang architecture, weighted loss) 97.1 97.6 96.5 97.1 DS (weighted loss) 97.4 1.1 97.7 0.7 98.2 0.5 97.8 0.3 DS (wider architecture, weighted loss) 91.5 93.4 99.0 94.6

DS (Jiang architecture, non-weighted loss) 33.6 93.6 99.3 75.5 DS (non-weighted loss) 69.2 3.7 94.5 2.9 99.7 0.1 87.8 0.5 DS (wider architecture, non-weighted loss) 73.4 92.7 99.8 88.7

Table 6: Results on climate event segmentation: accuracy. Tropical cyclones (TC) and atmospheric rivers (AR) are the two positive classes, against the background (BG). Mudigonda et al. (2017) is not directly comparable as they don t use the same input feature maps. Note that a non-weighted cross-entropy loss is not optimal for the accuracy metric.

Jiang et al. (2019) (rerun) 11.08 65.21 38.41 Cohen et al. (2019) (S2R) - - 68.6 Cohen et al. (2019) (R2R) - - 75.9

DS (Jiang architecture, non-weighted loss) 46.2 93.9 70.0 DS (non-weighted loss) 80.86 2.42 97.45 0.38 89.16 1.37 DS (wider architecture, non-weighted loss) 84.71 98.05 91.38

DS (Jiang architecture, weighted loss) 49.7 89.2 69.5 DS (weighted loss) 58.88 3.17 95.41 1.51 77.15 1.94 DS (wider architecture, weighted loss) 52.80 94.78 73.79

Table 7: Results on climate event segmentation: average precision. Tropical cyclones (TC) and atmospheric rivers (AR) are the two positive classes. Note that a weighted cross-entropy loss is not optimal for the average precision metric.

C.3 CLIMATE EVENT SEGMENTATION

Table 6, 7, and 8 show the accuracy, m AP, and efﬁciency of all the NNs we ran.

The experiment with the model from Jiang et al. (2019) was rerun in order to obtain the AP metrics, but with a batch size of 64 instead of 256 due to GPU memory limit.

Several experiments were run with different architectures for Deep Sphere (DS). Jiang architecture use a similar one as Jiang et al. (2019), with only the convolutional operators replaced. Deep Sphere only is the original architecture giving the best results, deeper and with four times more feature maps than Jiang architecture. And the wider architecture is the same as the previous one with two times the number of feature maps.

Regarding the weighted loss, the weights are chosen with scikit-learn function compute class weight on the training set.

Published as a conference paper at ICLR 2020

params inference training

Jiang et al. (2019) 330 k 10 ms 10 h Deep Sphere (Jiang architecture) 590 k 5 ms 3 h Deep Sphere 13 M 33 ms 13 h Deep Sphere (wider architecture) 52 M 50 ms 20 h

Table 8: Results on climate event segmentation: size and speed.

Deep Sphere with Jiang architecture encoder:

(17) [GC8 + BN + Re LU]L5 + Pool + [GC16 + BN + Re LU]L4 + Pool

+ [GC32 + BN + Re LU]L3 + Pool + [GC64 + BN + Re LU]L2 + Pool

+ [GC128 + BN + Re LU]L1 + Pool + [GC128 + BN + Re LU]L0

Unpool + [GC128 + BN + Re LU]L1 + concat + [GC128 + BN + Re LU]L1 + Unpool + [GC64 + BN + Re LU]L2 + concat

+ [GC64 + BN + Re LU]L2 + Unpool + [GC32 + BN + Re LU]L3 + concat + [GC32 + BN + Re LU]L3 + Unpool

+ [GC16 + BN + Re LU]L4 + concat + [GC16 + BN + Re LU]L4 + Unpool

+ [GC8 + BN + Re LU]L5 + concat + [GC8 + BN + Re LU]L5 + [GC3]L5

Concat is the operation that concatenate the results of the corresponding encoder layer.

Original Deep Sphere architecture with encoder decoder encoder:

(19) [GC32 + BN + Re LU]L5 + [GC64 + BN + Re LU]L5 + Pool + [GC128 + BN + Re LU]L4 + Pool

+ [GC256 + BN + Re LU]L3 + Pool + [GC512 + BN + Re LU]L2 + Pool + [GC512 + BN + Re LU]L1 + Pool + [GC512]L0

Unpool + [GC512 + BN + Re LU]L1 + concat + [GC512 + BN + Re LU]L1 + Unpool + [GC256 + BN + Re LU]L2 + concat

+ [GC256 + BN + Re LU]L2 + Unpool + [GC128 + BN + Re LU]L3 + concat + [GC128 + BN + Re LU]L3 + Unpool

+ [GC64 + BN + Re LU]L4 + concat + [GC64 + BN + Re LU]L4 + Unpool + [GC32 + BN + Re LU]L5 + [GC3]L5

Published as a conference paper at ICLR 2020

C.4 UNEVEN SAMPLING

Architecture for dense regression:

(21) [GC50 + BN + Re LU] + [GC100 + BN + Re LU] + [GC100 + BN + Re LU] + [GC1]

Architecture for global regression:

(22) [GC50 + BN + Re LU] + [GC100 + BN + Re LU] + [GC100 + BN + Re LU] + GAP + FCN