# deep_diffusioninvariant_wasserstein_distributional_classification__d3e19b52.pdf Deep Diffusion-Invariant Wasserstein Distributional Classification Sung Woo Park Dong Wook Shu Junseok Kwon School of Computer Science and Engineering Chung-Ang University, Seoul, Korea pswkiki@gmail.com seowok@naver.com jskwon@cau.ac.kr Abstract In this paper, we present a novel classification method called deep diffusioninvariant Wasserstein distributional classification (Deep WDC). Deep WDC represents input data and labels as probability measures to address severe perturbations in input data. It can output the optimal label measure in terms of diffusion invariance, where the label measure is stationary over time and becomes equivalent to a Gaussian measure. Furthermore, Deep WDC minimizes the 2-Wasserstein distance between the optimal label measure and Gaussian measure, which reduces the Wasserstein uncertainty. Experimental results demonstrate that Deep WDC can substantially enhance the accuracy of several baseline deterministic classification methods and outperforms state-of-the-art-methods on 2D and 3D data containing various types of perturbations (e.g., rotations, impulse noise, and down-scaling). 1 Introduction The Wasserstein space has been widely used for various machine learning tasks, including generative adversarial learning (2), policy optimization (30), Gaussian processes (17), statistical learning (14), data embedding (6; 19), topic modeling (29), Bayesian inference (1), Gaussian mixture modeling (11), and optimal transport (13; 20; 23). However, the Wasserstein space has not been actively studied for developing novel classification models. In conventional classification problems, N pairs of input data xn and its corresponding target label ˆyn (i.e., {xn, ˆyn}N n=1) are used for training, where the input data and target label are considered vector-valued points in the Euclidean space, and an objective function is formulated to obtain an inference network f based on the Euclidean distance as follows: minf P n d E (yn = f(xn), ˆyn). We aim to answer the following questions in this paper. How can the stochastic properties of input data and labels be appropriately captured to handle severe perturbations? To answer this question, we represent both input data and target labels as probability measures (i.e., probability densities), denoted as µn and ˆνn, respectively, in the Wasserstein space and solve a distance-based classification problem (i.e., minf P n W2 (νn = f#[µn], ˆνn)) based on the 2-Wasserstein distance W2. Specifically, Euclidean vectors {xn, ˆyn} and inference network f are replaced with probability measures {µn, ˆνn}N n=1 and push-forward operation f#[ ], respectively. Because probability measures can be spread out and multi-modal (19), they provide excellent flexibility for handling stochastic perturbation in data. Figure 1: Randomly perturbed data. Each vector-valued point x is recognized to be sampled from object measure µn; x µn. Suppose that perturbations are randomly added to data points at test and training time (i.e., (x + ε) µε n for an unknown random perturbation vector ε and perturbed probability measure µε n). Then, it is reasonable to represent data using a probability measure µε n rather than as an individual observed data point x + ε because data can have multiple locations at every observations (e.g., red points in Fig.1). In this circumstance, we have to minimize the Wasserstein uncertainty of perturbed data represented as probability measures 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. to become predictable. Therefore, the proposed classification method attempts to represent each estimated label stochastically as a probability measure f#[µn] to consider various types of random perturbations. Such an approach (i.e., considering stochastic data as probability measures in the Wasserstein space) has been studied for stochastic deep neural networks (DNNs) (5). However, stochastic DNNs still represent labels as Euclidean vectors and tend to focus on constructing neural networks rather than theoretical analysis. By contrast, our method represents labels as probability measures and theoretically analyzes classification problems according to these label measures. How can classification tasks be defined in the Wasserstein space in an efficient and optimal manner? It is nontrivial to employ the Wasserstein space for distance-based classification because computing Wasserstein distances directly is intractable. There are two representative approaches for computing Wasserstein distances, which are based on the primal and dual formulations of optimal transport problems. One approach involves discretizing a data space and finding the discrete optimal coupling of densities that yields the lowest transport cost (7). Linear programming has been adopted to solve the primal problem directly (21; 25). However, the total computational cost grows quadratically as the dimensionality of space and the carnality of data increase. Therefore, this approach cannot be employed for complex data representations. Another approach based on Kantorovich duality problem has been widely used in various tasks (2). This approach can represent complex data, but it focuses on the most general case, where probability measures are arbitrary in the Wasserstein space. By contrast, in our method, the Wasserstein distance is defined between a label measure νn and the corresponding fixed Gaussian measure NΣ, which enables our method to use a fundamental property of diffusion semi-groups called hypercontractivity to impose an explicit upper bound on the 2-Wasserstein distance (i.e., Wasserstein ambiguity) and the exponential decay of the distance. Figure 2: Deep diffusion-invariant Wasserstein distributional classification (Deep WDC). Proposed method. Fig.2 presents a conceptual illustration of the proposed method. An input measure µn is fed into the proposed push-forward operator f θ #, which is parameterized by the neural network f θ, to produce a label measure νn. Then, the test function gϑ adversarially trained making νn invariant to a pre-defined diffusion operator, which is equivalent to minimizing the Wasserstein uncertainty by tightening the upper bound of W2. If W2 = 0, then we obtain the optimal label measure, which is the Gaussian measure NΣ. The label of µn is equivalent to that of NΣm if the intrinsic distance W2,g between µn and NΣm is optimally minimized in the Wasserstein Gaussian subspace. Thus, our method aims to find the optimal parameters θ and ϑ for two DNNs. Our contributions can be summarized as follows: We introduce a novel distance-based distributional classification method (Deep WDC), where both input data and target labels are probability measures in the 2-Wasserstein space. To make classification problems computationally tractable, we indirectly derive 2-Wasserstein distances by determining their upper bounds (i.e., Wasserstein ambiguity). We present theoretical evidence supporting the capability of our method to rapidly reduce Wasserstein ambiguity. We prove that minimizing Wasserstein ambiguity is equivalent to making νn (i.e., the estimated label measure) diffusion-invariant. If νn is diffusion-invariant, then the density of νn is stationary over time. During theoretical analysis, we introduce the concept of hypercontractivity of a diffusion semi-group to relate diffusion invariance to the Wasserstein distance. We experimentally demonstrate the robustness of our method against severe random perturbations (e.g., rotations, downscaling, and non-homogeneous local noises) and verify that it can substantially outperform state-of-the-art deterministic methods. Wasserstein Gaussian embedding. Muzellec and Cuturi explored the Gaussian 2-Wasserstein space (19), where embedded points are represented as non-degenerate Gaussian measures. Because a Bures metric is explicitly defined in this space, no sub-routines for computing this metric are required. However, this method focuses on point embedding in the Gaussian Wasserstein subspace rather than the distributional realization of the feature space. Therefore, solving classification problems is problematic in this space because the actual density of objects cannot be represented as a unique elliptic distribution. By contrast, our method imposes an explicit constraint that necessarily transforms inference measures into Gaussian measures, which are uniquely defined. Wasserstein distributional learning. In recent studies, data uncertainty has been represented as distance in the Wasserstein space (8; 21; 25; 26; 9). To this end, the Wasserstein distance was employed to define the uncertainty of data, which was referred to as the Wasserstein ambiguity set (mathematically equivalent to a Wasserstein ball). The uncertainty of data was measured by considering the closeness of the induced probability density of the data to the prescribed target density in terms of the distance metric. Our method can be interpreted within this framework, where the radius of the Wasserstein ambiguity set is proportional to the square root term in (2). Our method minimizes this Wasserstein ambiguity for classification tasks to handle severe perturbations from a distributional perspective. 2 The Proposed Method In this section, we describe the two fundamental concepts of our method, namely, diffusion invariance and Wasserstein Gaussian subspaces. Then, we define an objective function based on these concepts for classification tasks in the Wasserstein space. Finally, we present a Wasserstein-distance-based classifier. As mentioned in Section 1, the n-th input data sample and its target label are represented as probability measures µn and ˆνn, respectively. The label measure νn is estimated using a push-forward operation f#[µn] (i.e., νn = f#[µn]). Then, an input vector x can be sampled from µn (i.e., x µn), and the estimated label measures are denoted as νn. We define the test function g as an element of the class of smooth real-valued functions with compact support (i.e., g C 0 ), where g maps y to a real value (i.e., g(y) R). In this paper, f#[µn] and g(y) are implemented as DNNs with parameters θ and ϑ denoted as f θ #[µn] and gϑ(y), respectively. 2.1 Diffusion-Invariant Measure Our method imposes a diffusion operator on probability measures to derive diffusion-invariant measures of the estimated label νn and computes the corresponding Wasserstein ambiguity (i.e., the upper bound of the 2-Wasserstein distance W2). Definition 1. (Diffusion Operator) Given a Markov semi-group Pt at time t, the diffusion operator (i.e., infinitesimal generator) L of Pt is defined as Lg(y) = lim t 0 1 t (Ptg(y) g(y)) = X yi yj Bij(y)g(y) X yi g(y), (1) where B and A are the matrix and vector-valued measurable functions, respectively, Bij denotes the (i, j)-th function of B, and Ai denotes the i-th component function of A. The diffusion operator Lg can be considered the average change in g with respect to an infinitesimal change in time according to the Markov semi-group. If the expectation of the average change in g is zero, then the data sampled from the probability measure y νn are considered to be stationary over time. In this case, νn becomes invariant to L (i.e., diffusion invariant) (please refer to Section 3.3 for details regarding perturbation analysis). Definition 2. (Diffusion-invariant Measure) Given the diffusion operator L, the probability measure νn is considered to be invariant to L, when Ey ν[Lg(y)] = 0 for any g C 0 . We set a target label measure ˆνn to NΣn (i.e., a centered non-degenerate Gaussian measure with covariance Σn). This setting is possible because NΣn is an element of the 2-Wasserstein space. Additionally, we set Bij and Ai in (1) to Σij (i.e., the (i, j)-th entry of Σ) and yi (i.e., the i-th component of y), respectively. Then, the Wasserstein ambiguity is represented by L as follows: Proposition 1. Let Lgϑ(y) is diffusion operator defined in Definition 1. Then, W2(f θ #[µn], ˆνn) = W2(f θ #[µn], NΣn) r sup ϑ Ey f θ #[µn] [|Lgϑ(y)|]. (2) As shown in (2), we can minimize the 2-Wasserstein distance (i.e., W2(f θ #[µn], ˆνn) = 0) by minimizing its upper bound (i.e., supϑ Ey νn Lgϑ(y) = 0), meaning that the estimated label measure νn becomes diffusion-invariant, as defined in Definition 2, and is hardly affected by perturbations. We can make supϑ Ey νn Lgϑ(y) equal to zero if and only if f θ #[µn] = NΣn. Thus, the goal of our method is to make νn = f θ #[µn] similar to NΣn (i.e., νn = NΣn) by updating the parameter θ. In (2), the upper bound is minimized at a square root rate and can rapidly converge to zero. Section 3.1 presents the proof of Proposition 1 and further investigations. The diffusion operator L in (1) contains a second-order derivative term. Therefore, it is inefficient for a neural network g to calculate a Hessian matrix 2 yg(y) during training. However, our method can calculate the diffusion operator without any derivatives. With respect to the proposed diffusion operator L with the specific settings Bij = Σij and Ai = yi, the Markov semi-group Ptg has an explicit form called Mehler s formula: Ptgϑ(y) = EZ NI h gϑ e ty + 1 2n Z i , where NI denotes the standard Gaussian measure. Then, the diffusion operator is calculated as Ey νn[Lgϑ(y)] = lim t 0 Ey f θ #[µn] Ptgϑ(y) gϑ(y) = lim t 0 1 t Ey f θ #[µn]EZ NI h gϑ e ty + p 1 2n Z gϑ(y) i . (3) Our method has several mathematical advantages over existing methods. For example, the WGANGP method (2) adopts a gradient penalty term for its test function to induce 1-Lipschitzness. However, this method involves a strong global penalty for the test function. Such a penalty is inevitably caused by the assumption of the Kantorovich Rubinstein formula. However, the test function g in the proposed method only uses the local properties of Markov semi-groups with no assumptions regarding prior conditions. 2.2 Wasserstein Gaussian Subspaces We define Wasserstein Gaussian subspaces to compute the intrinsic distance between the estimated label measure νn and target label measure ˆνn. Definition 3. The Wasserstein Gaussian subspace P2,g is a subspace of the 2-Wasserstein space P2, which consists of centered non-degenerate Gaussian measures, where W2,g is a distance metric in this space. Suppose that each estimated label vector is mapped to the hypersphere, yn = f(xn) and ˆyn Sd 1 Rd. In such cases, d E=Rd is not the same as d S=Sd 1, and we cannot use d E=Rd as a true metric for minimizing the distance between the estimated and target labels. Therefore, the intrinsic distance d Sd 1 must be defined to derive accurate objective functions for classification tasks. To define the intrinsic distance, we introduce the Wasserstein Gaussian subspace (i.e., hypersphere), where the target label measure ˆνn is represented as a Gaussian measure NΣn and the estimated label measure νn also lies within this subspace. Then, the intrinsic distance is defined as PN n=1 W2,g(νn, ˆνn) = PN n=1 W2,g(f#[µn], NΣn). In Section 3.2, we analyze the geometric characteristics of the Wasserstein Gaussian subspace. 2.3 Objective Function The proposed objective function is defined as follows: min θ max ϑ 1 N n=1 Eyn f θ #[µn][|Lgϑ(yn)|] | {z } Diffusion invariance term + W2,g(f θ #[µn], NΣn) | {z } Intrinsic distance term The first term of (4), which is defined in (3), ensures the diffusion invariance of the estimated label measure νn, where νn can be invariant to diffusion operators (Section 2.1). This term also determines the Wasserstein ambiguity and yields the upper bound for the 2-Wasserstein distance in (2). The second term of (4), which is defined in (6), minimizes the intrinsic distance between the estimated label measure νn and target label measure NΣn in the Wasserstein Gaussian subspace (Section 2.2). This term reduces infinite-dimensional problems in the Wasserstein space to finite-dimensional tractable problems, which can easily optimized by f and g. Our method aims to find the optimal parameters θ and ϑ for two neural networks f and g, respectively, that minimize the objective function in (4). Therefore, our estimated label measures are close to the target label measures and are robust to stochastic perturbations. In the Supplementary Materials, algorithms are provided to outline the entire procedure of the proposed method. 2.4 Evaluation Metric To evaluate the classification performance, we propose the following classifier based on 2-Wasserstein distance. We calculate the Top-1 average accuracy for N objects as: ˆ cls({µn}N n=1) = 1 µn, arg min NΣ m W2,g f θ #[µn], NΣm # where cls[µn, NΣ m] = 1 if NΣ m and µn share the same label information. We search for the probability measure of target label NΣ m that is the closest to the probability measure of estimated label f θ #[µn]. Then, input data µn is classified into the label of NΣ m. If we can find the optimal parameters (θ , ϑ ) for the neural networks (f, g), then ˆ cls({µn}N 1 n=0 ) in (5) is 1. Thus, we can solve the metric-based classification problem in the Wasserstein space. 3 Theoretical Analysis In Section 2, we present the basic concepts of the proposed method. In this section, we decompose each term in (4) by examining the connection between the diffusion operator and 2-Wasserstein distance, and we detail the theoretical advantages of the proposed method. We first introduce the main assumptions used in this theoretical analysis subsequently. Because our method aims to make each push-forward measure νn diffusion invariant with respect to L, we assume that there is a sequence {δk,n} k=0 such that δk,n = supg C 0 R |Lg(y)| qk,nd NΣn(y) < , where qk,n denotes the density of νk,n, and there is a large K0 N+ that satisfies δk,n = 0 if k K0 and n = 0, , N. Intuitively, δk,n can be understood as a indicator that how n-th label measure νk,n is diffused according to L at learning iteration k. The Supplementary Materials include other minor assumptions with notations and the details of the entire theoretical analysis, which cannot be presented herein owing to space constraints. 3.1 Hypercontractivity We examine the relationship between the proposed diffusion operator L (defined in Section 2.1) and the 2-Wasserstein distance W2. For simplicity, the subscript n (e.g., in νn) is omitted in this section. Proposition 2. (Descending of the Wasserstein Ambiguity Set) Let {νk} k=0 be a sequence of probability measures satisfying assumptions. Then, νk BW2(NΣ, δk). In other words, W2(νk, NΣ) δk. Furthermore, W2(νk, NΣ) 0 as k . Figure 3: Descending of the Wasserstein Ambiguity Set. The radius of the ambiguity set converges to zero (i.e., r = δk 0). This proposition is inspired by the Bakry-Emery inequality (3) and HWI-inequality (22), which describe the fundamental behavior of the Markov diffusion semi-group. To prove Proposition 2, Fisher information is replaced with the proposed diffusion term supϑ Ey νk |Lgϑ(y)| in (2), which clearly highlights the connection between the diffusion operator L and the 2-Wasserstein distance. Specifically, the 2Wasserstein distance is bounded above by the square root of the diffusion term. Fig.3 presents a conceptual illustration of Proposition 2. The inequality in Proposition 1 is different form that in Proposition 2. Proposition 3. (Exponential Decay of Wasserstein Distance) We define a sub-sequence τ(k) N+, such that τ(k) = k ; 1 k R ζd[νk ν0] + ϵ(k ) δ0, k k , where ζ Cb, and let |ϵ(k)| be a dual error satisfying |ϵ(k)| 0 as k . In this case, the following inequality holds: W2(ντk, NΣ) p δ0e 2τk + ϵ(τk), where τk denotes an element of τ(k). By setting ζ(y) to a 1-Lipschitz function, the inequality in Proposition 3 can be rewritten according to its definition and the Kantorovich-Rubinstein formula as follows: 1 k (W1(νk, ν0) + ε(k)) δ0. Therefore, if the average difference in terms of 1-Wasserstein distance between the initial and updated measures (i.e., ντk = f θτk # [µ]) is bounded above by δ0, the Wasserstein ambiguity (i.e., radius of 2-Wasserstein balls) exponentially decays to zero. Therefore, with a small error of ϵ and a moderate update rate of f θ, the Wasserstein ambiguity decays exponentially. 3.2 Riemannian Geometry of W2,g We interpret the proposed Wasserstein Gaussian subspaces W2,g (defined in Section 2.2) as totally geodesic submanifolds in the 2-Wasserstein space analogous to the convex set (e.g., a unit cube with one-hot vectors as elements). In this totally geodesic submanifold, any geodesic that is calculated using induced Riemannian metrics is also defined in the ambient manifold. Therefore, if the Wasserstein Gaussian subspace is a totally geodesic submanifold in the 2-Wasserstein space, then W2 and W2,g coincide and have an explicit form (i.e., the Bures metric) defined as follows: Definition 4. Let Σn be defined as Σn = Cov(Xn), Xn νn. Then, the Bures metric between probability measures νn and νn is defined as d B(νn, νn) = Tr Σn + Σn 2 Σ Proposition 4. (4; 18) The Wasserstein Gaussian subspace P2,g(Rd) is a totally geodesic submanifold in the 2-Wasserstein space P2(Rd). Thus, the following distances are equivalent: d B(νn, νn) = W2,g(νn, νn) = W2(νn, νn) for any νn, νn P2,g(Rd). In (4), we assume that the label measure νn is Gaussian, which is generally not true at the training time, and we approximate the intrinsic distance W2,g to d B. However, according to Propositions 2 and 4 as well as (2), d B can be an exact estimation of W2,g if the optimal parameters (θ, ϑ) of the neural networks (f, g) are found and νn lies in the Wasserstein Gaussian subspace. Using the explicit form presented in Definition 4, we can calculate approximated 2-Wasserstein distances during testing. 3.3 Perturbation Analysis Let µ be an input data measure and µε be a perturbed measure induced by unknown random perturbations. The corresponding label measures for µ and µε are defined as ν = f θ #[µ] and νε = f θ #[µε], respectively. It is nontrivial to construct an explicit form of the perturbation function due to the complexity of neural network. Therefore, we assume that the mass of ν is transported along (Id + εh). Then, the perturbed measure νε can be written as follows: νε = (Id + εh)#[ν], ε > 0, ε pε(y), (7) where h is a perturbation function defined on the feature space with a magnitude ε pε, and the corresponding information is assumed to be unknown at both the training and testing times. In this context, we answer two crucial questions: First, how is the average perturbed distance EεW2,g(νε k, NΣ) related to the diffusion operator L or δk? Second, what are the theoretical advantages of the proposed method over conventional deterministic models? The following propositions answer these questions. Proposition 5. (Wasserstein Perturbation) Let νε k = (Id + εh)#νk be a perturbed measure by εh. Then, there exist numerical constants 0 κ1, κ2 < such that mean radius of perturbed Wasserstein ambiguity set νε k BW2(NΣ, r ) is bounded as: EεW2(νε k, NΣ) p dκ1κ2E[ε] + δk. (8) Consider a deterministic model in which label measures are considered as Dirac-delta measures in the feature space, meaning ν = δy, νε = δy+εh, and the target label measure is considered ˆν = δz. In this case, the Wasserstein distance is equivalent to the Euclidean distance and has a deterministic upper bound defined as EεW2(νε, δz) = Eεd E(y + εh, z) h 2 E[ε] + d E(y, z). For simplicity, assume that dκ1κ2 1 in (8) and h 2 1. Then, the Wasserstein uncertainty definitions for the deterministic and stochastic models are given by E[ε] + d E(y, z) | {z } Deterministic Model E[ε] + δk | {z } Stochastic Model There are two major implications of (9) that verify the theoretical advantages of our method over conventional deterministic models. First, it is clear that the proposed method efficiently reduces the randomness of label information by considering δk in (9), whereas deterministic models are incapable of handling the stochastic properties of perturbations. Second, the Wasserstein ambiguity is proportional to the square root of E[ε] in our stochastic model, which ensures a minor impact to the model in the presence of large perturbations (i.e., E[ε] 1). To show the effectiveness of diffusion term to classification accuracy, we assume the binary classification task with perturbed label measure νε where the positive and negative target label measures are denoted by NΣ+ and NΣ . Corollary 1. (Perturbed Binary Classification.) Let Σ+ and Σ be a r-rank SPD matrices, and ε pε = exp(b) be an exponential distribution with parameter b. Then, the probability of νε classified as positive labels is bounded as follows: P[cls(νε) = 1] 1 e br(λ+ max+λ max) 4bδ 4dκ1κ2 , (10) where λ+ max and λ max denote maximum eigenvalues of matrices Σ+ and Σ , respectively. Corollary 1 shows that the probability of correct classification is maximized with the exponential ratio, if we can find the optimal parameters (θ, ϑ) to attain δk 0. In Proposition 6, we consider an extreme case in which Yt=k is a stochastic process that enforces the path for the Markov semi-group Pk (defined in Section 2.1) for which the corresponding law is a label measure νk = f θk # [µ]. In other words, each particle exactly follows the Ornstein-Uhlenbeck process, which is known to have an explicit path. This proposition verifies that the probability of the average norm of the perturbation function νk(E h 2 2) can be efficiently minimized. Proposition 6. (Markov Inequality for the Perturbation Function) Let Yk νk denote the Markovprocess related to the Markov semi-group and its corresponding law νk. For the l-th component of the perturbation function hl L1(νk), we denote T(y) = h(y) 2 2 < . Then, there are numerical constants 0 κ3, κ4 < such that νk Ey[T(Yk)] a κ3 1 2(e2k 1) +2ε2 (dκ4 + kδk) , (11) for y Rd. Furthermore, limk νk(E[T(Yk)] a) a 2e2ε2dκ3κ4. 4 Experiments For empirical validation, we applied our method to two classification problems: 3D point cloud classification using the Model Net10 (28) dataset and image classification using the CIFAR10 dataset (12), where the data suffer from various perturbations. There are different considerations in this work that can be compared to those in previous works. We generated severe structural perturbations, including random rotations, random resizing, and non-homogeneous local noises. These perturbations are different from those considered in previous works, such as the generalization of adversarial examples, which simply add noise to original images and construct the maximum bound in terms of Lp-distance in the pixel space. We randomly changed the data representation at the training and testing times. This setting aims to model real-world situations in which unknown severe random perturbations potentially exist in data. The Supplementary Materials elucidate the specifications of the network architectures, perturbation setup and samples, additional experiments, and more ablation studies. 4.1 Implementation Details Covariance Matrices. Each target Gaussian measures are represented as r-rank degenerate covariance matrices, Σc = MT c Mc, where Mc denotes (r d) size of random matrix for the c-th class and all indices are i.i.d uniform random variables. For our 2D image classification experiments, we used Sym+ 128 of rank-32 covariance matrices to represent centered Gaussian measures. For our 3D image classification experiments, we used Sym+ 128 rank-3 covariance matrices. To calculate the square roots of the covariance matrices efficiently, we used the GPU-friendly Newton Schulz algorithm presented in (15). Because computational complexity increases quadratically even using this algorithm, we limited the maximum number of dimensions of the Gaussian measures to dim(P2,g) d(d + 1)/2 8K, where d = 128. Hyperparameters and Training Setup. We used the ADAM optimizer with a learning rate of 10 5 for the network g and a learning rate of 10 3 for the network f, as well as for the baseline networks. All experiments were executed using a single RTX 2080 TI GPU. Our method was implemented using Pytorch1.4.0 and Python3.6. Algorithm. Algorithm 1 outlines the entire procedure of the proposed method. 4.2 Comparison with Deterministic Models 2D Image Classification. If the spatial information of 2D images is highly distorted by severe perturbations, then the accuracy of conventional classification networks decreases significantly. In Algorithm 1 Deep WDC Require: Neural networks f, g with initial weights θ, ϑ. Initialize covariance matrices {Σn}N n=1 with rank-s and sample xn µn. for k = 1 to K (i.e., the total number of training iterations) do 1) Optimize the diffusion invariance term in (4). θ1 k θk Exn,Z NI,t U[0,1] 1 t h g e tf θk(xn) + 1 2n Z g f θk(xn) i . 2) Estimate the sample covariance matrices. Σn(θk) 1 d 1 f θk(xn) Ef θk(xn) T f θk(xn) Ef θk(xn) . 3) Optimize the intrinsic distance term in (4). Tr Σn + Σn(θk) 2 Σ 1 2n Σn(θk)Σ 4) Update f by descending its stochastic gradients θ1 k + θ2 k. ϑk ϑk Exn,Z NI,t U[0,1] 1 t h gϑk e tf(xn) + 1 2n Z gϑk (f(xn)) i . 5) Update g by ascending its stochastic gradients ϑk. end for contrast, our method delivers accurate classification results under these conditions. 2D convolutions are highly vulnerable to rotations and local perturbations, because information integration is dependent on the spatial structures of 2D images. For example, each pixel is integrated by convolutional kernels in which the connectivity between the pixels is conditioned (i.e., 2D grid). However, our method does not require such pre-conditioning because it uses the holistic (distributional) information of group of pixels in feature space, which is recognized as a form of high dimensional push-forward probability density. Therefore, our method can generate robust representations in convolutional feature spaces even in the presence of severe structural perturbations. Table 1: 2D image classification accuracy (in %) on the CIFAR10 dataset with perturbations. Regarding perturbation types and amounts, we set parameters for impulse noise (ϵ), rotation (θ, θ2), random scaling (sc), random shearing (sh), and random crop (cc). For the deterministic perturbation, we used CIFAR10-C (10) benchmark dataset. #Param." denotes the number of parameters in the network. The best results are presented in bold. Methods (#Param.) {e} {θ, sc, sh, ϵ} {θ2, sc, sh, ϵ} {θ2, cc} {θ2, cc2} CIFAR10-C Res Net (11.1M) 89.4 73.6 73.4 76.9 77.9 - Dense Net (6.8M) 86.6 76.3 75.9 78.9 74.8 92.6 Deep WDC (6.7M) 95.9 88.6 93.0 92.7 87.7 95.8 3D Point Cloud Classification. For 3D point cloud classification, we considered DGCNN (27) and Point Net++ (24) as baselines. We compared our method with the baselines and the results are summarized in Table 2. As shown in the table, the conventional networks are highly vulnerable to geometric perturbations and jitters with high variance because such perturbations induced drastic changes in features. Therefore, accuracy decreased significantly as additional perturbation types are applied. By contrast, our method exhibits significantly smaller drops in accuracy in the presence of severe perturbations. Table 2: 3D point cloud classification accuracy (in %) for the Model Net10 dataset with perturbations. For perturbations, we set parameters for random sampling (T), random scaling (s), random jitter (ϵ), and random rotation (θ). The best results are presented in bold. Methods (#Param) {T} {T, s, ϵ} {T, s, ϵ, m} {T, s, ϵ2} {T, s, ϵ3} {T, s, ϵ3, θ} Point Net++ (1.7M) 96.6 71.7 67.3 47.2 35.7 25.8 DGCNN (1.9M) 97.1 83.6 79.1 68.0 54.8 35.4 Deep WDC (0.96M) 96.9 94.8 91.0 85.3 71.9 55.2 4.3 Ablation Study We analyzed Deep WDC by evaluating each of its component. First, we examined the role of the diffusion invariance term in (4). As shown in Fig.4(a), Deep WDC without the diffusion invariance term delivers low accuracy with increased variance, whereas using E[Lg] + d W 2,g yields stable and accurate results. We can interpret the proposed diffusion invariance term as distributional regularization, where a group of pixels in the feature space is forced to match a Gaussian density. There is a canonical isomorphism between the Wasserstein Gaussian subspace and positive semi- (a) Effectiveness of the diffusion invariance term (b) Classification results with different dimensionality Figure 4: Effectiveness of the proposed Deep WDC. (a) The blue curve represents the results of using both diffusion invariance E[Lg] and intrinsic distance terms W2,g in (4), whereas the red curve represents using the intrinsic distance term alone. The plot depicts the average accuracy with m 0.2σ. (b) Curves show classification results with different dimensionality of the proposed Deep WDC. Both experiments are conducted on 2D image classification tasks on CIFAR10. definite matrix space (16) (i.e., P2,g(Rd) = Sym+ d ). Thus, the dimensionality of the covariance matrices is another key factor for accurate classification. We set up an experiment by testing different dimensions of d = 16, 64, 128 and ranks r = 8, 32. As shown in Fig.4(b), setting dimension dim(P2,g) = d(d + 1)/2 8K, d = 128 with rank-32 label measures produced the best result. 5 Conclusion In this paper, we proposed a novel classification method, called Deep WDC, where input data and labels are represented as probability measures to address severe perturbations in input data. Deep WDC can output the optimal label measure in terms of diffusion invariance, where the label measures are stationary and become equivalent to Gaussian measures. Experimental results verified that Deep WDC significantly outperforms state-of-the-art classification methods on both 2D and 3D data in the presence of various types of perturbations. Acknowledgements This work was supported by Institute for Information & communications Technology Planning & evaluation (IITP) grant funded by the Korea government (MSIP) (No.2017-0-01780). Broader Impact The proposed framework can considerably enhance conventional classification methods, of which performance is very sensitive to various types of perturbations (e.g., rotations, impulse noise, and down-scaling). The proposed Wasserstein distributional classifier represents both input data and target labels as probability measures and its diffusion invariant property prevents the classifier from being affected by severe perturbations. Hence, various research fields under real-world environments can benefit from exploiting our framework to obtain accurate classification results. [1] Luca Ambrogioni, Umut Güçlü, Ya gmur Güçlütürk, Max Hinne, Marcel A. J. van Gerven, and Eric Maris. Wasserstein variational inference. In Neur IPS. 2018. [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. ar Xiv preprint ar Xiv:1701.0787, 2017. [3] Dominique Bakry and Michel Émery. Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84, pages 177 206. Springer, 1985. [4] Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures Wasserstein distance between positive definite matrices. Expositiones Mathematicae, 37(2):165 191, 2019. [5] Gwendoline De Bie, Gabriel Peyré, and Marco Cuturi. Stochastic deep networks. ar Xiv preprint ar Xiv:1811.07429, 2018. [6] Charlie Frogner, Farzaneh Mirzazadeh, and Justin Solomon. Learning entropic Wasserstein embeddings. In ICLR, 2019. [7] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a Wasserstein loss. In NIPS, 2015. [8] Rui Gao, Xi Chen, and Anton J Kleywegt. Distributional robustness and regularization in statistical learning. ar Xiv preprint ar Xiv:1712.06050, 2017. [9] RUI GAO, Liyan Xie, Yao Xie, and Huan Xu. Robust hypothesis testing using wasserstein uncertainty sets. In Neur IPS. 2018. [10] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2019. [11] Soheil Kolouri, Gustavo K. Rohde, and Heiko Hoffmann. Sliced Wasserstein distance for learning gaussian mixture models. In CVPR, 2018. [12] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. [13] Tam Le, Makoto Yamada, Kenji Fukumizu, and Marco Cuturi. Tree-sliced variants of Wasserstein distances. In Neur IPS. 2019. [14] Jaeho Lee and Maxim Raginsky. Minimax statistical learning with Wasserstein distances. In Neur IPS. 2018. [15] Tsung-Yu Lin, Aruni Roy Chowdhury, and Subhransu Maji. Bilinear cnns for fine-grained visual recognition. In ICCV, 2017. [16] Luigi Malagò, Luigi Montrucchio, and Giovanni Pistone. Wasserstein riemannian geometry of gaussian densities. Information Geometry, 1(2):137 179, 2018. [17] Anton Mallasto and Aasa Feragen. Learning from uncertain curves: The 2-Wasserstein metric for gaussian processes. In NIPS, 2017. [18] Robert J Mc Cann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153 179, 1997. [19] Boris Muzellec and Marco Cuturi. Generalizing point embeddings using the Wasserstein space of elliptical distributions. In Neur IPS, 2018. [20] Boris Muzellec and Marco Cuturi. Subspace detours: Building transport plans that are optimal on subspace projections. In Neur IPS. 2019. [21] Viet Anh Nguyen, Soroosh Shafieezadeh Abadeh, Man-Chung Yue, Daniel Kuhn, and Wolfram Wiesemann. Optimistic distributionally robust optimization for nonparametric likelihood approximation. In Neur IPS. 2019. [22] Felix Otto and Cédric Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361 400, 2000. [23] François-Pierre Paty and Marco Cuturi. Subspace robust Wasserstein distances. In ICML, 2019. [24] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017. [25] Soroosh Shafieezadeh Abadeh, Viet Anh Nguyen, Daniel Kuhn, and Peyman Mohajerin Mohajerin Esfahani. Wasserstein distributionally robust kalman filtering. In Neur IPS. 2018. [26] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. [27] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics, 38(5):146, 2019. [28] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D shapenets: A deep representation for volumetric shapes. In CVPR, 2015. [29] Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. Distilled Wasserstein learning for word embedding and topic modeling. In Neur IPS. 2018. [30] Ruiyi Zhang, Changyou Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as Wasserstein gradient flows. In ICML, 2018.