# robust_hypothesis_testing_using_wasserstein_uncertainty_sets__a20c3463.pdf

Robust Hypothesis Testing Using Wasserstein Uncertainty Sets

Rui Gao School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 30332 rgao32@gatech.edu

Liyan Xie School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 30332 lxie49@gatech.edu

Yao Xie School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 30332 yao.xie@isye.gatech.edu

Huan Xu School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 30332 huan.xu@isye.gatech.edu

We develop a novel computationally efﬁcient and general framework for robust hypothesis testing. The new framework features a new way to construct uncertainty sets under the null and the alternative distributions, which are sets centered around the empirical distribution deﬁned via Wasserstein metric, thus our approach is data-driven and free of distributional assumptions. We develop a convex safe approximation of the minimax formulation and show that such approximation renders a nearly-optimal detector among the family of all possible tests. By exploiting the structure of the least favorable distribution, we also develop a tractable reformulation of such approximation, with complexity independent of the dimension of observation space and can be nearly sample-size-independent in general. Real-data example using human activity data demonstrated the excellent performance of the new robust detector.

1 Introduction

Hypothesis testing is a fundamental problem in statistics and an essential building block for scientiﬁc discovery and many machine learning problems such as anomaly detection. The goal is to develop a decision rule or a detector which can discriminate between two (or multiple) hypotheses based on data and achieve small error probability. For simple hypothesis test, it is well-known from the Neyman-Pearson Lemma that the likelihood ratio between the distributions of the two hypotheses is optimal. However, in practice, when the true distribution deviates from the assumed nominal distribution, the performance of the likelihood ratio detector is no longer optimal and it may perform poorly.

Various robust hypothesis testing frameworks have been developed, to address the issue with distribution misspeciﬁcation and outliers. The robust detectors are constructed by introducing various uncertainty sets for the distributions under the null and the alternative hypotheses. In non-parametric setting, Huber s original work [13] considers the so-called ϵ-contamination sets, which contain distributions that are close to the nominal distributions in terms of total variation metric. The more recent works [17, 9] consider uncertainty set induced by Kullback-Leibler divergence around a nominal distribution. Based on this, robust detectors usually depend on the so-called least-favorable distributions (LFD). Although there has been much success in theoretical results, computation remains a

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

major challenge in ﬁnding robust detectors and ﬁnding LFD in general. Existing results are usually only for the one-dimensional setting. In multi-dimensional setting, ﬁnding LFD remains an open question in the literature. In parametric setting, [1] provides a computationally efﬁcient and provably near-optimal framework for robust hypothesis testing based on convex optimization.

In this paper, we present a novel computationally efﬁcient framework for developing data-driven robust minimax detectors for non-parametric hypothesis testing based on the Wasserstein distance, in which the robust uncertainty set is chosen as all distributions that are close to the empirical distributions in Wasserstein distance. This is very practical since we do not assume any parametric form for the distribution, but rather let the data speak for itself . Moreover, the Wasserstein distance is a more ﬂexible measure of closeness between two distributions. The distance measures used in other non-parametric frameworks [13, 17, 9] are not well-deﬁned for distributions with non-overlapping support, which occurs often in (1) data-driven problems, in which we often want to measure the closeness between an empirical distribution and some continuous underlying true distribution, and (2) high-dimensional problems, in which we may want to compare two distributions that are of high dimensions but supported on two low-dimensional manifolds with measure-zero intersection.

To solve the minimax robust detector problem, we face at least three difﬁculties: (i) The hypothesis testing error probability is a nonconvex function of the decision variable; (ii) The optimization over all possible detectors is hard in general since we consider any inﬁnite-dimensional detector with nonlinear dependence on data; (iii) The worst-case distribution over the uncertainty sets is also an inﬁnite dimensional optimization problem in general. To tackle these difﬁculties, in Section 3, we develop a safe approximation of the minimax formulation by considering a family of tests with a special form that facilitates a convex approximation. We show that such approximation renders a nearly-optimal detector among the family of all possible tests (Theorem 1), and the risk of the optimal detector is closely related to divergence measures (Theorem 2). In Section 4, exploiting the structure of the least favorable distributions yielding from Wasserstein uncertainty sets, we derive a tractable and scalable convex programming reformulation of the safe approximation based on strong duality (Theorem 3). Finally, Section 5 demonstrates the excellent performance of our robust detectors using real-data for human activity detection.

2 Problem Set-up and Related Work

Let Ω Rd be the observation space where the observed random variable takes its values. Denote by P(Ω) be the set of all probability distributions on Ω. Let P1, P2 P(Ω) be our uncertainty sets associated with hypothesis H1 and H2. The uncertainty sets are two families of probability distributions on Ω. We assume that the true probability distribution of the observed random variable belongs to either P1 or P2. Given an observation ω of the random variable, we would like to decide which one of the following hypotheses is true

H1 : ω P1, P1 P1, H2 : ω P2, P2 P2.

A test for this testing problem is a (Lebesgue) measurable function T : Ω {1, 2}. Given an observation ω Ω, the test accepts hypotheses HT (ω) and rejects the other. A test is called simple, if P1, P2 are singletons.

The worst-case risk of a test is deﬁned as the maximum of the worst-case type-I and type-II errors

ϵ(T|P1, P2) := max sup P1 P1 P1{ω : T(ω) = 2}, sup P2 P2 P2{ω : T(ω) = 1} .

Here, without loss of generality, we deﬁne the risk to be the maximum of the two types of errors. Our framework can extend directly to the case where the risk is deﬁned as a linear combination of the Type-I and Type-II errors (as usually considered in statistics).

We consider the minimax robust hypothesis test formulation, where the goal is to ﬁnd a test that minimizes the worst-case risk. More speciﬁcally, given P1, P2 and ϵ > 0, we would like to ﬁnd an ϵ-optimal solution of the following problem

inf T :Ω {1,2} ϵ(T|P1, P2). (1)

We construct our uncertainty sets P1, P2 to be centered around two empirical distributions and deﬁned using the Wasserstein metric. Given two empirical distributions Qk = (1/nk) Pnk i=1 δbωk i , which are based on samples drawn from two underlying distributions respectively, where δω denotes the Dirac measure on ω. Deﬁne the sets using Wasserstein metric (of order 1):

Pk = {P P(Ω) : W(P, Qk) θk}, k = 1, 2, (2)

where θk > 0 speciﬁes the radius of the set, and W(P, Q) denotes the Wasserstein metric of order 1:

W(P, Q) := min γ P(Ω2)

n E(ω,ω ) γ [ ω ω ] : γ has mariginal distributions P and Q o ,

where is an arbitrary norm on Rn. We consider Wasserstein metric of order 1 for the ease of exposition. Intuitively, the joint distribution γ on the right-hand side of the above equation can be viewed as a transportation plan which transports probability mass from P to Q. Thus, the Wasserstein metric between two distributions equals the cheapest cost (measured in some norm ) of transporting probability mass from one distribution to the other. In particular, if both P and Q are ﬁnite-supported, the above minimization problem reduces to the transportation problem in linear programming. Wasserstein metric has recently become popular in machine learning as a way to measuring the distance between probability distributions, and has been applied to a variety of areas including computer vision [25, 16, 23], generative adversarial networks [2, 10], and distributionally robust optimization [6, 7, 4, 27, 26].

2.1 Related Work

We present a brief review on robust hypothesis test and related work. The most commonly seen form of hypothesis test in statistics is simple hypothesis. The so-called simple hypothesis test assuming that the null and the alternative distributions are two singleton sets. Suppose one is interested in discriminating between H0 : θ = θ0 and H1 : θ = θ1, when the data x is assumed to follow a distribution fθ with parameter θ. The likelihood ratio test rejects H0 when fθ1(x)/fθ0(x) exceeds a threshold. The celebrated Neyman-Pearson lemma says that the likelihood ratio is the most powerful test given a signiﬁcance level. In other words, the likelihood ratio test achieves the minimum Type-II error given any Type-I error. In practice, when the true distributions deviate from the two assumed distributions, especially in the presence of outliers, the likelihood ratio test is no longer optimal. The so-called robust detector aims to extend the simple hypothesis test to composite test, where the null and the alternative hypotheses include a family of distributions. There are two main approaches to the minimax robust hypothesis testing, one dates back to Huber s seminal work [13], and one is attributed to [17]. Huber considers composite hypotheses over the so-called ϵ-contamination sets which are deﬁned as total variation classes of distributions around nominal distribution, while the more recent work [17, 9] considers uncertainty sets deﬁned using the Kullback-Leibler (KL) divergence, and demonstrated various closed-form LFDs for one-dimensional setting. However, in the multi-dimensional setting, there remains the computational challenge to establish robust sequential detection procedures or to ﬁnd the LFD. Indeed, closed-form LFDs are found only for one-dimensional case (e.g, [12, 18, 9]). Moreover, classic hypothesis test is usually parametric in that the distribution functions under the null and the alternative are assumed to be belong to a family of distributions with certain parameters.

Recent works [8, 14] take a different approach from the classic statistical approach for hypothesis testing. Although robust hypothesis test are not mentioned, the formulation therein is essentially minimax robust hypothesis test, when the null and the alternative distributions are parametric with the parameters belong to certain convex sets. They show that when exponential function is used as a convex relaxation, the optimal detector corresponds to the likelihood ratio test between the two LFDs that are solved from a convex programming. Our work is inspired by [8, 14] and extends the state-of-the-art in several ways. First, we consider more general classes of convex relaxations, and show that they can produce a nearly-optimal detector for the original problem and admits an exact tractable reformulation for common convex surrogate loss functions. In contrast, the tractability of the framework in [8] relies heavily on the particular choice of the convex loss, because their parametric framework has stringent convexity requirement in distribution parameters which fails to hold for general convex loss even for Gaussian case, while our non-parametric framework only requires convexity in distribution which holds for general convex surrogates and imposes no conditions on the considered distributions. In addition, certain convex loss functions render a tighter nearly-optimal

detector than the one considered in [8]. Furthermore, the tractability of our framework is due to novel strong duality results Proposition 1 and Theorem 3. They are nontrivial, and to the best of our knowledge, cannot be obtained from extending strong duality results on robust hypothesis testing [8] and distributionally robust optimization (DRO) [4, 6, 7], as will be elaborated later. We ﬁnally remark that [24] also considered using Wasserstein metric for hypothesis testing and drew connections between different test statistics. Our focus is different from theirs as we consider Wasserstein metric mainly for the minimax robust formulation.

3 Optimal Detector

We consider a family of tests with a special form, which is referred as a detector. A detector φ : Ω R is a measurable function associated with a test Tφ which, for a given observation ω Ω, accepts H1 and rejects H2 whenever φ(ω) 0, otherwise accepts H2 and rejects H1. The restriction of problem (1) on the sets of all detectors is

inf φ:Ω R max sup P1 P1 P1{ω : φ(ω) < 0}, sup P2 P1 P2{ω : φ(ω) 0} . (3)

We next develop a safe approximation of problem (3) that provides an upper bound via convex approximations of the indicator function [22]. We introduce a notion called generating function.

Deﬁnition 1 (Generating function). A generating function ℓ: R R+ { } is a nonnegative valued, nondecreasing, convex function satisfying ℓ(0) = 1 and limt ℓ(t) = 0.

For any probability distribution P, it holds that P{ω : φ(ω) < 0} EP [ℓ( φ(ω))}]. Set

Φ(φ; P1, P2) := EP1[ℓ ( φ)(ω))] + EP2[ℓ φ(ω)].

We deﬁne the risk of a detector for a test (P1, P2) by

ϵ(φ|P1, P2) := sup P1 P1,P2 P2 Φ(φ; P1, P2).

It follows that the following problem provides an upper bound of problem (3):

inf φ:Ω R sup P1 P1,P2 P2 Φ(φ; P1, P2). (4)

We next bound the gap between (4) and (1). To facilitate discussion, we introduce an auxiliary function ψ, which is well-deﬁned due to the assumptions on ℓ:

ψ(p) := min t R [pℓ(t) + (1 p)ℓ( t)], 0 p 1.

For various generating functions ℓ, ψ admits a close-form expression. Table 1 lists some choices of generating function ℓand their corresponding auxiliary function ψ. Note that the Hinge loss (last row in the table) leads to the smallest relaxation gap. As we shall see, ψ plays an important role in our analysis, and is closely related to the divergence measure between probability distributions.

Table 1: Generating function (ﬁrst column) and its corresponding auxiliary function (second column), optimal detector (third column), and detector risk (fourth column).

ℓ(t) ψ(p) φ 1 1/2 infφ Φ(φ; P1, P2)

p(1 p) log p

p1/p2 H2(P1, P2) log(1 + exp(t))/ log 2 (p log p + (1 p) log(1 p))/ log 2 log(p1/p2) JS(P1, P2)/ log 2 (t + 1)2 + 4p(1 p) 1 2 p1 p1+p2 χ2(P1, P2) (t + 1)+ 2 min(p, 1 p) sgn(p1 p2) TV (P1, P2)

Theorem 1 (Near-optimality of (4)). For any distributions Q1 and Q2, and any non-empty uncertainty sets P1 and P2, whenever there exists a feasible solution T of problem (1) with objective value less than ϵ (0, 1/2), there exists a feasible solution φ of problem (4) with objective value less than ψ(ϵ).

Theorem 1 guarantees that the approximation (4) of problem (1) is nearly optimal, in the sense that whenever the hypotheses H1, H2 can be decided upon by a test T with risk less than ϵ, there exists a detector φ with risk less than ψ(ϵ). It holds regardless the speciﬁcation of P1 and P2. Figure 1 illustrates the value of ψ(ϵ) when ϵ (0, 1/2).

The next proposition shows that we can interchange the inf and sup operators. Hence, in order to solve (4), we can ﬁrst solve the problem of ﬁnding the best detector for a simple test (P1, P2), and then ﬁnding the least favorable distribution that maximizes the risk among those best detectors.

0 0.1 0.2 0.3 0.4 0.5 0

Figure 1: ψ(ϵ) as a function of ϵ Proposition 1. For the Wasserstein uncertainty sets P1, P2 speciﬁed in (2), we have

inf φ:Ω R sup P1 P1,P2 P2 Φ(φ; P1, P2) = sup P1 P1,P2 P2 inf φ:Ω R Φ(φ; P1, P2).

To establish Proposition 1, observe that the sets under inf and sup are: (i) both inﬁnitely dimensional, (ii) the Wasserstein ball is not compact in the space of probability measures, and (iii) the space of all tests φ is not endowed with a linear topological structure, so there is no readily applicable tools (such as Sion s minimax theorem used in [8]) to justify the interchange of inf and sup. Our proof strategy is to identify an approximate optimal detector for the sup inf problem on the left side of (5) using Theorem 3 (whose proof does not depend on the result in Proposition 1), and then verify it is also an approximate optimal solution for the original inf sup problem (4). We also note that such issue does not occur in the distributionally robust optimization setting, as their focus is to study only the inner supremum, while the outer inﬁmum in those problems are already ﬁnite-dimensional by deﬁnition (in fact it corresponds to decision variables in Rn).

The next theorem provides an expression of the optimal detector and its risk.

Theorem 2 (Optimal detector). For any distributions P1 and P2, let d Pk d(P1+P2) be the Radon-Nikodym derivative of Pk with respect to P1 + P2, k = 1, 2. Then

inf φ:Ω R Φ(φ; P1, P2) = Z

Ω ψ d P1 d(P1+P2) d(P1 + P2).

Deﬁne Ω0(P1, P2) := ω Ω: 0 < d Pk d(P1+P2)(ω) < 1, k = 1, 2 . Suppose there exists a well-deﬁned map t : Ω R such that

t (ω) arg mint R h d P1 d(P1+P2)(ω)ℓ( t) + d P2 d(P1+P2)(ω)ℓ(t) i .

Then φ (ω) := t (ω) is an optimal detector for the simple test.

Remark 1. By deﬁnition, ψ(0) = ψ(1) = 0. Then the inﬁmum depends only on the value of P1, P2 on Ω0, the subset of Ωon which P1 and P2 are absolutely continuous with respect to each other:

inf φ:Ω R Φ(φ; P1, P2) = Z

Ω0 ψ d P1 d(P1+P2) d(P1 + P2).

This is intuitive as we can always ﬁnd a detector φ such that its value is arbitrarily close to zero on Ω\ Ω0. In particular, if P1 and P2 have measure-zero overlap, then infφ Φ(φ; P1, P2) equals zero, that is, the optimal test for the simple test (P1, P2) has zero risk.

Optimal detector φ . Set pk := d Pk/(d(P1 +P2)) on Ω0, k = 1, 2. For the four choices of ψ listed in Table 1, the optimal detectors φ on Ω0 are listed in the third column, where sgn denotes the sign function. The ﬁrst one has been considered in [1].

Relation between divergence measures and the risk of the optimal detector. The term R

Ωψ d P1 d(P1+P2) d(P1 + P2) can be viewed as a measure of closeness between probability distributions. Indeed, in the fourth column of Table 1 we show that the smallest detector risk for a simple test P1 vs. P2 equals the negative of some divergence between P1 and P2 up to a constant, where H, JS, , and TV represent respectively the Hellinger distance [11], Jensen-Shannon divergence [20],

triangle discrimination (symmetric χ2-divergence) [28], and Total Variation metric [28]. It follows from Theorem 2 that

sup P1 P1,P2 P2 inf φ:Ω R Φ(φ; P1, P2) = sup P1 P1,P2 P2

Ω ψ d P1 d(P1+P2) d(P1 + P2). (5)

The objective on the right-hand side is concave in (P1, P2) since by Theorem 2, it equals to the inﬁmum of linear functions Φ(φ; P1, P2) of (P1, P2). Problem (5) can be interpreted as ﬁnding two distributions P 1 P1 and P 2 P2 such that the divergence between P 1 and P 2 is minimized. This makes sense in that the least favorable distribution (P 1 , P 2 ) should be as close to each other as possible for the worst-case hypothesis test scenario.

4 Tractable Reformulation

In this section, we provide a tractable reformulation of (5) by deriving a novel strong duality result. Recall in our setup, we are given two empirical distributions Qk = 1 nk Pnk i=1 δbωi k, k = 1, 2. To unify notation, for l = 1, . . . , n1 + n2, we set

ωl = bωl 1, 1 l n1, bωl n1 2 , n1 + 1 l n1 + n2,

and set bΩ:= {ωl : l = 1, . . . , n1 + n2}. Theorem 3 (Convex equivalent reformulation). Problem (5) with P1, P2 speciﬁed in (2) can be equivalently reformulated as a ﬁnite-dimensional convex program

max p1,p2 Rn1+n2 + γ1,γ2 R(n1+n2) + R(n1+n2) +

l=1 (pl 1 + pl 2)ψ pl 1 pl 1+pl 2

m=1 γlm k ωl ωm θk, k = 1, 2,

m=1 γlm 1 = 1

n1 , 1 l n1,

m=1 γlm 1 = 0, n1 + 1 l n1 + n2,

m=1 γlm 2 = 0, 1 l n1,

m=1 γlm 2 = 1

n2 , n1 + 1 l n1 + n2,

l=1 γlm k = pm k , 1 m n1 + n2, k = 1, 2.

Theorem 3, combining with Proposition 1, indicates that problem (4) is equivalent to problem (6). We next explain various elements in problem (6).

Decision variables. pk can be identiﬁed with a probability distribution on bΩ, because P

lm γlm k = 1, and γk can be viewed as a joint probability distribution on bΩ2 with marginal distributions Qk and pk. We can eliminate variables p1, p2 by substituting pk with γk using the last constraint, so that γ1, γ2 are the only decision variables.

Objective. The objective function is identical to the objective function of (5), and thus we are maximizing a concave function of (p1, p2). If we substitute pk with γk, then the objective function is also concave in (γ1, γ2).

Constraints. The constraints are all linear. Note that ωl are parameters, but not decision variables, thus ωl ωm can be computed before solving the program. The constraints all together are equivalent to the Wasserstein metric constraints W(Qk, pk) θk.

Strong duality. Problem (6) is a restriction of problem (4) in the sense that they have the same objective but (4) restricts the feasible region to the subset of distributions that are supported on a

subset bΩ. Nevertheless, Theorem 3 guarantees that the two problems has the same optimal value, because there exists a least favorable distribution supported on bΩ, as explained below.

Intuition on the reformulation. We here provide insights on the structural properties of the least favorable distribution that explain why the reduction in Theorem 3 holds. The complete proof of Theorem 3 can be found in Appendix. Suppose Qk = δbωk, k = 1, 2, Ω= Rd and ψ(p) = 2 p

p(1 p). Note that Wasserstein metric measures the cheapest cost (measured in ) of transporting probability mass from one distribution to the other. Thus, based on the discussion in Section 3, the goal of problem (5) is to move (part of) the probability mass on ˆω1 and ˆω2 such that the negative divergence between the resulting distributions is maximized. The following three key observations demonstrate how to move the probability mass in a least favorable way.

Figure 2: Illustration of the least favorable distribution: it is always better off to move the probability mass from bω1 and bω2 to an identical point ω on the line segment connecting bω1, bω2.

(i) Consider feasible solutions of the form

(P1, P2) = (1 p1)δbω1 + p1δω1, (1 p2)δbω2 + p2δω2 , ω1, ω2 Ω\ {bω1, bω2}.

Namely, (P1, P2) is obtained by moving out probability mass pk > 0 from bωk to ωk, k = 1, 2 (see Figure 2). It follows that the objective value Z

Ω ψ( d P1 d(P1+P2))d(P1 + P2) = 2 p1p2, if ω1 = ω2, 0, o.w.

This is consistent with Remark 1 in that the objective value vanishes if the supports of P1, P2 do not overlap. Moreover, when ω1 = ω2, the objective value is independent of their common value ω = ω1 = ω2. Therefore, we should move probability mass out of resources bω1, bω2 to some common region, which contain points that receive probability mass from both resources.

(ii) Motivated by (i), we consider solutions of the following form

(P1, P2) = (1 p1)δbω1 + p1δω, (1 p2)δbω2 + p2δω , ω Ω\ {bω1, bω2},

which has the same objective value 2 p1p2. In order to save the budget for the Wasserstein metric constraint, i.e., to minimize the transport distance

p1 ω1 bω1 + p2 ω2 bω2 ,

by triangle inequality we should choose ω1 = ω2 = ω to be on the line segment connecting bω1 and bω2 (see Figure 2).

(iii) Motivated by (ii), we consider solutions of the following form

P k = (1 pk p k)δbωk + p1δωk + p 1δω k, k = 1, 2,

where ωk, ω k / Ω\ {bωk} are on the line segment connecting bω1 and bω2, k = 1, 2. Then the objective value is maximized at ω1 = ω 1 = bω2, ω2 = ω 2 = bω1, and equals 2 p

(p1 + p 1)(p2 + p 2) + 2 p

(1 p1 p 1)(1 p2 p 2). Hence it is better off to move out probability mass from bω1 to bω2 and from bω2 to bω1.

Therefore, we conclude that there exist a least favorable distribution supported on bΩ. The argument above utilizes Theorem 2, the triangle inequality of a norm and the concavity of the auxiliary function ψ. The compete proof can be viewed as a generalization to the inﬁnitesimal setting.

Complexity. Problem (6) is a convex program which maximizes a concave function subject to linear constraints. We brieﬂy comment on the complexity of solving (6) in terms of the dimension of the observation space and the sample sizes:

(i) The complexity of (6) is independent of the dimension d of Ω, since we only need to compute pairwise distances ωl ωm as an input to the convex program.

(ii) The complexity in terms of the sample sizes n1, n2 depends on the objective function and can be nearly sample size-independent when the objective function is Lipschitz in ℓ1 norm (equivalently,

the ℓ norm of the partial derivative is bounded). The reasons are as follows. In this case, after eliminating variables p1, p2, we end up with a convex program involving only γ1, γ2, and the Lipschitz constant of the objective with respect to γ is identical to that with respect to p. Observe that the feasible region of each γk is a subset of the ℓ1-ball in R(n1+n2) + . Then according to the complexity theory of the ﬁrst order method for convex optimization [3], when the objective function is Lipschitz in ℓ1 norm, the complexity is O(ln(n1) + ln(n2)). Notice that this is true for all except for the ﬁrst case in Table 1. Hence, this is a quite general.

We ﬁnally remark that extending previous strong duality results on DRO [4, 6, 7] from one Wasserstein ball to two Wasserstein balls does not lead to an immediately tractable (convex) reformulation. For one thing, simply applying those previous results on the inner supremum in (4) does not work, because after doing so we are left with the outer inﬁmum that is still intractable. For another thing, applying the previous methodology onto problem (5) does not lead to an tractable reformulation either, mainly because the objective function R

Ωψ( d P1 d(P1+P2))d(P1 + P2) is not separable in P1 and P2, but depends on the density on the common support of P1 and P2. Thus, as argued in Section 4, in the least-favorable distribution the probability mass of the two distributions cannot be transported arbitrarily, but are linked via their common support. In contrast, the problems in DRO [4, 6, 7] have no such linking constraints, which makes it hard to extend the previous methodology. Instead, we prove the strong duality from scratch and provide new insights on the structural properties of the least-favorable distribution that are different in nature from that in DRO settings.

5 Numerical Experiments

In this section, we demonstrate the performance of our robust detector using real data for human activity detection. We adopt a dataset released by the Wireless Sensor Data Mining (WISDM) Lab in October 2013. The data in this set were collected with the Actitracker system, which is described in [19, 29, 15]. A large number of users carried an Android-based smartphone while performing various everyday activities. These subjects carried the Android phone in their pocket and were asked to walk, jog, ascend stairs, descend stairs, sit, and stand for speciﬁc periods of time.

The data collection was controlled by an application executed on the phone. This application is able to record the user s name, start and stop the data collection, and label the activity being performed. In all cases, the accelerometer data is collected every 50ms, so there are 20 samples per second. There are 2,980,765 recorded time-series in total. The activity recognition task involves mapping time-series accelerometer data to a single physical user activity [29]. Our goal is to detect the change of activity in real-time from sequential observations. Since it is hard to model distributions for various activities, traditional parametric methods do not work well in this case.

For each person, the recorded time-series contains the acceleration of the sensor in three directions. In this setting, every ωl is a three-dimensional vector (al x, al y, al z). We set θ1 = θ2 = θ as the sample sizes are identical, and θ is chosen such that the quantity 1 1/2 infφ Φ(φ; P 1 , P 2 ) in Table 1, or equivalently, the divergence between P 1 and P 2 , is close to zero with high probability if Q1 and Q2 are bootstrapped from the data before change, where P 1 , P 1 is the LFD yielding from (6). The intuition is that we want the Wasserstein ball to be large enough to avoid false detection while still have separable hypotheses (so the problem is well-deﬁned).

We compare our robust detector, when coupled with CUSUM detector using a scheme similar to [5], with the Hotelling T2 control chart, which is a traditional way to detect the mean and covariance change for the multivariate case. The Hotelling control chart plots the following quantity [21]:

T 2 = (x µ) Σ 1(x µ),

where µ and Σ are the sample mean and sample covariance obtained from training data.

As shown in Fig. 3 (a), in many cases, Hotelling T2 fails to detect the change successfully and our method performs pretty well. This is as expected since the change is hard to capture via mean and covariance as Hotelling does.

Moreover, we further test the proposed robust detector, φ = 1

2 ln(p 1/p 2) and φ = sgn(p 1 p 2), on 100 sequences of data. Here p 1 and p 2 are the LFD computed from the optimization problem (6). For each sequence, we choose the threshold for detection by controlling the type-I error. Then we compare the average detection delay of the robust detector and the Hotelling T2 control chart, as

4520 4554 4589 4619 0

optimal detector

4520 4554 4589 4619 0

Hotelling control chart

4520 4554 4589 4619 sample index

Post-change

Post-change

Post-change

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 type-I error

average detection delay

Hotelling control chart detector ?$ = sgn(p$

2) detector ?$ = 1

Figure 3: Comparison of the detector φ = 1

2 ln(p 1/p 2) with Hotelling control chart: (a): Upper: the proposed optimal detector; Middle: the Hotelling T2 control chart; Lower: the raw data, here we plot (a2 x + a2 y + a2 z)1/2 for simple illustration. The dataset is a portion of full observations from the person indexed by 1679, with the pre-change activity jogging and post-change activity walking. The black dotted line at index 4589 indicates the boundary between the pre-change and post-change regimes. (b): The average detection delay v.s. type-I error. The average is taken over 100 sequences of data.

shown in 3 (b). The robust detector has a clear advantage, and the sgn(p 1 p 2) indeed has better performance than 1

2 ln(p 1/p 2), consistent with our theoretical ﬁnding.

6 Conclusion

In this paper, we propose a data-driven, distribution-free framework for robust hypothesis testing based on Wasserstein metric. We develop a computationally efﬁcient reformulation of the minimax problem which renders a nearly-optimal detector. The framework is readily extended to multiple hypotheses and sequential settings. The approach can also be extended to other settings, such as constraining the Type-I error to be below certain threshold (as the typical statistical test of choosing the size or signiﬁcance level of the test), or considering minimizing a weighed combination of the Type-I and Type-II errors. In the future, we will study the optimal selection of the size of the uncertainty sets leveraging tools from distributionally robust optimization, and test the performance of our framework on large-scale instances.

[1] Anatoli Juditsky Alexander Goldenshluger and Arkadi Nemirovski. Hypothesis testing by convex optimization. Electron. J. Statist., 9(2):1645 1712, 2015.

[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017.

[3] Ahron Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and engineering applications, volume 2. Siam, 2001.

[4] Jose Blanchet and Karthyek RA Murthy. Quantifying distributional model risk via optimal transport. ar Xiv preprint ar Xiv:1604.01446, 2016.

[5] Yang Cao and Yao Xie. Robust sequential change-point detection by convex optimization. In International Symposium on Information Theory (ISIT), 2017.

[6] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, pages 1 52, 2015.

[7] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. ar Xiv preprint ar Xiv:1604.02199, 2016.

[8] Alexander Goldenshluger, Anatoli Juditsky, Arkadi Nemirovski, et al. Hypothesis testing by convex optimization. Electronic Journal of Statistics, 9(2):1645 1712, 2015.

[9] Gokhan Gul and Abdelhak M. Zoubir. Minimax robust hypothesis testing. IEEE Transactions on Information Theory, 63(9):5572 5587, 2017.

[10] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769 5779, 2017.

[11] Ernst Hellinger. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik, 136:210 271, 1909.

[12] Peter J Huber. Robust statistics. 1981.

[13] Peter J Huber. A robust version of the probability ratio test. The Annals of Mathematical Statistics, 36(6):1753 1758, 1965.

[14] Anatoli Juditsky and Arkadi Nemirovski. Hypothesis testing via afﬁne detectors. Electron. J. Statist., 10(2):2204 2242, 2016.

[15] Jennifer R Kwapisz, Gary M Weiss, and Samuel A Moore. Activity recognition using cell phone accelerometers. ACM Sig KDD Explorations Newsletter, 12(2):74 82, 2011.

[16] Elizaveta Levina and Peter Bickel. The earth mover s distance is the mallows distance: Some insights from statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, pages 251 256. IEEE, 2001.

[17] B. C. Levy. Robust hypothesis testing with a relative entropy tolerance. IEEE Transactions on Information Theory, 55(1):413 421, 2009.

[18] Bernard C Levy. Principles of signal detection and parameter estimation. Springer Science & Business Media, 2008.

[19] Jeffrey W Lockhart, Gary M Weiss, Jack C Xue, Shaun T Gallagher, Andrew B Grosner, and Tony T Pulickal. Design considerations for the wisdm smart phone-based sensor mining architecture. In Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data, pages 25 33. ACM, 2011.

[20] Christopher D Manning, Christopher D Manning, and Hinrich Schütze. Foundations of statistical natural language processing. MIT press, 1999.

[21] Douglas C Montgomery. Introduction to statistical quality control. John Wiley & Sons (New York), 2009.

[22] Arkadi Nemirovski and Alexander Shapiro. Convex approximations of chance constrained programs. SIAM Journal on Optimization, 17(4):969 996, 2006.

[23] Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435 446. Springer, 2011.

[24] Aaditya Ramdas, Nicolás García Trillos, and Marco Cuturi. On wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47, 2017.

[25] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover s distance as a metric for image retrieval. International journal of computer vision, 40(2):99 121, 2000.

[26] Soroosh Shaﬁeezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, pages 1576 1584, 2015.

[27] Aman Sinha, Hongseok Namkoong, and John Duchi. Certiﬁable distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017.

[28] Flemming Topsoe. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on information theory, 46(4):1602 1609, 2000.

[29] Gary M Weiss and Jeffrey W Lockhart. The impact of personalization on smartphone-based activity recognition. In AAAI Workshop on Activity Context Representation: Techniques and Languages, pages 98 104, 2012.