# representation_learning_via_adversariallycontrastive_optimal_transport__ebc28ec6.pdf

Representation Learning via Adversarially-Contrastive Optimal Transport

Anoop Cherian 1 Shuchin Aeron 2

Abstract In this paper, we study the problem of learning compact (low-dimensional) representations for sequential data that captures its implicit spatiotemporal cues. To maximize extraction of such informative cues from the data, we set the problem within the context of contrastive representation learning and to that end propose a novel objective via optimal transport. Speciﬁcally, our formulation seeks a low-dimensional subspace representation of the data that jointly (i) maximizes the distance of the data (embedded in this subspace) from an adversarial data distribution under the optimal transport, a.k.a. the Wasserstein distance, (ii) captures the temporal order, and (iii) minimizes the data distortion. To generate the adversarial distribution, we propose a novel framework connecting Wasserstein GANs with a classiﬁer, allowing a principled mechanism for producing good negative distributions for contrastive learning, which is currently a challenging problem. Our full objective is cast as a subspace learning problem on the Grassmann manifold and solved via Riemannian optimization. To empirically study our formulation, we provide experiments on the task of human action recognition in video sequences. Our results demonstrate competitive performance against challenging baselines.

1. Introduction

Recent advancements in deep neural network architectures (Greff et al., 2016; Merity et al., 2018; Sutskever et al., 2014; Zilly et al., 2017; Varol et al., 2017; Chung et al., 2015; Kim et al., 2019) have resulted in signiﬁcant progress towards our ability to model and reason over sequential data. However, this problem is far from considered solved and continues to be challenging, especially in high-dimensional

1Mitsubishi Electric Research Labs, Cambridge, MA. 2Tufts University, Medford, MA. Correspondence to: Anoop Cherian <cherian@merl.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

spatio-temporal settings. There are several practical issues that lead to this difﬁculty, notably (i) most of the highly successful neural network models operate on data of a ﬁxed length (such as images), however temporally-evolving data, such as for example the frame-level features from a video recognition task, could be of arbitrary temporal length, and (ii) the data may be entangled with nuisance factors, such as for example, features corresponding to temporally-evolving background clutter, which may make inference difﬁcult. While, it may be possible to extend popular deep architectures to address these challenges, they may be computationally heavy or require large-scale annotated data, which may be difﬁcult to gather. We refer the reader to several papers (Dollar et al., 2005; Varol et al., 2016; Wang & Gupta, 2015; Tran et al., 2017; Wu et al., 2017; Girdhar et al., 2019) illustrating the wide range of approaches undertaken to tackle this challenging problem.

In this paper, we address these issues by learning a compact representation of a given video sequence of arbitrary length, that maximally captures the spatio-temporal information, while at the same time can be effectively fed to a light weight classiﬁer for action recognition. We approach this representation learning problem from the perspective of contrastive learning (Saunshi et al., 2019; van den Oord et al., 2018; Bose et al., 2018; Tschannen et al., 2019; Wang & Cherian, 2018) that has recently emerged as a ﬂexible yet powerful tool for learning generic representations for high dimensional data, using positive and negative examples. The key idea in contrastive learning is to produce a representation that is closer to the positive (given data) examples, and farther from the negatives. Usually, the negative examples are randomly picked. However, it is usually seen that the performance of these algorithms heavily depend on the choice of the negatives and their pairings with the positives. (Arora et al., 2019; Bose et al., 2018).

Suppose we are given a set of negative examples to work with. As the performance of contrastive learning depends on the coupling between positive and negative examples, to this end, we propose a novel contrastive learning objective that simultaneously optimizes for learning a representation via a tightly coupled interplay between four key components, namely (i) generating a (probabilistic/soft) coupling with the negative data via minimizing the optimal transport /Wasserstein distance (Santambrogio, 2015) between the

Adversarially-Contrastive Optimal Transport

Adversarial Noise

𝑦= 𝛾(𝑥 𝑔$ 𝑧)

𝑧~𝑁( #𝑋, 𝜎%𝐼)

Ordered Input Features

Optimal coupling

min ' ) *!,,"

Subspace Learning

Distortion + Order

Adversarial Noise

Figure 1. Our overall architecture. The input frames are ﬁrst encoded into a set X of ordered feature vectors xt using a (pre-trained) neural network. These features are then used in an adversarial noise generator (implemented using a Wasserstein GAN trained for adversarial losses) to generate a set Y of adversarial noise samples. Next, we use X and Y in a joint optimal transport and representation learning formulation that tries to (i) minimize the optimal coupling π between the two sets, while (ii) also learn a subspace U that maximizes the distance between projections of X onto U and the adversarial noise Y. This latter cost also includes distortion and ordering penalties. Illustratively, we depict the useful dimensions of the input in red color (or its variants). The idea is that the representation U can ﬁlter this dimension via contrasting against the noise.

positive and negative examples, (ii) a representation learning cost that seeks a low-dimensional data subspace such that the projections of the data onto this subspace maximizes their distance from the coupled negative examples, (iii) a distortion penalty that prevents the subspace projections from diverging too much from the input data, and (iv) an ordering constraint on the subspace projections that captures the sequentiality of the input.

Having outlined our novel approach for simultaneous learning of coupling between the positive and negative examples and the (contrastive) representation, we turn towards addressing generation of good negative examples. As pointed out in (Tschannen et al., 2019), design of good negative distribution and its impact on contrastive learning is not fully addressed yet. In contrast to existing works such as, (H enaff et al., 2019; Oord et al., 2018; Sun et al., 2019; Wang & Gupta, 2015) we show that, when coupled with the objective to learn a contrastive representation, one must allow a high discrimination, with respect to the intended task, from the positive distribution while maintaining good correlation.

We resolve these conﬂicting objectives in learning the contrastive distribution by trading off conditional distributional discrepancy with a (given) classiﬁer accuracy on the contrastive distribution. Speciﬁcally, we use a Wasserstein GAN that takes as input positive examples from the dataset, and learn to generate new samples (conditioned on the positives), where these generated samples when added to their

respective positive examples will ensure two properties, namely (i) the modiﬁed positives must be as close in distribution as possible to the true positives in the dataset, and (ii) a classiﬁer that has good performance on the true positives must have high misclassiﬁcation rate on these modiﬁed positives. Figure 1 pictorially depicts our complete framework.

We summarize our novel contributions below.

1. We propose a representation learning cost that naturally blends contrastive learning, adversarial learning, optimal transport, and Riemannian geometry into one framework.

2. We show that our objective for learning contrastive representation, while completely differing in its aims, is related to the subspace robust optimal transport distances proposed in (Paty & Cuturi, 2019). We characterize this relation in Theorem 1, thereby making a novel connection between contrastive learning and robust optimal transport.

3. Further, we present an adversarial distribution learning setup within which, it is possible to optimize over and regulate the choice of distribution for negative samples that is known to be critical for contrastive representation learning.

4. We apply this framework for the problem of learning representations for classifying action video sequences and obtain promising classiﬁcation results.

Adversarially-Contrastive Optimal Transport

2. Related Work

Our approach has several parallels and similarities with existing works that we systematically outline below. Contrastive estimation (Smith & Eisner, 2005) and noisecontrastive estimation (Gutmann & Hyv arinen, 2010) are popular representation learning methods that learn via an objective that contrasts data likelihood under the model against the likelihood of the noise or implicitly-constructed contrastive (i.e., negative) examples. Popular deep metric learning and triplet losses (Hoffer & Ailon, 2015) can also be considered as variants of contrastive learning when explicit pairwise relationships between data samples are available. As empirically shown in many works and recently theoretically argued in (Arora et al., 2019), contrastive learning can reduce the sample complexity for downstream tasks.

There have been recent efforts at unifying the general representation learning methods and contrastive methods (Hoffer & Ailon, 2015; Arora et al., 2019), such as for example via the maximum Info NCE principle (Tschannen et al., 2019; Oord et al., 2018), in which the main idea is to implicitly maximize the mutual information (Cover & Thomas, 2006) between the learned representation and the original data. However, these formulations assume a good approximation of mutual information from the given samples a problem that is typically hard. Recently, using different forms of variational characterizations of mutual information (Poole et al., 2019), efﬁcient estimators have been proposed. Nevertheless, the resulting optimization and training may become unstable (Song & Ermon, 2019).

Among works based on maximizing Info NCE, one similar to ours is (Oord et al., 2018) that proposes contrastive predictive coding through a density ratio capturing the mutual information between the representation and future samples of the sequence. They use noise contrastive estimation on their representation learning cost, where the noise distribution is considered as those plausibly unrelated to the input. Our proposed framework is different from theirs in that we explicitly seek a negative distribution that can potentially increase the contrastiveness of useful cues via learning to generate these hard negative samples. Attempts towards improving the negative sampler in contrastive learning have been made in (Bose et al., 2018). Here the authors propose to use a mixture of unconditional and conditional negative distributions, conditioned on given data, and parametrized via an implicit generative model; however with a totally different loss function than ours. In contrast, we achieve the required discrimination using a classiﬁer, while our proposed variant of WGAN captures the similarity of the negatives to the given data.

Our work is also related to discriminative pooling (Wang & Cherian, 2019) that proposes to generate negative samples via passing random noise through an image-trained CNN,

however is not adversarial. In (Wang & Cherian, 2018), the authors propose an adversarial setup in a discriminative representation learning framework, however uses a deterministic deep model to learn a single adversarial sample per data point. Instead, our proposed framework can generate distributions of adversarial samples. Another paper related to ours is (Cherian et al., 2017) that proposes to use subspace representations for video sequences. While, we also use their temporal learning constraints, our optimal transport and adversarial distribution learning offer a richer representation learning setup via suppressing perhaps false-positive temporal features, such as temporally-evolving noise.

3. Problem Formulation

In this section, we describe our problem setup, introduce our notation, and review some prior work on which we build our proposed algorithms. Following standard notation, we use uppercase boldface letters X to denote matrices, and lowercase boldface x to denote vectors. Refer Figure 1 for contextualizing our notation and variables in the sequel.

Suppose we are given a set of N data sequences D = {X1, X2, , XN}, where each Xi = xi 1, xi 2, , xi ni

is a sequence of ni ordered feature vectors and each xt Rd. Further, we assume Xi is associated with a ground truth class label ℓi L; L denoting a given set of labels. We also assume each x X is an independent sample from a data distribution PD(X) conditioned on the mean X of the sequence X.1 Note that we do not make any explicit assumption on what these sequences represent. For example, in a video recognition application, each xt could be the output of a frame-level deep neural network that is trained on individual frames against their respective video label (Simonyan & Zisserman, 2014).

As each feature xt in a sequence X is assumed to be generated independently without accessing the rest of the sequence, these individual features could be noisy; for example, they could be entangled with irrelevant features from the background. Our key assumption is that the useful temporally-evolving features belong to subspaces in this ddimensional feature space. Thus, our main idea is to design an objective that could extract these subspaces in a compact form, denoted U(X), such that by using this representation some suitable empirical sequence recognition loss LD is minimized, where:

i=1 LC(Ui, ℓi) and Ui = arg min U LR(U(Xi)).

(1) The loss LD aggregates the error LC in training a classiﬁer C on the representations Ui for each sequence against its

1We may use any other central tendency of the sequence to deﬁne this distribution.

Adversarially-Contrastive Optimal Transport

ground truth label ℓi. Further, the Ui s are obtained via optimizing a sequence level representation learning objective captured by the loss LR. In a classical feature learning setup (Simonyan & Zisserman, 2014; Carreira & Zisserman, 2017), LR ﬁnds a vector U that minimizes, say, the mean-squared error to the data samples; which boils down to the average feature. Thus, in that case, the arg min optimization is merely the average pooling scheme. In this paper, we generalize this pooling for richer and better representation learning. While, we can easily train for the two losses LC and LR jointly in an end-to-end manner (Wang & Cherian, 2019), in this work, we deal with them separately so that we have better control of each of them. In the next few sections, we look deeper into the representation loss using a contrastive learning framework. We will describe the classiﬁer loss LC in Sec. 3.6

3.1. Contrastive Learning via Optimal Transport

Suppose, we treat a sequence Xi as a set of positive examples (ignoring the order within), and assume that we have access to a set of negative examples, denoted Yi = yi 1, yi 2, , yi m , each y Rd PY. In noise contrastive estimation, PY is typically assumed to be either uniform or Gaussian noise. The goal of contrastive learning in this setup is to ﬁnd a suitable representation for X that is maximally distant from Y. How should we characterize this contrastiveness? Given that we are working with sets of data points under the assumption that they are random samples from underlying probability distributions, a natural possibility is to consider the Optimal Transport (OT), also known as the Wasserstein distance between the two distributions (Santambrogio, 2015).

Recall that the Wasserstein distance, denoted by Wc(µ, ν), between two probability measures µ, and ν that are both supported on Rd with respect to a ground cost c(x, y), x, y Rd, is given by (Santambrogio, 2015):

Wc(µ, ν) := inf π Π(µ,ν)E(x,y) π c(x, y), (2)

where Π(µ, ν) denotes the set of all couplings (joint probability distributions) with marginals µ and ν. Typically, in the absence of any other information and when the measures are discrete, µ, ν are chosen to be uniform distributions over the support. Speciﬁc to our case, given positive examples X, we let µX be the empirical distribution (with equal, i.e., uniform probability) over xt X, i.e., µX = Pn t=1 1 nδ(xt), δ(xt) denoting the Dirac measure at xt. Similarly, let νY be the empirical distribution of the negative samples Y, also uniform over the samples.

Next, let f U : Rd Rd be a mapping parameterized by the to be learned representation U. Then, we formulate our constrastive optimal transport problem for representation

learning as:

max U LOT (U) := Wc(f U#µX, νY). (3)

The notation f U#ρ denotes the push-forward measure of ρ under the mapping f U. In this paper, we assume the useful features belong to linear feature subspaces2 and thus use the mapping f deﬁned as f = UU for orthonormal U Rd k, i.e., U U = Ik, where Ik denotes the k k identity matrix and k d. Rewriting the Wasserstein distance (2) in empirical form, combining it with the deﬁnition of f in (3), and using the ℓ2-norm for the OT cost c, we can write the contrastive representation learning objective as:

max U G(d,k) LOT (U) := inf π Π(µX,νY)

i,j πij f U(xi) yj ,

(4) In the formulation (4), we assume U G(d, k), the Grassmann manifold of all k-dimensional subspaces of Rd. Recall that G(d, k) denotes the quotient space S(d, k)/O(k) of all d k orthonormal matrices S(d, k) that are invariant to right rotations. Given that our loss LOT (U) = LOT (UR) for any k k orthogonal matrix R, casting the learning objective on the Grassmann manifold is a natural choice.

A question with regard to (4) is why we have f U acting only on the positive samples? This is because, we assume that the negative samples (as described in Sec. 3.2) share all the subspaces of the positives, except for those subspaces containing relevant cues for classiﬁcation (e.g., action-related), which are present only in the positives. Thus, asking for a maximal common projection subspace on positive and negatives may lead the optimizer to move away from ﬁnding the useful contrastive subspaces in the positives.

3.1.1. CONNECTIONS TO SUBSPACE ROBUST OT

We note that (4) is itself novel for contrastive learning in that instead of looking for pairs of positive and negative examples as is common in contrastive learning setup, it implicitly learns the best coupling (via the optimal transport plan) and enforces this coupling to be the maximal. Our formulation (4) is related to the recently proposed subspace robust Wasserstein distances (Paty & Cuturi, 2019), which can be deﬁned in our problem setup as:

P2 k max U:S(d,k) min π Π(µX,νY)) Eπ U x U y 2, and

S2 k min π Π(µX,νY) max U G(d,k) Eπ U x U y 2.

2This linearity choice is inspired by the observation that usually these deep features are extracted from the penultimate layers of a CNN, subsequently classiﬁed using a linear classiﬁer.

Adversarially-Contrastive Optimal Transport

Suppose, C2 k denotes the Wasserstein-2 distance variant of our objective in (4); i.e.,

C2 k = max U G(d,k) min π Π(µX,νY)

i,j πij f U(xi) yj 2. (5)

It is clear that our subspaces U act only on the positive examples, and seek the maximal robustness against the chosen negative distribution. The following theorem characterizes the connection between the solutions to the two objectives. The proof can be found in the supplementary material. Theorem 1. Assuming n Y negative samples,

P2 k C2 k S2 k + max U G(d,k) 1 n Y

j=1 (Id UU )yj 2,

with equality iff k = d. The variational term on the RHS is equal to the sum of d k largest eigenvalues of the Gram matrix ΣY = 1 n Y P yjy j .

3.2. Generating Noise Distributions

As alluded to above, in contrastive learning, typically the noise distribution is assumed uncorrelated to the data. However, it is often observed that as the noise is closer to the data (hard negatives), the quality of the learned representation improves. In this regard, there are two difﬁculties to address with regard to the noise generation, (i) how to generate the noise distribution νY closer to the data distribution µX, and (ii) how to ensure the generated noise will not impact the useful features in X? We resolve this dilemma via generative adversarial networks with some new regularizations.

Concretely, suppose our measure on the negative samples νY is deﬁned via an implicit function gθ : Rd Rd, where θ deﬁnes its parameters to be learned. That is, we assume y = gθ(z), where z N(X, σ2I). Speciﬁcally, we assume the input to our implicit generator gθ comes from normally distributed data with mean X and standard deviation σ. Note that X deﬁnes the average of the samples in the respective sequence, i.e., X = 1

n P t xt. Recall that we had originally assumed each x X PD(X) (in Sec 3). Now, our goal is to learn θ such that it emulates PD(X), X D, while also capturing other desirable properties listed above. Substituting the deﬁnition of y in (4), we have a modiﬁed OT problem:

L OT := min θ min π Π(µX,gθ(X,σ)) Eπ f U(x) y , (6)

where we use the succinct notation gθ(X, σ) to explicitly show the relation of the noise distribution νY with the input sequence X. Using Wasserstein-1 distance, we can use the Kantorovich duality to derive a dual form of this objective via learning a 1-Lipschitz function h, and rewriting (6) as:

L OT := min θ max h L1 Ex µX [h(f U(x))] Ey gθ(X,σ) [h(y)] .

The formulation in (7) is a variant of the popular Wasserstein-GAN (Arjovsky et al., 2017), except that the noise z is conditioned on the input sequence, and that the input data x is projected into some subspace U to be learned.

Note that while it may seem we can optimize all the parameters together alongside learning the representation U using (7), it poses some technical hurdles. For example, unless we know how to generate the negative distribution via learning θ, we cannot produce the representation U, and without this representation, we cannot use f U. This poses a challenge with learning the two jointly, especially when considering highly non-linear neural networks to characterize h. A second problem is that the data from a single sequence may be insufﬁcient to learn θ. We circumvent both these problems by separating the representation learning objective from the noise generation objective; the latter we re-formulate as:

LG(θ) := min θ max h L1 Ex X D [h(x)] Ey gθ(X,σ) [h(y)] ,

(8) where now, we removed f U and included the parameter learning problem using the full dataset D, instead of a single sequence X.

3.3. Adversarial Noise Generation

While, our formulation in (8) does allow learning noise distributions that are similar in distribution to the data samples x, it is not adversarial in the sense that there is nothing preventing the generated noise from being adequately discriminative, i.e., from mimicking spatio-temporal features; for example, Y may have features that are useful for recognition, however we desire our ﬁnal representation to be maximally different from this noise. To account for these requirements, we propose to learn to generate adversarial noise. We need some new notation to present our technique.

Suppose we have a pre-trained (frame-level) classiﬁer ζ : Rd |L| that takes each x X and returns a class label ℓX associated with X. Different from (7), now our objective is not to just produce noise, but to make this noise adversarial to the useful components in the input data. That is, for a given feature x X, we seek to generate a noise sample ˆx from gθ(X, σ) such that when this sample is subtracted from the input x, a classiﬁer ζ(x) that is trained to produce ℓX will misclassify; i.e., if ζ(x) ℓX, then ζ(γ(x ˆx))|ˆx gθ(X, σ) ℓX , where ℓX L is any other class label3

other than ℓX, and γ is a suitable operator; e.g., a Re LU

3In practice, WGAN usually learns to generate random noise of arbitrary strength that could misclassify the input, without accounting for useful subspaces. To circumvent this issue, our objective demands that the generated noise when combined with the input data will ensure the useful data properties are removed. We do this by asking the classiﬁer to produce a logit vector such that its softmin is equal to ℓ, while for the non-perturbed input, we ask the softmax be equal to ℓ.

Adversarially-Contrastive Optimal Transport

ensuring γ(x ˆx) remains in the same feature space as x. Incorporating these requirements into our GAN objective in (7), we have our new adversarial WGAN objective as:

LA(θ) = min θ max h L1 Ex X D [h(x)] Ey=γ(x ˆx), ˆx gθ(X,σ) [h(y)]

+ λ1 (ζ(x, ℓX) ζ(γ(x ˆx), ℓX))+λ2E[ ˆx 2]. (9)

The regularization E[ ˆx 2] is useful to make gθ learn to produce noise of small strength that can misclassify the input (similar to standard adversarial models (Moosavi-Dezfooli et al., 2016)). The positive weights λ s balance the losses at training. Note that we need to input X to the generator gθ so that the generator learns to know the noisy features in its input (at the frame level). Our full objective in (9) is trained end-to-end for the WGAN.

3.4. Capturing Sequentiality and Distortion

Now that, we have a concrete setup to sample from adversarial noise distributions to generate the negative samples, lets look back at the representation learning objective in (4). Recall that using the noise distributions, we have accounted for ﬁnding samples Y that do not have useful data properties that were present in the input data X. While, using Y in OT allows for a high and relevant discrimination from X; however, the constrastive representation should also account for the sequentiality in the data. To this end, we include temporal ordering constraints on the learned subspaces. Speciﬁcally, we ask the subspace to be learned in such a way that when data points is projected onto this subspace, some monotonicity property is satisﬁed. That is, for some η > 0, the projections of each xt onto the learned subspace U satisﬁes ordering constraints: U xt 2 + η U xt+1 2 , t = 1, 2, , n 1. We also ensure that the representation does not diverge too much from the input sequence, via including a distortion penalty into our objective. Note that these constraints have been used previously, such as in (Cherian et al., 2017). With these additional constraints, we rewrite our full representation learning objective in (4) as:

max U G(d,k)LR(U) := LOT (U) β1

x X f U(x) x 2

h U xt 2 + η U xt+1 2i

where the βs are constants, and [ . ]+ = max(0, .).

3.5. Representation Learning

For a given sequence X Rd n and its adverserially sampled noise matrix Y Rd m, we can rewrite our full

objective in (10) as:

max U G(d,k) min π (n,m) π, dist(f U(X), Y) Ω(X, U),

(11) where (n, m) is a set of linear constraints capturing the marginals, and dist produces an n m distance matrix. Further, with a slight abuse of notation, we assume f U(X) = UU X, and Ω(X, U) captures the distortion and the ordering constraints on U. We propose to use alternating minimization to solve this objective, where we alternatingly solve the following sub-problems, while keeping the other optimization variable ﬁxed. Speciﬁcally, by ﬁxing U (initially assuming UU = I), we solve for π as:

min π (n m) π, dist(f U(X, Y) , (12)

which we solve efﬁciently using an inexact proximal point optimal transport (IPOT) solver proposed in (Xie et al., 2019). In contrast to entropy regularization based methods (Cuturi, 2013), IPOT has better convergence and stability properties. Fixing the coupling π, we solve for the subspace U via casting the optimization on the Grassmann manifold. That is, we solve:

min U G(d,k) UU X Yπ 2

F + Ω(X, U). (13)

We use the Riemannian conjugate gradient algorithm (Absil et al., 2009) to optimize this sub-problem. We use the Fletcher and Reeves step size selection (Fletcher & Reeves, 1964) and retractions via the QR decomposition for the iterations. As each of our sub-problems is non-convex, it is not easy to guarantee any convergence. However, empirically, we see that the IPOT solver ﬁnds the coupling in about 1000 iterations (note that each iteration is very cheap) and the Riemannian conjugate gradient converges in about 5 iterations.

3.6. Representation Classiﬁer

The missing piece to present in our framework is the classiﬁer C to use on the subspace representations, deﬁned in (1). A tricky problem with subspaces is that there is no control on the sign of their basis. While, we may use a deep neural network classiﬁer to this end, in this work, we use a standard kernelized SVM, with a kernel K deﬁned for two points U1, U2 G(d, k) as (Harandi et al., 2014):

K(U1, U2) = exp γ U 1 U2 2

for a bandwidth parameter γ > 0. A computationally cheaper alternative to using a kernel, which we found to produce similar results empirically, is to use average pooling after computing UU X. This results in a representation in Rd, and is especially desirable if using a linear sequence classiﬁer.

Adversarially-Contrastive Optimal Transport

Ablation JHMDB (vgg) JHMDB (I3D) HMDB (I3D) RGB FLOW R+F RGB FLOW R+F RGB FLOW R+F Avg. Pool 47.0 63.0 73.1 77.5 81.0 85.0 68.2 69.5 76.5 COT + Random 48.0 63.9 77.9 62.2 77.2 79.4 68.5 71.1 72.5 ACOT 49.3 65.0 75.0 76.1 81.2 90.0 69.5 74.6 76.4 ACOT + PCA 49.5 65.7 75.6 77.6 82.8 90.6 69.8 74.9 76.6 AC + PCA + order (No OT) 49.0 66.1 75.8 75.2 80.0 89.8 70.2 74.8 76.3 ACOT + PCA + order 50.3 69.2 79.8 78.1 82.9 91.5 73.2 75.5 79.4

Table 1. Ablative Study of various modules in our framework and their beneﬁts on the JHMDB dataset (with two different types of frame-level features, i.e.,VGG and I3D) and the HMDB dataset. We report performances on these datasets for the RGB stream, the optical ﬂow stream, and their combination in a two-stream action recognition setup. The results are based on the split-1 of the respective datasets. The acronyns are as follows: A=Adversarial, C=Contrastive, OT=optimal transport, PCA=principal components analysis, and Random=using random noise instead of adversarially generated noise, order=Temporal ordering. Note that for this experiment, we did not ﬁne-tune the I3D features on the datasets; instead use the pre-trained Imagenet+Kinetics model (Carreira & Zisserman, 2017).

1 2 3 4 5 8 # subspaces

Accuracy (%)

RGB FLOW RGB+FLOW

(a) Increasing subspaces k

00.1 0.3 0.5 1 2 (distortion)

Accuracy (%)

RGB FLOW RGB+FLOW

(b) Distortion regularization β1

20 4050 100 200 # negatives

Accuracy (%)

RGB FLOW RGB+FLOW

(c) # Negatives

-3 -2 -1 0 1 2

Accuracy (%)

(d) Temporal regularization β2

Figure 2. Sensitivity analysis of various choices in our representation learning setup. See text for details. Left two plots are on the HMDB dataset, the right two are on the JHMDB dataset.

4. Experiments

In this section, we present experiments demonstrating the effectiveness of our approach for representation learning. To this end, we use two standard action recognition datasets, namely (i) small-scale JHMDB (Jhuang et al., 2013) and (ii) the larger HMDB dataset (Kuehne et al., 2011). We also present some qualitative results on the CIFAR dataset. More experiments and results are provided in the supplementary materials, due to lack of space.

JHMDB dataset: consists of 928 video sequences, each sequence about 15-40 frames long. There are 21 actions deﬁned on the clips, and each clip has only one action. We use this dataset as a test-bed to explore the performances of various modules in our setup. To this end, we use a standard two stream neural network (Simonyan & Zisserman, 2014) for extracting video features at the frame-level and from a short-clip of 20 optical ﬂow frames. To make our results comparable to prior works, we use vgg-16 features made available as part of (Cherian et al., 2017). We also evaluate using 3D CNN features as produced by a pre-trained I3D network (Carreira & Zisserman, 2017) in a two-stream setup.

HMDB Dataset: is a super-set of the JHMDB dataset and consists of about 6700 video clips, each with 50 400 frames and 51 actions. We use a pre-trained I3D network (trained on Kinetics) in a two-stream framework for evaluating the performance of our model on this dataset.

Hyperparameters: Our entire implementation is in Py Torch. For our adversarial module, we modiﬁed the public WGAN code associated with (Arjovsky et al., 2017). We used a noise variance σ = 0.01, which resulted in an average classiﬁer fooling rate of 60% on the training set on both the datasets. See the Appendix for more experiments in this regard. Further, we used λ1 = 0.1 and λ2 = 1 in (9). We used Py Man Opt as our Riemannian optimization framework4. As for the regularization constants on the distortion and ordering constraints in (10), we set β1 = 1 and β2 = 10, and we used η = 0.01 for the temporal margin. We also assume that all features from the two datasets are normalized to unit ℓ2 norm. We will be making our code publicly available at https://www.merl.com/research/license/.

4https://www.pymanopt.org/

Adversarially-Contrastive Optimal Transport

Figure 3. Qualitative results showing (left) the original CIFAR image, (middle) image added with sampled noise from an adversarial distribution, and (right) the respective sampled noise. See text for details.

(a) Adv. Samples

(b) Average Pooling

(c) Random Contrastive

(d) Adv. Contrastive

Figure 4. T-SNE plots on the JHMDB dataset using I3D features. Figure 4(a) shows features (circles) and their respective adversarial negative samples as x s (shown for 5 classes). Each color corresponds to a distinct class. Note that the negative samples ( x s) for each class are very close in embedding to the (positive) data samples (circles). In Figure 4(b), we plot the embeddings of sequence representations produced via average pooling of frame-level features (for all 21 classes). In Figure 4(c), we use our contrastive optimal transport for representation learning, however uses random noise to contrast against. In Figure 4(d), we embed the representations produced via adversarially-constrastive OT (ACOT). Our scheme results in well-separated clusters, justifying the superior performances.

4.1. Ablative Studies

In Table 1, we provide a thorough ablative study of the various modules in our framework. Speciﬁcally, we compare (i) average pooling, which provides the baseline performance on the dataset, (ii) Contrastive Optimal Transport (COT) without the adversarial noise, instead using Gaussian noise (iii) adversarially contrastive optimal transport (ACOT), (iv) ACOT with the distortion penalty (PCA) as characterized via a data reconstruction loss, (v) adversarially contrastive estimation with the PCA and temporal ordering constraints, but without the optimal transport cost, and (vi) our full model (ACOT+PCA+temporal order). For (ii), we found that using random noise that is completely uncorrelated with the data resulted in very poor performance (COT+Random). For this experiment, we produced the negative samples by multiplying random noise with the maximum feature value and subtracting it from the feature. That is, for a feature x, we generate z N(0, max(x)), and compute the negative examples y = γ(x z). Here, max(x) computes the maximum value in the dimensions of x. We repeated this ablative study on the two datasets over vgg and I3D features.

As is clear from the Table 1, using COT leads to better per-

formances than average pooling in most cases, with ACOT demonstrating better performance than COT+Random in most cases. We found that there is a signiﬁcant variance ( 2%) on COT+Random and thus our numbers are averaged over 5 trials. For ACOT, we sampled twice the number of negative samples as the positives. The combinations of PCA and order also demonstrates beneﬁts overall, showing the effectiveness of our proposed method in extracting weak temporal signals from the sequential data. On the JHMDB dataset, our full architecture is signiﬁcantly better than average pooling by nearly 7% on both vgg and I3D features. A similar observation is made on the HMDB dataset as well. Our full model is about 2-5% better on the RGB and FLOW streams separately and about 2% better in combination. This clearly demonstrates the generalizability of our method to different datasets, types of features, and data modalities (RGB and optical ﬂow).

Increasing Number of Subspaces: In Figure 2(a), we keep the distortion β1 = 0.1 and number of negatives to 50, and plot the performance of ACOT against increasing number of subspaces in U. As we see, there is an increasing trend in the performance (on the HMDB dataset). However, as the number of subspaces increases, the representation learning

Adversarially-Contrastive Optimal Transport

time also increases (about k times slower for k subspaces). Beyond 8 subspaces, we did not ﬁnd any improvements.

Increasing Distortion Penalty: Keeping #-subspaces at 1, and #-negatives at 50, we increased β1 from 0 to 2. As seen in Figure 2(b), we see a marginal improvement in performance with increasing β1. We believe, the contrastive formulation is already capturing useful properties of the input sequences that the contribution from an explicit distortion penalty is incremental.

Increasing Number of Negatives: An advantage of our setup is the possibility of generating unlimited number of negatives. In Figure 2(c), we increased the number of negative samples from 20 to 200. While, it is very expensive for the OT to solve for large negative sets, we found that using about 40-100 samples is adequate and demonstrates performance improvements. However, with a large number of negatives (such as 200), counter-intuitively, we ﬁnd that the accuracy drops signiﬁcantly. This is perhaps because our noise generator model does not fool the classiﬁer perfectly; as a result, overabundant negatives may be overlapping in distribution to the positives; diminishing the contrastiveness.

Increasing Temporal Ranking Regularization: In this experiment, we kept β1 = 0.1, number of negatives to 50, and number of subspaces to 1, and changed the temporal regularization β2 (in (10)). Figure 2(d) plots this sensitivity. As is clear from the ﬁgure, smaller regularization is ineffective.

Running Time: On average, excluding the time to extract CNN features, our representation learning setup takes about 30 frames per second using 5 iterations of conjugate gradient and 1000 iterations of inexact proximal optimal transport.

Qualitative Visualizations: To gain insights into the kind of perturbations our adversarial network generates, we trained this sub-module on the CIFAR10 dataset. Specifically, we use an auto-encoder on the CIFAR images, the latent vectors of each image forms the feature. Next, we trained our adversarial network, similar to our setup for sequences, but assuming only a single frame (the encoded CIFAR image). We combined the generated noise with the latent feature and decoded the image. The goal of the discriminator in our framework is to classify this generated image as fake , while the generator is trained to produce better noise such that the discriminator is fooled, however, a pre-trained CIFAR classiﬁer (trained on the 10 CIFAR classes) shows a low-conﬁdence against the true image class (i.e., the noise needs to be adversarial). In Figure 3, we show a few qualitative CIFAR images. The ﬁgure shows the noise impacts image regions that are likely to be useful for recognition (e.g., the top-right cat image, and its respective perturbation that corrupts mainly the face region). Figure 4 shows T-SNE embeddings of representations produced by our scheme on JHMDB features.

JHMDB using VGG features Accuracy GRP (Cherian et al., 2017) 70.6 P-CNN (Ch eron et al., 2015) 72.2 Kernelized Pooling (Cherian et al., 2018) 73.8 Ours (full model) 75.7 JHMDB using 3D-CNNs Accuracy Chained (Zolfaghari et al., 2017) 76.1 I3D + Potion (Choutas et al., 2018) 85.5 I3D + Ours (full model) 87.5 Table 2. Comparisons on JHMDB dataset (3-splits).

Method Acc. (%) I3D (Carreira & Zisserman, 2017) 80.9 Disc. Pool (Wang & Cherian, 2019) 81.3 DSP (Wang & Cherian, 2018) 81.5 Ours (I3D+full model) 81.8 Table 3. Comparisons on the HMDB dataset (3 splits) using twostream I3D model ﬁne-tuned on the dataset.

4.2. Comparisons to the State of the Art

In Tables 2 and 3, we compare the performances of our method against prior works on the respective datasets, such as (i) generalized rank pooling (GRP) (Cherian et al., 2017), (ii) kernelized rank pooling (KRP) (Cherian et al., 2018), and P-CNN (Ch eron et al., 2015). Our method performs better by about 2% on three-fold cross-validation. We repeated this experiment using I3D features and compare against similar prior methods. Our results demonstrating superior performance. For experiments on the HMDB dataset, we ﬁne-tuned the I3D model on this dataset (this is in contrast to the results reported in Table 1 that used a pre-trained I3D model). For our ACOT, we used a single hyperplane subspace. Our re-trained model produces a baseline 3-split accuracy of 80.6% (we report the numbers from (Carreira & Zisserman, 2017) in Table 3), and our proposed approach improves it to 81.8%. We also compare to the results in (Wang & Cherian, 2018) that uses a similar pooling setup via adversarial perturbations for generating the negative set; our proposed model shows beneﬁts.

5. Conclusions

We presented a novel framework for producing data representations on sequences via contrastive learning. Our key insight to look at this problem emerged from the observation that each item in a (video) sequence is often encoded using a model that does not access the full sequence. As a result, the cues for sequence level inference within such encodings might be weak. To amplify such cues, we resorted to contrastive learning, where we contrast the features against adversarially-learned features, and learns subspaces, as a representation, that captures these weak cues via optimal transport. We presented experiments on two datasets, demonstrating empirical beneﬁts against recent methods.

Adversarially-Contrastive Optimal Transport

Acknowledgements

Anoop Cherian thanks Matthew Brand, Anshul Shah, and Jue Wang for helpful suggestions. Shuchin Aeron acknowledges the support by NSF CCF:1553075 and support by MERL/Tufts University during the sabbatical period Jan 2019-August 2019.

Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017.

Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. ar Xiv preprint ar Xiv:1902.09229, 2019.

Bose, A. J., Ling, H., and Cao, Y. Adversarial contrastive estimation. ar Xiv preprint ar Xiv:1805.03642, 2018.

Carreira, J. and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.

Cherian, A., Fernando, B., Harandi, M., and Gould, S. Generalized rank pooling for activity recognition. In CVPR, 2017.

Cherian, A., Sra, S., Gould, S., and Hartley, R. Non-linear temporal subspace representations for activity recognition. In CVPR, 2018.

Ch eron, G., Laptev, I., and Schmid, C. P-CNN: Pose-based CNN features for action recognition. In ICCV, 2015.

Choutas, V., Weinzaepfel, P., Revaud, J., and Schmid, C. Potion: Pose motion representation for action recognition. In CVPR, 2018.

Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A., and Bengio, Y. A recurrent latent variable model for sequential data. In NIPS, 2015.

Cover, T. and Thomas, J. Elements of Information Theory. John Wiley and Sons, 2006.

Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.

Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S. Behavior recognition via sparse spatio-temporal features. In Intl. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005.

Fletcher, R. and Reeves, C. M. Function minimization by conjugate gradients. The computer journal, 7(2):149 154, 1964.

Girdhar, R., Tran, D., Torresani, L., and Ramanan, D. Distinit: Learning video representations without a single labeled video. Co RR, abs/1901.09244, 2019. URL http://arxiv.org/abs/1901.09244.

Greff, K., Srivastava, R. K., Koutn ık, J., Steunebrink, B. R., and Schmidhuber, J. Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222 2232, 2016.

Gutmann, M. and Hyv arinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010.

Harandi, M. T., Salzmann, M., Jayasumana, S., Hartley, R., and Li, H. Expanding the family of grassmannian kernels: An embedding perspective. In ECCV, 2014.

H enaff, O. J., Razavi, A., Doersch, C., Eslami, S. M. A., and van den Oord, A. Data-efﬁcient image recognition with contrastive predictive coding. Co RR, abs/1905.09272, 2019. URL http://arxiv.org/ abs/1905.09272.

Hoffer, E. and Ailon, N. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, 2015.

Jhuang, H., Gall, J., Zufﬁ, S., Schmid, C., and Black, M. J. Towards understanding action recognition. In ICCV, 2013.

Kim, T., Ahn, S., and Bengio, Y. Variational temporal abstraction. In Neur IPS. 2019.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. HMDB: a large video database for human motion recognition. In ICCV, 2011.

Merity, S., Keskar, N. S., and Socher, R. Regularizing and optimizing lstm language models. In ICLR, 2018.

Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deepfool: a simple and accurate method to fool deep neural networks. In CVPR, 2016.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Paty, F. and Cuturi, M. Subspace robust wasserstein distances. Co RR, abs/1901.08949, 2019. URL http: //arxiv.org/abs/1901.08949.

Adversarially-Contrastive Optimal Transport

Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., and Tucker, G. On variational bounds of mutual information. In ICML, 2019.

Santambrogio, F. Optimal transport for applied mathematicians. Birk auser, NY, 55(58-63):94, 2015.

Saunshi, N., Plevrakis, O., Arora, S., Khodak, M., and Khandeparkar, H. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019.

Simonyan, K. and Zisserman, A. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.

Smith, N. A. and Eisner, J. Contrastive estimation: Training log-linear models on unlabeled data. In ACL, 2005.

Song, J. and Ermon, S. Understanding the limitations of variational mutual information estimators. ar Xiv preprint ar Xiv:1910.06222, 2019.

Sun, C., Baradel, F., Murphy, K., and Schmid, C. Contrastive bidirectional transformer for temporal representation learning. Co RR, abs/1906.05743, 2019. URL http://arxiv.org/abs/1906.05743.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In NIPS, 2014.

Tran, D., Ray, J., Shou, Z., Chang, S., and Paluri, M. Convnet architecture search for spatiotemporal feature learning. Co RR, abs/1708.05038, 2017. URL http: //arxiv.org/abs/1708.05038.

Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. On mutual information maximization for representation learning. ar Xiv preprint ar Xiv:1907.13625, 2019.

van den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. Co RR, abs/1807.03748, 2018.

Varol, G., Laptev, I., and Schmid, C. Long-term temporal convolutions for action recognition. Co RR, abs/1604.04494, 2016.

Varol, G., Laptev, I., and Schmid, C. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40 (6):1510 1517, 2017.

Wang, J. and Cherian, A. Learning discriminative video representations using adversarial perturbations. Co RR, abs/1807.09380, 2018. URL http://arxiv.org/ abs/1807.09380.

Wang, J. and Cherian, A. Discriminative video representation learning using support vector classiﬁers. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1 1, 2019. doi: 10.1109/TPAMI.2019.2937292.

Wang, X. and Gupta, A. Unsupervised learning of visual representations using videos. Co RR, abs/1505.00687, 2015.

Wu, C., Zaheer, M., Hu, H., Manmatha, R., Smola, A. J., and Kr ahenb uhl, P. Compressed video action recognition. Co RR, abs/1712.00636, 2017.

Xie, Y., Wang, X., Wang, R., and Zha, H. A fast proximal point method for computing exact wasserstein distance. In UAI, 2019.

Zilly, J. G., Srivastava, R. K., Koutnık, J., and Schmidhuber, J. Recurrent highway networks. In ICML, 2017.

Zolfaghari, M., Oliveira, G. L., Sedaghat, N., and Brox, T. Chained multi-stream networks exploiting pose, motion, and appearance for action classiﬁcation and detection. In ICCV, 2017.