# supervised_training_of_conditional_monge_maps__7341bc2b.pdf Supervised Training of Conditional Monge Maps Charlotte Bunne ETH Zurich bunnec@ethz.ch Andreas Krause ETH Zurich krausea@ethz.ch Marco Cuturi Apple cuturi@apple.com Optimal transport (OT) theory describes general principles to define and select, among many possible choices, the most efficient way to map a probability measure onto another. That theory has been mostly used to estimate, given a pair of source and target probability measures (µ, ), a parameterized map T that can efficiently map µ onto . In many applications, such as predicting cell responses to treatments, pairs of input/output data measures (µ, ) that define optimal transport problems do not arise in isolation but are associated with a context c, as for instance a treatment when comparing populations of untreated and treated cells. To account for that context in OT estimation, we introduce CONDOT, a multi-task approach to estimate a family of OT maps conditioned on a context variable, using several pairs of measures (µi, i) tagged with a context label ci. CONDOT learns a global map T conditioned on context that is not only expected to fit all labeled pairs in the dataset {(ci, (µi, i))}, i.e., T (ci)]µi i, but should also generalize to produce meaningful maps T (cnew) when conditioned on unseen contexts cnew. Our approach harnesses and provides a novel usage for partially input convex neural networks, for which we introduce a robust and efficient initialization strategy inspired by Gaussian approximations. We demonstrate the ability of CONDOT to infer the effect of an arbitrary combination of genetic or therapeutic perturbations on single cells, using only observations of the effects of said perturbations separately. 1 Introduction A key challenge in the treatment of cancer is to predict the effect of drugs, or a combination thereof, on cells of a particular patient. To achieve that goal, single-cell sequencing can now provide measurements for individual cells, in treated and untreated conditions, but these are, however, not in correspondence. Given such examples of untreated and treated cells under different drugs, can we predict the effect of new drug combinations? We develop a general approach motivated by this and related problems, through the lens of optimal transport (OT) theory, and, in that process, develop tools that might be of interest for other application domains of OT. Given a collection of N pairs of measures (µi, i) over Rd (cell measurements), tagged with a context ci (encoding the treatment), we seek to learn a context-dependent, parameterized transport map T such that, on training data, that map T (ci) : Rd ! Rd fits the dataset, in the sense that T (ci)]µi i. Additionally, we expect that this parameterized map can generalize to unseen contexts and patients, to predict, given a patient s cells described in µnew, the effect of applying context cnew on these cells as T (cnew)]µ. Learning Mappings Between Measures From generative adversarial networks, to normalizing flows and diffusion models, the problem of learning maps that move points from a source to a target distribution is central to machine learning. OT theory (Santambrogio, 2015) has emerged as a principled approach to carry out that task: For a pair of measures µ, supported on Rd, OT suggests that, among all maps T such that can be reconstructed by applying T to every point in the support Work done during an internship at Apple. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). .. by scalar a. b. c. source measure target measures lb Aje7Mvzp Hp S9M6Kp7c2j Wu YIAt7c Aj H4ME5l OAKyl ABn14h Gd4c WLny Xl13iat C850Zhf+w Hn/Af0ckw M= t .. by covariate e.g., time-course or dosage levels e.g., metadata or identifiers .. by action KD0Ded Id Y9Ne1l4n9e K9HBq Zsy ESea Cj Je FCQc6Qhl76MOk5Ro Pj AE8n Mr Yj0s MREm5Cy EJzpl2d J/b Ds HJe Prkwalz BGHn Zg F/b Bg ROow AVUo QYEBDz CM7x Yynqy Xq23c Wv Omsxswx9YHz9tr5SB ai ahbzp Dr Htq0cv E/7x Wo Mr N2Ui Tj QVZLYo SDj SEco SQB0m Kd F8ZAgmkplb Eelhi Yk2OWUh OIsv L5P6Wdm5KJ/fm TRu Y8HMExn IDl1CBG6h CDQg8w BM8w4s1t Cb Wq/U2a81Z85l D+APr/Qcr DJYf aij A3n SHWf TXv Ze J/Xiv Rw YWXMh Enmgoy XRQk HOk IZb+j Lp OUa D4y BPJz K2I9LHERJu Esh Dc+Zc XSf2k7J6VT29MGpcw R724QCOw YVzq MA1VKEGBAZw D4/w ZMXWg/Vsv Uxbc9Zs Zg/+w Hr9Ac U2k6M=aij 0RNk M972Xif14n Me Glnz IRJ4YKMls UJhw Zib LXUZ8p Sgwf W4KJYv ZWRIZYWJs QFk I3vz Li6R5Wvb Oy2d1m0YVZsj DIRz BCXhw ARW4ho0g MAt3Mj PDn Se XCen Zd Za875mdm HP3Bevw Ea AJKHai ed ITY9Pe+l4n9e Izb BZSth Io NFWS6KIg5Mh Klr6MOU5QYPr IE8Xsr Yj0s MLE2IDSELz5lxd J9b Ton Rf Pbm0a Vz BFBg7g E7Agwsow Q2Uo QIE+n APj/Dk SOf Be XZepq1Lzmxm H/7Aef0B/p+TMA=aj combination covariate e.g., perturbations or decisions .. learn .. learn .. learn T (t)]µ T ((ai, aj))]µ 2DSOj HFRypvzt SFEo5CH1dme0rp71M/M9r JSo4c1PK40QRjscf BQm DKo JZHLBNBc GKDTRBWFC9K8Rd JBWOr Qs BHv65Fl SPyrb J+Xj K53GORgj D3b ALjg ANjg FXAJq AGMLg F9+ARPBl3xo Pxb Ly MS3PGp Gc L/IHx9g O2k5r Yc1,10 10 Ua8s B0cq R7at4bi/95z USHl35KRZxo Iv B0UZgw W0f2OAK7Qy XBmg0NQVh Sc6u Ne0gir E1Qe ROCO/y Iqmdltz0tmd Se MGpsj BPhz BCbhw AW4hgp UAYOER3i GF2tg PVmv1tu0NWPNZvbg D6z3H16zlhg= l0 L0 Mg Evpx BUfq74Uh VIOQl9XZv Ka S8T/Nai Qr O3JTy OFGE4/FHQc Kgim AWB2x TQb Bi A0QFl Tv Cn EXCYSVDi0Lw Z4+e Zb Uj8r2Sfn4Sqdx Dsb Igx2w Cw6ADU5BVy CKqg BDG7BPXg ET8ad8WA8Gy/j0pwx6dk Cf2C8/QBu Fpt Ocl,l0 ZBJPTj Co7U3x0p Cq Uch L6uz Pa V014m/ue1Eh Wcu Snlca Ix+OPgo RBFc Es Dtimgm DFBpog LKje Fe Iu Egr HVo Wgj198iypl0v2Sen4Sqdx Dsb Ig V2w Bw6BDU5BVy CKqg BDG7BPXg ET8ad8WA8Gy/j0jlj0r MN/s B4+w EUJs Ucl,20 t Kwi Fflx ZI/V3Rw KZl APm68p0Xzntpe J/Xj NWw Zmb EB7FCn M0/i Iqa VCK43Dah OBka IDTSASRO9qo S4UECkd Whq CM3y LKkd FZ2T4v GVTu Mcj JEFO2AXHAHn ISu AQVUAUI3IJ78Aiej Dvjw Xg2Xsal GWPSsw X+w Hj7AQq Wmw4=c L,L0 T (cl,l0)]µ Figure 1: The evolution from a source µ to a target measure can depend on context variables c of various nature. This comprises a. scalars such as time or dosage t which determine the magnitude of an optimal transport, b. flow of measures into another one based on additional information (possibly different between µ and ) stored in vectors cl,l0, or c. discrete and complex actions ai, possibly in combination aij. We seek a unified framework to produce a map T (c) from any type of condition c. of µ (abbreviated with the push-forward notation as T]µ = ), one should favor so-called Monge maps, which minimize the average squared-lengths of displacements kx T(x)k2. A rich literature, covered in Peyré and Cuturi (2019), addresses computational challenges of estimating such maps, with impactful applications to various areas of science (cf., Hashimoto et al., 2016; Schmitz et al., 2018; Schiebinger et al., 2019; Yang et al., 2020; Janati et al., 2020; Bunne et al., 2022a). Neural OT We focus in this work on neural approaches that parameterize the optimal maps T as neural networks. An early approach is the work on Wasserstein GANs (Arjovsky et al., 2017), albeit the transport map is not explicitly estimated. Several recent results have exploited a more explicit connection between OT and NNs, derived from the celebrated Brenier theorem (1987), which states that Monge maps are necessarily gradients of convex functions. Such convex functions can be represented using input convex neural networks (ICNN) (Amos et al., 2017), to parameterize either the Monge map (Jacob et al., 2018; Yang and Uhler, 2019; Bunne et al., 2021, 2022b) or a dual potential (Makkuva et al., 2020; Korotin et al., 2020) as, respectively, the gradient of an ICNN or an ICNN itself. In this paper, we build on this line of work, but substantially generalize it, to learn a parametric family of context-aware transport maps, using a collection of labeled pairs of measures. Contributions We propose a framework that can leverage labeled pairs of measures {(ci, (µi, i))}i to infer a global parameterized map T . Hereby, the context ci belongs to an arbitrary set C. We construct T so that it should be able, given a possibly unseen context label c 2 C, to output a map T (c) : Rd ! Rd, that is itself the gradient of a convex function. To that end, we propose to learn these parameterized Monge maps T as the gradients of partially input convex neural networks (PICNN), which we borrow from the foundational work of Amos et al. (2017). Our framework can be also interpreted as a hypernetwork (Ha et al., 2016): The PICNN architecture can be seen as an ICNN whose weights and biases are modulated by the context vector c, which parameterizes a family of convex potentials in Rd. Because both ICNN and to a greater extent PICNN are notoriously difficult to train (Richter-Powell et al., 2021; Korotin et al., 2020, 2021), we use closed-form solutions between Gaussian approximations to derive relevant parameter initializations for (P)ICNNs: These choices ensure that, upon initialization, the gradient of the (P)ICNNs mimics the affine Monge map obtained in closed form between Gaussian approximations of measures µi, i (Gelbrich, 1990). Our framework is applied to three scenarios: Parameterization of transport through a real variable (time or drug dosage), through an auxiliary informative variable (cell covariates) and through action variables (genetic perturbations in combination) (see Fig. 1). Our results demonstrate the ability of our architectures to better capture on out-of-sample observations the effects of these variables in various settings, even when considering never-seen, composite context labels. These results suggest potential applications of conditional OT to model personalized medicine outcomes, or to guide novel experiments, where OT could serve as a predictor for never tested context labels. 2 Background on Neural Solvers for the 2-Wasserstein Problem Optimal Transport The Monge problem between two measures µ, 2 P(Rd), here restricted to measures supported on Rd and compared with the squared Euclidean metric, reads T ? := arg inf Rd kx T(x)k2dµ(x) . (1) The existence of T ? is guaranteed under fairly general conditions (Santambrogio, 2015, Theorem 1.22), which require that µ and have finite L2 norm, and that µ puts no mass on (d 1) surfaces of class C2. This can be proved with the celebrated Brenier theorem (1987), which states that there must exist a unique (up to the addition of a constant) potential f ? : Rd ! R such that T ? = rf ?. This theorem has far-reaching implications: It is sufficient, when seeking optimal transport maps, to restrict the computational effort to seek a good convex potential, such that its gradient pushes µ towards . This result has been exploited to propose OT solvers that rely on input convex neural networks (ICNNs) (Amos et al., 2017), introduced below f ? := arg sup Rd fd . (2) In practice, Monge maps can be estimated using a dual formulation (Makkuva et al., 2020; Korotin et al., 2020; Bunne et al., 2022b; Alvarez-Melis et al., 2021; Mokrov et al., 2021). Indeed, T ? in (1) is recovered as rf ?, where f ? is defined in (2), writing f for the Legendre transform of f. Convex Neural Architectures Input convex neural networks (ICNN) are neural networks that admit certain constraints on their architecture and parameters , such that their output (x) is a convex function of their input x (Amos et al., 2017). As a result, they have been increasingly used as drop-in replacements to the set of admissible functions in (2). Practically speaking, an ICNN is a K-layer, fully connected network such that, at each layer index k from 0 to K 1, a hidden state vector zk is defined recursively as in (3), zk+1 = σk(W x k zk + bk) (3) and (x) = z K, where, by convention, z0 and W z 0 are 0; σk are convex non-decreasing activation functions; = {bk, W z k=0 are the weights and biases of the neural network. While ample flexibility is provided to choose dimensions for intermediate hidden states zk, the last layer must necessarily produce a scalar, hence W x K 1 and W z k 1 are line vectors and b K 1 2 R. ICNNs are characterized by the fact that all weight matrices W z k associated to latent representations z must have non-negative entries. This, along with the specific activation functions, ensures the convexity of . We encode this constraint by identifying these matrices as the elementwise softplus or Re LU of other matrices of the same size, or, alternatively, using a regularizer that penalizes the negative entries of these matrices. Since the work by Amos et al. (2017), convex neural architectures have been used within the context of OT to model convex dual functions (Makkuva et al., 2020), or normalizing flows derived from convex potentials (Huang et al., 2021). Their expressivity and universal approximation properties have been studied by Chen et al. (2019), who show that any convex function over a compact convex domain can be approximated in sup norm by an ICNN. 3 Supervised Training of Conditional Monge Maps We are given a dataset of N pairs of measures, each endowed with a label, (ci, (µi, i)) 2 C P(Rd)2. Our framework builds upon two pillars: (i.) we formulate the hypothesis that an optimal transport T ? i (or, equivalently, the gradient of a convex potential f ? i ) explains how measure µi was mapped to i, given context ci; (ii.) we build on the multi-task hypothesis (Caruana, 1997) that all of the N maps T ? i between µi and i share a common set of parameters, that are modulated by context informations ci. These ideas are summarized in an abstract regression model described below. 3.1 A Regression Formulation for Conditional OT Estimation 2 Rr, T describes a function that takes an input vector c 2 C, and outputs a function T (c) : Rd ! Rd, as a hypernetwork would (Ha et al., 2016). Assume momentarily that we are given ground truth maps Ti, that describe the effect of context ci on any measure, rather only pairs of measures (µi, i). This is of course a major leap of faith, since even recovering an OT map T ? from two measures is in itself very challenging (Hütter and Rigollet, 2021; Rigollet and Stromme, 2022; Pooladian and Niles-Weed, 2021). If such maps were available, a direct supervised approach to learn a unique could hypothetically involve minimizing a fit function composed of losses between maps Rd k T (ci)(x) Ti(x)k2 dµi(x) . (4) Unfortunately, such maps Ti are not given, since we are only provided unpaired samples before µi and after i that map s application. By Brenier s theorem, we know, however, that such an OT map T ? i exists, and that it would be necessarily the gradient of a convex potential function that maximizes (2). As a result, we propose to modify (4) to (i.) parameterize, for any c, the map T (c) as the gradient w.r.t. x of a function f (x, c) : Rd C ! R that is convex w.r.t. x, namely T (c) := x 7! r1f (x, c); (ii.) estimate by maximizing jointly the dual objectives (2) simultaneously for all N pairs of measures, in order to ensure that the maps are close to optimal, to form the aggregate problem i=1 Eµi, i(f ( , ci)). (5) We detail in App. B how the Legendre transforms that appear in the energy terms Eµi, i are handled with an auxiliary function. 3.2 Integrating Context in Convex Architectures We propose to incorporate context variables, in order to modulate a family of convex functions f (x, c) using partially input convex neural networks (PICNN). PICNNs are neural networks that can be evaluated over a pair of inputs (x, c), but which are only required to be convex w.r.t. x. Given an input vector x and context vector c, a K-layer PICNN is defined as (x, c) = z K, where, recursively for 0 k K 1 one has uk+1 = k (Vkuk + vk) , where the PICNN is initialized as u0 = c, z0 = 0, denotes the Hadamard elementwise product, and k is any activation function. The parameters of the PICNN are then given by k}. Similar to ICNNs, the convexity w.r.t. input variable x is guaranteed as long as activation functions σi are convex and non-decreasing, and the weight matrices W z k have non-negative entries. We parameterize this by storing them as elementwise applications of softplus operations on precursor matrices of the same size, or, alternatively, by regularizing their negative part. Finally, much like ICNNs, all matrices at the K 1 layer are line vectors, and their biases scalars. Such networks were proposed by Amos et al. (2017, Eq. 3) to address a problem that is somewhat symmetric to ours: Their inputs were labeled as (y, x), where y is a label vector, typically much smaller than that of vector x. Their PICNN is convex w.r.t. y, in order to easily recover, given a datapoint x (e.g., an image) the best label y that corresponds to x using gradient descent as a subroutine, i.e. y?(x) = arg miny PICNN (x, y). PICNN were therefore originally proposed to learn a parameterized, implicit classification layer, amortized over samples, whose motivation rests on the property that it is convex w.r.t. label variable y. By contrast, we use PICNNs that are convex w.r.t. data points x. In addition to that swap, we do not use the convexity of the PICNN to define an implicit layer (or to carry out gradient descent). Indeed, it does not make sense in our setting to minimize (x, c) as a function of x, since x is an observation. Instead, our work rests on the property that r1 (x, c) describes a parameterized family of OT maps. We note that PICNNs were considered within the context of OT in (Fan et al., 2021, Appendix B). In that work, PICNN provide an elegant reformulation for neural Wasserstein barycenters. Fan et al. (2021) considered a context vector c that was restricted to be a small vector of probabilities. 3.3 Conditional Monge Map Architecture Using PICNNs as a base module, the CONDOT architecture integrates operations on the contexts C. As seen in Figure 1, context values c may take various forms: 1. A scalar t denoting a strength or a temporal effect. For instance, Mc Cann s interpolation and its time parameterization, t = ((1 t)Id + t T)] 0 (Mc Cann, 1997) can be interpreted as a trivial conditional OT model that creates, from an OT map , a set of maps parameterized by t, T (t) := x 7! rx (1 t)kxk2/2 + t (x) . 2. A covariate vector influencing the nature of the effect that led µi to i, (capturing, e.g., patient feature vectors). 3. One or multiple actions, possibly discrete, representing decisions or perturbations applied onto µi. To provide a flexible architecture capable of modeling different types of conditions as well as conditions appearing in combinations, the more general CONDOT architecture consists of the hypernetwork T that is fed a context vector through embedding and combinator modules. This generic architecture provides a one-size fits all approach to integrate all types of contexts c. Embedding Module To give greater flexibility when setting the context variable c, CONDOT contains an embedding module E that translates arbitrary contexts into real-valued vectors. Besides simple scalars t (Fig. 1a) for which no embedding is required, discrete contexts can be handled with an embedding module Eφ. When the set C is small, this can be done effectively using one-hot embeddings Eohe. For more complicated actions a such as treatments, there is no simple way to vectorize a context c. Similarly to action embeddings in reinforcement learning (Chandak et al., 2019; Tennenholtz and Mannor, 2019), we can learn embeddings for discrete actions into a learned continuous representation. This often requires domain-knowledge on the context values. For molecular drugs, for example, we can learn molecular representations Emol such as chemical, motif-based (Rogers and Hahn, 2010) or neural fingerprints (Rong et al., 2020; Schwaller et al., 2022). However, often this domain knowledge is not available. In this work, we thus construct so-called mode-of-action embeddings, by computing an embedding Emoa that encourages actions a with similar effect on target population to have a similar representation. In 5, we analyze several embedding types for different use-cases. PS8Nxb/85q JCS/8l Mk4MVS6a Iw4ch Ea Pw36j BFie EDSz BRz N6KSA8r TIx NJ2d D8OZf Xi S1k5J3Vjq9t Wncw BRZ2Ic DOAYPzq EM1CBKh Dowi M8w4v Dn Sfn1Xmbtmac2cwe/IHz/g P18ZINµ Figure 2: CONDOT Architecture and Modules. The embedding module Eφ embeds arbitrary conditions c, which are then combined via module CΦ. Using the processed contexts c, the map T (c) acts on µ to predict the target measure . Combinator Module While we often have access to contexts c in isolation, it is crucial to infer the effect of contexts applied in combination. A prominent example are cancer combination therapies, in which multiple treatment modalities are administered in combination to enhance treatment efficacy (Kummar et al., 2010). In these settings, the mode of operation between individual contexts c is often not known, and can thus not be directly modeled via simple arithmetic operations such as min, max, sum, mean. While we test as a baseline the case, applicable to one-hot-embeddings, where simple additions are used to model these combinations, we propose to augment the CONDOT architecture with a parameterized combinator module CΦ. If the order in which the actions are applied is irrelevant or unknown, the corresponding network CΦ needs to be permutation-invariant, which can be achieved by using a deep set architecture (Zaheer et al., 2017). Receiving a flexible number of inputs from the embedding module Eφ, CONDOT allows for a joint training of the PICNN parameters , embedding parameters φ, and combinator parameters Φ in a single, end-to-end differentiable architecture. Training Procedure Given a dataset D = {ci, (µi, i)}N i=0 of N pairs of populations before µi and after transport i connected to a context ci, we detail in Algorithm 1 provided in B, a training loop that incorporates all of the architecture proposals described above. The training loss aims at making sure the map T (ci) is an OT map from µi to i, where ci may either be the original label itself or its embedded/combined formulation in more advanced tasks. To handle the Legendre transform in (2), we use the proxy dual objective defined in (Makkuva et al., 2020, Eq. 6) (15)-(16) in place of (2) in our overall loss (5). This involves training the CONDOT architecture using two PICNNs, i.e., PICNN f and PICNN g, that share the same embedding/combinator module, with a regularization (14) promoting that for any c, the PICNN g( , c) resembles the Legendre transform of the other, PICNN 4 Initialization Strategies for Neural Convex Architectures We address the problem of initializing the parameters of (P)ICNNs to ensure their gradient evaluated at every point is (initially) meaningful in the context of OT, namely that it is able to map the first and second moments of a measure µ into those of a target measure . The initializers we propose build heavily on the quadratic layers proposed in the seminal reference (Korotin et al., 2020, Appendix B.2), notably the Dense Quad layer, as well as on closed-form solutions available for Gaussian approximations of measures (Gelbrich, 1990). Closed-Form Potentials for Gaussians Given two Gaussian distributions N1, N2 with means respectively m1, m2 and covariance matrices 1, 2 (where 1 is assumed to be full rank), the k Uhv AMr/CGBHp B7+hj0Vp A+cwx/AH6/AFi Z43mµ 5U/M/rp Ni/9j Ohkh S5Yv NF/VQSj Mk0ANITmj OUY0so08Le Sti Qasr Qxl S0IXi Ly+T5k XFu6x UH6rl2n0e Rw GO4QTOw IMrq MEd1KEBDBJ4hld4c1Lnx Xl3Puat K04+cw R/4Hz+AKX2k NA=h A1,b1 ve TPz P6Umuv Yz Jp LUEWi6KUIy PRLA0YIo Swye WYKYv RWREVa YGBt Ty Ybg Lb+8Stq1qnd Zr T/UK437PI4in MApn IMHV9CAO2h Cwgk8Ayv8Oakzovz7nws Wgt OPn Mf+B8/g Cp BZDSh A2,b2 Gbiv953d RE137GZJIa Ksl8UZRy ZGI0DQD1ma LE8LElm Chmb0Vki BUmxs ZUt CF4iy8vk9ZFxbus1B5q5fp9Hkc Bju Ezs CDK6j DHTSg CQSe IZXe HNS58V5dz7mr St OPn MEf+B8/g Cs FJDUh A3,b3 TNCG4sy/Pk9px2T0rn96a NK5hg Lswg Ecg Qvn UIErq IHBg8wj O8WMJ6sl6t0lrzpr O7MAf WO8/1Bq Rtg= 3 /CZIal Gyx KEw FMTGZf U36XCEz Ymw JZYrb Wwkb Uk WZsdk Ub Aje8surp Hl Z9q7Kl Xql VL3P4sj DCZz COXhw DVW4gxo0g AHCM7z Cm/Pov Djvzsei Nedk M8fw B87n D3OGj L0=) 9yma QGJVsu Cl NBTEzm X5MBV8i Mm Fh Cme L2Vs JGVFmb DYFG4K3+v I6a V1Wv Kt Kt VEt1e6z OPJw Bud QBg+uo QZ3UIcm MEB4hld4cx6d F+fd+Vi25pxs5h T+w Pn8AXICj Lw=( u0 = softmax h A1,b1(x) h A2,b2(x) h A3,b3(x) 72hu J/Xi Mx4bmf Mhknhkoy Xh Qm HJk IDf9Gba Yo Mbxv CSa K2Vs R6WKFib Hp5Gw I3v TLs6R6VPROiyc3No0r GCMLO7APh+DBGZTg Esp QAQIde IRne HG48+S8Om/j1jln Mr MNf+C8/w A565DGc3 Figure 3: a. From a measure µ to several target measures 1, 2, 3 provided with labels c1, c2, c3 we can extract three Gaussian (quadratic) potentials in closed form, b. whose gradients transport on a first approximation µ to areas in space that cover the three targets. c. Given a new label vector c, we compare it to known labels to modulate the magnitude of each of the three potentials. Brenier potential solving the OT problem from the first to the second Gaussian reads: 2x T AT Ax + b T x + t(A, b) = 1 2 + b T x + t(A, b), where, (7) , b := m2 AT Am1, define both quadratic and linear terms and t(A, b) can be any constant. Importantly, note that we write the quadratic term in factorized form AAT to enforce psd-ness, as done by Korotin et al. (2020), not as usually done with a single psd matrix (Peyré and Cuturi, 2019, Remark 2.31). Our quadratic potentials are only injected in the first state of hidden vector z0, to populate it with a collection of relevant full-rank quadratic convex functions, with the goal of recovering an affine OT map from the start, as illustrated in the experiments from C.1. Quadratic Potentials Lower Bounded by 0 Naturally, for any choice of t(A, b) one recovers the property that rf ? µ, ]N1 = N2. When used in deep architectures, the level of that constant does, however, play a role, since convex functions in ICNN are typically thresholded or modulated using rectifying functions. To remove this ambiguity, we settle on a choice for t(A, b) that is such that the lowest value reached by f ? N1,N2 is 0. This can be obtained by setting t(A, b) := b T (AT A) 1b , (8) which results in the following choice, writing ! = m1 (AT A) 1m2, N1,N2(x) = 1 x + (AT A) 1b 2k A (x !) k2 To mimic these potential functions, we introduce a quadratic layer parameterized by a weight matrix M and a bias vector m, defined as q M,m(x) = 1 2k M(x m)k2 2. By design, q M,m(x) is a convex quadratic, non-negative layer. Finally, one has the following relationships, rq I,0d = Id, rq A,!]N1 = N2 . (10) ICNN Initialization We explore two possible ICNN (3) initializers for OT. Identity Initialization The first approach ensures that upon initialization the ICNN s gradient mimics the identity map, i.e., r (x) = x for any x. We do so by injecting in the initial hidden state z0 the norm of the input vector 1 2kxk2, cast as a trainable layer q M,m initialized with M = I and m = 0d, see (10). The remaining parameters are chosen to propagate that norm throughout layers using averages. This amounts to the following choices: 1. Set all σi to be activations such that σ0 i(u) 1 for u large enough, e.g., (leaky) Re LU or softplus. 2. Introduce an initialization layer, z0 = q M,m(x)1, itself initialized with M = I and m = 0d. 3. Initialize all matrices W z i to 1d2,d1/d1, where d1, d2 are the dimensions of these matrices. 4. Initialize all matrices W x i to 0. 5. Initialize biases bi to s1, where s is a large enough value s so that σ0 Gaussian Initialization The second approach can be used to initialize an ICNN so that its gradient mimics the affine transport between the Gaussian approximations of µ and . To this end, we follow all of the steps outlined above, except for step 2 where the quadratic layer q M,m is initialized instead with M = A and m = m1 (AT A) 1m2 using notations in (7), (8), (9), where m1, m2, 1, 2 Table 1: Evaluation of drug effect predictions from control cells to cells treated with drug Givinostat when conditioning on various covariates influencing cellular responses such as drug dosage and cell type. Results are reported based on MMD and the 2 distance between perturbation signatures of marker genes in the 1000 dimensional gene expression space. Method Conditioned on Drug Dosage Conditioned on Cell Line In-Sample Out-of-Sample In-Sample MMD 2(PS) MMD 2(PS) MMD 2(PS) CPA (Lotfollahi et al., 2021) 0.1502 0.0769 2.47 2.89 0.1568 0.0729 2.65 2.75 0.2551 0.006 2.71 1.51 ICNN OT (Makkuva et al., 2020) 0.0365 0.0473 2.37 2.15 0.0466 0.0479 2.24 2.39 0.0206 0.0109 1.16 0.75 CONDOT (Identity initialization) 0.0111 0.0055 0.63 0.09 0.0374 0.0052 2.02 0.10 0.0148 0.0078 0.39 0.06 CONDOT (Gaussian initialization) 0.0128 0.0081 0.60 0.11 0.0325 0.0062 1.84 0.14 0.0146 0.0074 0.41 0.07 K562 ENSG00000165092.12 Expression Level Expression Level Expression Level Wasserstein Distance In-Sample Out-of-Sample Setting Cond OT CPA (Lotfollahi et al., 2021) ICNN OT b. a. Cond OT Target Source CPA (Lotfollahi et al., 2021) ICNN OT Figure 4: a. Predictive performance of CONDOT and baselines w.r.t. the entropy-regularized Wasserstein distance on drug dosages in-sample, i.e., seen during training, and out-of-sample, i.e., unseen during training. b. Marginal distributions of observed source and target distributions, as well as predictions on perturbed distributions by CONDOT and baselines of an exemplary gene across different cell lines. Predicted marginals of each method should match the marginal of the target population. are replaced by the empirical mean and covariances of µ and . Throughout the experiments, we use the Gaussian and identity initialization. Further comparisons between the vanilla initialization and those introduced in this work can be found in C.1 (Fig. 8). PICNN Initialization Recall for convenience that a K-layer PICNN architecture reads: uk+1 = k (Vkuk + vk) (x, c) = z K. In their original form (Amos et al., 2017, Eq. 3), PICNNs are initialized by setting u0 = c and z0 = 0 to a zero vector of suitable size. Intuitively, the hidden states uk act as context-dependent modulators, whereas vectors zk propagate, layer after layer, a collection of convex functions in x that are iteratively refined, while retaining the property that they are each convex in x. A reasonable initialization for a PICNN that is provided a context vector c is that if c cj (where j is in the training set), one has that r1 0( , c) maps approximately µj to j, which can be obtained by having 0( , c) mimic the closed-form Brenier potential between the Gaussian approximations of µj, j. Alternatively, one may also default to an identity initialization as discusses above. To obtain either behavior, we make the following modifications, and refer to the illustration in Fig. 3: 1. The modulator u0(c) = softmax(c T M), where C is initialized as M = [cj]j, and Vi = I, vi = 0. 2. z0 = [q Mj,mj(x)]j, where weight matrices and bias (Mj, mj) are either initialized to (I, 0) or as (Aj, !j) recovered by solving the Gaussian affine map from µj to j using (9). 3. Modulator u0 is passed directly to hidden state upon first iteration W zu 0 = 0. 4. All subsequent matrices W z k are initialized to 1d2,d1/d1, where d1, d2 are their dimensions, 5. W x k are 0, the biases bz 5 Evaluation Biological cells undergo changes in their molecular profiles upon chemical, genetic, or mechanical perturbations. These changes can be measured using recent technological advancements in high- Wasserstein Loss of ICNN OT Wasserstein Loss of ICNN OT Wasserstein Loss of CPA Wasserstein Loss of Cond OT Wasserstein Loss of Cond OT Wasserstein Loss of Cond OT Known Perturbations Unknown Perturbations a. b. c. Mo A Embedding One-Hot Embedding One-Hot Embedding Unknown Combinations Known Perturbations Unknown Combinations Known Perturbations Figure 5: Comparison between a. CONDOT and ICNN OT (Makkuva et al., 2020) based on embedding Emoa b. as well as Eohe, and c. CONDOT and CPA (Lotfollahi et al., 2021) based on embedding Eohe on known and unknown perturbations or combinations. Results above the diagonal suggest higher predictive performance of CONDOT. resolution multivariate single-cell biology. Measuring single cells in their unperturbed or perturbed state requires, however, to destroy them, resulting in populations µ and that are unpaired. The relevance of OT to that comes from its ability to resolve such ambiguities through OT maps, holding promises of a better understanding of health and disease. We consider various high-dimensional problems arising from this scenario to evaluate the performance of CONDOT ( 3) versus other baselines. 5.1 Population Dynamics Conditioned on Scalars Upon application of a molecular drug, the state of each cell xi of the unperturbed population is altered, and observed in population . Molecular drugs are often applied at different dosage levels t, and the magnitude of changes in the gene expression profiles of single cells highly correlates with that dosage. We seek to learn a global, parameterized transport map T sensitive to that dosage.We evaluate our method on the task of inferring single-cell perturbation responses to the cancer drug Givinostat, a histone deacetylase inhibitor with potential anti-inflammatory, anti-angiogenic, and antineoplastic activities (Srivatsan et al., 2020), applied at different dosage levels, i.e., t 2 {10 n M, 100 n M, 1, 000 n M, 10, 000 n M}. The dataset contains 3, 541 cells described with the gene expression levels of 1, 000 highly-variable genes. In a first experiment, we measure how well CONDOT captures the drug effects at different dosage levels via distributional distances such as MMD (Gretton et al., 2012) and the 2-norm between the corresponding perturbation signatures (PS), as well as the entropy-regularized Wasserstein distance (Cuturi, 2013). We compute the metrics on 50 marker genes, i.e., genes mostly affected upon perturbation. For more details on evaluation metrics, see E.2. To put CONDOT s performance into perspective, we compare it to current state-of-the-art baselines (Lotfollahi et al., 2021) as well as parameterized Monge maps without context variables (Bunne et al., 2021; Makkuva et al., 2020, ICNN OT), see E.1. As visible in Table 1 and Fig. 4a, CONDOT achieves consistently more accurate predictions of the target cell populations at different dosage levels than OT approaches that cannot utilize context information, demonstrated through a lower average loss and a smaller variance. This becomes even more evident when moving to the setting where the population has been trained only on a subset of dosages and we test CONDOT on out-of-sample dosages. Table 1 and Fig. 4a demonstrate that CONDOT is able to generalize to previously unknown dosages, thus learning to interpolate the perturbation effects from dosages seen during training. For further analysis, we refer the reader to E (see Fig. 9 and 10). We further provide an additional comparison of CONDOT, operating in the multi-task setting, to the single-task performance of optimal transport-based methods C.4. While the single-task setting of course fails to generalize to new contexts and requires all contexts to be distinctly known, it provides us with a pseudo lower bound, which CONDOT is able to reach (see Table 2). 5.2 Population Dynamics Conditioned on Covariates Molecular processes are often highly dependent on additional covariates that steer experimental conditions, and which are not present in the features measures in population µ or . This can be, for instance, factors such as different cell types clustered within the populations. When the model can only Wasserstein Distance Maximum Mean Discrepancy Difficulty of Train / Test Split Difficulty of Train / Test Split Difficulty of Train / Test Split Known Perturbations Unknown Combinations Unknown Combinations a. b. c. CPA (Lotfollahi et al., 2021) Cond OT (Combinator ) Cond OT (Combinator ) Cmoa Figure 6: Predictive performance for a. known perturbations, b. unknown perturbations in combination w.r.t. regularized Wasserstein distance and c. MMD over different train / test splits of increasing difficulty for baseline CPA as well as CONDOT with different combinators Cohe Φ . For more details on the dataset splits, see D.2. be conditioned w.r.t. a small and fixed set of metadata information, such as cell types, it is sufficient to encode these contexts using a one-hot embedding module Eohe. To illustrate this problem, we consider cell populations comprising three different cell lines (A549, MCF7, and K562). As visible in Table 1, CONDOT outperforms current baselines which equally condition on covariate information such as CPA (Lotfollahi et al., 2021), assessed through various evaluation metrics. Figure 4b displays a gene showing highly various responses towards the drug Givinostat dependent on the cell line. CONDOT captures the distribution shift from control to target populations consistently across different cell lines. 5.3 Population Dynamics Conditioned on Actions To recommend personalized medical procedures for patients, or to improve our understanding of genetic circuits, it is key to be able to predict the outcomes of novel perturbations, arising from combinations of drugs or of genetic perturbations. Rather than learning individual maps T a predicting the effect of individual treatments, we aim at learning a global map T which, given as input the unperturbed population µ as well as the action a of interest, predicts the cell state perturbed by a. Thanks to its modularity, CONDOT can not only learn a map T for all actions known during training, but also to generalize to unknown actions, as well as potential combinations of actions. We will discuss all three scenarios below. 5.3.1 Known Actions In the following, we analyze CONDOT s ability to accurately predict phenotypes of genetic perturbations based on single-cell RNA-sequencing pooled CRISPR screens (Norman et al., 2019; Dixit et al., 2016), comprising 98, 419 single-cell gene expression profiles with 92 different genetic perturbations, each cell measured via a 1, 500 highly-variable gene expression vector. As, in a first step, we do not aim at generalizing beyond perturbations encountered during training, we utilize again a one-hot embedding Eohe to condition T on each perturbation a. We compare our method to other baselines capable of modeling effects of a large set of perturbations such as CPA (Lotfollahi et al., 2021). Often, the effect of genetic perturbations are subtle in the high-dimensional gene expression profile of single cells. Using ICNN-parameterized OT maps without context information, we can thus assess the gain in accuracy of predicting the perturbed target population by incorporating context-awareness over simply predicting an average perturbation effect. Figure 5a and b demonstrate that compared to OT ablation studies, Fig. 5c and Fig. 6a for the current state-of-the-art method CPA (Lotfollahi et al., 2021). Compared to both, CONDOT captures the perturbation responses more accurately w.r.t. the Wasserstein distance. 5.3.2 Unknown Actions With the emergence of new perturbations or drugs, we aim at inferring cellular responses to settings not explored during training. One-hot embeddings, however, do not allow us to model unknown perturbations. This requires us to use an embedding E, which can provide us with a representation of an unknown action a0. As genetic perturbations further have no meaningful embeddings as, for example, Identity initialization ICNN OT Identity initialization CPA (Lotfollahi et al., 2021) Perturbation KLF1+MAP2k6 Predictions Figure 7: UMAP embeddings of cells perturbed by the combination KLF1+MAP2K6 (gray) and predictions of CONDOT (ours), ICNN OT (Makkuva et al., 2020), and CPA (blue). While CONDOT aligns well with observed perturbed cells, the baselines fail to capture subpopulations. molecular fingerprints for drugs, we resort to mode-of-action embeddings introduced in 3.3. Assuming marginal sample access to all individual perturbations, we compute a multidimensional scaling (MDS)-based embedding from pairwise Wasserstein distances between individual target populations, such that perturbations with similar effects are closely represented. For details, see E. As current state-of-the-art methods are restricted to modeling perturbations via one-hot encodings, we compare our method to ICNN OT only. As displayed in Fig. 5a, CONDOT accurately captures the response of unknown actions (BAK1, FOXF1, MAP2K6, MAP4K3), which were not seen during training, at a similar Wasserstein loss as perturbation effects seen during training. For more details, see E. 5.3.3 Actions in Combination While experimental studies can often measure perturbation effects in biological systems in isolation, the combinatorial space of perturbations in composition is too large to capture experimentally. Often, however, combination therapies are cornerstones of cancer therapy (Mokhtari et al., 2017). In the following, we test different combinator architectures to predict genetic perturbations in combination from single targets. Similarly to Lotfollahi et al. (2021), we can embed combinations by adding individual one-hot encodings of single perturbations (i.e., Cohe + ). In addition, we parameterize a combinator via a permutation-invariant deep set, as introduced in 3.3, based on mode-of-action embeddings of individual perturbations (i.e., Cmoa Φ ). We split the dataset into train / test splits of increasing difficulty (details on the dataset splits in D.2). Initially containing all individual perturbations as well as some combinations, the number of perturbations seen in combination during training decreases over each split. For more details, see E. We compare different combinators to ICNN OT (Fig. 5b) and CPA (Lotfollahi et al., 2021) (Fig. 5c, Fig. 6b, c). While the performance drops compared to inference on known perturbations (Fig. 6a) and decreases with increasing difficulty of the train / test split, CONDOT outperforms all baselines. When embedding these high-dimensional populations in a low-dimensional UMAP space (Mc Innes et al., 2018), one can see that CONDOT captures the entire perturbed population, while ICNN OT and CPA fail in capturing certain subpopulations in the perturbed state (see Fig. 7 and 11). 6 Conclusion We have developed the CONDOT framework that is able to infer OT maps from not only one pair of measures, but many pairs that come labeled with a context value. To ensure that CONDOT encodes optimal transports, we parameterize it as a PICNN, an input-convex NN that modulates the values of its weights matrices according to a sequence of feature representations of that context vector. We showcased the generalization abilities of CONDOT in the extremely challenging task of predicting outcomes for unseen combinations of treatments. These abilities and PICNN more generally hold several promises, both as an augmentation of the OTT toolbox (Cuturi et al., 2022), and for future applications of OT to single-cell genomics. Acknowledgments and Disclosure of Funding This publication was supported by the NCCR Catalysis (grant number 180544), a National Centre of Competence in Research funded by the Swiss National Science Foundation. We thank Stefan Stark and Gabriele Gut for helpful discussions and the reviewers for their thoughtful comments and efforts towards improving our manuscript. D. Alvarez-Melis, Y. Schiff, and Y. Mroueh. Optimizing Functionals on the Space of Probabilities with Input Convex Neural Networks. ar Xiv Preprint ar Xiv:2106.00774, 2021. B. Amos, L. Xu, and J. Z. Kolter. Input Convex Neural Networks. In International Conference on Machine Learning (ICML), volume 34, 2017. M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein Generative Adversarial Networks. In Interna- tional Conference on Machine Learning (ICML). PMLR, 2017. Y. Brenier. Décomposition polaire et réarrangement monotone des champs de vecteurs. CR Acad. Sci. Paris Sér. I Math., 305, 1987. C. Bunne, S. G. Stark, G. Gut, J. S. del Castillo, K.-V. Lehmann, L. Pelkmans, A. Krause, and G. Ratsch. Learning Single-Cell Perturbation Responses using Neural Optimal Transport. bio Rxiv, 2021. C. Bunne, Y.-P. Hsieh, M. Cuturi, and A. Krause. Recovering Stochastic Dynamics via Gaussian Schrödinger Bridges. ar Xiv Preprint ar Xiv:2202.05722, 2022a. C. Bunne, L. Meng-Papaxanthos, A. Krause, and M. Cuturi. Proximal Optimal Transport Modeling of Population Dynamics. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 25, 2022b. R. Caruana. Multitask Learning. Machine Learning, 28(1), 1997. Y. Chandak, G. Theocharous, J. Kostas, S. Jordan, and P. Thomas. Learning Action Representations for Reinforcement Learning. In International Conference on Machine Learning (ICML), 2019. Y. Chen, Y. Shi, and B. Zhang. Optimal Control Via Neural Networks: A Convex Approach. In International Conference on Learning Representations (ICLR), 2019. S. Chopra, R. Hadsell, and Y. Le Cun. Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1. IEEE, 2005. M. Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Advances in Neural Information Processing Systems (Neur IPS), volume 26, 2013. M. Cuturi and G. Peyré. Semidual Regularized Optimal Transport. SIAM Review, 60(4):941 965, M. Cuturi, L. Meng-Papaxanthos, Y. Tian, C. Bunne, G. Davis, and O. Teboul. Optimal Transport Tools (OTT): A JAX Toolbox for all things Wasserstein. ar Xiv Preprint ar Xiv:2201.12324, 2022. J. De Leeuw and P. Mair. Multidimensional Scaling Using Majorization: SMACOF in R. Journal of Statistical Software, 31, 2009. A. Dixit, O. Parnas, B. Li, J. Chen, C. P. Fulco, L. Jerby-Arnon, N. D. Marjanovic, D. Dionne, T. Burks, R. Raychowdhury, et al. Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens. Cell, 167(7), 2016. J. Fan, A. Taghvaei, and Y. Chen. Scalable Computations of Wasserstein Barycenter via Input Convex Neural Networks. In International Conference on Machine Learning (ICML), 2021. M. Gelbrich. On a Formula for the L2 Wasserstein Metric between Measures on Euclidean and Hilbert Spaces. Mathematische Nachrichten, 147(1), 1990. A. Genevay, L. Chizat, F. Bach, M. Cuturi, and G. Peyré. Sample Complexity of Sinkhorn Divergences. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 22, 2019. A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A Kernel Two-Sample Test. Journal of Machine Learning Research, 13, 2012. D. Ha, A. Dai, and Q. V. Le. Hypernetworks. ar Xiv preprint ar Xiv:1609.09106, 2016. T. Hashimoto, D. Gifford, and T. Jaakkola. Learning Population-Level Diffusions with Generative Recurrent Networks. In International Conference on Machine Learning (ICML), volume 33, 2016. C.-W. Huang, R. T. Q. Chen, C. Tsirigotis, and A. Courville. Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization. In International Conference on Learning Representations (ICLR), 2021. J.-C. Hütter and P. Rigollet. Minimax estimation of smooth optimal transport maps. The Annals of Statistics, 49(2), 2021. L. Jacob, J. She, A. Almahairi, S. Rajeswar, and A. Courville. W2GAN: Recovering an Optimal Transport Map with a GAN. In ar Xiv Preprint, 2018. H. Janati, T. Bazeille, B. Thirion, M. Cuturi, and A. Gramfort. Multi-subject MEG/EEG source imaging with sparse multi-task regression. Neuro Image, 220, 2020. L. Kantorovich. On the transfer of masses (in Russian). In Doklady Akademii Nauk, volume 37, D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR), 2014. A. Korotin, V. Egiazarian, A. Asadulaev, A. Safin, and E. Burnaev. Wasserstein-2 Generative Networks. In International Conference on Learning Representations (ICLR), 2020. A. Korotin, L. Li, A. Genevay, J. M. Solomon, A. Filippov, and E. Burnaev. Do Neural Optimal Transport Solvers Work? A Continuous Wasserstein-2 Benchmark. Advances in Neural Information Processing Systems (Neur IPS), 34, 2021. S. Kummar, H. X. Chen, J. Wright, S. Holbeck, M. D. Millin, J. Tomaszewski, J. Zweibel, J. Collins, and J. H. Doroshow. Utilizing targeted cancer therapeutic agents in combination: novel approaches and urgent requirements. Nature Reviews Drug discovery, 9(11), 2010. M. Lotfollahi, F. A. Wolf, and F. J. Theis. sc Gen predicts single-cell perturbation responses. Nature Methods, 16(8), 2019. M. Lotfollahi, M. Naghipourfar, F. J. Theis, and F. A. Wolf. Conditional out-of-distribution generation for unpaired data using transfer VAE. Bioinformatics, 36, 2020. M. Lotfollahi, A. K. Susmelj, C. De Donno, Y. Ji, I. L. Ibarra, F. A. Wolf, N. Yakubova, F. J. Theis, and D. Lopez-Paz. Compositional perturbation autoencoder for single-cell response modeling. bio Rxiv, 2021. R. K. Mahabadi, S. Ruder, M. Dehghani, and J. Henderson. Parameter-efficient Multi-task Fine- tuning for Transformers via Shared Hypernetworks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2021. A. Makkuva, A. Taghvaei, S. Oh, and J. Lee. Optimal transport mapping via input convex neural networks. In International Conference on Machine Learning (ICML), volume 37, 2020. R. J. Mc Cann. A Convexity Principle for Interacting Gases. Advances in Mathematics, 128(1), 1997. L. Mc Innes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ar Xiv Preprint, 2018. A. Mead. Review of the Development of Multidimensional Scaling Methods. Journal of the Royal Statistical Society: Series D (The Statistician), 41(1), 1992. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. International Conference on Learning Representations (ICLR), Workshop Track, 2013a. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems (Neur IPS), 26, 2013b. T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic Regularities in Continuous Space Word Representa- tions. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2013c. R. B. Mokhtari, T. S. Homayouni, N. Baluch, E. Morgatskaya, S. Kumar, B. Das, and H. Yeger. Combination therapy in combating cancer. Oncotarget, 8(23):38022, 2017. P. Mokrov, A. Korotin, L. Li, A. Genevay, J. Solomon, and E. Burnaev. Large-Scale Wasserstein Gradient Flows. Advances in Neural Information Processing Systems (Neur IPS), 2021. G. Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l Académie Royale des Sciences, pages 666 704, 1781. T. M. Norman, M. A. Horlbeck, J. M. Replogle, A. Y. Ge, A. Xu, M. Jost, L. A. Gilbert, and J. S. Weissman. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science, 365(6455), 2019. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2011. G. Peyré and M. Cuturi. Computational Optimal Transport. Foundations and Trends in Machine Learning, 11(5-6), 2019. ISSN 1935-8245. A.-A. Pooladian and J. Niles-Weed. Entropic estimation of optimal transport maps. ar Xiv preprint ar Xiv:2109.12004, 2021. N. Prasad, K. Yang, and C. Uhler. Optimal Transport using GANs for Lineage Tracing. ar Xiv preprint ar Xiv:2007.12098, 2020. A. Rambaldi, C. M. Dellacasa, G. Finazzi, A. Carobbio, M. L. Ferrari, P. Guglielmelli, E. Gattoni, S. Salmoiraghi, M. C. Finazzi, S. Di Tollo, et al. A pilot study of the Histone-Deacetylase inhibitor Givinostat in patients with JAK2V617F positive chronic myeloproliferative neoplasms. British journal of haematology, 150(4):446 455, 2010. D. Rezende and S. Mohamed. Variational Inference with Normalizing Flows. In International Conference on Machine Learning (ICML), 2015. J. Richter-Powell, J. Lorraine, and B. Amos. Input Convex Gradient Networks. ar Xiv preprint ar Xiv:2111.12187, 2021. P. Rigollet and A. J. Stromme. On the sample complexity of entropic optimal transport. ar Xiv preprint ar Xiv:2206.13472, 2022. D. Rogers and M. Hahn. Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 50(5), 2010. Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang. Self-Supervised Graph Transformer on Large-Scale Molecular Data. Advances in Neural Information Processing Systems (Neur IPS), 2020. F. Santambrogio. Optimal Transport for Applied Mathematicians. Birkhäuser, NY, 55(58-63):94, G. Schiebinger, J. Shu, M. Tabaka, B. Cleary, V. Subramanian, A. Solomon, J. Gould, S. Liu, S. Lin, P. Berube, et al. Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell, 176(4), 2019. M. A. Schmitz, M. Heitz, N. Bonneel, F. Ngole, D. Coeurjolly, M. Cuturi, G. Peyré, and J.-L. Starck. Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1):643 678, 2018. P. Schwaller, A. C. Vaucher, R. Laplaza, C. Bunne, A. Krause, C. Corminboeuf, and T. Laino. Machine intelligence for chemical reaction space. Wiley Interdisciplinary Reviews: Computational Molecular Science, page e1604, 2022. S. R. Srivatsan, J. L. Mc Faline-Figueroa, V. Ramani, L. Saunders, J. Cao, J. Packer, H. A. Pliner, D. L. Jackson, R. M. Daza, L. Christiansen, et al. Massively multiplex chemical transcriptomics at single-cell resolution. Science, 367(6473), 2020. V. Stathias, A. M. Jermakowicz, M. E. Maloof, M. Forlin, W. Walters, R. K. Suter, M. A. Durante, S. L. Williams, J. W. Harbour, C.-H. Volmar, et al. Drug and disease signature integration identifies synergistic combinations in glioblastoma. Nature Communications, 9(1), 2018. G. Tennenholtz and S. Mannor. The Natural Language of Actions. In International Conference on Machine Learning (ICML), 2019. C. Villani. Topics in Optimal Transportation, volume 58. American Mathematical Soc., 2003. F. A. Wolf, P. Angerer, and F. J. Theis. SCANPY: large-scale single-cell gene expression data analysis. Genome biology, 19(1), 2018. K. D. Yang and C. Uhler. Scalable Unbalanced Optimal Transport using Generative Adversarial Networks. International Conference on Learning Representations (ICLR), 2019. K. D. Yang, K. Damodaran, S. Venkatachalapathy, A. C. Soylemezoglu, G. Shivashankar, and C. Uhler. Predicting cell lineages using autoencoders and optimal transport. PLo S Computational Biology, 16(4), 2020. M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep Sets. In Advances in Neural Information Processing Systems (Neur IPS), volume 30, 2017.