# unlinked_monotone_regression__f337f08f.pdf

Journal of Machine Learning Research 22 (2021) 1-60 Submitted 7/20; Revised 4/21; Published 7/21

Unlinked Monotone Regression

Fadoua Balabdaoui fadoua.balabdaoui@stat.math.ethz.ch Seminar f ur Statistik ETH, Zurich R amistrasse 101 8092, Zurich, Switzerland

Charles R. Doss cdoss@umn.edu School of Statistics University of Minnesota Minneapolis, MN 55455, USA

C ecile Durot cecile.durot@gmail.com Modal x Universit e Paris Nanterre F92000, Nanterre, France

Editor: Gabor Lugosi

We consider so-called univariate unlinked (sometimes decoupled, or shuﬄed ) regression when the unknown regression curve is monotone. In standard monotone regression, one observes a pair (X, Y ) where a response Y is linked to a covariate X through the model Y = m0(X) + ϵ, with m0 the (unknown) monotone regression function and ϵ the unobserved error (assumed to be independent of X). In the unlinked regression setting one gets only to observe a vector of realizations from both the response Y and from the covariate X where now Y d= m0(X) + ϵ. There is no (observed) pairing of X and Y . Despite this, it is actually still possible to derive a consistent non-parametric estimator of m0 under the assumption of monotonicity of m0 and knowledge of the distribution of the noise ϵ. In this paper, we establish an upper bound on the rate of convergence of such an estimator under minimal assumption on the distribution of the covariate X. We discuss extensions to the case in which the distribution of the noise is unknown. We develop a second order algorithm for its computation, and we demonstrate its use on synthetic data. Finally, we apply our method (in a fully data driven way, without knowledge of the error distribution) on longitudinal data from the US Consumer Expenditure Survey. Keywords: deconvolution, quantile, monotone regression, rates, shuﬄed, uncoupled, unlinked

1 Introduction 2

2 The Minimum Contrast Estimator: Existence 6 2.1 Setup, Terminology, and Notation . . . . . . . . . . . . . . . . . . . . . 6 2.2 Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Convergence Rates of the Minimum Contrast Estimator 9 3.1 Convergence Rate Under Ordinary Smooth Noise . . . . . . . . . . . . . 9

c 2021 Fadoua Balabdaoui, Charles R. Doss, C ecile Durot.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-689.html.

Balabdaoui, Doss, and Durot

3.2 Convergence Rate Under Supersmooth Noise . . . . . . . . . . . . . . . 14 3.3 Convergence Rate Under a Discrete Noise Distribution . . . . . . . . . . 14 3.4 Convergence Rate in the Case of Diﬀerent Sample Sizes . . . . . . . . . 15 3.5 Uniform Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Estimation of Moments of m0(X) 16

5 Fenchel Optimality Conditions 17

6 Computation 18

7 Extension to the Case of Unknown Noise Distribution 20 7.1 Estimation of the Noise Distribution in the Semi-supervised Setting . . 20 7.2 Estimation of the Noise Distribution with Longitudinal Responses . . . 20

8 Demonstrations on Synthetic and Real Data 21 8.1 Computations on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 21 8.2 Computations on CEX Data . . . . . . . . . . . . . . . . . . . . . . . . 22

9 Conclusions and Directions for Future Research 25

A Bounding the Integrals in (56) and (57). 26

B Basic Empirical Process Theory Deﬁnitions and a Fundamental Preservation Result 27

C Wasserstein Distance Lemmas 28

D Gradient, Curvature, and Other Algorithmic Computations 28 D.1 Proof of (24) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.2 Curvature Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.3 Computations for Laplace Distribution . . . . . . . . . . . . . . . . . . . 30 D.4 Computations for Gaussian Errors . . . . . . . . . . . . . . . . . . . . . 30 D.5 Mixtures of Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 D.6 On Implementation of Algorithm 1 . . . . . . . . . . . . . . . . . . . . . 31

E Proofs 31 E.1 A Preparatory Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 E.2 Proofs for Section 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E.3 Proofs for Section 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 E.4 Proofs for Section 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

F Proof of Proposition 7 53

1. Introduction

An important part of data science is the construction of a data set; nowadays, because there are so many diﬀerent entities collecting increasing amounts of data, data sets are often constructed by combining separate sub-data sets or datastreams. Also, data sets sometimes undergo some form of anonymization: this can be due to the increasing prevalence of privacy concerns, or in some cases due to concerns about having limited

Unlinked Monotone Regression

data-transmission bandwidth when many separate sensors are streaming data to a central server (Pananjady et al., 2016). Thus, it is increasingly common for data scientists/analysts to want to relate variables in one data set to variables in another data set when the two data sets are unlinked. In this paper, we consider the problem of unlinked regression, speciﬁcally when the regression function is assumed to satisfy a monotonicity constraint. In the standard regression setting, we have

Yi = m0(Xi) + ϵi, E(ϵi) = 0, i = 1, . . . , n, (1)

for a random noise variable ϵi that is independent of Xi. The most basic assumption of this model is that for each index i = 1, . . . , n, the pair (Xi, Yi) is observed. For now, we assume that the covariates Xi, i = 1, . . . , n are univariate random variables. A more general model than the above standard regression model is the shuﬄed regression, in which we do not get to see the pairs (Xi, Yi), i = 1, . . . , n; rather, we only observe (X1, . . . , Xn) and (Y1, . . . , Yn), without knowing which Xi is paired with which Yi. Thus, we have the same model as (38) except that the ﬁrst equality is now only an equality in distribution and there is an unknown permutation π on {1, . . . , n} such that Yi = m0(Xπ(i)) + ϵi for all i. This happens for instance in case of anonymized data. An even more general model is the unlinked regression model that we consider in this paper, where again, the ﬁrst equality only holds in distribution but where in addition, the Xi s could be observed on diﬀerent individuals from the Yi s so that the two samples are not necessarily connected through a permutation π. This happens for instance if the two samples have been collected independently (by separate entities). The number of observed Xi s may even diﬀer from the number of observed Yi s so we observe independent and identically distributed (i.i.d.) variables Y1, . . . , Yny and i.i.d. variables X1, . . . , Xnx such that

Y d= m0(X) + ϵ (2)

where Y Yi, X Xi, and ϵ is independent of X with distribution function Φϵ. In shuﬄed or unlinked regression models, it may seem hopeless to even try to learn the regression function m0 since m0 is in general not even identiﬁable (see Section 2.1 below); however, it turns out that if m0 is assumed to be monotonically increasing, as we will do in this paper, and if the error distribution is known a priori, then in fact one can construct consistent estimators of m0. (We will return in Section 7 to discuss the case where the error distribution is unknown.) Unlinked estimation can be considered in a variety of settings (e.g., De Groot and Goel (1980)). It appears that unlinked monotone regression was recently introduced by Carpentier and Schl uter (2016). One motivating example discussed in Carpentier and Schl uter (2016) is about expenditure on goods or services, such as housing: the price an individual is willing to pay for housing is expected to be monotonically increasing (at least on average, if not individually) as a function of the individual s salary. However, estimating the monotonic relationship is hindered by the simple fact that the data on wages and housing transactions are often gathered by diﬀerent agents. There are many other motivating examples for unlinked or shuﬄed regression, besides the ones already discussed. (In some examples, there may be information allowing partial matching; see our discussion point 4 in Section 9.) In ﬂow cytometry, cells suspended in a ﬂuid ﬂow past a laser, and the response (scattering of light) reveals information

Balabdaoui, Doss, and Durot

about the cell, which may be explained by its features (e.g., gene expression). However, the order of the cells as they pass the laser is unknown, so we are in a shuﬄed regression setting (Abid et al., 2017). In image stitching (related to the so-called pose-correspondence problem) one wants to ﬁnd the unknown correspondence between point clouds constructed from multiple camera angles of the same image (Pananjady et al., 2018). Several other motivating examples are discussed by Pananjady et al. (2018) (in the context of permuted/shuﬄed linear regression). The method of Carpentier and Schl uter (2016) is based on the fact that monotone unlinked regression can be rephrased as the so-called deconvolution problem, as is apparent from (2). Below, we will discuss links to this problem in more detail. Note that, as is also true in the deconvolution setting, m0 is not identiﬁable if we do not know the distribution of the noise ϵ (or have at least an estimate thereof). The following simple example explains why. Suppose that m0(x) = 2x and X N(0, 1) is independent of ϵ N(0, 1). Then, this model is the same as m0(x) = x and X N(0, 1) independent of ϵ N(0, 4). Let F0 be the cumulative distribution function (CDF) of X and L0 the CDF of m0(X). If both m0 and F0 are assumed to be one-to-one, then L0 satisﬁes

L0(w) = P(m0(X) w) = F0 m 1 0 (w) (3)

for all w in the range of m0, and we have

m0 = L 1 0 F0. (4)

Thus, the estimator constructed by Carpentier and Schl uter (2016) takes the form

em(x) = e L 1 ny Fnx(x) (5)

where e Lny is an estimator of L0 obtained by deconvolution methods and is based on the sample (Y1, . . . , Yny) and knowledge of the distribution of ϵ, and where Fnx can be taken for example to be the empirical distribution function of F0 based on the sample X1, . . . , Xnx. Thus, the estimator in (5) at a point x is equal to the deconvolution estimator of the quantile of m0(X) corresponding to the random level Fnx(x). In the presence of contextual variables (i.e., covariates that are paired with both X and Y ), Carpentier and Schl uter (2016) gave some consistency and rate of convergence results in their Theorem 3.2 under the assumption that m0 and F 1 0 belong to H older classes. Carpentier and Schl uter (2016) do not discuss how to choose the optimal bandwidth, as the main focus in that paper is to show how their estimation approach can be easily implemented for real data sets. Beside Carpentier and Schl uter (2016), the only other work of which we are aware on a similar problem is the very recent article of Rigollet and Weed (2019). In their setting, the authors consider a shuﬄed monotone regression model in a ﬁxed design setting but use the term uncoupled to describe the problem. In fact, the authors do not make any attempt to recover the unknown permutation, and focus entirely, as we do in this paper, on estimating the unknown regression function. Rigollet and Weed (2019) assume that the known distribution of the noise is sub-exponential and the true monotone function is bounded by some known constant, but they do not make any smoothness assumption on that function. Using the Wasserstein s distance and arguments from optimal transport, they showed that their estimator converges to the truth at a rate no slower than log log n/ log n and that this rate is minimax when the distribution of the noise is Gaussian. Although Rigollet and Weed (2019) describe in

Unlinked Monotone Regression

their Section 2.2 an algorithm for computing their monotone estimator, the authors do not present simulation results.

It is worth mentioning that more research seems to have been done in the shuﬄed (permuted) linear regression model than in the monotone regression model. We refer here to the work of Abid et al. (2017), Pananjady et al. (2017), Pananjady et al. (2018), and Unnikrishnan et al. (2018). The main focus in the former three papers is to ﬁnd conditions on the signal-to-noise ratio that guarantee recovery of the unknown permutation. Abid et al. (2017) show that the least-squares estimator in this model is inconsistent in general, but they construct a method-of-moments type estimator and prove that this estimator is consistent assuming that E(Xi) = 0. A common feature of these works is to restrict attention to the case of Gaussian noise. As we show in the present paper, the noise distribution may be of fundamental importance in these problems.

As announced above, we give now some more detail to the existing link between unlinked monotone regression and estimation in deconvolution problems. As noted earlier, (4) shows clearly that there is a tight connection to quantile estimation in a deconvolution setting. In fact, consider the deconvolution setting in which one observes n i.i.d. copies of Y where Y = X + ϵ for independent random variables X and ϵ. In this problem, the goal is learn the distribution of X under the assumption that the distribution of ϵ is known (such an assumption can be relaxed if this distribution can be estimated). Estimation of the distribution and quantile functions of X has been considered in Hall and Lahiri (2008), and revisited in Dattner et al. (2011, 2016) under slightly diﬀerent assumptions. In particular, contrary to Hall and Lahiri (2008), no moment assumptions are made about the covariate X or ϵ in Dattner et al. (2011, 2016). There, the smoothness of the density of X is measured in terms of belonging to Sobolev or H older balls. In the case where the error is ordinary smooth of order larger than 1/2, Dattner et al. (2011) recover the rates of convergence that Hall and Lahiri (2008) obtained for the integrated risk when estimating the distribution function, and moreover provide new rates of convergence for the case of smoother error distributions; the square-root rate is shown to be achieved for smooth enough distribution of X. The convergence results obtained in these previous papers do not apply directly in this present paper, as we do not assume that the covariate X (from (2)) admits even a Lebesgue density, which also means that m0(X) is not assumed to have a density.

While Carpentier and Schl uter (2016) and Rigollet and Weed (2019) are the only other articles of which we are aware on unlinked monotone regression besides the present paper, the classical isotonic regression model given in (1) is a very well-known estimation problem with a vast literature. The most known estimator in this problem is certainly the Grenander-type estimator, obtained by taking the right derivative of the greatest convex minorant of the cumulative sum diagram associated with the data (Xi, Yi), i = 1, . . . , n (Barlow et al., 1972; Robertson et al., 1988; Groeneboom and Jongbloed, 2014). The pointwise non-standard rate of convergence of the Grenander estimator is n 1/3 if m0 is continuously diﬀerentiable with a non-vanishing derivative and n in case m0 is locally ﬂat (Groeneboom, 1983; Carolan and Dykstra, 1999; Zhang, 2002; Cator, 2011; Chatterjee et al., 2015). Asymptotic properties, including the pointwise limit distribution and convergence in the Lp-norms for p [1, 5/2) { } have been fully described in Brunk (1970), Durot (2002), Durot (2007) and Durot et al. (2012); see also Groeneboom (1985, 1989). One can also combine kernel estimation

Balabdaoui, Doss, and Durot

with the monotonicity constraint to improve rates of convergence if m0 has higher orders of smoothness (Mammen, 1991; Durot and Lopuha a, 2014).

In this paper, we investigate the unlinked monotone regression following the method introduced by Edelman (1988) for estimating the mixing distribution in a mixture problem with Normal noise. Let (X1, θ1), . . . , (Xn, θn) be independent random vectors such that Xi|θi N(θi, 1), that is, conditionally on θ1, . . . , θn the random variables X1, X2, . . . , Xn are generated from the unknown distributions Φ(. θ1), Φ(. θ2), . . . , Φ( θn), where Φ denotes the cumulative distribution function of N(0, 1). The approach of Edelman (1988) consists of ﬁnding the vector ( θ1, . . . , θn) which minimizes the integrated diﬀerence between n 1 Pn i=1 Φ( θi) and the empirical distribution Fn based on the observations Xi, i = 1, . . . , n among all vectors (θ1, . . . , θn). As Edelman (1988) already noted, the normal distribution can be replaced by other distributions, which is exactly what we do here. The merits of this approach are the facts that it does not depend on some bandwidth, and that it is easily implementable. The link between our problem and the work Edelman (1988) is made clear in Section 2. There, we introduce our estimator, which we call the minimum contrast estimator, and we establish its existence and some of its important features in the case of equal sample sizes for the responses and covariates. In Section 3, we establish rates of convergence under some fairly general conditions on the distribution function of the covariates. In that section, we assume that the noise distribution is known and distinguish between three cases for its smoothness: (1) ordinary smooth, (2) supersmooth and (3) discrete with ﬁnite number of support points. The convergence rates in cases (1) and (2) are derived using some classical Fourier transform techniques, while in case (3) a very diﬀerent approach is employed, which makes use of a recent result in Meis and Mammen (2020). In the proofs, we use a conversion device which allows us to link the convergence rate of the estimator to that of its generalized inverse, an interesting result in its own right. Since our method allows for diﬀerent sample sizes for the responses and covariates, we extend the construction of the estimator to this case and derive the corresponding convergence rate in the three aforementioned smoothness cases. In Section 4, we show that the estimator achieves the parametric rate in estimating the moments of m0(X) in case (3) and also when the noise is known to be uniformly distributed on a compact. Although our estimator cannot be shown to be unique, we prove in Section 5 that any solution has to satisfy a necessary optimizing condition. This condition is used to develop a gradient-descent algorithm to compute the estimator, see Section 6. In Section 7, we discuss how our method can be extended to the more realistic situation where the noise distribution is unknown. In Section 8, we illustrate our approach through synthetic and real data. We ﬁnish this manuscript with some concluding remarks and future research directions; see Section 9. Technical proofs are deferred to an Appendix.

2. The Minimum Contrast Estimator: Existence

2.1 Setup, Terminology, and Notation

Let m0 be the monotone function appearing in the model in (2), and in which we are interested. In the model that we consider, the response Y has the same distribution as the convolution of m0(X) and the noise ϵ with m0 monotone non-decreasing. Note for the case where m0 is non-increasing it is enough to consider Y instead of Y and all our results will still apply. Denote by M the set of all bounded non-decreasing and

Unlinked Monotone Regression

right continuous functions deﬁned on [0, 1]. This class accomodates for the assumption made in the sequel that the covariate X [0, 1] almost surely. It is worth mentioning that without the monotonicity assumption, the function m0 is not identiﬁable in general, even in the simple case where m0(X) is observed without noise, that is in the case where Y = m0(X) is observed. Precisely, the distribution of X together with the distribution of m0(X) does not pin down the function m0. A simple counterexample (that was suggested by a referee) is where X has the uniform distribution on [0, 1]:

m1(x) = x and m2(x) = 1 2 x 1

Then, it is easy to see that both m1(X) and m2(X) have the uniform distribution on [0, 1]. On the other hand, if m0 is monotone, then it can be determined on the support of X using knowledge about the distribution of X and m0(X). This identiﬁability property is proved in the following proposition.

Proposition 1. Let X be a real valued random variable with a continuous distribution on [0, 1], and let m1 and m2 be non-decreasing and right-continuous functions on the support of X. If m1(X) has the same distribution as m2(X), then m1 = m2 on the support of X.

Here, we describe the working framework of our estimation approach. We observe two independent samples (X1, . . . , Xn) and (Y1, . . . , Yn) such that the following holds.

Assumption A0. X1, . . . , Xn iid F0, 0 Xi 1 almost surely, and Y1, . . . , Yn iid H0. The unobserved error ϵ is independent of X, satisﬁes E(ϵ) = 0, and has CDF Φϵ. Moreover, m0 M.

Note that there are no restrictions on the relationships between the Xi s and the Yj s, other than the equality in distribution in (2). Then, F0, H0, and Φϵ are the true CDF s of Xi, Yi, and ϵi, respectively. Let Fn(x) = n 1 Pn i=1 1[Xi, )(x) and Hn(x) = n 1 Pn i=1 1[Yi, )(x) where x 7 1A(x) is the indicator function for the set A. Also, let be the supremum norm, i.e. m = supt [0,1] |m(t)|. For K > 0, let MK be the set of functions m M such that m K. For m M, m 1 denotes the generalized inverse of m, i.e.

m 1(y) := inf {x [0, 1] : m(x) y} (6)

where the inﬁmum of an empty set is deﬁned to be 1. Hence, we have m 1(y) = 1 for all y > m(1). In this section, we assume that Φϵ is known. This assumption will be relaxed in Section 7. Also, although we take the respective sizes of the samples of covariates and responses to be equal, our method can be easily adapted to the case where these sizes are diﬀerent. In that case, the convergence rate of our estimator is driven by the minimum of the sample sizes; see Subsection 3.4. Let F be the set of all distribution functions on R. A contrast function C deﬁned on the Cartesian product F F is any non-negative function such that C(F1, F2) = 0 if and only if F1 = F2. Consider the estimator

bmn = argminm M C Hn, n 1 n X

i=1 Φϵ( m(Xi))

Balabdaoui, Doss, and Durot

provided that a minimizer exists (in which case it is not necessarily unique). Since the criterion depends on m only through its values at the observations Xi, i = 1, . . . , n, it follows that any candidate for the minimization problem in (7), m in M, can be well replaced by the non-decreasing function em such that

em(t) = m(X(1)), for t [0, X(1)] m(X(n)), for t [X(n), 1]

and em is constant between the remaining order statistics such that m is right continuous (by deﬁnition of M) and coincides with m at every X(i), i = 1, . . . , n. Here as is customary, X(1) X(n) denote the order statistics corresponding to X1, . . . , Xn. In addition, note that bmn does not have to be unique at the data points and thus bmn denotes any solution of the minimization problem.

2.2 Existence

In the sequel, we consider the following contrast function

C(F1, F2) = Z

n F1(y) F2(y) o2 dy

whenever this integral is ﬁnite, which is the case when F1 and F2 are distribution functions of random variables admitting ﬁnite expectations. The choice of such contrast function is mainly motivated by application of the Parseval-Plancherel s Theorem. The estimator we consider here is reminiscent of the one studied in Edelman (1988) for deconvoluting a distribution function from a Gaussian noise. However, our goal here is diﬀerent since the main target in our problem is the monotone transformation m0. Before starting the analysis of the estimator, we establish ﬁrst its existence. Denote

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 dy

R Φϵ(y m(x))d F0(x) 2 dy

be its deterministic counterpart. Then the minimizer bmn in (7) (if it exists) can also be written as bmn argmin m M Mn(m). (8)

The following assumptions will be needed.

Assumption A1. For some K0 [0, ), we have m0 = K0.

Assumption A2. The distribution function Φϵ is continuous on R.

Proposition 2. Let Assumption A0 hold. Then,

1. Mn(m) is ﬁnite for any m M for all n 1;

2. If Assumption A1 also holds then M(m) is ﬁnite for any m M;

Unlinked Monotone Regression

3. If Assumptions A1 and A2 also hold, then there exists at least a solution to (8) that is piecewise constant and right-continuous, for all n 1.

4. If Assumptions A1 and A2 also hold, and ϵ has a bounded support, then with probability 1, there exists at least a solution to (8) that is piecewise constant, right-continuous and bounded in the sup-norm by a deterministic constant for all n 1. This deterministic constant can be taken to be equal to K0 + 2 in case ϵ 1 with probability 1.

Note that parts 3 and 4 of the proposition give existence but not necessarily uniqueness of bmn. In fact, it can be seen from the proof of Proposition 2 that if bmn is a solution to (8), then any monotone function that coincides with bmn at the observed covariates X1, . . . , Xn gives another solution to (8). In what follows, we will consider a solution bmn that takes constant values between successive covariates and that is right continuous. This choice is consistent with the way the Grenander-type estimator is deﬁned, that is, the estimator in the classical monotone regression estimation problem.

3. Convergence Rates of the Minimum Contrast Estimator

In this section, we will give upper bounds on the convergence rate of the minimum contrast estimator deﬁned above. Not surprisingly, this rate depends on the smoothness of the noise distribution. More speciﬁcally, we will consider the following cases for the smoothness: (1) ordinary smooth, (2) supersmooth and (3) discrete with a ﬁnite support. In the whole section, we assume that the Assumptions A0, A1 and A2 hold. By Proposition 2, this guarantees that the minimization problem in (8) admits a piecewise constant and right-continuous solution. We will denote bmn any such solution. Also, we will require the following assumption about the distribution of the design points.

Assumption A3. The common distribution function F0 of the covariates X1, . . . , Xn is continuous.

For the smoothness cases (1) and (2), we will need the following notation:

R eitxd F(t) (9)

for any distribution function F on R, and

R eitxg(t)dt (10)

denotes the Fourier transform of g, whenever g is integrable. Note that when F is absolutely continuous with density f, then it follows immediately that ψF = φf. .

3.1 Convergence Rate Under Ordinary Smooth Noise

We start with the ordinary smooth case for the noise distribution described in the next assumption.

Assumption A4. The distribution function Φϵ is absolutely continuous with a 0-mean ordinary smooth density fϵ in the sense that

d0 |t|β |φfϵ(t)| d1

as |t| , for some β > 0 and constants d0 > 0, d1 > 0.

Balabdaoui, Doss, and Durot

Some comments are in order. Assumption A4 is common in deconvolution problems, see for instance Dattner et al. (2011). The positive real number β is usually referred to as the order of smoothness. Known examples include the Exponential distribution with any scale parameter for which β = 1, Gamma distribution with shape parameter α > 0 and scale γ > 0 for which β = α, the Laplace distribution with β = 2, and more generally symmetric Gamma distributions (the distribution of X X where X and X are i.i.d. Gamma(α, γ)) in which case we have β = 2α. See for example the examples given in Fan (1991) after (1.4). We plot in Figure 3 in the appendix several gamma densities with a variety of shape parameters, to show the decreasing smoothness: as the shape parameter goes to 0, the density has an increasing spike at 0. Our main theorem below provides the rate of convergence of the L1-error with respect to the distribution of the design points, F0, over a given interval [a, b] [0, 1] provided that the estimator is stochastically bounded on that interval. Suﬃcient conditions for this boundedness are given below. The following assumption is crucial for deriving the convergence rate of bmn. It will be also needed below in the supersmooth case.

Assumption A5. Assume that there exists T such that |φfϵ(t)| |φfϵ(T)| > 0 for all T > T and |t| T.

Now, we we are able to state the main result of this subsection.

Theorem 1. 1. Suppose that Assumptions A0 to A5 hold. Let [a, b] [0, 1] be a ﬁxed interval such that

bmn(a) = OP (1) and bmn(b) = OP (1). (12)

Then, it holds that Z b

a | bmn(x) m0(x)| d F0(x) = OP (n 1/(2(2β+1))). (13)

2. If the Assumptions A0 to A5 hold, then the claims in (12) hold for all a and b such that 0 < F0(a) F0(b) < 1.

We give below the main steps of the proof and conclude this subsection with some comments about the rates in Theorem 1. Details of the proof are postponed to Section E.3. Let us deﬁne

b Ln(w) := 1

i=1 1{ bmn(Xi) w} and Ln(w) := 1

i=1 1{m0(Xi) w} (14)

for all w R. Using deconvolution arguments we show closeness of the two processes in the following proposition.

Proposition 3. Assume that Assumptions A4 and A5 hold. Then,

b Ln(w) Ln(w) 2 dw = O(n 1/(2β+1)).

Unlinked Monotone Regression

Next, using entropy arguments from empirical process theory, we show in the following proposition that on intervals [A, B] that are possibly random, the empirical processes b Ln and Ln are close to their population counterparts b L0 n and L0 respectively, where for all w R,

b L0 n(w) = Z 1{ bmn(x) w}d F0(x) and L0(w) = Z 1{m0(x) w}d F0(x). (15)

More precisely, we derive the convergence rate of the associated L2-error integrated on such interval [A, B]. Note that the ﬁrst two claims in the proposition hold under the only assumptions that X1, . . . , Xn are i.i.d. and bmn is taken from (8). In fact, it can be seen from the proof that these two claims continue to hold with bmn replaced by any monotone estimator.

Proposition 4. If X1, . . . , Xn are i.i.d., then for all random variables A < B (that may depend on n) it holds that Z B

b Ln(w) b L0 n(w) 2 dw (B A)OP (1/n),

Ln(w) L0(w) 2 dw (B A)OP (1/n),

where OP (1/n) is uniform in A and B. Moreover, if B A = OP (n2β/(2β+1)) and Assumptions A4 and A5 holds, then Z B

b L0 n(w) L0(w) 2 dw = OP (n 1/(2β+1)). (16)

The following proposition makes the connection between the above error and a squared distance between the inverse functions of bmn and m0 composed with the distribution function F0 of the Xi s; a rate of convergence of that squared distance is derived. We recall that for m M, the inverse of m is deﬁned by (6), where the inﬁmum of an empty set is deﬁned to be 1.

Proposition 5. Under Assumptions A3, A4, and A5 for all random variables A < B such that B A = OP (1) it holds that Z B

F0 bm 1 n (w) F0 m 1 0 (w) 2 dw = Z B

b L0 n(w) L0(w) 2 dw

= OP (n 1/(2β+1)).

The last step in the proof makes the connection between the above squared distance and the L1-error of bmn.

Proposition 6. Suppose that Assumption A3 holds, and let [a, b] [0, 1] be a ﬁxed interval. Then it holds that Z b

a | bmn(x) m0(x)| d F0(x) (Bn An) Z Bn

An (F0 bm 1 n (x) F0 m 1 0 (x))2dx 1/2

where An = m0(a) bmn(a) and Bn = m0(b) bmn(b).

Balabdaoui, Doss, and Durot

We conclude the subsection with some comments about the convergence rate obtained in Theorem 1.

When β = 1/4, then the L1-rate of convergence obtained for our estimator matches the well-known n1/3-rate of the Grenander estimator in the classical isotonic regression (where the link between the responses and covariates is known). If β < 1/4, then the rate is strictly better than the cubic rate, which may appear as a contradiction. However, there is one major diﬀerence between the regular isotonic regression model and the one we consider here: we assume the error distribution to be known, which is not the case in regular isotonic regression. Thus, it seems that a transitional regime occurs in the rate of convergence in case the noise distribution is known. In the unlinked regression setting which we study in this paper, if the error distribution is known to be a centered (or symmetric) Gamma with shape and scale parameters α (0, 1/4) (or α (0, 1/8)) and λ > 0 then the smoothness parameter is β = α (or β = 2α) belongs to (0, 1/4) and the rate of convergence of our estimator will be faster than n1/3.

Note that if we let β 0, then the centered or symmetric Gamma will converge to a Dirac at 0 and the rate of convergence will approach the parametric rate n. In fact, it can be shown that the rate of convergence is precisely the parametric rate if the error distribution is a Dirac at 0, i.e. if ϵ = 0 with probablity one. Indeed, when ϵ = 0 with probability 1, we have Φϵ(y bmn(Xi)) = 1 bmn(Xi) y. If bmn is taken to be a minimum contrast estimator that is bounded in the sup-norm by K0 + 2 (such a solution is known to exist with probability 1 by Proposition 2), then we have by deﬁnition of bmn

Hn(y) n 1 n X

i=1 1 bmn(Xi) y 2dy Z

Hn(y) n 1 n X

i=1 1m0(Xi) y 2dy

implying that

i=1 1 bmn(Xi) y n 1 n X

i=1 1m0(Xi) y

Hn(y) n 1 n X

i=1 1m0(Xi) y 2dy

Hn(y) H0(y) 2dy + 8 Z

H0(y) n 1 n X

i=1 1m0(Xi) y 2dy.

The ﬁrst term was already shown to be OP (n 1); see the proof of Proposition 3. Moreover, the true distribution H0 of the Yi s is that of m0(X), which implies that the second term is also OP (n 1) since its expectation is equal to n 1 R V ar(1m0(X) y)dy = n 1 R H0(y)(1 H0(y))dy where the integral R H0(y)(1 H0(y))dy is known to be ﬁnite; see Appendix A. Hence,

i=1 1 bmn(Xi) y n 1 n X

i=1 1m0(Xi) y

2 dy = OP (n 1). (17)

Unlinked Monotone Regression

If follows from the rate obtained in (17) and the empirical process arguments as in the proof of Proposition 4 that Z K0+2

b L0 n(w) L0(w) 2 dw = Z K0+2

F0 bm 1 n (w) F0 m 1 0 (w) 2 dw

= OP (n 1).

This in turn implies by Proposition 6 (with a = 0, b = 1, and [An, Bn] [ K0 2, K0 + 2]) that Z 1

0 | bmn(x) m0(x)|d F0(x) = OP (n 1/2).

The parametric rate obtained for a noise that is Dirac at 0 can be generalized to the case where ϵ is is supported on a ﬁnite set of points. This is proved in Theorem 3.

It may be considered as unsatisfactory that the conclusions were stated with OP notation, so we provide now more precise bounds. The bounds can be obtained by closely inspecting the proof of Theorem 1; details are omited. It can be seen from the proofs that, with An and Bn the stochastically bounded random variables taken from Proposition 6, we have Z b

a | bmn(x) m0(x)| d F0(x)

(Bn An) 3 Z Bn

b Ln(w) Ln(w) 2 dw + 6(Bn An)

b Ln(w) Ln(w) 2 dw 1/2

+(Bn An) 6 Gn 2 I n

where Gn 2 I is a random variable with ﬁnite expectation. The expectation is bounded above by an unknown absolute constant that is connected to the entropy measure of the set of monotone functions on R [0, 1], as well as to the absolute constants that emerge from the empirical process theory. Hence, the second summand in the right-hand term is of the parametric order n 1/2, and can be seen as a systematic error that is typically negligable as compared to the ﬁrst summand. This means that the rate of convergence of the estimator is driven by the integral in the ﬁrst summand. For this integral, we have

b Ln(w) Ln(w) 2 dw Kn 1/(2β+1),

where K depends on the parameters of the model. One can choose for instance

K := (2β + 1) 24

d0 (2/(βπ))2β(2K0 + E|ϵ|) 1/(2β+1) .

It is worth mentioning that the decomposition in (18) still holds with bmn replaced (in the deﬁnitions of b Ln, An and Bn) by any estimator in M.

Balabdaoui, Doss, and Durot

3.2 Convergence Rate Under Supersmooth Noise

The main arguments used above for an ordinary smooth noise continue to apply to the supersmooth case. In this setting, Assumption A4 should be replaced by the following one.

Assumption A4 . The distribution function Φϵ is absolutely continuous with a 0mean supersmooth density fϵ in the sense that

d0|t|α exp( |t|β/γ) |φfϵ(t)| d1|t|α exp( |t|β/γ)

as |t| , for some α > 0, β > 0 and constants d0 > 0, d1 > 0.

In the above deﬁnition we provide for supersmoothness we deviate from the one given by Fan (1991) in (1.3) by taking the same exponent α in the lower and upper bound, for simplicity.

Theorem 2. Under the Assumptions A0 A3, A4 , and A5, we have for any [a, b] (0, 1) such that 0 < F0(a) F0(b) < 1 that

a | bmn(x) m0(x)| d F0(x) = OP ((log n) 1/(2β)).

For instance, in the case of a Gaussian noise, in which case β = 2, the rate is 1/(log n)1/4, which matches the conclusion of Edelman (1988) right after the proof of his Theorem 1. On the other hand, this rate is slower than the minimax rate, log log n/ log n, obtained by Rigollet and Weed (2019) in the shuﬄed regression problem. Note however that the setting studied by Rigollet and Weed (2019) is the shuﬄed regression problem, whereas we consider the unlinked regression problem (recall we explain the diﬀerence in our introduction). It is not known whether the two problems share the same minimax rates; or, rather, it is not known if the minimax lower bound derived in Rigollet and Weed (2019) applies to the unlinked problem, since the unlinked problem is statistically more diﬃcult than the shuﬄed problem. Therefore, it is currently unknown if the rate we derived is in fact minimax suboptimal or not.

3.3 Convergence Rate Under a Discrete Noise Distribution

In this section, we consider the case where ϵ is supported on a ﬁnite set of points. We prove that in that case, the minimum contrast estimator achieves the parametric rate in the L1-loss.

Theorem 3. Suppose that Assumptions A0 to A2 hold, and that ϵ is supported on a ﬁnite set of points. If bmn is a solution to (8) that is bounded in the sup-norm by a deterministic constant (which exists with probability 1 in view of Proposition 2), then

0 | bmn(x) m0(x)| d F0(x) = OP (n 1/2).

Unlinked Monotone Regression

3.4 Convergence Rate in the Case of Diﬀerent Sample Sizes

In this section we brieﬂy consider the case where one observes Y1, . . . , Yny and X1, . . . , Xnx with possibly diﬀerent sample sizes nx = ny. In that case, the estimator is deﬁned as

bmnx,ny = argminm M

n Hny(y) n 1 x

i=1 Φϵ(y m(Xi)) o2 dy

where Hny denotes the empirical distribution function corresponding to the sample Y1, . . . , Yny. The asymptotic here has to be understood in the sense that both sample sizes nx and ny go to inﬁnity. This means that nx ny , where nx ny denotes the inﬁmum between nx and ny. In the following theorem, we give the upper bound on the convergence rate of our minimum contrast estimator in the three regimes (1) ordinary smooth, (2) supersmooth and (3) discrete with ﬁnite number of support points.

Theorem 4. 1. Suppose that the Assumptions A0, A1, A2, A3, and A5 hold true. Let [a, b] (0, 1) be any ﬁxed interval such that 0 < F0(a) F0(b) < 1.

If Assumption A4 holds true, then as nx, ny , we have

bmnx,ny(a) = OP (1) and bmnx,ny(b) = OP (1) (19)

bmnx,ny(x) m0(x) d F0(x) = OP ((nx ny) 1/(2(2β+1))).

If Assumption A4 holds true, then as nx, ny , (19) holds, and Z b

bmnx,ny(x) m0(x) d F0(x) = OP (log(nx ny) 1/(2β)).

2. Suppose that the Assumptions A0, A1, A2 hold true. If ϵ is supported on a ﬁnite number of points, then with probability 1 there exists a solution bmnx,ny which is bounded by a deterministic constant. Furthermore, Z 1

bmnx,ny(x) m0(x) d F0(x) = OP ((nx ny) 1/2)

as nx, ny .

3.5 Uniform Consistency

The convergence rates for the minimum contrast estimator we derived above are obtained for the L1-norm. Thus, a natural question is whether these global rates also hold pointwise. Although the exact answer to this question is still unknown we can provide at least an intermediate result which shows that the estimator is pointwise consistent, even uniformly provided that the true monotone regression function, m0, is continuous. For simplicity, we assume that nx = ny = n.

Theorem 5. Suppose that the assumptions of Theorem 1, or 2 or 3 hold, and let S0 denote the support of F0. Let a, b [0, 1] be any points such that bmn(a) and bmn(b) are each OP (1). Let C be any compact set in the interior of [a, b] S0. Assume m0 is continuous on (0, 1). Then it holds that

sup x C | bmn(x) m0(x)| = o P (1).

Balabdaoui, Doss, and Durot

Note that in the case where ϵ is compactly supported, the result above implies that if m0 is continuous on [0, 1]

sup x [0,1] S0 | bmn(x) m0(x)| = o P (1)

for any estimator bmn that is bounded in the sup-norm by a deterministic constant. Recall such an estimator exists with probability 1 (see Claim 4 of Proposition 2). Finally, note that we cannot hope to extend the convergence outside S0 since m0 is only identiﬁable on this set; see Proposition 1.

4. Estimation of Moments of m0(X)

In this section, we showcase that our minimum contrast estimator can achieve the nrate for estimating certain smooth functionals of m0. In doing so, we restrict attention to estimating moments of m0(X) and the cases where either ϵ is discrete with a ﬁnite number of points in the support or when it is uniformly distributed over a compact. Modulo some scaling, we can assume without loss of generality that the support is a subset of [ 1, 1] in both cases. For simplicity, we assume that the sample sizes of the covariates and responses are equal. Note that boundedness of m0 implies that m0(X) admits ﬁnite moments of any order. Furthermore, in case the noise distribution is compactly supported, as assumed here in this section, all moments of the response Y are ﬁnite and we have

E[mj 0(X)]E(ϵk j).

Since the distribution of ϵ is known, the moments of the error can be exactly computed. Then, replacing the moments of Y by the corresponding empirical estimators in the above formula yields a n-consistent estimator of E[mk 0(X)]. The minimum contrast estimator oﬀers an alternative (and also a direct) way for estimating the moments since one can simply take the natural choice R bmk nd Fn, that is

i=1 bmk n(Xi).

Our result below shows that the latter estimator is converging at the parametric rate.

Theorem 6. Assume that Assumptions A0 to A2 hold, and that either ϵ is supported on a ﬁnite set of points or uniformly distributed on some compact. Let bmn denote a solution to (8) which is piecewise constant, right-continuous and bounded in the supnorm by a deterministic constant, for all n 1 (which exists with probability 1), and Fn the empirical CDF based on X1, ..., Xn. The following holds true.

If ϵ is supported on a ﬁnite set of points, then for all integers k 1

0 bmk n(x)d Fn(x) Z 1

0 mk 0(x)d F0(x) = OP (n 1/2), (20)

If ϵ is uniformly distributed on a compact, then

0 bmn(x)d Fn(x) Z 1

0 m0(x)d F0(x) = OP (n 1/2).

Unlinked Monotone Regression

Note that when ϵ is a uniformly distributed noise (the second case in the above theorem), the convergence rate of bmn cannot be obtained from the results obtained in Section 3. Indeed, considering without loss of generality the case where the support is [ 1, 1], a uniform distribution does not belong to the ordinary smooth nor to supersmooth categories since in this case

|φϵ(t)| = sin(t)

, for t = 0

and hence |φϵ(t)| cannot be bounded below by d0/|t|β for some d0, β > 0 nor by d0|t|α exp( |t|β/γ) for some d0, α, β, γ > 0. Thus, the convergence rate of bmn for such a noise distribution is still an open problem. Nevertheless, Theorem 6 shows clearly that our estimator behaves reasonably well when the goal is estimation of the ﬁrst moment of m0(X).

5. Fenchel Optimality Conditions

In view of the computational section below, we derive in this section the optimality conditions related to the optimization problem deﬁning the estimator bmn. Recall that by Assumption A0, the covariates X1, . . . , Xn are assumed to belong to [0, 1]. In the following, we denote by bmn a piecewise constant and right-continuous solution to (8), see Proposition 2. For some 1 p n, we write bmn,(1) < . . . < bmn,(p) for the distinct values taken by bmn on [0, 1]. We will use the following assumption on the density fϵ.

Assumption A6. The density fϵ is continuously diﬀerentiable such that

sup t R |f ϵ(t)| D,

for some constant D > 0, and R

R |f ϵ(t)|dt < .

Proposition 7. Let bmn,(1) < . . . < bmn,(p) be the distinct values of the estimator bmn. Let Assumption A6 hold. Then, for any k {1, . . . , p} Z

Hn(y) n 1 n X

i=1 Φϵ(y bmn(Xi)) fϵ(y bmn,(k))dy = 0. (21)

Furthermore, this condition can be equivalently re-written as

i=1 Φϵ(Yi bmn,(k)) = 1

R Φϵ(y bmn(Xi))fϵ(y bmk)dy (22)

for k = 1, . . . , p.

Remark 1. The alternative form in (22) gives a useful way of verifying numerically the second equality condition via numerical integration. Indeed, with m = bmn(Xi) and m = bmk, the integral on the right side of (22) takes the form Z

R Φϵ(y m)fϵ(y m ) dy = B(m m),

where for all m R, B(m) = EΦϵ(ϵ + m) = R Φϵ(y)fϵ(y m)dy. Explicit formulas of B(m), for m R, can be even found for some distributions such as Laplace or Gaussian; see Subsections D.3 and D.4.

Balabdaoui, Doss, and Durot

Remark 2. Consider the case where bmn takes one unique value, denoted bm1,(1). Then the right side of (22) equals R

R Φϵ(y bm1,(1))fϵ(y bm1,(1))dy = R

R Φϵ(y)fϵ(y)dy which equals E(Φϵ(ϵ)) = E(U) = 1/2 where U is Uniform(0, 1).

6. Computation

Recall that our goal is to minimize Mn. Write m := (m1, . . . , mn), so that the objective function can be written as

i=1 Φϵ( mi)

(by a slight abuse of notation). To minimize Mn, we can compute an unconstrained minimizer m of Mn (i.e., we do not force mi mi+1 for i = 1, . . . , n 1), and then the overall solution would be given by reordering the entries of m so that it is nondecreasing; i.e., bmn(X(i)) := m(i) (where m(1) m(n)). The gradients mi Mn(m) can be computed using that

mi Mn(m) = 2n 1 2n 2 n X

α=1 {Φϵ(Yα mi) + B(mi mα)} (24)

where for all m R, B(m) = EΦϵ(ϵ + m), see Appendix D.1. In Appendix D we show how to derive the gradient of Mn when Φϵ is either a Gaussian or a Laplace distribution. Thus, we can consider using a (ﬁrst order) gradient descent algorithm for computation. However, because the objective function is symmetric in all its mi components, at any stationary point at which some of the mi s are equal, no gradient-based method can tell if we could improve the objective function by allowing the equal mi s to take separate values. Thus, we will consider a second order method to solve this problem: we will compute a second derivative in a direction related to separating mi into two distinct values. Let Mn,p denote the objective function parameterized such that it takes 2p arguments and the regression function m is represented by p values m := (m1, . . . , mp) and p weights π := (π1, . . . , πp) (Pp α=1 πα = 1), so that Mn,p(π, m) = R

R Hn(y) Pp α=1 παΦϵ(y mα) 2dy. The following algorithm is an active set type of algorithm. At a point (eπp, f mp) R2p, deﬁne

eπp+1 := (eπ1, . . . , λeπi, (1 λ)eπi, eπi+1, . . . , eπp), and f mp+1 := ( em1, . . . , emi, emi, . . . , emp),

and consider Mn,p+1(eπp+1, f mp+1), where λ [0, 1]. Let θ := (mi mi+1)/2 and, for 1 i p, let Ci,p := 2/ θ2Mn,p+1(eπp+1, f mp+1). (25)

This is the curvature in the direction in which we separate out the ith component into two separate components. Note that Ci,p Ci,p(eπp, f mp) depends on (eπp, f mp) but we will suppress that dependence below for succinctness. Also, a priori, Ci,p depends on λ. However we can see from (29) that in fact Ci,p is minimized by taking λ = 1/2 always, so we do this from now on. We will use Ci,p when (eπp, f mp) is a stationary point.

Unlinked Monotone Regression

With the above setup, we describe our active set type of algorithm in Algorithm 1 below. Algorithm 1 includes a call to a generic subroutine to ﬁnd the optimum value of m given a ﬁxed length p and a ﬁxed value of π above (corresponding to counts in the algorithm). This is referred to as the ﬁxed-p-subroutine in the algorithm. Our suggested implementation is to use a trust-region second-order method for this generic subroutine, see (Fletcher, 1987, Section 5.1) or (Nocedal and Wright, 1999, Section 4.2) for details of a trust region method. Algorithm 1 also includes a call to a subroutine, for collapsing (approximately) unique entries, described in Algorithm 2. To optimize a function, the method requires to be able to compute the function, its gradient, and its Hessian. The latter two we have derived closed forms for, for error distributions that are either Gaussian, Laplace, or mixtures of Gaussians. We use numerical integration to compute the objective function itself at this point. In the algorithms, we use the notation mi:j to denote the subvector of m given by the indices {i, . . . , j}.

Algorithm 1: Active set algorithm

input : p(0) R, m(0) := (m(0) 1 , . . . , m(0) p ), counts(0) := (counts(0) 1 , . . . ,counts(0) p ) where p = p(0), and Tolerance parameter eps, Stepsize η output: bmn(X(1)), . . . , bmn(X(n)) while end criterion not met do

Do ﬁxed-p-subroutine(m(i 1), counts(i 1)): Find ﬁxed-p(i) and ﬁxed-counts optimal m, and assign to m(i); Do activate-constraints-subroutine(m(i), counts(i 1), eps): run Algorithm 2 which collapses the non-unique entries in m(i), and store the output in m(i) and counts(i), and let p(i) be the new (smaller) number of unique entries; Compute Cj,p (see (25)) for each j = 1, . . . , p; if minj Cj,p 0 then

End algorithm; else

k argmini Ci,p; p(i) p(i) + 1;

counts(i) (counts(i) 1 , . . . , counts(i) k /2, counts(i) k /2, . . ., counts(i) p );

m(i) (m(i) 1 , . . . , m(i) k η, m(i) k + η, . . . , m(i) p ); i i + 1; end /* Reconstruct full length solution */

The solution vector is given by the (unique, sorted) elements m(K) i , i = 1, . . . , p K, each repeated counts(K) i times, respectively, where K is the number of iterations run.

In Appendix D.6 we provide a few comments about practical implementation of the algorithm.

Balabdaoui, Doss, and Durot

Algorithm 2: Activate constraints: group (approximately) non-unique entries in m

input : m := (m1, . . . , mp), counts:= (counts1, . . . ,countsp), eps (tolerance parameter) output: mnew Rep, countsnew Rep, where 1 ep p newidx 1, begidx 1; for j 2 to p + 1 do

if (j == p + 1) OR (mj mbegidx > eps) then

mnew newidx mean(mbegidx:(j 1)); countsnew newidx sum of countsbegidx:(j 1); begidx j; newidx newidx +1; end ep length of mnew; end

7. Extension to the Case of Unknown Noise Distribution

7.1 Estimation of the Noise Distribution in the Semi-supervised Setting

In general, full knowledge of the distribution of ϵ might not be available which means that one needs to estimate it. In this case, it may be possible to collect a sample of ϵ s, ϵ 1, . . . , ϵ N, from a separate data source. These can be used to construct an estimate of Φϵ which can be then plugged into the objective function. The sample of ϵ s does not necessarily need to be independent of the Y or X samples (note, for instance, Dattner et al. 2016). There are a variety of ways one may arrive at the sample of ϵ s. In some cases, the main data set may consist of unlinked covariates and responses, but there may be a smaller (or sub) data set of linked/paired covariates and responses, (X 1, Y 1 ), . . . , (X N, Y N). In this case, one may run the traditional monotone regression on this subset to obtain a monotone estimator bm N from which one can compute the estimated residuals by bϵ i := Y i bm N(X i ), i = 1, . . . , N. In general, the previouslydescribed setting might be considered to be one of semi-supervised learning, where only a part of the data is unlinked. It would be useful with such data to learn from all of it simultaneously. This may be possible using the M-estimation framework we have proposed in this paper, but we leave an investigation of that question for future research.

7.2 Estimation of the Noise Distribution with Longitudinal Responses

Another common framework in which we may want to estimate Φϵ from data is the one where we have repeated (or longitudinal) observations on the response Y . Assume we observe X1, . . . , Xnx as before and also Y1,j, . . . , Yny,j, where we take j {1, 2} (for simplicity). We will impose the assumption that the distribution of ϵ is symmetric around 0. We also assume that Yi,j = m0( Xi) + ϵi,j (for some Xi which does not belong to our data set and need not be observed), where ϵi,1 is independent of ϵi,2, and both are independent of all other error terms and all X variables. Then, as in Carroll et al. (2006) and Dattner et al. (2016), we can let Y i := (Yi,1 + Yi,2)/2 = m0( Xi) + ϵ i where ϵ i := (ϵi,1 + ϵi,2)/2. If we let ϵ i := (Yi,1 Yi,2)/2 = (ϵi,1 ϵi,2)/2, then it follows

Unlinked Monotone Regression

from the assumption of 0-symmetry that ϵ i ϵ i. Then, we can use X1, . . . , Xnx and Y 1 , . . . Y ny as our unlinked data and ϵ 1, . . . ϵ ny to estimate Φϵ . Note that computing the estimator of m in practice as described in Algorithm 1 generally requires computation of derivatives of Mn, e.g., the gradients mi Mn(m), where we use the same slight abuse of notation as in (23). The gradient depends on Φϵ and is given by (24). Hence, in the case where Φϵ is unknown, the gradient cannot be computed directly and has to be replaced by an appropriate estimator. In the setting of longitudinal responses, we have

mi Mn(m) = 2n 1 2n 2 n X

α=1 {Φϵ (Y α mi) + B(mi mα)}

where Φϵ denotes the common distribution function of ϵ 1, . . . , ϵ n and where for all m R, B(m) = E(Φϵ (ϵ + m)). With bΦ the empirical distribution function based on the sample ϵ 1, . . . ϵ ny, the gradient can be estimated by

2n 1 2n 2 n X

n bΦ(Y α mi) + b B(mi mα) o

where b B(m) = n 1 y Pny i=1 bΦ(ϵ i + m). The advantage of this estimator is that it does not require any choice of tuning parameters.

8. Demonstrations on Synthetic and Real Data

8.1 Computations on Synthetic Data

In this subsection we present simulation studies for our method and compare our minimum contrast estimator to the deconvolution method of Carpentier and Schl uter (2016). We also compare to classical/linked (oracle) isotonic regression (which uses matching information that the other estimators do not use). We will use mean-squared errors (MSE s) for comparison: for an estimator bm, we report n 1 Pn i=1( bm(X(i)) m0(X(i)))2. There are 2 output tables, Tables 2 and 3, containing Monte Carlo estimates of MSE s. The sample sizes are taken to be n = 100 and n = 1000 for the ﬁrst and second tables respectively. In both tables, we used 1000 Monte Carlo replications. We used 5 diﬀerent true mean functions m0. They are (up to translation and additive constants), together with the abbreviations that denote them, gathered in Table 1. In the tables, our estimator is UL BDD (where UL stands for unlinked ), Car-

m0 Abbreviation x lin 0 const 21[0,5)(x) + 81[5,10](x) step2 51[10/3,20/3)(x) + 101[20/3,10](x) step3 (x41(0,5](x) x41[ 5,0)(x))/120 power

Table 1: The true monotone regression function used in the simulations and their abbreviations.

Balabdaoui, Doss, and Durot

UL BDD UL CS L mono lin, Laplace 0.31 0.27 0.16 const, Laplace 0.18 0.25 0.05 step2, Laplace 0.33 2.84 0.09 step3, Laplace 0.43 2.74 0.12 power, Laplace 0.29 0.47 0.13 lin, Gauss 0.48 0.78 0.16 const, Gauss 0.10 1.23 0.05 step2, Gauss 0.19 2.48 0.09 step3, Gauss 0.32 2.59 0.12 power, Gauss 0.43 0.59 0.14

Table 2: Monte Carlo d MSE s, n = 100.

pentier and Schl uter (2016) s is UL CS , and isotonic regression is L mono (for linked regression, that is, based on the classical case where all covariates and responses are perfectly linked). Our simulations were performed taking both Laplace and Gaussian errors with standard deviation 1 (and both unlinked methods are well-speciﬁed). In Figure 1 we present the output from a single Monte Carlo run, so that the true functions along with sample data and estimates can be visualized. The 10 plots in the ﬁgure are all based on n = 100 samples for each of 10 settings: the 5 monotone regression functions m0 of Table 1 with Laplace distributed errors (left column) and Gaussian errors (right column). To implement the deconvolution method of Carpentier and Schl uter (2016), we needed to monotonize the estimated CDF so that we could compute its generalized inverse. Carpentier and Schl uter (2016) do not mention a speciﬁc method for doing this; we chose to replace b F(x) with max( b F(y) : y x). The deconvolution estimator at X(1) or X(n) was occasionally unstable because of the steepness of the deconvolutionestimated quantile function. In that case, we have dropped those values out without a noticeable eﬀect. Furthermore, the bandwidth for the deconvolution estimator of the CDF was chosen using the bootstrap method of Delaigle and Gijbels (2004). From the output one can see that it is not always the case that the Gaussian noise is harder for our method than Laplace is. This is an interesting ﬁnding because it is known that deconvolution is harder with Gaussian noise than it is with Laplace; see e.g. Fan (1991). This suggests that, although unlinked regression is tightly connected to deconvolution, considering the problem from this point of view may not be the most eﬃcient approach. The output also shows that the CS estimator performs poorly especially when m0 has discontinuities. In general our estimator is competitive with the CS deconvolution estimator and in some cases is signiﬁcantly better.

8.2 Computations on CEX Data

Figure 2 shows plots based on the United States Consumer Expenditure Survey (CEX). The CEX survey has detailed data on both income and the expenditure patterns of so-called U.S. consumer units (roughly, households), see Ruggles et al. (2020). The CEX consists of two surveys, the Interview and the Diary ; the data we use here come from the former. Since the CEX survey has data on both income and expenditures, we can use regular (matched / standard) regression techniques. We

Unlinked Monotone Regression

0 2 4 6 8 10

m0 UL BDD UL CS L mono

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 2 4 6 8 10

0 5 10 0 2 4 6 8

Figure 1: Output from a single Monte Carlo simulation, with n = 100, and Xi i.i.d. uniform on [0, 10]. The left column has Laplace errors and the right has Gaussian errors, both with standard deviation 1. The dotted gray line is the true m0, the red line is our minimum contrast estimator, the blue line is the deconvolution estimator of Carpentier and Schl uter (2016), and the green line is a classical/linked isotonic regression.

Balabdaoui, Doss, and Durot

MSE s UL BDD UL CS L mono lin, Laplace 0.10 0.10 0.03 const, Laplace 0.06 0.13 0.01 step2, Laplace 0.12 3.58 0.01 step3, Laplace 0.15 3.34 0.02 power, Laplace 0.09 0.36 0.03 lin, Gauss 0.29 0.18 0.03 const, Gauss 0.03 0.70 0.01 step2, Gauss 0.07 1.04 0.01 step3, Gauss 0.13 1.40 0.02 power, Gauss 0.25 0.25 0.03

Table 3: Monte Carlo d MSE s, n = 1000.

compare a U.S. consumer unit s food expenditure to its income by regressing the former on the latter. Moreover, by simply ignoring the X-Y pairing information we can also use our approach detailed above for the unlinked setting, and then compare the results obtained with both approaches. Note that we prefer here to use data that are matched, for the sake of being able to validate our results, but there are many settings where matching naturally lacks; for instance a ﬁrm may be able to gather information on an individual consumer s expenditures on the ﬁrm s products, but the ﬁrm would not be able to know the individual s income information. They would be able to access that expenditure information (at least nationally) through the CEX, which creates a data set with unlinked covariates and responses. We consider the interview data only from the second quarter of 2018, for which there are approximately 6000 respondents. We narrow this down to 2164 respondents who provided the relevant information, had income no larger than $250, 000, and reported a non-negative response for both income and food expenditure. The survey actually follows each individual for four consecutive quarters, but we only included those who were surveyed in both quarter 2 and quarter 3. The residuals were computed as described in the previous section: for each individual i, we computed ϵi := (Yi,1 Yi,2)/2 where Yi,1 is the quarter 2 response and Yi,2 is the quarter 3 response. The error distribution is assumed to be Laplace distributed with λ = p

bσ2/2 where bσ2 is the empirical variance of ϵ1, . . . , ϵ2164. We chose to assume that the errors follow the Laplace distribution rather than to use the full method detailed in Subsection 7.2 because the former is much faster to run. Choosing the Laplace distribution was based on observing that it ﬁts much better the distribution of the residuals than the Gaussian one. In Figure 2, the UL BDD line is the unlinked monotone minimum contrast estimator proposed in this paper. This estimate is fully data driven, using no oracle (matching) information. The L mono line corresponds to monotone regression estimator based on the matched data; i.e., the Grenander-type estimator. Similarly the L linear line is a linear regression estimator based on the matched data. The UL CS line corresponds to the deconvolution estimator of Carpentier and Schl uter (2016) (using the same choice of λ we used in our method). We also implemented a type of unlinked oracle estimator in which we used a Gaussian-mixture as the error distribution, labeled UL-oracle BDD : For this estimator we used the residuals from the matched monotone regression to estimate Φϵ (so this is oracle information which would not be available in a true unmatched problem). We used a four component

Unlinked Monotone Regression

0 50000 100000 150000 200000

0 500 1000 1500 2000 2500

FINCBTAX (annual family income before tax, US$)

FOODCQ (current quarter food expenditure, US$)

UL BDD UL quantile UL CS L mono L linear UL oracle BDD

Figure 2: CEX Interview Survey: Family income vs. Food expenditure

Gaussian mixture (with no variance constraint) to approximate the error distribution, which was ﬁt using an EM algorithm to converge to a local maximum (it is well known that the global maximum is inﬁnite). The estimate has weights (.56, .05, .33, .06), means/locations ( 345, 416, 326, 1710) and standard deviations (322, 75, 562, 1286). The mixture distribution is a much better ﬁt to the residual distribution than either a a single Gaussian or Laplace distribution, since the residual distribution is multimodal and heavy tailed. This estimator of Φϵ is much more dispersed than the fully data driven one; for instance, the former has standard deviation 743 instead of 266 for the latter one. This leads to diﬀerences in these two estimators. Finally, the UL quantile line is based on matching the empirical quantiles of the Y and X samples: it is simply given by (connecting linearly) the points (X(1), Y(1)), . . . , (X(n), Y(n)). Our UL BDD estimator is somewhat accurate although it does diﬀer noticeably from the oracle isotonic regression as well as the UL-oracle BDD estimator. The estimator of Carpentier and Schl uter (2016) does very poorly on this data set. We suspect that inaccuracy in choice of the error distribution causes diﬃculty for both of the unlinked estimators, which is of course expected.

9. Conclusions and Directions for Future Research

In this paper, we have presented a general method for unlinked regression with a monotonic regression function, and developed basic theory for the resulting estimator. We believe our approach will generalize to other (identiﬁable) unlinked regression settings. We have introduced a variant of an active set algorithm for computing the estimator, and demonstrated its use on a real data example in a fully data driven way in which we estimated the unknown error distribution. There are many remaining questions about this problem and about our method that future work could answer.

1. Our current study is restricted to the case of a univariate predictor. Studying both theory and practice when dimension is larger than 1 will be an important avenue for future work. In the case of linear regression, several works (Abid et al. (2017);

Balabdaoui, Doss, and Durot

Pananjady et al. (2017, 2018); Unnikrishnan et al. (2018)) have already begun this study, although those works focus mostly on the case of Gaussian noise.

2. Finding (minimax) lower bounds for the rate of convergence seems to be hard to obtain in our setting. Such bounds are needed for a more complete theoretical understanding of the problem setting. In earlier literature, the closest result we are aware of is the one obtained in Rigollet and Weed (2019), who provide a minimax lower bound in the case of a Gaussian noise in the shuﬄed monotone regression. However, it is not known if the minimax lower bound derived by these authors applies to our problem, since the unlinked problem is statistically more diﬃcult than the shuﬄed problem.

3. One of the major diﬀerences in unlinked regression from linked regression is that in the former the speciﬁcation of the error distribution is crucially important. As shown in the introduction, if the error distribution is unknown then the model is not even identiﬁable. It would be helpful to understand the general properties of unlinked regression models and of our method in particular when one has partial but incomplete knowledge of the error distribution (e.g., some moment parameters can be estimated well but the full distribution is not known precisely).

4. Carpentier and Schl uter (2016) allowed for so-called contextual variables ; for instance, if the unit i of observation is an individual, both Yi and Xi may be paired with a contextual variable Zi,Y and Zi,X such as the individual s age. One may match Y s and X s with equal (or similar) ages, and then one may consider unmatched regression on these partially matched data sets. This is what Carpentier and Schl uter (2016) proposed in the case of discrete, perfectly (noiselessly) observed contextual variables. More broadly, one may use so-called linkage methods (Herzog et al., 2007) to partially link Y and X (eﬀectively reducing the noise level) when the contextual variables are not as idealized, and then perform linked regression. This methodology could be broadly useful in the linkage literature and warrants further study.

Acknowledgements

The second author is supported in part by NSF Grant DMS-1712664. The third author is supported in part by MME-DII (ANR11-LBX-0023-01) and by the FP2M federation (CNRS FR 2036).

Appendix A. Bounding the Integrals in (56) and (57).

Using Assumptions A0 and A1 we can write that

H0(y)dy = Z 0

R Φϵ(y m0(x))d F0(x)dy

Φϵ(y + K0)dy < ,

(note that this also follows from E(|Y |) < and integration by parts.) Similarly it can be shown that R 0 (1 H0(y))dy < . Hence, I1 R 0 (1 H0(y))dy + R 0 H0(y)dy <

Unlinked Monotone Regression

R (Φϵ(y m0(x)) H0(y))2 dy 2 Z 0

Φϵ(y m0(x))2 + H0(y)2 dy

0 (1 H0(y))2dy

Φϵ(y + K0)dy + 2 Z 0

0 (1 H0(y))dy

as shown above; this implies, by Fubini s Theorem, that I2 < .

Appendix B. Basic Empirical Process Theory Deﬁnitions and a Fundamental Preservation Result

For a (possibly random) signed measure Q on a (measurable) space X and a measurable function f on X, we denote Qf := R

X fd Q. For some class of functions G, we can deﬁne

its ϵ-covering number N(ϵ, G, ) with respect to some norm is deﬁned as the smallest integer N > 0 such that there exists g1, . . . , g N satisfying that for any g G, there exists i {1, . . . , N} such that

its ϵ-bracketing number NB(ϵ, G, ) with respect to some norm is deﬁned as the smallest integer N > 0 such that there exist pairs (h1, k1), . . . , (h N, k N) satisfying that for any g G, there exist i {1, . . . , N} such that hi g ki and ki hi < ϵ.

From the deﬁnition of the covering and bracketing numbers it can be easily shown (van der Vaart and Wellner, 1996, pp. 83 84) that if is an Lp norm, for some 1 p , then for any δ > 0,

N(δ, G, ) NB(2δ, G, ). (26)

Also, if the class G admits an envelope F, then deﬁne for η > 0 the number

J(η, G) = sup Q

1 + log N(δ F Q,2, G, L2(Q))dδ (27)

where the supremum is taken over all discrete probability measures Q such that

F Q,2 := R |F(x)|2d Q(x) 1/2 < .

We ﬁnish this section by the following preservation result.

Proposition 8. Let be some norm, and G1 and G2 two classes of functions. For ﬁxed λ1, λ2 such that (λ1, λ2) = (0, 0), deﬁne the class

λ1G1 + λ2G2 = {h = λ1g1 + λ2g2 : (g1, g2) G1 G2} .

Balabdaoui, Doss, and Durot

Then, for any ϵ > 0

N(ϵ, λ1G1 + λ2G2, ) N((|λ1| + |λ2|) 1ϵ, G1, ) N((|λ1| + |λ2|) 1ϵ, G2, ).

Proof [Proof of Proposition 8.] Fix ϵ > 0. Let h = λ1g1 + λ2g2 λ1G1 + λ2G2, N1 = N(ϵ(|λ1| + |λ2|) 1, G1, ) and N2 = N(ϵ(|λ1| + |λ2|) 1, G2, ). We assume in the sequel that both N1 and N2 are ﬁnite since otherwise, the inequality in Proposition 8 is trivial. Then, there exists a pair (i, j) {1, . . . , N1} {1, . . . , N2} and (g1,i, g2,j) such that g1 g1,i < ϵ(|λ1| + |λ2|) 1 and g2 g2,j < ϵ(|λ1| + |λ2|) 1. Then, by the triangle inequality we have that

h λ1g1,i λ2g2,j = λ1(g1 g1,i) + λ2(g2 g2,j)

|λ1| g1 g1,i + |λ2| g2 g2,j < ϵ

which completes the proof.

Appendix C. Wasserstein Distance Lemmas

Recall that W1(F, G) denotes the ﬁrst Wasserstein distance between two probability distributions with distribution functions F and G. The following is a well-known representation of the Wasserstein-1 distance in one dimension (see, e.g., Bobkov and Ledoux (2019), or e.g., Proposition 2 of Meis and Mammen (2020)).

Proposition 9. Let F, G be distribution functions on R, each having ﬁnite ﬁrst moment. Then

W1(F, G) = Z

R |F(x) G(x)|dx.

The following is Proposition 3 of Meis and Mammen (2020).

Proposition 10. Let F, G be two distribution functions supported on [0, V ] and suppose H is a distribution function supported on a ﬁnite set of points. Then

W1(F, G) C(V, H)W1(F H, G H)

where C(V, H) > 0 depends only on V, H.

Appendix D. Gradient, Curvature, and Other Algorithmic Computations

In this section, we prove (24) and give an explicit formula of B(m) in the case of Laplace, Gaussian, and Gaussian-mixture distributions.

Unlinked Monotone Regression

D.1 Proof of (24)

mi Mn(m) = Z

Hn(y) n 1 n X

α=1 Φϵ(y mα)

n 1fϵ(y mi)dy

1[Yα, )(y)fϵ(y mi) Φϵ(y mα)fϵ(y mi) dy

Yα mi fϵ(y) dy Z

R Φϵ(y mα + mi)fϵ(y) dy ,

and (24) follows.

D.2 Curvature Derivation

It is convenient to re-parameterize the objective function as was done in the algorithm development in the main text. Let b Hm(y) := Pp j=1 πjΦϵ(y mj), and then recall

Mn,p(m) := Z (Hn(y) b Hn(y))2dy

where πj 0 and Pp j=1 πj = 1. (Here n = ny, and nx is deﬁned implicitly in terms of the πj and p.) Then, as derived above but in the new notation, we have

mi Mn,p(m) = 2πi

Z (Hn(y) b Hn(y))fϵ(y mi)dy

α=1 1 Φ(Yα mi)

j=1 πj EΦϵ(ϵ + mi mj)

α=1 Φϵ(Yα mi)

j=1 πj B(mi mj)

where again B(m) := EΦϵ(ϵ + m). Then (assuming i = j)

m2 i Mn,p(m) = 2πi(

α=1 n 1fϵ(Yα mi)

j =i πj B (mi mj)

mi mj Mn,p(m) = 2πiπj B (mi mj).

Thus, in order to compute the curvature (i.e., 2/ θ2), it suﬃces to compute B (which we do below in the three cases we consider). One more set of calculations that are useful are the following; we have

α=1 παΦϵ(y mα)

(πifϵ(y mi) πjfϵ(y mj))dy,

θ2 Mn(m) = Z

b Hm(y) Hn(y)

(πif ϵ(y mi) + πjf ϵ(y mj)) + (πifϵ(y mi) πjfϵ(y mj))2dy. (29)

Balabdaoui, Doss, and Durot

D.3 Computations for Laplace Distribution

Let λ > 0 and assume that

( 2 1e |z|/λ if z 0 1 2 1e z/λ if z > 0

and let fϵ(y) := Φ ϵ(y) = e |z|/λ/2λ for z R. We compute B(m) for m R. We have

(4λ) 1ey/λe |y m|/λdy + Z

0 (1 2 1e y/λ)e |y m|/λ(2λ) 1dy

which equals Z m 0

(4λ) 1e(y m/2)2/λdy + Z 0

m 0 (4λ) 1em/λdy

0 (2λ) 1(e(y m)/λ 2 1e m/λ)dy

1 2λe (y m)/λ 1

4λe (y m/2)2/λdy.

If m 0, then

8em/λ + |m|

4λ em/λ + 0 + 3

8em/λ. (30)

If m 0, then

8e m/λ + 0 + 1

4λe m/λ(2λ + m) + 1

8e m/λ . (31)

This gives an explicit formula for B(m). We can then compute that for any m R (including m = 0), B (m) = e |m|/λ(λ + |m|)/(4λ2). The calculations for B (m) are as follows. For m 0, we have

2λem/λ = ((4λ) 1em/λ + m(4λ2) 1em/λ) = em/λ 2λ λ m

= e |m|/λ λ + |m|

And for m 0 we have

4λe m/λ 2λ + m

4λ2 e m/λ = e m/λ 2λ + m

= e m/λ m + λ

D.4 Computations for Gaussian Errors

Now we consider the case where, for some σ > 0, Φϵ = Φ( /σ) is the cumulative distribution function of a N(0, σ2) random variable. It turns out we can write B( ) in terms of Φ: by Corollary 1 of Ellison (1964), B(m) = EΦϵ(N(m, σ2)/σ) = Φ(m/σ

2). Thus we also have

B (m) = N(0, σ2) density = 1

4πσ2 e m2/4σ2,

B (m) = 2m (2σ)3 πe m2/4σ2.

Unlinked Monotone Regression

D.5 Mixtures of Gaussian

This again relies on Corollary 1 of Ellison (1964). Recall Φ is the CDF of a N(0, 1) variable. Assume for an integer L 1, locations µi R, and weights λi summing to 1 that

B(m) = EΦϵ(ϵ + m) =

i,j λiλj EΦ Y µi + m

where Y N(µj, σ2 j ), so Y µi+m

σi N((µj µi + m)/σi, σ2 j /σ2 i ) and thus (33) equals (by Corollary 1 of Ellison (1964))

(m µi + µj)/σi q

1 + σ2 j /σ2 i

µj µi + m q

σ2 i + σ2 j

We then have that

µj µi + m q

σ2 i + σ2 j

σ2 i + σ2 j .

D.6 On Implementation of Algorithm 1

A few remarks about the implementation of Algorithm 1 are as follows. The entries of the initial counts vector should sum to nx; this relationship is then preserved throughout the algorithm. Generally the initializer for m is taken to be a constant (e.g., the median of the responses with p(0) = 1). In the algorithm we did not take care to force the counts to be integers when we divided by 2, but this can easily be done and should be done for easy interpretability. As end criterion, one can iterate for a ﬁxed number K of steps, or one can iterate until a stopping rule (e.g., the objective function decrease is smaller than a ﬁxed tolerance level) is satisﬁed. A heuristic choice for the parameter eps is eps = (Y(n) Y(1))/(n1/3σ) where σ2 is the variance of ϵ, and n1/3 is motivated by properties of classical isotonic regression.

Appendix E. Proofs

In this section we provide our proofs.

E.1 A Preparatory Lemma

We begin with a lemma that will be used several times in the proofs of the main results.

Lemma 3. Let m M. If F0 is continuous, then

Z 1{m(x) w}d F0(x) F0 m 1(w) 2 dw = 0. (34)

Balabdaoui, Doss, and Durot

Proof [Proof of Lemma 3] Recall that m 1 is deﬁned by (6) where the inﬁmum of an empty set is deﬁned to be 1. If the set in (6) is non-empty, then the inﬁmum is achieved by right-continuity of m. Hence, we have m m 1(y) y for all y m(1). Now, consider x [0, 1] and y m(1) such that m(x) y. Since the inﬁmum in (6) is achieved this implies that x m 1(y). Conversely, if we have x m 1(y) then monotonicity of m implies that m(x) m m 1(y) where as mentioned above, m m 1(y) y. It follows that for all x [0, 1] and y m(1) we have the equivalence

m(x) y x m 1(y). (35)

Now, consider y > m(1). The set in (6) is empty and therefore, m 1(y) = 1 by deﬁnition. The left-hand inequality in (35) does not hold if x [0, 1], and the righthand inequality does not hold neither if x < 1 since m 1(y) = 1. This mean that the above equivalence holds for all x [0, 1) and y R. Let X be a random variable with distribution function F0. Since P(X = 1) = 0 by assumption, it follows that for all w R, we have P(m(X) < w) = P(X < m 1(w)) (36)

and therefore,

P(m(X) w) P(X m 1(w)) = P(m(X) = w) P(X = m 1(w))

where the second probability on the right hand side equals zero since X has a continuous distribution function. It follows that

P(m(X) w) F0 m 1(w) 2 dw = Z

R P(m(X) = w)2dw

To see why the preceding equality holds true, note that since the distribution function of X, F0, is assumed to be continuous, then it follows that Z

R P(m(X) = w)2dw = Z

W P (X [a(w), b(w)))2 dw

where W is the set of point w R such that there exist x = x that satisfy m(x) = m(x ) = w, and for w W, a(w) < b(w) are such that m takes the constant value w on [a(w), b(w)), and a(w) = m 1(w). Using the well-known fact that a monotone function admits at most countably many constant parts, the set W is at most countable and therefore, Z

R P(m(X) = w)2dw Z

W dw = λ(W) = 0

where λ denotes the Lebesgue measure on R. Lemma 3 follows from (37) since

P(m(X) w) = Z 1{m(x) w}d F0(x).

Unlinked Monotone Regression

E.2 Proofs for Section 2

Proof of Proposition 1. Let F(u) = P(X u) for all u R. It follows from (36), that holds for all w R an m M, that

P(m1(X) t) = P(X m 1 1 (t)) (38)

= F m 1 1 (t) (39)

for all t R. Since m1(X) has the same distribution as m2(X), this implies that

F m 1 1 = F m 1 2 . (40)

It follows from the deﬁnition (6) of the inverse of a function m M, where we recall that the inﬁmum of an empty set is deﬁned to be one, that m 1 j (y) = 0 for all y mj(0) and m 1 j (y) = 1 for all y > mj(1). Hence, we deﬁne the inverse of the non-increasing left-continuous function F m 1 j as

(F m 1 j ) 1(t) = sup{y [mj(0); mj(1)], F m 1 j (y) t}

for all t R, with the convention that the supremum of an empty set is equal to mj(0). Our aim is to derive from (40) that the inverses of F m 1 1 and F m 1 2 are equal. This is not an immediate consequence of the equality in (40) since the deﬁnition of the inverse function of F m 1 j involves the function mj in addition to the function F m 1 j . However, we show below that the dependence on mj can be removed by restricting attention to a restricted support. We deﬁne the generalized inverse of F by

F 1(t) = sup{u [0, 1], F(u) t}

for all t R, with the convention that the supremum of an empty set is equal to zero. Similar to the proof of Lemma 3, it can be proved using that F is non-increasing and left-continuous that the equivalence

F(u) t F 1(t) u (41)

holds for all u (0, 1] and t R. Combining this with (35) (that holds for all x [0, 1) and y R), we obtain that the equivalence

F m 1 j (y) t mj F 1(t) y (42)

holds for all t > 0 and y > mj(0). For y mj(0), the inequalities on both sides of the equivalence in the previous display hold true for all t 1 since in that case, F m 1 j (y) = F(0) = 1. This means that the equivalence in (42) holds for all t (0, 1] and y R. Now, consider t (0, 1] such that t < F m 1 j (mj(0)). Since m 1 j (mj(0)) = 0, this means that t < F(0) where F(0) = 1. Otherwise said, we consider t (0, 1). Because t < F m 1 j (mj(0)), we have

(F m 1 j ) 1(t) = sup{y mj(1), F m 1 j (y) t}.

Moreover, the inequality F m 1 j (y) t cannot hold for y > mj(1) since t > 0 and therefore,

(F m 1 j ) 1(t) = sup{y R, F m 1 j (y) t}.

Balabdaoui, Doss, and Durot

Combining this with (40) proves that

(F m 1 1 ) 1(t) = (F m 1 2 ) 1(t)

for all t (0, 1). Using the equivalence in (42) (that holds for all t (0, 1] and y R), we also have

(F m 1 j ) 1(t) = sup{y R, mj F 1(t) y}

= mj F 1(t).

Hence, m1 F 1(t) = m2 F 1(t)

for all t (0, 1). This in turn implies that m1 = m2 on the support of X since the range of F 1 is the support of X.

Proof of Proposition 2. Proof of Claim 1. Recall that any element m M is bounded, and hence there exists K > 0 such that m K. Denote by Y(1) Y(n) the order statistics corresponding to Y1, . . . , Yn. We have for all y < Y(1) that Hn(y) = 0. Moreover, it follows from monotonicity of Φϵ and m that 0 Φϵ(y m(Xi)) Φϵ(y m(X(1))) 1 for all i {1, . . . , n} and therefore,

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 = n 2 n X

i=1 Φϵ(y m(Xi))

Φϵ(y m(X(1)))2

Φϵ(y m(X(1))).

Now, existence of expectation of ϵ implies that Z c

Φϵ(t)dt < , and Z

c (1 Φϵ(t))dt < (43)

for arbitrary c R and therefore,

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 dy Z Y(1)

Φϵ(y m(X(1)))dy

= Z Y(1) m(X(1))

Similarly, Hn(y) = 1 for y > Y(n) and hence

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 = 1 n 1 n X

i=1 Φϵ(y m(Xi) 2

1 Φϵ(y m(X(n)) 2

1 Φϵ(y m(X(n)).

Unlinked Monotone Regression

Combined with (43), this proves that Z

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 dy < . (45)

Since the integrand is bounded, (44) and (45) yield Z

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 dy < ,

which proves that Mn(m) is ﬁnite.

Proof of Claim 2. Assume that Assumption A1 holds and consider m M. Then, we have that

Φϵ(y m0(x)) Φϵ(y m(x)) d F0(x) 2 dy

1 Φϵ(y m(x)) (1 Φϵ(y m0(x))) d F0(x) 2 dy (46)

Φϵ(y m0(x)) Φϵ(y m(x)) d F0(x) 2 dy. (47)

We further bound above the integral in (46) by

1 Φϵ(y m(x)) d F0(x) 2 dy

1 Φϵ(y m0(x)) d F0(x) 2 dy

1 Φϵ(y max(K0, K)) 2 dy (48)

where we recall that K m . The latter integral is ﬁnite using again (43). Similarly, we argue that the integral in (47) can be also bounded above by

Φϵ(y + max(K0, K)) 2 dy < .

This completes the proof that M(m) is ﬁnite.

Proof of Claim 3. Using again that Hn(y) = 0 for all y < Y(1), together with monotonicity of Φϵ and m, we have that

Mn(m) Z Y(1)

i=1 Φϵ(y m(Xi)) 2 dy

Φϵ(y m(X(1)))2dy

= n 2 Z Y(1) m(X(1))

, if m(X(1)) .

Balabdaoui, Doss, and Durot

Similarly, for y Y(n), it holds that

Mn(m) n 2 Z

1 Φϵ(y m(X(n))) 2dy

Y(n) m(X(n)) (1 Φϵ(t))2dt

, if m(X(n)) .

Hence, there exists some K > 0 (which may depend on n) such that any candidate m M for the minimization problem in (8) should satisfy K m(X(1)) . . . m(X(n)) K. By identifying an element m MK by the corresponding vector θ = (m(X(1)), . . . , m(X(n)))T , it is easy to see that the original minimization problem is equivalent to minimizing

e Mn(θ) =: Z

Hn(y) n 1 n X

i=1 Φϵ(y θi)

on the compact ﬁnite dimensional subset

SK =: (θ1, . . . , θn)T Rn : K θ1 . . . θn K .

Now, the function e Mn is continuous on SK since for any sequence (θp)p 0 in SN K converging (in any distance) to θ SK, the sequence of functions

Hn(y) n 1 n X

i=1 Φϵ(y θp,i)

converges pointwise by continuity of Φϵ (see Assumption A2) to the limit

Hn(y) n 1 n X

i=1 Φϵ(y θi)

Also, for y R, we have that (49) is no larger than

n 1 Pn i=1 Φϵ(y + K) 2, for y < Y(1) 4, for Y(1) y Y(n) 1 n 1 Pn i=1 Φϵ(y K) 2 , for y > Y(n)

where the function on the right side can be shown to be integrable using similar arguments as above. By the Lebesgue dominated convergence theorem, it follows that

lim p e Mn(θp) = e Mn(θ).

Thus, e Mn admits at least a minimizer in SK, bθn say. We conclude that Mn admits at least a minimizer bmn which is bounded by K, and such that ( bmn(X(1)), . . . , bmn(X(n)))T = bθn. The values of the minimizer being given by bθn at the observed covariates X1, . . . , Xn,

Unlinked Monotone Regression

any monotone interpolation of these values gives a solution to (8). In particular, there exists a solution bmn that takes constant values between successive covariates and that is right continuous. Proof of Claim 4. Without loss of generality, and possibly changing scale, we can assume that ϵ is supported on [ 1, 1]. We show below that there exists at least a solution to (8) that is bounded in sup-norm by K0 + 2, where K0 is taken from Assumption A1. For an arbitrary m M, we deﬁne the truncated version m by

K0 + 2 if m(x) K0 + 2 K0 2 if m(x) K0 2 m(x) otherwise

In the following, we place ourselves in the event ϵ 1 which occurs with probability 1. Consider y > K0 + 1. Since |Yi| K0 + 1 for all i , we then have Hn(y) = 1, and Φϵ(y m(Xi)) = Φϵ(y m(Xi)) = 1

for all Xi s such that m(Xi) K0 2. Also, for all Xi s such that m(Xi) K0 + 2 we have that Φϵ(y m(Xi)) Φϵ(y m(Xi)) 1.

This implies that Z

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 dy

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 dy. (50)

Similarly, it can be shown that Z K0 1

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 dy

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 dy. (51)

Now, consider y such that |y| K0 + 1. If for some i we have y > m(Xi) + 1 (or y < m(Xi) 1) then we have that

Φϵ(y m(Xi)) = Φϵ(y m(Xi)).

Indeed, if y > m(Xi) + 1, then m(Xi) < K0. In case m(Xi) > K0 2 we have m(Xi) = m(Xi) and Φϵ(y m(Xi)) = Φϵ(y m(Xi)) = 1. If m(Xi) K0 2, then m(Xi) = K0 2 and hence y m(Xi) 1 implying again that Φϵ(y m(Xi)) = Φϵ(y m(Xi)) = 1. Similar arguments can be used in case y < m(Xi) 1. Now, the equality in the above display holds also if |y m(Xi)| 1 since in that case, |m(Xi)| K0 + 2, implying that m(Xi) = m(Xi). Combining this with (50) and (51) shows that Z

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 dy

n Hn(y) n 1 n X

i=1 Φϵ(y m(Xi)) o2 dy.

Balabdaoui, Doss, and Durot

From Claim 3 in Proposition 2, there exists at least a solution to (8) and from the arguments above, its truncated version also is a solution. Hence, there exists at least a solution that is bounded in the sup-norm by K0 +2 with probability 1. This completes the proof of the proposition.

E.3 Proofs for Section 3

We ﬁrst prove the propositions in Section 3 and ﬁnish with the proof of Theorems 1, 2, 3, 4 and 5.

Proof of Proposition 3. Let b Hn and H0 n be the distribution functions deﬁned as

b Hn(y) = 1

i=1 Φϵ(y bmn(Xi)), and H0 n(y) = 1

i=1 Φϵ(y m0(Xi)), (52)

for y R. Recall the Plancherel s identity for Fourier transforms: for a function g L1(R) L2(R), where L1(R) and L2(R) denote respectively the set of integrable, and the set of square integrable functions from R to R with respect to the Lebesgue measure it holds that Z

R g(x)2dx = 1

R |φg(x)|2dx

where φg is deﬁned in (10). If F1 and F2 are two distribution functions with ﬁnite expectations, it follows using integration by parts that

ψF2(x) ψF1(x) = ix Z

R (F2(t) F1(t))eitxdt

implying that

φF2 F1(x) = i ψF2(x) ψF1(x)

for x = 0. Moreover, if F1 and F2 have ﬁnite expectations then

Fj(x)dx < and Z

0 (1 Fj(x))dx < ,

for j {1, 2}, implying that F1 F2 L1(R) L2(R). Therefore, the Plancherel identity implies that Z

R (F2(x) F1(x))2dx = 1

1 x2 |ψF2(x) ψF1(x)|2dx.

We apply below this identity with F1 and F2 replaced respectively by b Ln and Ln, deﬁned in (14). Note that the two corresponding distributions have ﬁnite expectations since they are supported on a ﬁnite set. Hence, Z

b Ln(w) Ln(w) 2 dw = 1 2π

ψb Ln(t) ψLn(t) 2 dt.

Unlinked Monotone Regression

By Assumption A5, we can ﬁnd T > 0 such that |φfϵ(t)| |φfϵ(T)| > 0 for all T > T

and |t| T. Using that |ψF | 1 for any distribution function F, it follows from the previous display that for all T > T we have Z

b Ln(w) Ln(w) 2 dw 1 2π|φfϵ(T)|2

ψb Ln(t) ψLn(t) 2 dt + 4

1 2π|φfϵ(T)|2

ψb Ln(t) ψLn(t) 2 dt + 4

Now, using again Plancherel s identity we have Z

ψb Ln(t) ψLn(t) 2 dt = Z

R |φfϵ(t)|2|φb Ln Ln(t)|2dt

R |φfϵ (b Ln Ln)(t)|2dt

b Hn(y) H0 n(y) 2 dy

since b Hn = fϵ b Ln and H0 n = fϵ Ln. Here, (f g)(y) := R

R f(z)g(y z)dz. Hence, it follows from Assumption A4 that for suﬃciently large T, Z

b Ln(w) Ln(w) 2 dw T 2β

b Hn(y) H0 n(y) 2 dy + 4

Assuming that we have Z

R E b Hn(y) H0 n(y) 2 dy = O(n 1), (55)

it will follow that for all suﬃciently large T, Z

R E b Ln(w) Ln(w) 2 dw O(T 2βn 1) + 4

For T = Tn n1/(2β+1) we get Z

R E b Ln(w) Ln(w) 2 dw O 1 n1/(2β+1)

which proves Proposition 3. Now, we will show (55). From the inequality (a + b)2 2(a2 + b2), which holds for any a and b in R, and the deﬁnition of bmn it follows that Z

b Hn(y) H0 n(y) 2 dy 2 Z

b Hn(y) Hn(y) 2 dy + 2 Z

H0 n(y) Hn(y) 2 dy

H0 n(y) Hn(y) 2 dy

H0 n(y) H0(y) 2 dy + 8 Z

R (Hn(y) H0(y))2 dy

E[(Hn(y) H0(y))2] = n 1H0(y)(1 H0(y)),

Balabdaoui, Doss, and Durot

E[ H0 n(y) H0(y) 2] = 1 n Var Φϵ(y m0(X))

R (Φϵ(y m0(x)) H0(y))2 d F0(x).

Both the integrals

R H0(y)(1 H0(y))dy (56)

R (Φϵ(y m0(x)) H0(y))2 d F0(x)dy (57)

are ﬁnite, see Appendix A. This yields the result.

Proof of Proposition 4. In the sequel we denote by PX n and P X the empirical probability measure associated with X1, . . . , Xn and the true corresponding probability measure. Then, the two integrals in Proposition 4 are the integrated square of the empirical processes

b Ln(w) b L0 n(w) = (PX n P X)1{ bmn( ) w}

Ln(w) L0(w) = (PX n P X)1{m0( ) w}.

In Appendix B, we recall some of the basic tools of empirical processes that we need in this proof. In what follows, the notation means smaller or equal modulo a universal positive multiplicative constant. For all ﬁxed w R and m M, let kw,m be the function deﬁned by kw,m(x) = 1m(x) w for all x [0, 1]. Consider the set of functions

I := n kw,m, with m M and w [A, B] o .

Using the same notation as in van der Vaart and Wellner (1996) (for completeness, we provide deﬁnitions in Appendix B), let us write Gnk = n(PX n P X)k for k I. Since I is a subset of the class of monotone non-increasing functions f : R 7 [0, 1], it follows from van der Vaart and Wellner (1996, Theorem 2.7.5) that there exists a universal constant C > 0, such that for any δ > 0 and any probability measure Q,

log NB δ, I, L2(Q) C

(where NB( , , ) is deﬁned in Appendix B). Since I admits F(t) = 1 as an envelope, this and the inequality in (26) imply that

J(1, I) sup Q

1 + log NB(2δ, I, L2(Q))dδ

Unlinked Monotone Regression

where J(δ, F) is deﬁned in (27). Since X1, . . . , Xn are i.i.d. it follows now from van der Vaart and Wellner (1996, Theorem 2.14.1) that

E Gn 2 I 1/2 J(1, I). (58)

Let us denote

Mn = max R B A b Ln(w) b L0 n(w) 2 dw, R B A Ln(w) L0(w) 2 dw .

The ﬁrst two claims in the proposition now follow from (58) combined to the Markov s inequality. Now, using the inequality (a + b + c)2 3(a2 + b2 + c2) for any real numbers a, b and c, we have Z B

b L0 n(w) L0(w) 2 dw 3 Z B

b Ln(w) b L0 n(w) 2 dw

b Ln(w) Ln(w) 2 dw

A (Ln(w) L0(w))2 dw.

It thus follows from Proposition 3 that (16) holds provided that B A = OP (n2β/(2β+1)).

Proof of Proposition 5. The ﬁrst equality in Proposition 5 follows from Lemma 3 above combined with the deﬁnition of b L0 n and L0 n, while the second equality follows from Proposition 4.

Proof of Proposition 6. It follows from Lemma 4 below that Z b

a | bmn(x) m0(x)| d F0(x) Z Bn

An |F0 bm 1 n (x) F0 m 1 0 (x)|dx.

The proposition then follows from applying the Cauchy-Schwarz inequality.

Lemma 4. Let f : [0, 1] R and g : [0, 1] R be right-continuous non-decreasing functions. Let f 1 and g 1 be the corresponding generalized inverses, see (6) where the inﬁmum of an empty set is deﬁned to be one. Let H : [0, 1] [0, 1] be a continuous non-decreasing function. Then, for all a < b in [0, 1] we have

a |f(t) g(t)|d H(t) Z S(b)

I(a) |H g 1(x) H f 1(x)|dx

where I(a) = f(a) g(a) ; S(b) = f(b) g(b).

Balabdaoui, Doss, and Durot

Proof [Proof of Lemma 4.] For all real numbers u, let u+ = max(u, 0). We then have

a |f(t) g(t)|d H(t) = I1 + I2 (59)

a (f(t) g(t))+d H(t) and I2 = Z b

a (g(t) f(t))+d H(t).

Let us deal ﬁrst with I1. We have

0 I{x f(t) g(t)}dxd H(t) = Z b

g(t) I{x f(t)}dxd H(t)

g(t) I{x f(t)}dxd H(t),

where we use a change of variable for the second equality and the monotonicity of f for the third one. Similar to (35), the equivalence

t f 1(x) f(t) x

holds for all t [0, 1) and x R. Combining this with the Fubini theorem, we arrive at

g(t) I{t f 1(x)}dxd H(t)

a I{t f 1(x)}I{t<g 1(x)}d H(t)dx.

Hence, it follows from the continuity of H that

I1 = Z f(b)

H(g 1(x) b) H(f 1(x) a)

H(g 1(x)) H(f 1(x))

since H is non-decreasing. Since I(a) g(a) and S(b) f(b), this implies that

H g 1(x) H f 1(x)

Interchanging the roles of f and g, we obtain

H f 1(x) H g 1(x)

and therefore,

I1 + I2 Z S(b)

H f 1(x) H g 1(x) dx.

Unlinked Monotone Regression

Lemma 4 then follows from (59).

Proof of Theorem 1. Proof of the ﬁrst claim. If we have (12), then An and Bn from Proposition 6 are both of the order OP (1). Hence, the ﬁrst claim in Theorem 1 is an immediate consequence of Proposition 6 combined with Proposition 5. Proof of the second claim. Now, we show that (12) holds true for all a and b such that 0 < F0(a) F0(b) < 1. It follows from the deﬁnition of b Ln and Ln together with the H older inequality and Proposition 3 that

i=1 1{ bmn(Xi) w}

dw = Z 2 m0

b Ln(w) Ln(w) dw

b Ln(w) Ln(w) 2 dw 1/2

On the other hand,

i=1 1{ bmn(Xi) w}

dw m0 n 1 n X

i=1 1{ bmn(Xi)>2 m0 }

and therefore,

i=1 1{ bmn(Xi)>2 m0 } = o P (n).

By monotonicity of bmn this implies that

1{ bmn(b)>2 m0 }

i=1 1{Xi>b}

i=1 1{ bmn(Xi)>2 m0 } = o P (n).

By the law of large numbers, n 1 Pn i=1 1{Xi>b} converges in probablity to 1 F0(b) > 0 and therefore, it follows from the previous display that

1{ bmn(b)>2 m0 } = o P (1).

This implies that lim n P( bmn(b) > 2 m0 ) = 0.

One can prove similarly that

lim n P( bmn(a) < 2 m0 ) = 0.

This implies (12) by monotonicity of bmn, which completes the proof of the second claim in Theorem 1.

Proof of Theorem 2 As the arguments are very similar to those used in the proof of Theorem 1, we focus here on how the converge rate is obtained in the supersmooth

Balabdaoui, Doss, and Durot

case. Under Assumptions A0 A3, Assumption A4 and Assumption A5, we can show that this rate of convergence is driven by

T 2αn 1 + 4

for T = Tn as n which should be determined so that the above expression is smallest. This means that the ﬁrst term should converge to 0 or equivalently that there exists a sequence (Kn)n such that log(Tn) = Kn and Kn log n such that

2T β n γ = log n + (2α 1)Kn

or equivalently

Tn = c log n + (2α 1)Kn 1/β , for c = (γ/2)1/β.

It is not diﬃcult to see that the optimal choice of the sequence (Kn)n is Kn = (1 a) log n for some a (0, 1) (the case a = 0 is impossible because otherwise we would have Tn = c(2α)1/β(log(Tn))1/β). This in turn yields

Tn = c a log n + 2α log(Tn) 1/β ,

implying that Tn ca1/β(log n)1/β and that rate of convergence is (log n) 1/(2β).

Proof of Theorem 3. Without loss of generality, we can assume that the support points of ϵ are all in [ 1, 1]. From Proposition 2 we know that with probability 1 there exists a solution to (8) which is bounded in the sup-norm by K0 + 2. In the sequel, we denote by bmn such a solution. Recall b Hn and H0 n from (52). Using the Cauchy-Schwarz inequality it follows that Z

b Hn(y) H0 n(y) dy = Z K0+3

b Hn(y) H0 n(y) dy

(2K0 + 6) Z

b Hn(y) H0 n(y) 2 dy 1/2 .

Since the equality in (55) holds under Assumptions A0 to A2 it follows that Z

b Hn(y) H0 n(y) dy = OP (n 1/2).

Now note that b Hn and H0 n are distribution functions with bounded support, and hence they admit a ﬁnite ﬁrst moment. Therefore, denoting by W1(F, G) the Wassersteindistance of ﬁrst order between two probability distributions with respective distribution functions F and G, it follows from Proposition 9 in Appendix C that

W1 b Hn, H0 n = OP (n 1/2).

With b Ln and Ln taken from (14), it follows from Proposition 10 in Appendix C that there exists some constant C > 0 that depends only on the distribution of ϵ and K0 such that

W1 b Ln, Ln C W1 b Hn, H0 n .

Unlinked Monotone Regression

W1 b Ln, Ln = OP (n 1/2).

Using again Proposition 9 in Appendix C the latter rate yields Z

b Ln(y) Ln(y) dy = OP (n 1/2).

Combining this with Proposition 4 and the Cauchy-Schwarz inequality, we obtain Z

b L0 n(y) L0(y) dy = Z K0+2

b L0 n(y) L0(y) dy

= OP (n 1/2).

Using the result of Lemma 4, it follows that Z 1

0 | bmn(x) m0(x)|d F0(x) Z m0(1) bmn(1)

m0(0) bmn(0) |F0 bm 1 n (y) F0 m 1 0 (y)|dy

= Z m0(1) bmn(1)

m0(0) bmn(0)

b L0 n(y) L0(y) dy

b L0 n(y) L0(y) dy

implying that R 1 0 | bmn(x) m0(x)|d F0(x) = OP (n 1/2).

Proof of Theorem 4. Proof of Claim 1. We start with the case where the noise has an absolutely continuous density. We restrict attention to the ordinary smooth case because the arguments are very similar in the supersmooth one. Now, since the proof in the ordinary smooth case follows the same lines of the proof of Theorem 1, details are omitted and the reader is referred to the latter proof. Here, we only point out the main existing diﬀerences. Similar to (14) and (52),we deﬁne

b Lnx,ny(w) := 1

i=1 1{ bmnx,ny (Xi) w} and Lnx(w) := 1

i=1 1{m0(Xi) w}

for all w R, and

b Hnx,ny(y) = 1

i=1 Φϵ(y bmnx,ny(Xi)), and H0 nx(y) = 1

i=1 Φϵ(y m0(Xi))

for all y R. With similar arguments as for the proof of (54) we obtain that for all suﬃciently large T, Z

b Lnx,ny(w) Lnx(w) 2 dw T 2β

b Hnx,ny(y) H0 nx(y) 2 dy + 4

Moreover, it follows from the deﬁnition of bmnx,ny that Z

b Hnx,ny(y) H0 nx(y) 2 dy 2 Z

b Hnx,ny(y) Hny(y) 2 dy

H0 nx(y) Hny(y) 2 dy

Balabdaoui, Doss, and Durot

which is less than or equal to

H0 nx(y) Hny(y) 2 dy 8 Z

H0 nx(y) H0(y) 2 dy + 8 Z

Hny(y) H0(y) 2 dy

E[ Hny(y) H0(y) 2] = n 1 y H0(y)(1 H0(y)),

E[ H0 nx(y) H0(y) 2] = 1 nx Var Φϵ(y m0(X))

R (Φϵ(y m0(x)) H0(y))2 d F0(x).

The integrals I1 and I2 deﬁned in (56) and (57) are ﬁnite, since it can be shown that R 0 H0(y)dy < and R 0 (1 H0(y))dy < ; see Appendix A. Hence, we obtain that Z

R E b Hnx,ny(y) H0 nx(y) 2 dy = O((nx ny) 1). (61)

Combining this with (60) proves that for all suﬃciently large T, Z

R E b Lnx,ny(w) Lnx(w) 2 dw T 2βO((nx ny) 1) + 4

For T (nx ny)1/(2β+1) we get Z

R E b Lnx,ny(w) Lnx(w) 2 dw O 1 (nx ny)1/(2β+1)

which proves an analogue of Proposition 3 in the case of possibly unequal sample sizes. Next, we consider an analogue of Proposition 4. For this task, we denote by PX nx and P X the empirical probability measure associated with X1, . . . , Xnx and the true corresponding probability measure. We consider the empirical processes

b Lnx,ny(w) b L0 nx,ny(w) = (PX nx P X)1{ bmnx,ny ( ) w}

Lnx(w) L0(w) = (PX nx P X)1{m0( ) w};

where b L0 nx,ny(w) = Z 1{ bmnx,ny (x) w}d F0(x)

and L0 is deﬁned in (15). Then, with similar arguments as in the proof of Proposition 4, we obtain that for all random variables A < B (that may depend on n) it holds that Z B

b Lnx,ny(w) b L0 nx,ny(w) 2 dw (B A)OP (1/nx) (B A)OP (1/(nx ny)),

Lnx(w) L0(w) 2 dw (B A)OP (1/nx) (B A)OP (1/(nx ny)),

Unlinked Monotone Regression

where OP (1/nx) is uniform in A and B. Moreover, if B A = OP ((nx ny)2β/(2β+1)), then Z B

b L0 nx,ny(w) L0(w) 2 dw = OP ((nx ny) 1/(2β+1)).

Next, similar to Proposition 5 we obtain that for all random variables A < B such that B A = OP (1) it holds that Z B

F0 bm 1 nx,ny(w) F0 m 1 0 (w) 2 dw = Z B

b L0 n(w) L0(w) 2 dw

= OP ((nx ny) 1/(2β+1)).

Proposition 6 still holds in the case of possibly diﬀerent sample sizes with bmn replaced by bmnx,ny If (12) also holds, then An and Bn from Proposition 6 are both of the order OP (1). In that case, the second assertion in Theorem 4 is an immediate consequence of Proposition 6 combined with the preceding display. Hence, it remains to prove that (12) holds. It follows from the deﬁnition of b Lnx,ny and Lnx together with the Cauchy-Schwarz inequality and (62) that

i=1 1{ bmnx,ny (Xi) w}

dw = Z 2 m0

b Lnx,ny(w) Lnx(w) dw

which is bounded above by s m0

b Lnx,ny(w) Lnx(w) 2 dw = o P (1).

On the other hand,

i=1 1{ bmnx,ny (Xi) w}

dw m0 n 1 x

i=1 1{ bmnx,ny (Xi)>2 m0 }

and therefore,

i=1 1{ bmnx,ny (Xi)>2 m0 } = o P (nx).

By monotonicity of bmnx,ny this implies that

1{ bmnx,ny (b)>2 m0 }

i=1 1{Xi>b}

i=1 1{ bmnx,ny (Xi)>2 m0 } = o P (nx).

By the law of large numbers, n 1 x Pnx i=1 1{Xi>b} converges in probability to 1 F0(b) > 0 and therefore, it follows from the previous display that

1{ bmnx,ny (b)>2 m0 } = o P (1).

This implies that lim n P( bmnx,ny(b) > 2 m0 ) = 0.

Balabdaoui, Doss, and Durot

One can prove similarly that

lim n P( bmnx,ny(a) < 2 m0 ) = 0.

This implies (12) by monotonicity of bmn.

Proof of Claim 2. Now, we turn to the case where ϵ is supported on a ﬁnite set of points. Without loss of generality we assume that ϵ 1 with probability 1. The same proof of Claim 4 in Proposition 2 can be again used to show existence with probability 1 of an estimator bmnx,ny which is bounded in the sup-norm by K0 + 2. Now, the rate obtained above in (61) and the Cauchy-Schwarz inequality allow us to write that Z

b Hnx,ny(y) H0 nx(y) dy = Z K0+3

b Hnx,ny(y) H0 nx(y) dy

= OP ((nx ny) 1/2).

Using the same arguments as in the proof of Theorem 3 (in particular Proposition 9 in Appendix C) this implies that W1(b Hnx,ny, H0 nx) = OP ((nx ny) 1/2). Since b Hnx,ny(y) = R

R b Lnx,ny(y t)dΦϵ(t) and H0 nx(y) = R

R Lnx(y t)dΦϵ(t), we can use again Proposition 10 in Appendix C to ﬁnd a constant D > 0 depending only on the distribution of ϵ and K0 such that

W1(b Lnx,ny, Lnx) D W1(b Hnx,ny, H0 nx).

implying that Z

b Lnx,ny(w) Lnx(w) dw = Z K0+2

b Lnx,ny(w) Lnx(w) dw = OP ((nx ny) 1/2)

using again Proposition 2 of Meis and Mammen (2020). Therefore, Z K0+2

K0 2 |b L0 nx,ny(w) L0(w)|dw = Z K0+2

K0 2 |F0 bm 1 nx,ny(w) F0 m 1 0 (w)|dw

= OP ((nx ny) 1/2)

and hence Z 1

0 | bmnx,ny(x) m0(x)|d F0(x) Z bmnx,ny (1) m0(1)

bmnx,ny (0) m0(0) |F0 bm 1 nx,ny(w) F0 m 1 0 (w)|dw

K0 2 |F0 bm 1 nx,ny(w) F0 m 1 0 (w)|dw

= OP ((nx ny) 1/2),

which completes the proof of Theorem 4.

Proof of Theorem 5. Theorem 1, Theorem 2, or Theorem 3 imply that along any subsequence of {n} n=1 we can ﬁnd a further subsequence {ni} i=1 such that with probability 1

a | bmni(x) m0(x)|d F0(x) = 0.

Unlinked Monotone Regression

Call this event E1. For notational ease let bmni bmi. Further, by Corollary 2.3 of Stein and Shakarchi (2005) (stated for Lebesgue measure but the proof does not rely on the Lebesgue measure at all and the result holds for a general measure space), there exists another subsubsequence (which we call again {ni} i=1 for convenience) such that bmi converges F0-a.e. to m0. Recall C is a compact in the interior of [a, b] S0. Then since m0 is continuous on C, bmi converges on a dense subset of [a, b] S0 to m0 (for any points α, β [a, b] S0, the F0 measure of (α, β) is given by F0(β) F0(α), so if F0(β) F0(α) > 0 then there must be a point of convergence in (α, β), since convergence is F0-a.e.), and both bmi and m0 are monotone, it follows that bmi converges pointwise on all of C to m0 (one can sandwich any point in C, including its boundary points, by sequences of points above and below at which bmi converges to m0 and appeal to monotonicity). And, we can strengthen the convergence to uniform convergence on C, since m0 and bmi are monotone and m0 is continuous on C, again, for any ω E1. The elementary proof is as follows. Fix ϵ > 0. By uniform continuity of m0 on (the compact) [a, b], there exists δ > 0 such that |m0(x) m0(y)| ϵ for all x, y such that |x y| δ. Cover C with the set of open (in C s subspace topology) sets A(x, δ) := {y C : |x y| < δ/2} for all x C, and extract by compactness a ﬁnite subcover of these open sets, A(xi, δ), for i = 1, . . . , N. Let xi1 := inf A(xi, δ) and xi2 := sup A(xi, δ). Since C is closed, xij C, j = 1, 2. By pointwise convergence bmn(xij) is within ϵ of m0(xij), j = 1, 2, for all i and n large enough. Now, take any x C; let j be such that xj1 x xj2. Using monotonicity of bmn and of m0, we have for n large enough that bmn(x) bmn(xj2) m0(xj2) + ϵ m0(x) + 2ϵ. Similarly, using xj1, we have bmn(x) m0(x) 2ϵ, which proves the uniform convergence on C. Hence, for all ω E1, for all subsequences of {ni} i=1, we can ﬁnd a further subsequence (depending on ω) along which supx C | bmi(x) m0(x)| converges to zero. Hence, for all ω E1, this supremum distance converges to zero along the subsequence {ni} i=1. Therefore,

P lim i sup x C | bmi(x) m0(x)| = 0 = 1.

Since for any subsequence of {n} n=1, we have almost sure convergence along a subsubsequence, it follows that along the original sequence {n}

sup x C | bmn(x) m0(x)| 0, in probability

which completes the proof.

E.4 Proofs for Section 4

We begin with an auxiliary lemma. Below, we use the same notation b Hn and H0 n as in (52). We recall that H0 denotes the distribution function of Y . We use the same notation P X n and P X as in the proof of Proposition 4 and denote, moreover, by EX

the expectation corresponding to P X.

Lemma 5. Let Assumptions A0 and A1 hold. Assume that ϵ is supported on [ 1; 1] and independent of X. For any solution bmn M to (8) that is bounded in the sup-

Balabdaoui, Doss, and Durot

norm by K0 + 2, we then have Z

n H0(y) EX[Φϵ(y bmn(X))] o2 dy 2 Z

n Hn(y) b Hn(y) o2 dy + OP (n 1)

= OP (n 1). (63)

Moreover, Z Z

R (Φϵ(y m0(x)) Φϵ(y bmn(x))) dyd F0(x) 2 = OP (n 1). (64)

Proof [Proof of Lemma 5.] We can write

i=1 Φϵ(y bmn(X)) EX [Φϵ(y bmn(X))] = PX n P X Φϵ(y bmn( )).

For a ﬁxed y, consider the class of non-decreasing functions n x 7 Φ(y m(x)), x [0, 1], m monotone non-decreasing and m K0 + 2 o .

Using entropy arguments as in the proof of Proposition 4 by replacing the class I with the one deﬁned above we can show that for all random variables A < B such that B A = OP (1), it holds that Z B

i=1 Φϵ(y bmn(Xi)) EX [Φϵ(y bmn(X))] o2 dy = OP (n 1).

Since ϵ is supported on [ 1, 1], and bmn is assumed to be bounded in the sup-norm by K0 + 2, the above integral over [A, B] with A = K0 3 and B = K0 + 3, is equal to the same integral over the whole real line R. Hence, we get Z

i=1 Φϵ(y bmn(Xi)) EX [Φϵ(y bmn(X))] o2 dy = OP (n 1). (65)

On the other hand, with H0 the distribution function of Y we have

R (Hn(y) H0(y))2dy = Z

R E(Hn(y) H0(y))2dy

R H0(y)(1 H0(y))dy

since n Hn(y) is a binomial random variable with parameter n and probability of success H0(y). The integral on the right hand side is ﬁnite since Y has bounded support (included in [ K0 1, K0 + 1]) and therefore, Z

R (Hn(y) H0(y))2dy = OP (n 1).

Combining this with (65) together with the fact that for all real numbers a and b, we have (a + b)2 2a2 + 2b2, we conclude that Z

n H0(y) EX [Φϵ(y bmn(X))] o2 dy

n Hn(y) n 1 n X

i=1 Φϵ(y bmn(Xi)) o2 dy + OP (n 1).

Unlinked Monotone Regression

The ﬁrst inequality follows by deﬁnition of b Hn. Now, it follows from the deﬁnition of bmn that Z

n Hn(y) b Hn(y) o2 dy Z

n Hn(y) H0 n(y) o2 dy

n Hn(y) H0(y) o2 dy + 2 Z

n H0 n(y) H0(y) o2 dy.

Since ϵ is independent of X, both n Hn(y) and n H0 n(y) are the average of n i.i.d. bounded random variables with mean H0(y) and therefore,

n Hn(y) b Hn(y) o2 dy 2n 1 Z

R V ar(1{Y y})dy +2n 1 Z

R V ar(Φϵ(y m0(X))dy.

Both integrals on the right-hand side are ﬁnite since the integrands are bounded and equal to zero for all y K0 1 and all y K0 + 1. Hence, Z

n Hn(y) b Hn(y) o2 dy = OP (n 1).

This completes the proof of (63). Now, with F0 the distribution function of X, it follows from the assumption that bmn (as well as m0) is bounded in sup-norm by K0 + 2 that Z

n H0(y) EX [Φϵ(y bmn(X))] o2 dy

n Z (Φϵ(y m0(x)) Φϵ(y bmn(x))) d F0(x) o2 dy

n Z (Φϵ(y m0(x)) Φϵ(y bmn(x))) d F0(x) o2 dy

so it follows from the Jensen inequality and the Fubini theorem that Z

n H0(y) EX [Φϵ(y bmn(X))] o2 dy

Z (Φϵ(y m0(x)) Φϵ(y bmn(x))) d F0(x)dy 2

= 1 2K0 + 6

R (Φϵ(y m0(x)) Φϵ(y bmn(x))) dyd F0(x) 2 .

Combining this with (63) yields (64) and completes the proof of the lemma.

Proof of Theorem 6. Case of error with ﬁnite support: Suppose that ϵ is supported on a ﬁnite set such that all the points of the support belong to [ 1, 1] and let k 2. For any solution bmn to (8) which is bounded by K0 +2 (which exists with probability 1 in view of Proposition 2), it follows from Theorem 3 and the identity ak bk = (a b)(ak 1+ak 2b+. . .+bk 1) that

bmk n(x) mk 0(x) d F0(x) k(K0 + 2)k 1 Z 1

0 | bmn(x) m0(x)|d F0(x)

= OP (1/ n).

Balabdaoui, Doss, and Durot

To show (20), it is enough to show that Z 1

0 bmk n(x)d(Fn(x) F0(x)) 1 n Gn bmk n (66)

is OP (1/ n). Deﬁne

Mc = {m : m non-decreasing on [0, 1] and m c},

Gk,c = {mk : m Mc},

for c > 0. If k = 1, then bmn MK0+2 and it follows from similar arguments as in the proof of Proposition 4 that there exists M > 0 depending only on K0 such that Gn MK0+2 = OP (1) which implies from (66) that (20) holds true. Now, suppose that k 2. Using the decomposition mk = mk1m 0 + mk1m<0 we see that for any m MK0+2, mk is either the sum or the diﬀerence (depending on whether k is odd or even) of two functions in M(K0+2)k. Using Proposition 8 with (λ1, λ2) = (1, 1) or (λ1, λ2) = (1, 1), it follows that for any discrete measure Q and δ > 0

log N(δ, Gk,K0+2, L2(Q)) 2 log N(δ/2, M(K0+2)k, L2(Q)).

Using (26) and similar arguments as in the proof of Proposition 4, we conclude again that Gn Gk,K0+2 = OP (1). Together with (66), it follows that (20) holds true.

Case of uniform error: Now, we turn to the case where ϵ U[ 1, 1]. To compute the integral on the left-hand side of (64), we distinguish between several cases. We recall that, because Φϵ is the distribution function of a uniformly distributed random variable over [ 1, 1], Φϵ(t) is equal to 0 if t < 1, to 1 if t > 1, and to (t + 1)/2 otherwise. For all x such that m0(x) bmn(x) m0(x) + 2 we have

Φϵ(y m0(x)) Φϵ(y bmn(x))

(y m0(x) + 1)/2 if m0(x) 1 y bmn(x) 1 ( bmn(x) m0(x))/2 if bmn(x) 1 y m0(x) + 1 1 (y bmn(x) + 1)/2 if m0(x) + 1 y bmn(x) + 1 0 otherwise .

Hence, for all x such that m0(x) bmn(x) m0(x) + 2 we have Z

R (Φϵ(y m0(x)) Φϵ(y bmn(x))) dy

2( bmn(x) m0(x))2 + 1

2( bmn(x) m0(x))(m0(x) bmn(x) + 2)

= ( bmn(x) m0(x)).

Similarly, for all x such that m0(x) + 2 < bmn(x) we have

Φϵ(y m0(x)) Φϵ(y bmn(x))

(y m0(x) + 1)/2 if m0(x) 1 y m0(x) + 1 1 if m0(x) + 1 y bmn(x) 1 1 (y bmn(x) + 1)/2 if bmn(x) 1 y bmn(x) + 1 0 otherwise,

Unlinked Monotone Regression

which implies that Z

R (Φϵ(y m0(x)) Φϵ(y bmn(x))) dy = ( bmn(x) m0(x)).

Hence, Z 1{m0(x) bmn(x)}

R (Φϵ(y m0(x)) Φϵ(y bmn(x))) dyd F0(x)

= Z 1{m0(x) bmn(x)} ( bmn(x) m0(x)) d F0(x)

Similarly, Z 1{ bmn(x) m0(x)}

R (Φϵ(y m0(x)) Φϵ(y bmn(x))) dyd F0(x)

= Z 1{ bmn(x) m0(x)} ( bmn(x) m0(x)) d F0(x)

Combining the two previous displays yields Z Z

R (Φϵ(y m0(x)) Φϵ(y bmn(x))) dyd F0(x)

= Z ( bmn(x) m0(x)) d F0(x).

Now, from (64) it follows that

0 ( bmn(x) m0(x)) d F0(x) = OP (1/ n).

As we already know that R 1 0 bmn(x)d(Fn(x) F0(x)) = OP (1/ n), the second claim of the proposition now follows.

Appendix F. Proof of Proposition 7

To make the notation less cumbersome, we write in the following Zi for the i-th order statistic X(i). Suppose that bmn takes at least two distinct values and let 1 j < j n be such that bmn is constant on [Zj, Zj ), where Zj < Zj are two successive jump points of bmn. Consider the function mδ which is right-continuous, constant between the order statistics Z1, . . . , Zn, and

mδ(Zi) = bmn(Zi) + δ, i {j, . . . , j 1} bmn(Zi), otherwise.

Then, the function mδ as deﬁned above belongs to M, provided that |δ| is small enough. It follows from the deﬁnition (8) that Mn(mδ) Mn( bmn). Using Taylor expansion of Φϵ with the integral remainder term we can write that for i {j, . . . , j 1}

Φϵ(y bmn(Zi) δ) = Φϵ(y bmn(Zi)) δfϵ(y bmn(Zi)) + Rδ,i(y)

Balabdaoui, Doss, and Durot

where the remainder term Rδ,i is given below. Hence,

0 Mn(mδ) Mn( bmn) (67)

i/ {j,...,j 1} Φϵ(y bmn(Zi)) 1

i=j Φϵ(y bmn(Zi) δ) o2 dy

i=1 Φϵ(y bmn(Xi)) o2 dy

i/ {j,...,j 1} Φϵ(y bmn(Zi))

i=j Φϵ(y bmn(Zi)) + δ 1

i=j fϵ(y bmn(Zi)) 1

i=j Rδ,i(y) o2 dy

i=1 Φϵ(y bmn(Xi)) o2 dy

which equals

i=1 Φϵ(y bmn(Xi)) + δ 1

i=j fϵ(y bmn(Zi)) 1

i=j Rδ,i(y) o2 dy

i=1 Φϵ(y bmn(Xi)) o2 dy

which equals

i=1 Φϵ(y bmn(Xi)) δ

i=j fϵ(y bmn(Zi))

i=j Rδ,i(y) dy

i=j fϵ(y bmn(Zi))

i=j Rδ,i(y) 2 dy

which equals

i=1 Φϵ(y bmn(Xi)) δ(j j)fϵ(y bmn(Zj))

i=j Rδ,i(y) dy

δ(j j)fϵ(y bmn(Zj))

i=j Rδ,i(y) 2 dy

where for i = j, . . . , j 1

Rδ,i(y) = Z y bmn(Zi) δ

y bmn(Zi) f ϵ(t) (y bmn(Zi) δ t)dt

0 f ϵ(y bmn(Zi) u) (u δ) du, letting u = y bmn(Zi) t

0 f ϵ(y bmn(Zi) δv) (v 1) dv, letting v = u/δ.

Unlinked Monotone Regression

Thus, (Mn(mδ) Mn( bmn))/δ equals

i=1 Φϵ(y bmn(Xi)) fϵ(y bmn(Zj))dy

i=1 Φϵ(y bmn(Xi)) 1

i=j Rδ,i(y)dy

δ(j j)2fϵ(y bmn(Zj))2 2(j j)fϵ(y bmn(Zj)

i=j Rδ,i(y)

Pj 1 i=j Rδ,i(y) 2

which equals

i=1 Φϵ(y bmn(Xi)) fϵ(y bmn(Zj))dy

i=1 Φϵ(y bmn(Xi))

0 f ϵ(y bmn(Zi) δv) (v 1) dv dy

R fϵ(y bmn(Zj))2dy

R fϵ(y bmn(Zj))

0 f ϵ(y bmn(Zi) δv) (v 1) dvdy

i=j f ϵ(y bmn(Zi) δv) (v 1) dv 2 dy.

We show below that each term on the right hand side that depends on δ takes the form of δi, i = 1, 2, 3 times a ﬁnite integral, so that it tends to zero as δ 0. From Assumption A6, it follows that

i=1 Φϵ(y bmn(Xi)) j 1 X

0 f ϵ(y bmn(Zi) δv) (v 1) dv

D(j j) Hn(y) 1

i=1 Φϵ(y bmn(Xi))

which can be shown to be integrable on R using the property of Φϵ in (43). Also, Assumption A6 implies that there exists D > 0 such that supt R fϵ(t) D . Then,

R fϵ(y bmn(Zj))2dy D Z

R fϵ(y bmn(Zj))dy = D ,

Balabdaoui, Doss, and Durot

and by Fubini s Theorem

0 fϵ(y bmn(Zj))|f ϵ(y bmn(Zi) δv)| (v 1) dvdy

R fϵ(y bmn(Zj))|f ϵ(y bmn(Zi) δv)|dy (v 1)dv

0 (v 1)dv = D(j j)

using Assumption A6 and the fact that fϵ is a density. Finally, using again Assumption A6 and Fubini s Theorem we have

i=j f ϵ(y bmn(Zi) δv) (v 1) dv 2 dy

i=j |f ϵ(y bmn(Zi) δv)| (1 v) dvdy

which equals

R |f ϵ(y bmn(Zi) δv)|dy

R |f ϵ(t)|dt < ,

by Assumption A6. By using (67) and distinguishing between the cases of positive and negative values of δ it follows that

0 = lim δ 0 Mn(mδ) Mn( bmn)

= 2 n(j j) Z

i=1 Φϵ(y bmn(Xi)) fϵ(y bmn(Zj))dy

and therefore,

Hn(y) n 1 n X

i=1 Φϵ(y bmn(Xi)) fϵ(y bmk)dy

where bmk = bmn(Zj) = . . . = bmn(Zj 1). This is precisely the condition given in (21). In the case bmn takes a unique value, a similar reasoning give the same result, characterizing bmk for k = 1. Now, the alternative expression in (22) follows from the fact that for any a R

R Hn(y)fϵ(y a)dy = 1

Yi fϵ(y a)dy = 1 1

i=1 Φϵ(Yi a)

which completes the proof.

Unlinked Monotone Regression

0.0 0.1 0.2 0.3 0.4 0.5

0 1 2 3 4 5 6 7

Gamma(2) Gamma(1.5) Gamma(1) Gamma(0.75) Gamma(0.5) Gamma(0.25)

Figure 3: Plots of gamma densities with varying shape parameter.

Abubakar Abid, Ada Poon, and James Zou. Linear regression with shuﬄed labels. ar Xiv, May 2017.

R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical Inference under Order Restrictions. The Theory and Application of Isotonic Regression. John Wiley, London-New York-Sydney, 1972.

Sergey Bobkov and Michel Ledoux. One-dimensional empirical measures, order statistics, and Kantorovich transport distances, volume 261, number 1259. American Mathematical Society, 2019.

H. D. Brunk. Estimation of isotonic regression. In Nonparametric Techniques in Statistical Inference (Proc. Sympos., Indiana Univ., Bloomington, Ind., 1969), pages 177 197. Cambridge Univ. Press, London, 1970.

Chris Carolan and Richard Dykstra. Asymptotic behavior of the Grenander estimator at density ﬂat regions. The Canadian Journal of Statistics, 27(3):557 566, 1999. ISSN 0319-5724.

Alexandra Carpentier and Teresa Schl uter. Learning relationships between data obtained independently. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, pages 658 666, 2016.

Raymond J. Carroll, David Ruppert, Leonard A. Stefanski, and Ciprian M. Crainiceanu. Measurement error in nonlinear models, volume 105 of Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton, FL, second edition, 2006. ISBN 978-1-58488-633-4; 1-58488-633-1. doi: 10.1201/9781420010138. URL https://doi.org/10.1201/9781420010138. A modern perspective.

Eric Cator. Adaptivity and optimality of the monotone least-squares estimator. Bernoulli, 17(2):714 735, 2011. ISSN 1350-7265. doi: 10.3150/10-BEJ289. URL https://doi.org/10.3150/10-BEJ289.

Balabdaoui, Doss, and Durot

Sabyasachi Chatterjee, Adityanand Guntuboyina, and Bodhisattva Sen. On risk bounds in isotonic and other shape restricted regression problems. The Annals of Statistics, 43(4):1774 1800, 2015.

Itai Dattner, Alexander Goldenshluger, and Anatoli Juditsky. On deconvolution of distribution functions. The Annals of Statistics, 39(5):2477 2501, 2011.

Itai Dattner, Markus Reiß, Mathias Trabs, et al. Adaptive quantile estimation in deconvolution with unknown error distribution. Bernoulli, 22(1):143 192, 2016.

Morris H De Groot and Prem K Goel. Estimation of the correlation coeﬃcient from a broken random sample. The Annals of Statistics, 8(2):264 278, March 1980.

A Delaigle and I Gijbels. Bootstrap bandwidth selection in kernel density estimation from a contaminated sample. Ann. Inst. Statist. Math., 56(1):19 47, 2004.

C ecile Durot. Sharp asymptotics for isotonic regression. Probab. Theory Related Fields, 122(2):222 240, 2002. ISSN 0178-8051. doi: 10.1007/s004400100171. URL https: //doi.org/10.1007/s004400100171.

C ecile Durot. On the Lp-error of monotonicity constrained estimators. Ann. Statist., 35(3):1080 1104, 2007. ISSN 0090-5364. doi: 10.1214/009053606000001497. URL https://doi.org/10.1214/009053606000001497.

C ecile Durot and Hendrik P. Lopuha a. A Kiefer-Wolfowitz type of result in a general setting, with an application to smooth monotone estimation. Electron. J. Stat., 8(2):2479 2513, 2014. ISSN 1935-7524. doi: 10.1214/14-EJS958. URL https: //doi.org/10.1214/14-EJS958.

C ecile Durot, Vladimir N. Kulikov, and Hendrik P. Lopuha a. The limit distribution of the L -error of Grenander-type estimators. Ann. Statist., 40(3):1578 1608, 2012. ISSN 0090-5364. doi: 10.1214/12-AOS1015. URL https://doi.org/10. 1214/12-AOS1015.

David Edelman. Estimation of the mixing distribution for a normal mean with applications to the compound decision problem. Ann. Statist., 16(4):1609 1622, 1988. ISSN 0090-5364. doi: 10.1214/aos/1176351056. URL https://doi.org/10.1214/ aos/1176351056.

Bob E Ellison. Two theorems for inferences about the normal distribution with applications in acceptance sampling. Journal of the American Statistical Association, 59 (305):89 95, 1964. doi: 10.2307/2282860.

Jianqing Fan. On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Statist., 19(3):1257 1272, 1991. ISSN 0090-5364. doi: 10.1214/aos/ 1176348248. URL https://doi.org/10.1214/aos/1176348248.

R Fletcher. Practical methods of optimization. A Wiley-Interscience Publication. John Wiley & Sons, Ltd., Chichester, second edition, 1987.

P. Groeneboom. Estimating a monotone density. In Proceedings of the Berkeley conference in honor of Jerzy Neyman and Jack Kiefer, Vol. II (Berkeley, Calif., 1983), Wadsworth Statist./Probab. Ser., pages 539 555. Wadsworth, Belmont, CA, 1985.

Unlinked Monotone Regression

Piet Groeneboom. The concave majorant of Brownian motion. Ann. Probab., 11(4): 1016 1027, 1983.

Piet Groeneboom. Brownian motion with a parabolic drift and Airy functions. Probab. Theory Related Fields, 81(1):79 109, 1989. ISSN 0178-8051. doi: 10.1007/ BF00343738. URL https://doi.org/10.1007/BF00343738.

Piet Groeneboom and Geurt Jongbloed. Nonparametric estimation under shape constraints, volume 38 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, New York, Cambridge, 2014.

Peter Hall and Soumendra N Lahiri. Estimation of distributions, moments and quantiles in deconvolution problems. The Annals of Statistics, 36(5):2110 2134, 2008.

Thomas N Herzog, Fritz J Scheuren, and William E Winkler. Data quality and record linkage techniques. Springer Science & Business Media, 2007.

Enno Mammen. Estimating a smooth monotone regression function. Ann. Statist., 19(2):724 740, 1991. ISSN 0090-5364. doi: 10.1214/aos/1176348117. URL https: //doi.org/10.1214/aos/1176348117.

Jan Meis and Enno Mammen. Uncoupled isotonic regression with discrete errors. Personal communication, 2020.

Jorge Nocedal and Stephen J Wright. Numerical Optimization. Springer Series in Operations Research. Springer-Verlag, New York, New York, 1999.

Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Linear regression with an unknown permutation: Statistical and computational limits. In Proceedings of the 2016 54th Annual Allerton Conference on Communication, Control, and Computing, pages 417 424. IEEE, 2016.

Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Denoising linear models with permuted data. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 446 450. IEEE, 2017.

Ashwin Pananjady, Martin J Wainwright, and Thomas A Courtade. Linear regression with shuﬄed data: statistical and computational limits of permutation recovery. IEEE Trans. Inform. Theory, 64(5):3286 3300, 2018.

Philippe Rigollet and Jonathan Weed. Uncoupled isotonic regression via minimum wasserstein deconvolution. Information and Inference (to appear), 2019.

Tim Robertson, F T Wright, and R L Dykstra. Order restricted statistical inference. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons, Ltd., Chichester, 1988.

Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas, and Matthew Sobek. IPUMS, USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020.

Elias M Stein and Rami Shakarchi. Real analysis, volume 3 of Princeton Lectures in Analysis. Princeton University Press, Princeton, NJ, 2005.

Balabdaoui, Doss, and Durot

Jayakrishnan Unnikrishnan, Saeid Haghighatshoar, and Martin Vetterli. Unlabeled sensing with random linear measurements. IEEE Trans. Inform. Theory, 64(5): 3237 3253, 2018.

Aad W. van der Vaart and Jon A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, 1996.

Cun-Hui Zhang. Risk bounds in isotonic regression. The Annals of Statistics, 30(2): 528 555, 2002.