# shuffled_deep_regression__66f49726.pdf

Shuffled Deep Regression

Masahiro Kohjima

NTT Human Informatics Laboratories, NTT Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan masahiro.kohjima.ev@hco.ntt.co.jp

Shuffled regression is the problem of learning regression models from shuffled data that consists of a set of input features and a set of target outputs where the correspondence between the input and output is unknown. This study proposes a new deep learning method for shuffled regression called Shuffled Deep Regression (SDR). We derive the sparse and stochastic variant of the Expectation-Maximization algorithm for SDR that iteratively updates discrete latent variables and the parameters of neural networks. The effectiveness of the proposal is confirmed by benchmark data experiments.

Introduction Due to the difficulties of comprehensive data collection such as privacy-aware or non-centralized data collection, the data to be analyzed are often given by the form of shuffled data where the correspondence between input (feature vector) and output (target value) is unknown (Carpentier and Schl uter 2016; Hsu, Shi, and Sun 2017; Pananjady, Wainwright, and Courtade 2017; Abid and Zou 2018). Since the true target value for each input is unknown, shuffled data can be collected even if the outputs represent sensitive information such as annual income while protecting privacy. To learn regression models from shuffled data, it is impossible to apply standard regression methods using paired data whose correspondence between input and output is known. Shuffled regression (SR) is the problem of estimating a model from shuffled data and arises in domains such as data integration, computer vision, and sensor networks, see e.g., (Pananjady, Wainwright, and Courtade 2017). In the literature of SR, existing studies focus on the learning of linear models (Carpentier and Schl uter 2016; Hsu, Shi, and Sun 2017; Abid and Zou 2018; Fang and Li 2023), and the learning of deep models has not been well investigated. Considering the known benefits of deep learning (Goodfellow, Bengio, and Courville 2016), the use of deep learning must contribute to improving the performance of solving SR. This study proposes a new deep learning-based method for shuffled regression called Shuffled Deep Regression (SDR). We derive a sparse stochastic expectationmaximization (SSEM) algorithm for parameter estimation

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

that iteratively updates latent variables and the parameters of neural networks, based on the EM algorithm (Dempster, Laird, and Rubin 1977). SSEM adopts sparse EM (Neal and Hinton 1998) and online EM (Capp e and Moulines 2009)-like approaches to naturally handle the huge spaces of discrete latent variables and supports combination with stochastic gradient algorithms (such as Adam) to update neural network parameters. We also show that SSEM is related to the iterative reweighted least squares (IRLS) (Rubin 2004; Bishop 2006). Experiments conducted on benchmark datasets confirm the effectiveness of the proposals. The contributions of this paper are summarized below:

We propose a new deep learning-based method for shuffled regression called shuffled deep regression (SDR). We develop the sparse and stochastic variant of EM algorithm (SSEM) that can naturally handle the huge space of discrete latent variables and can support the use of stochastic gradient-based algorithms. We confirm the effectiveness of the proposed method by numerical experiments on benchmark data.

Related Work Shuffled regression (SR) (Pananjady, Wainwright, and Courtade 2017; Abid and Zou 2018) which is also called regression without correspondence (Hsu, Shi, and Sun 2017), uncoupled regression (Xu et al. 2019), and unlabeled sensing (Unnikrishnan, Haghighatshoar, and Vetterli 2018), arises in e.g., the pose and correspondence estimation (David et al. 2004), flow cytometry for measuring chemical characteristics of cells (Abid and Zou 2018), and linkage of health records (Shi, Li, and Cai 2021). So SR has been actively studied both theoretically and for application. Theoretical studies have clarified various difficulties of SR. For instance, recovering the correspondence between input and output is NP-hard in general (Pananjady, Wainwright, and Courtade 2017). (Unnikrishnan, Haghighatshoar, and Vetterli 2018) also explored the conditions for obtaining unique solutions. Developing parameter estimation algorithms is a notable topic as well; (Abid and Zou 2018) developed an expectation-maximization (EM) based algorithm similar to our study, while (Slawski, Rahmani, and Li 2020) elucidated the connection to robust subspace recovery problem. However, these existing studies mainly

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 1: (a) Paired Data and (b) Shuffled Data. Shuffled regression (SR) is the problem of estimating a (true) unknown function from shuffled data.

focus on simple models with some specific structure such as the monotone model (Carpentier and Schl uter 2016) or the linear model (Pananjady, Wainwright, and Courtade 2017; Abid and Zou 2018), and their validity is limited to cases that suit the simple models. Our motivation was to apply deep models and develop an applicable estimation algorithm. Research directed towards practical settings have considered problems where richer data (to the extent practical) are available (Xu et al. 2019; Kohjima et al. 2022; Yamane et al. 2023). For example, (Xu et al. 2019; Kohjima et al. 2022) use pairwise ranking data, while (Yamane et al. 2023) assume the availability of some intermediate variables that weakly connect inputs and outputs. In this context, the use of more complex models such as Gaussian process model (Kohjima et al. 2022) and deep learning-based method (Yamane et al. 2023) has been investigated. However, these methods cannot be applied if ranking data or intermediate variables are not available. Our study considers the learning of deep models without using such additional data and variables.

Preliminary: Regression with Paired Data Let X Rd and Y R be an input feature space and an output space, respectively. We call the set of n pairs of input and output, DPD = {(x(i), y(i))}n i=1, paired data, where x(i) X is an input feature and y(i) Y is an target output. See Fig. 1a. In the (standard) regression problem, given (i) regression model hθ : X Y parametrized by θ, (ii) output probability distribution f, and (iii) paired data DPD, the model parameter θ is estimated by maximizing the log-likelihood. Possible choices for the regression model hθ include the linear models and the deep neural network models such as the (one-hidden layer) multi-layer perceptron (MLP) shown in Table 1. Although the main scope of this study lies in the variety of deep models including MLP where the parameter is updated by stochastic algorithms such as SGD or Adam, our algorithm can handle any type of regression model. For the output probability distribution f, the most standard choice for f is the normal distribution defined as

2πσ2 exp n(y c)2

It is well-known that the use of the normal distribution makes the log-likelihood function correspond to the

parameter model form Linear W, b hθ(x) = Wx + b MLP {Wℓ, bℓ}2 ℓ=1 hθ(x)=W2act(W1x+b1)+b2

Table 1: Examples of regression model hθ.

η p0(y) T(y) A(η)

2πσ y/σ η2/2 Bernoulli log( r 1 r) 1 y log(1+eη) Poisson log(λ) 1/y! y exp(η)

Table 2: Examples of exponential family distribution f.

squared-error loss. Since Gaussian may not be appropriate to handle binary/non-negative/integer target outputs, the following exponential family distribution is the more general choice since it can express various types of distributions:

f(y|η) = p0(y) exp{η T(y) A(η)}, (1)

where η is the natural parameter, T(y) is sufficient statistics and A(η) is the log-normalizer. The exponential family includes Normal, Bernoulli (Ber), and Poisson (Poi):

Ber(y|r) = ry(1 r)1 y, Poi(y|λ) = λy exp( λ)/y!.

By assigning specific values to T(y) and A(η), the above distributions are represented by Eq. (1) (See Table 2). Our notation of f makes the output distribution transparent to which distribution is adopted.

Proposed Method: Shuffled Deep Regression This section describes the method we propose for estimating deep regression models from shuffled data.

Shuffled Data Shuffled data are data where the correspondence between input feature and target output is unknown. Let x1:K and y1:K be a set of K input features x1:K = {x1, x2, , x K}, and a set of K target outputs y1:K = {y1, y2, , y K}, respectively (xk X, yk Y). Shuffled data DSD are defined as a pair of input-set and output-set, DSD = {(x(i) 1:K, y(i) 1:K)}n i=1 1. See Fig. 1b. Unlike paired data, oneto-one correspondence between input and output is lost in shuffled data, so the output corresponding to feature x(i) k is one of {y(i) 1 , ..., y(i) K }, but it is not known which one. Since shuffled data are reduced to paired data when K=1, it can be regarded as a more general representation of paired data. We denote the matrix representation of x1:K by the bold capital letters, X1:K RK d. Table 3 shows our notation. The shuffled regression (SR) problem tackled by this study is to estimate the parameters of the regression model given (i) regression model hθ : X Y parametrized by θ, (ii) output probability distribution f, and (iii) shuffled data DSD. We emphasize that paired data DPD are not given.

1This setup could easily be extended to an adaptive K setup (K is different for each sample i), but that is is omitted for simplicity.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Symbol Description n size of shuffled data d size of dimensions of input features K size of a set of input features/target outputs DSD shuffled data DSD = {(x(i) 1:K, y(i) 1:K)}n i=1 x(i) 1:K i-th set of input features {x(i) 1 , , x(i) K } y(i) 1:K i-th set of target outputs {y(i) 1 , , y(i) K } X(i) 1:K (K d) matrix representation of x(i) 1:K θ parameter of regression model (See Table 1) hθ regression model (See Table 1) f output probability distribution (See Table 2) g link function (See Table 4) Π set of all K K permutation matrices P(i) (latent) permutation matrix for i-th data η(i) 1:K set of parameter of f, η(i) 1:K = (η(i) 1 , , η(i) K ) η(i) k output of link func. η(i) k =g(hθ([P(i)X(i) 1:K]k)) Π(i) set of M permutation matrices { P (i,m)}M m=1 γ(i,m) responsibility of m-th perm. matrix (Eq. (11))

x(i,m) k shuffled input x(i,m) k =[ P (i,m)X(i) 1:K]k η(i,m) k output of link func. η(i,m) k = g(hθ( x(i,m) k )) ˇz(t) latent variable sampled from categorical dist.

ˇx(t) k shuffled input ˇx(t) k =[ P (t,ˇz(t))X(t) 1:K]k ˇη(t) k output of link func. ˇη(t) k = g(hθ(ˇx(t) k ))

Table 3: Notation summary for the paper

Note that in early works on SR such as (Hsu, Shi, and Sun 2017; Pananjady, Wainwright, and Courtade 2017; Abid and Zou 2018) consider the setting where data size n=1. In practical data collection scenarios such as survey data collection, data are often collected multiple times from different questionnaire respondents; even if the correspondence between the input and output is obscured to protect privacy, we can know which data collection event the data was gathered in and the data are represented by shuffled data with n>1.

Generative Process The generative process of shuffled data DSD is described by the following three steps. For each i = 1, , n, (Step1) a set of input features x(i) 1:K is determined subject to some probability distribution p(x1:K). (Step-2) a latent (unknown) K K permutation matrix P(i) used for shuffling the order of inputs (row of X1:K) is generated by uniform distribution u(P) 2. Then, in (Step-3) , a set of outputs y(i) 1:K is determined subject to

p(y1:K|P, x1:K, θ) = YK

k=1 f yk|ηk ,

ηk = g(hθ([PX1:K]k)), (2)

2Permutation matrix P is the doubly stochastic matrix whose elements are 0 or 1, i.e., PK k=1 [P]kℓ= 1 ( ℓ), PK ℓ=1 [P]kℓ= 1 ( k), [P]kℓ {0, 1} ( k, ℓ). Since the size of the set of all K K permutation matrices is K!, u(P) = 1/K! for all P Π.

typical use dist. f link func. g(ˆy) Linear Regression Normal ˆy Logistic Regression Bernoulli 1/{1+ exp( ˆy)} Poisson Regression Poisson exp(ˆy)

Table 4: Typical use of distributions and (inverse of) link functions based on the generalized linear model.

Figure 2: Graphical model representation of shuffled data. Shaded nodes represent observed variables.

where [PX1:K]k denotes the k-th row vector of PX1:K, and g is the link function that connects the output of the regression model and the (natural) parameter of the output distribution f, similar to the generalized linear model (Nelder and Wedderburn 1972; Fang and Li 2023) (See Table 4 for typical choise). The use of link function also can be found in deep models (Ranganath et al. 2015). Summarizing the above, the joint probability of the shuffled data DSD and a set of permutation matrix P={P(i)}n i=1 given parameters θ is given by

p(DSD, P|θ) =

i=1 p(y(i) 1:K|P(i), x(i) 1:K, θ)p(x(i) 1:K)u(P(i)).

Figure 2 shows the graphical model representation. Note that we denote the set of natural parameters as η(i) 1:K = (η(i) 1 , , η(i) K ) and η(i) k = g(hθ([P(i)X(i) 1:K]k)). Marginalizing out P, we get the following likelihood of the shuffled data DSD:

p(DSD|θ) = X

P(n) Π p(DSD, P|θ), (3)

where Π is the set of all K K permutation matrices. The size of set |Π| is K!, and later we show how to deal with the huge size of set Π. Parameter θ is estimated by maximizing the following log-likelihood function.

ˆθMLE = arg max θ log p(DSD|θ). (4)

Note that the order of applying regression model hθ (with link function g) and permutation by P can be exchangeable; the identical likelihood function is derived regardless of the order since g(hθ([PX1:K]k)) = [P g(hθ(x1)), , g(hθ(x K)) ]k holds.

Lower Bound of Objective Function For maximizing the objective function of Eq. (4), we use the expectation maximization (EM) algorithm (Dempster,

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Laird, and Rubin 1977). EM indirectly maximizes the objective by using the following lower-bound function, L:

L(q, θ) = Eq(P) log p(DSD, P|θ) log q(P) , (5)

where Eq(P) is the expectation w.r.t. q(P). By Jensen s inequality, L is shown to be a lower bound of the loglikelihood:

log p(DSD|θ) = log Eq(P) hp(DSD, P|θ)

Eq(P) h log p(DSD, P|θ)

Lower bound L can be expanded to yield the following form:

L(q, θ) = log p(DSD|θ) KL q(P)||p(P|DSD, θ) , (6)

where KL(Q||P) is the KL-divergence between Q and P.

Difficulties Posed by EM Algorithm in Handling Shuffled Data EM algorithm is the following two-step procedure that maximizes the log-likelihood log p(DSD|θ): (E-step) Maximize L w.r.t. q(P) and (M-step) Maximize L w.r.t. θ. For E-step, from Eq. (6), L is maximized when q is equivalent to the posterior probability of P:

(Exact E-Step) : ˆq Ex(P) = p(P|DSD, θ), (7)

where the above posterior is decomposed as p(P|DSD, θ) = Qn i=1 p(P(i)|DSD, θ) and

p(P(i)|DSD, θ)= QK k=1 f y(i) k |η(i) k

P P Π QK k=1 f y(i) k |g(hθ([PX(i) 1:K]k)) .

However, the above exact E-Step is infeasible for general K since ˆq Ex is the probability distribution on a |Π|(= K!)- dimensional probabilistic vector. To avoid this difficulty, (Abid and Zou 2018) adopts the following Winner-Takes All (WTA) variant of E-step (for linear model and normal distribution):

(WTA E-Step) : ˆq WTA(P) = I(P = ˆP), (9) ˆP = arg max P p(P|DSD, θ).

This WTA variant has some computational advantage in general but local convergence (monotonic decrease in objective function) is not assured (Neal and Hinton 1998). Accordingly, we adopt the approach of the sparse EM algorithm (Neal and Hinton 1998). Sparse-EM falls between the exact-EM and WTA-EM, and well balances computation cost against convergence performance.

Sparse EM Algorithm for Shuffled Regression

Let Π(i) = { P (i,1) , P (i,m), , P (i,M)} be a set of M ( K!) candidates of the permutation matrix. Details on how to construct Π(i) are given at the end of the section. Sparse-EM (Neal and Hinton 1998) considers the following

E-step with the restriction that the probability of a permutation matrix being other than the matrices contained in Π(i) is zero:

(Sparse E-Step) : ˆq Sp(P(i))=

( γ(i,m) (if P(i)= P (i,m)) 0 (otherwise) (10)

where γ(i,m), defined below, can be seen as the responsibility of the m-th permutation matrix for i-th data:

γ(i,m) = QK k=1 f y(i) k | η(i,m) k

PM m =1 QK k=1 f y(i) k | η(i,m ) k ,

η(i,m) k = g(hθ( x(i,m) k )), x(i,m) k = [ P (i,m)X(i) 1:K]k.

Thus this E-step avoids to need to compute the summation over Π. For M-step, from Eq. (5), parameter θ that maximizes L is given by

(Exact M-Step) : ˆθ = arg max θ Eq(P) log p(DSD, P|θ) .

When the sparse E-step is utilized, q takes the form of sparse distribution ˆq Sp with many zeros (Eq. (10)). Thus the sparse variant of M-step given below also can avoid the summation over Π:

(Sparse M-Step) : ˆθ= arg max θ Eˆq Sp(P) log p(DSD, P|θ) ,

where the objective of sparse M-step can be expanded as

Eˆq Sp log p(DSD, P|θ) (12)

m=1 γ(i,m) log f y(i) k | η(i,m) k +Const.

This objective function can be interpreted as the objective of the weighted regression problem using paired data with input x(i,m) k , output y(i) k , and sample weight γ(i,m). Therefore, if distribution f is Gaussian and linear models are adopted, the sparse EM algorithm that iteratively apply the sparse Estep and M-step can be seen as the iterative reweighted least squares (IRLS) (Rubin 2004; Bishop 2006). The use of numerical optimization techniques is allowed for M-step since if the objective value (Eq. (12)) is improved (even if the maximum point is not strictly obtained), monotone increase of the likelihood function can be assured (while Π(i) is fixed), similar to the generalized EM algorithm (Dempster, Laird, and Rubin 1977; Neal and Hinton 1998). At infrequent intervals, Π(i) should be updated due to the change in the other parameters.

Sparse Stochastic EM for Shuffled Deep Regression Here we develop the proposed sparse and stochastic variant of EM algorithm (SSEM). Combined with the stochastic gradient algorithm such as SGD or Adam (Kingma and Ba 2014) and the sparse EM algorithm presented in previous subsection, we can build an algorithm suitable for estimating deep learning models 3.

3SGD or Adam itself cannot be directly applied due to the existence of the (latent) permutation matrix.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 1: SSEM algorithm for Shuffled Deep Regression

Input: model hθ, output distribution f, shuffled data DSD, hyperparameters (e.g., # of steps Tmax, interval C). Output: model parameter θ 1: Randomly initialize { Π(i)}i and θ. 2: (Optional) Initialize { Π(i)}i by solving SLR. 3: for t = 1 to Tmax do 4: Randomly sample (x(t) 1:K, y(t) 1:K) DSD. 5: //E-Step 6: Compute γ(t,m) following Eq. (11). 7: Randomly sample ˇz(t) Cat(z(t)|{γ(t,m)}m). 8: //M-Step

9: Compute ˇη(t) k following following Eq. (13) 10: Compute loss F (Eq. (14)) and update θ by Adam. 11: //Infrequent updates 12: Every C steps/epochs, update { Π(i)}i. 13: end for

The proposed SSEM updates parameters using a part of stochastically selected (mini-batch) data and permutation matrix. The pseudo code of the SSEM algorithm is written in Alg. 1. At each (algorithmic) time step t, the algorithm randomly samples (x(t) 1:K, y(t) 1:K) from given shuffled data DSD. As the E-step (line 5-7), the parameters of ˆq Sp, γ(t,m), are computed and randomly samples ˇz(t) following categorical distribution (Cat) with parameter {γ(t,m)}m. Next, as Mstep (line 8-10), using the ˇz(t)-th permutation matrix, we compute

ˇη(t) k = g(hθ(ˇx(t) k )), ˇx(t) k = [ P (t,ˇz(t))X(t) 1:K]k. (13)

Then, compute loss F, see below, which is the stochastic approximation of the M-step objective (Eq. (12)) ignoring constant terms and inverting signs:

F(θ; {x(t) 1:K, y(t) 1:K}) = XK

k=1 log f(y(t) k |ˇη(t) k ) (14)

and update parameters by any of several optimizers such as Adam (Kingma and Ba 2014). Note that If (mini-)batch data {(x(t) 1:K, y(t) 1:K)}S t=1 are available, we can update the parameters by minimizing the objective function with (mini-)batch data 1

S PS t=1 F(θ; {x(t) 1:K, y(t) 1:K}).

Updating Candidates of Permutation Matrix

At the end of this section, we detail the construction of Π(i). The key is the following Proposition, which states that best permutation matrix can be found by a sort operation.

Proposition 1 The permutation matrix ˆP (i) that maximizes posterior probability (Eq. (8)) is the one that rearranges the elements of {x(i) k }K k=1 such that the order of magnitude of values {g(hθ(x(i) k ))}K k=1 matches the order of {T(y(i) k )}K k=1.

Dataset MPG Abalone Boston Conc. data size n B 398 3341 506 1030 dimension d 7 10 13 8

Table 5: Dataset statistics

Proof Without loss of generality, we assume that y1:K are sorted in ascending order of {T(yk)}K k=1, i.e., T(yk) T(yℓ) if k < ℓ. Let us denote g(hθ(xk)) as gk and define g = (g1, , g K) for simplicity. The proof is done by contradiction. Let us assume the posterior probability (or equivalently, logarithm of its numerator P k log f(yk|g k)) is maximized at g = (g 1, , g K) where the entries are not in ascending order, i.e., there exists k, ℓsuch that k < ℓand g k g ℓ. But this explicitly implies log f(yk|g k)+ log f(yℓ|g ℓ) log f(yk|g ℓ)+ log f(yℓ|g k) since log f(yk|g k)+ log f(yℓ|g ℓ) {log f(yk|g ℓ)+ log f(yℓ|g k)}

= g k T(yk) + g ℓ T(yℓ) g ℓ T(yk) g k T(yℓ)

= (g k g ℓ) {T(yℓ) T(yk)} 0. Therefore, we get XK

k =1 log f(yk |g k )

= log f(yk|g k) + log f(yℓ|g ℓ) + XK

k =k,ℓlog f(yk |g k )

log f(yk|g ℓ) + log f(yℓ|g k) + XK

k =k,ℓlog f(yk |g k ).

This result contradicts the assumption that g is maximal.

Note that this proposition is an extension of the result in (Abid and Zou 2018) where a Gaussian distribution is used as f to (any) exponential family distribution. In constructing Π(i), it is promising to use the neighbor-

hood of ˆP (i) obtained by the sort stating in Prop. 1. After

rearranging the order of the K elements by ˆP (i), and by further re-ordering the ℓ-th and (ℓ+ 1)-th smallest adjacent elements, we can obtain the neighborhood of the permutation

matrix ˆP (i). Repeating this for all ℓ= 1, , K 1, we get

K 1 neighbors; adding ˆP (i) to it, we can construct K candidate permutation matrices that can be used as Π(i). In later experiments, we use this way of construction when updating Π(i) (line 12 in Alg.1). Since this construction depends on the current parameter θ, this update should be done at some (infrequent) intervals. The above construction can also be used to get a good initial setting to learn deep models (line 2 in Alg.1). For example, after estimating the parameters by solving shuffled regression using an easy-to-learn model such as a linear model, the estimated linear model can be used to get an initial setting of Π(i) in the same way as above.

Experiment This section confirms the effectiveness of the proposed SDR against linear model-based methods for shuffled regression.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Dataset (Oracle) Supervised Regression Shuffled Regression Source K LR DR SLR SDR (ours)

2 23.97 2.51 17.57 2.40 24.79 2.76 17.34 2.07 4 23.98 2.48 17.67 2.44 23.89 2.64 17.96 1.94 8 23.98 2.48 17.92 2.78 24.36 2.37 17.82 1.94 16 23.95 2.47 17.64 2.43 24.81 2.22 17.99 1.65

2 11.92 1.10 7.69 1.00 12.31 1.30 9.53 1.74 4 11.93 1.09 7.74 1.00 13.24 1.59 10.44 2.09 8 11.93 1.09 7.65 1.11 14.78 3.31 11.51 3.65 16 11.87 0.91 8.12 1.23 24.60 8.51 21.40 7.87

2 4.99 0.47 4.81 0.73 4.98 0.45 4.92 0.95 4 4.99 0.47 4.81 0.73 5.03 0.39 4.79 0.65 8 4.99 0.47 4.81 0.65 5.26 0.43 4.98 0.51 16 4.99 0.47 4.79 0.66 5.84 0.73 5.35 0.52

2 23.86 5.69 11.01 3.38 33.68 6.10 11.68 4.05 4 23.88 5.70 11.17 3.69 24.60 4.95 12.36 4.88 8 23.80 5.79 10.93 3.43 26.65 5.75 18.71 5.91 16 24.14 6.19 10.89 3.90 50.86 10.38 48.57 10.46

2 119.06 7.00 82.61 37.10 139.79 4.13 107.20 17.76 4 119.06 7.00 82.40 37.08 120.10 5.58 111.24 16.56 8 119.06 7.00 82.24 36.93 129.14 4.69 126.46 12.06 16 119.06 6.96 94.69 28.62 150.94 20.14 156.73 21.73

Table 6: Result on benchmark datasets. Average and standard deviation are shown. Bold face means the best (lowest) MSE performance among methods for shuffled regression. Note that LR and DR use paired data and do not solve shuffled regression.

Data. As the regression problem instances we used four publicly available data sets provided in the UCI machine learning repository4: auto-MPG data (MPG), abalone data (Abalone), Boston housing data (Boston), and concrete compressive strength data (Concrete). Their sample size, n B, and input dimension, d, are summarized in Table 5. We excluded samples with missing values and converted the categorical feature into a one-hot vector in MPG. Input features are standardized for all data sets. We also made MPG1D whose input feature dimension is d=1 by extracting the horse power feature. We prepared 5 data sets by randomly dividing the data and using 60% for training, 20% for validation, and 20% for testing. We made the shuffled data used for training/validation by randomly dividing the training/- validation data into n sets whose number of elements is K 5 and shuffling the indices in each set. The test data are paired data for evaluation. We ran 5 trials using 5 sets of training, validation and test data.

Evaluation Metric. We use test mean squared error (test MSE) as the performance metric. Test MSE is defined as 1 |Dtest| P (x(m),y(m)) Dtest{y(m) hθ(x(m))}2, where Dtest is the (paired) test data and | | indicates the number of elements in the set. A lower test MSE indicates the true regression function is precisely estimated.

4https://archive.ics.uci.edu/ml/index.php 5Training data size n is about the 60% of the source data size divided by K. Data remaining after dividing by K were excluded.

Baselines. As one baseline, we used linear models for shuffled regression (SLR), which can be seen as an extension of (Abid and Zou 2018) for n>1, trained by IRLS scheme as explained in the proposed method section. As oracle baselines, we also examined (standard) linear-regression (LR) and deep-regression (DR) using paired data with data size of K n. We used normal distribution as output distribution f and the identity function as link function g in all methods.

Hyperparameters. Both DR and SDR used a one-hiddenlayer feedforward neural network with the Re LU activation function. The number of units was set to 20 for all problems. The parameters were optimized using Adam (Kingma and Ba 2014) with learning rate of 0.001. The mini-batch size of SDR and that of DR were 32/K and 32, respectively. The maximum number of epochs was 2000, and we used early-stopping with validation data (which were also shuffled data) (Goodfellow, Bengio, and Courville 2016). The above was implemented using Py Torch (Paszke et al. 2019). Experiments were run on a computer with an Intel Xeon CPU, and a Ge Force GTX TITAN GPU.

Overall Results. Table 6 shows the results of the experiments. These results indicate that our SDR outperforms SLR in almost all settings. In addition, the performance of SDR exceeds LR and is comparable to DR which use paired data under some setting such as MPG-1D (K=2 16), Abalone (K=2 4), and Boston (K=2 4). From Fig. 3a, we can also see that our method captures the non-linear structure behind the data, and this contributes to its improved

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

(a) Learned model in MPG-1D

(c) Abalone

Figure 3: Detailed comparison of SLR and SDR (ours): (a) Estimated regression lines for MPG-1D when K = 16, and (b)(c) MSE performance varying the size of shuffled data (n) in MPG-1D/Abalone when K = 4 (fixed). Average and standard deviation are shown. Lower values are better.

(a) convergence speed

(b) hidden units

(c) learning rate

Figure 4: Effect of hyperparameters on SDR for Abalone (K=8): (a) convergence behavior of random initialization (red) and initialization using SLR (blue) for Π(i) with 5 different runs. Dashed vertical lines indicate the epochs when the candidate of permutation matrices Π(i) is updated. (b)(c) MSE performance while varying the number of hidden units/learning rate.

performance. These results confirm the effectiveness of our method using deep models.

Effect of Data Size n. SDR and SLR have comparable performance for same dataset with small data size n, see e.g., Boston (K = 16) and Concrete (K = 16) in Table 6 6. So we investigated their performance while varying n (and fixing K) as shown in Fig. 3bc. These results confirm that, although SLR works well when data size n is small, the performance of SDR improves as n increases, and eventually outperforms SLR. This result is convincing since deep learning-based methods require a large amount of data to perform well in general. Therefore, it is confirmed that the proposed deep learning-based method (SDR) outperforms the linear model-based method (SLR) in solving the shuffled regression problem when sufficient number of data is available, similar to the standard regression problem.

Effect of Hyperparameters. We also investigated the impact of hyperparameters on SDR as shown in Fig. 4. Figure 4a shows the convergence property of SDR with random initialization and with initialization using SLR for the can-

6n=18 for Boston (K=16) and n=38 for Concrete (K=16)

didate of permutation matrices Π(i). We can see that SDR with the random initialization shows slow convergence, but that with SLR initialization shows faster convergence for all five runs. That implies that the initial values should be set carefully, and appropriate choice can significantly accelerate convergence speed. Figure 4b shows that, although the MSE performance is degraded when the number of units is smaller than 10, the performance is stable when the number of units is larger than 10. Similarly, the stable performance is confirmed when the learning rate is set to values between 10 3 and 10 2 from Fig. 4c. So we can say that SDR works well without the need for sensitive tuning of these two hyperparameters. This supports the usefulness of our method.

In this study, we proposed SDR for learning deep regression models from shuffled data. We derived the SSEM algorithm for parameter estimation and confirmed the effectiveness of the proposals by experiments on benchmark datasets. The remaining work of this study is to theoretically analyze how data size n, set size K, and input dimension d affect the performance.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Abid, A.; and Zou, J. 2018. A stochastic expectationmaximization approach to shuffled linear regression. In Annual Allerton Conference on Communication, Control, and Computing, 470 477. Bishop, C. M. 2006. Pattern recognition and machine learning. Springer. Capp e, O.; and Moulines, E. 2009. On-line expectation maximization algorithm for latent data models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(3): 593 613. Carpentier, A.; and Schl uter, T. 2016. Learning relationships between data obtained independently. In Artificial Intelligence and Statistics, 658 666. David, P.; Dementhon, D.; Duraiswami, R.; and Samet, H. 2004. Soft POSIT: Simultaneous pose and correspondence determination. International Journal of Computer Vision, 59: 259 284. Dempster, A. P.; Laird, N. M.; and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1): 1 22. Fang, G.; and Li, P. 2023. Regression with label permutation in generalized linear model. In International Conference on Machine Learning, 9716 9760. Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep learning. MIT press Cambridge. Hsu, D. J.; Shi, K.; and Sun, X. 2017. Linear regression without correspondence. In Advances in Neural Information Processing Systems. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv:1412.6980. Kohjima, M.; Nambu, Y.; Kurauchi, Y.; and Yamamoto, R. 2022. General Algorithm for Learning from Grouped Uncoupled Data and Pairwise Comparison Data. In International Conference on Neural Information Processing, 153 164. Neal, R. M.; and Hinton, G. E. 1998. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, 355 368. Nelder, J. A.; and Wedderburn, R. W. 1972. Generalized linear models. Journal of the Royal Statistical Society Series A: Statistics in Society, 135(3): 370 384. Pananjady, A.; Wainwright, M. J.; and Courtade, T. A. 2017. Linear regression with shuffled data: Statistical and computational limits of permutation recovery. IEEE Transactions on Information Theory, 64(5): 3286 3300. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; De Vito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. Py Torch: An Imperative Style, High Performance Deep Learning Library. In Advances in Neural Information Processing Systems, 8024 8035.

Ranganath, R.; Tang, L.; Charlin, L.; and Blei, D. 2015. Deep exponential families. In Artificial Intelligence and Statistics, 762 771. Rubin, D. B. 2004. Iteratively reweighted least squares. Encyclopedia of statistical sciences, 6. Shi, X.; Li, X.; and Cai, T. 2021. Spherical regression under mismatch corruption with application to automated knowledge translation. Journal of the American Statistical Association, 116(536): 1953 1964. Slawski, M.; Rahmani, M.; and Li, P. 2020. A Sparse Representation-Based Approach to Linear Regression with Partially Shuffled Labels. In Uncertainty in Artificial Intelligence, 38 48. Unnikrishnan, J.; Haghighatshoar, S.; and Vetterli, M. 2018. Unlabeled sensing with random linear measurements. IEEE Transactions on Information Theory, 64(5): 3237 3253. Xu, L.; Niu, G.; Honda, J.; and Sugiyama, M. 2019. Uncoupled regression from pairwise comparison data. In Advances in Neural Information Processing Systems, 3992 4002. Yamane, I.; Chevaleyre, Y.; Ishida, T.; and Yger, F. 2023. Mediated Uncoupled Learning and Validation with Bregman Divergences: Loss Family with Maximal Generality. In Artificial Intelligence and Statistics, 4768 4801.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)