# solving_sparse__highdimensionaloutput_regression_via_compression__35d72a28.pdf

Solving Sparse & High-Dimensional-Output Regression via Compression

Renyuan Li Department of Industrial Systems Engineering & Management National University of Singapore renyuan.li@u.nus.edu

Zhehui Chen Google zhehuichen@google.com

Guanyi Wang Department of Industrial Systems Engineering & Management National University of Singapore guanyi.w@nus.edu.sg

Multi-Output Regression (MOR) has been widely used in scientific data analysis for decision-making. Unlike traditional regression models, MOR aims to simultaneously predict multiple real-valued outputs given an input. However, the increasing dimensionality of the outputs poses significant challenges regarding interpretability and computational scalability for modern MOR applications. As a first step to address these challenges, this paper proposes a Sparse & High-dimensional-Output REgression (SHORE) model by incorporating additional sparsity requirements to resolve the output interpretability, and then designs a computationally efficient twostage optimization framework capable of solving SHORE with provable accuracy via compression on outputs. Theoretically, we show that the proposed framework is computationally scalable while maintaining the same order of training loss and prediction loss before-and-after compression under arbitrary or relatively weak sample set conditions. Empirically, numerical results further validate the theoretical findings, showcasing the efficiency and accuracy of the proposed framework.

1 Introduction

Multi-Output Regression (MOR) problem [8, 44] is a preponderant tool for factor prediction and decision-making in modern data analysis. Compared with traditional regression models that focus on a scalar output for each sample, MOR aims to predict multiple outputs y RK simultaneously based on a given input x Rd, i.e.,

y := arg min u Y dist(u, bg(x)) with bg := arg min g G

i=1 ℓ(yi, g(xi)) ,

where we use {(xi, yi)}n i=1 to denote its given sample set with xi Rd i-th input feature vector and yi RK corresponding output vector, define ℓ: RK RK R as the loss function, dist : RK RK R as some prediction/distance metric, Y as the structure/constraint set for multiple outputs, and G as the candidate set for predicting model g : Rd RK. Hence, MOR and its variants have been used for numerous regression tasks with structure requirements on multi-dimensional outputs arising from real applications, such as simultaneous estimation of biophysical parameters from remote sensing images [40], channel estimation through the prediction of several received signals [35], the grounding (e.g., factuality check [16]) in the Large Language Model (LLM, [34, 11]) era, to name but a few.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

In this paper, we are interested in the interpretability issue of high-dimensional outputs obtained from modern MOR tasks. One typical example is raised from algorithmic trading. In particular, in algorithmic trading, MOR helps to construct the portfolio [31] from a large number of financial instruments (e.g., different stocks, futures, options, equities, etc [19]) based on given historical market and alternative data [22]. To be concise, a high-dimensional output in this example could be viewed as a decision , where every component denotes the investment for the corresponding financial instruments. Thus, other than accuracy, quantitative researchers prefer outputs with only a few instruments to enhance interpretability for the underlying decision-making reasons, which naturally introduces a sparse output condition. Similar scenarios apply to other applications, including offline reinforcement learning in robotics[33], discovering genetic variations based on genetic markers[25].

As a result, the dramatic growth in output dimensions gives rise to two significant challenges: 1. High-dimensional-output impedes human interpretation for decision-making; 2. Approaches with better computational scalability are desired for training & predicting MOR. Upon these challenges, a conceptual question that motivates this research is:

How to design a framework that predicts output with enhanced interpretability, better computational scalability, and provable accuracy under a modern high-dimensional-output setting?

Generally speaking, this paper provides an affirmative answer as a first step to the above question. Before presenting the main contributions, let us first introduce the model that will be studied in this paper. Unlike the classical MOR model, we further assume that given outputs are of high-dimensional (i.e., d K), and to address the interpretability issue, these outputs have at most s non-zero components, i.e., yi 0 s, for all i [n] with some pre-determined sparsity-level s( K). Based on such given samples, this paper proposes the (uncompressed) Sparse & High-dimensional-Output REgression (SHORE) model that aims to predict an interpretable high-dimensional output y (i.e., s-sparse in this paper) via any input feature vector x. In particular, to be concise and still capture the essential relationship, the proposed (uncompressed) SHORE model predicts y from x under a linear model, i.e., y = arg min y 0 s dist(y, b Zx) for some distance metric (see Section 3.1, prediction stage) and the linear regression b Z is obtained by solving the following linear regression problem:

b Z := arg min Z RK d b Ln(Z) := 1

n Y ZX 2 F , (1)

where X := (x1 | | xn) Rd n is the input matrix and Y := (y1 | | yn) RK n is the corresponding column-sparse output matrix.

1.1 Contributions and Paper Organization

This paper makes the following three main contributions:

1. We propose a two-stage computationally efficient framework for solving SHORE model. Specifically, the first training stage offers a computationally scalable reformulation on solving SHORE through compression in the output space. The second prediction stage then predicts high-dimensional outputs from a given input by solving a specific sparsity-constrained minimization problem via an efficient iterative algorithm.

2. We show that for arbitrarily given samples, the training loss in the first stage with compression is bounded by a 1 + δ multiplicative ratio of the training loss for the original one (1) with some positive constant δ. Additionally, the proposed iterative algorithm in the second stage exhibits global geometric convergence within a neighborhood of the ground-truth output, with a radius proportional to the given sample s optimal training loss. Furthermore, if all samples are drawn from a light-tailed distribution, the generalization error bound and sample complexity remain in the same order for SHORE with output compression. This finding indicates that the proposed framework achieves improved computational efficiency while maintaining the same order of generalization error bounds statistically.

3. We conduct rich numerical experiments that validate the theoretical findings and demonstrate the efficiency and accuracy of the proposed framework on both synthetic and real-world datasets.

In summary, this paper studies the SHORE model through computational and statistical lenses and provides a computationally scalable framework with provable accuracy.

The paper is organized as follows: Section 2 reviews related literature; Section 3 presents our proposed framework and provides theoretical results on sample complexity and generalization error bounds; Section 4 compares the proposed method with existing baselines in a suite of numerical experiments on both synthetic and real instances. Concluding remarks are given in Section 5.

Notation. Given a positive integer n, we denote [n] := {1, . . . , n}. We use lowercase letters a as scalars and bold lowercase letters a as vectors, where ai is its i-th component with i [d], and bold upper case letters A as matrices. Without specific description, for a m-by-n matrix A, we denote Ai,j as its (i, j)-th component, A i,: as its i-th row, A:,j as its j-th column. For a symmetric square matrix A, we denote λmax(A), λmin(A) and λi(A) as its maximum, minimum and i-th largest eigenvalue, respectively. We denote a 1, a 2, a , A F , A op as the ℓ1, ℓ2, ℓ -norm of a vector a, the Frobenius norm and the operator norm of a matrix A, respectively. We denote I( ) as the indicator function, a 0 := Pd i=1 I(ai = 0) as the ℓ0-norm (i.e., the total number of nonzero components), supp(a) := {i [d] | ai = 0} as the support set. We denote VK s := {y RK | y 0 s} as a set of s-sparse vectors, B2(c; ρ) := {y RK | y c 2 ρ} as a closed ℓ2-ball with center c and radius ρ, N(µ, σ2) as a Gaussian distribution with mean µ and covariance σ2.

For two sequences of non-negative reals {fn}n 1 and {gn}n 1, we use fn gn to indicate that there is a universal constant C > 0 such that fn Cgn for all n 1. We use standard order notation fn = O(gn) to indicate that fn gn and fn = e Oτ(gn) to indicate that fn gn lnc(1/τ) for some universal constants τ and c. Throughout, we use ϵ, δ, τ, c, c1, c2, . . . and C, C1, C2, . . . to denote universal positive constants, and their values may change from line to line without specific comments.

2 Literature Review

Multi-output regression (MOR) and its variants have been studied extensively over the past decades. In this section, we focus on existing works related to our computational and statistical results.

Computational part. Existing computational methods for solving MOR can be, in general, classified into two categories [8], known as problem transformation methods and algorithm adaptation methods. Problem transformation methods (e.g., Binary Relevance (BR), multi-target regressor stacking (MTRS) method [37], regression chains method [37]) aim to transform MOR into multiple single-output regression problems. Thus, any state-of-the-art single-output regression algorithm can be applied, such as ridge regression [15], regression trees [9], and etc. However, these transformation methods ignore the underlying structures/relations between outputs, which leads to higher computational complexities. In contrast, algorithm adaptation methods focus more on the underlying structures/relations between outputs. For instance, [36] investigates input component selection and shrinkage in multioutput linear regression; [1] later couples linear regressions and quantile mapping and thus captures joint relationships among variables. However, the output dimension considered in these works is relatively small compared with modern applications, and their assumptions concerning low-dimensional structure of outputs are hard to verify. To overcome these shortages, we consider high-dimensional-output regression with only an additional sparsity requirement on outputs.

Statistical part. There are numerous works concerning statistical properties of traditional or multioutput regressions. [18] gives sharp results on "out-of-sample" (random design) prediction error for the ordinary least squares estimator of traditional linear regression. [45] proposes an empirical risk minimization framework for large-scale multi-label learning with missing outputs and provides excess risk generalization error bounds with additional bounded constraints. [28] investigates the generalization performance of structured prediction learning and provides generalization error bounds on three different scenarios, i.e., Lipschitz continuity, smoothness, and space capacity condition. [27] designs an efficient feature selection procedure for multiclass sparse linear classifiers (a special case for SHORE with sparsity-level s = 1), and proves that the proposed classifiers guarantee the minimax generalization error bounds in theory. A recent paper [42] studies transfer learning via multi-task representation learning, a special case in MOR, which proves statistically optimistic rates on the excess risk with regularity assumptions on the loss function and task diversity. In contrast with these works, our contributions concentrate on how generalization error bounds change before and after the compression under relatively weak conditions on the loss function and underlying distributions.

Specific results in MLC. MLC is an important and special case for MOR with {0, 1}-valued output per dimension, i.e., y {0, 1}K, and thus, in this paragraph, we use labels to replace

outputs. Here, we focus on dimensionality reduction techniques on outputs, in particular, the compressed sensing and low-rank conditions on the output matrix Y . The idea of compressed sensing rises from signal processing, which maps the original high-dimensional output space into a smaller one while ensuring the restricted isometry property (RIP). To the best of our knowledge, the compressed sensing technique is first used in [17] to handle a sparse expected output E[y|x]. Later, [39, 12] propose Principle Label Space Transformation (PLST) and conditional PLST through singular value decomposition and canonical component analysis respectively. More recently, many new compression approaches have been proposed, such as robust bloom filter [13], log time log space extreme classification [23], merged averaged classifiers via hashing [32], etc. Additionally, computational efficiency and statistical generalization bounds can be further improved when the output matrix Y ensures a low-rank condition. Under such a condition, [45] provides a general empirical risk minimization framework for solving MLC with missing labels. Compared with the above works, this paper studies MOR under a sparse & high-dimensional-output setting without additional correlation assumptions or low-rank assumptions for output space, and then provides a complete story through a computational and statistical lens.

3 Main Results

3.1 Two-Stage Framework

This subsection presents a general framework for solving SHORE and then the computational complexity for the proposed framework with/without compression. Given a set of training samples {(xi, yi)}n i=1 as described in Section 1, the framework can be separated into two stages: (compressed) training stage & (compressed) prediction stage.

Training stage. In the first training stage, the framework finds a compressed regressor by solving a linear regression problem with compressed outputs. In particular, the framework compresses the original large output space (K-dim) to a smaller "latent" output space (m-dim) by left-multiplying a so-called "compressed" matrix Φ Rm K to outputs. Thus, the compressed version of training stage in SHORE can be represented as follows,

c W := arg min W Rm d b LΦ n (W ) := 1

n ΦY W X 2 F . (2)

We would like to point out that the idea of compressing the output space into some smaller intrinsic dimension has been used in many existing works, e.g., [17, 39, 12] mentioned in Section 2.

Prediction stage. In the second prediction stage, given any input x Rd, the framework predicts a sparse output by by solving the following prediction problem based on the learned regressor c W in the training stage,

by(c W ) := arg min y Φy c W x 2 2 s.t. y VK s F, (3)

where VK s is the set of s-sparse vectors in RK, and F is some feasible set to describe additional requirements of y. For example, by letting F be RK, RK + , {0, 1}K, the intersection VK s F denotes the set of s-sparse output, non-negative s-sparse output, {0, 1}-valued s-sparse output, respectively. We use by(c W ) (shorthanded in by) to specify that the predicted output is based on the regressor c W . To solve the proposed prediction problem (3), we utilize the following projected gradient descent method (Algorithm 1), which could be viewed as a variant/generalization of existing iterative thresholding methods [6, 21] for nonconvex constrained minimization. In particular, step 4 incorporates additional constraints from F other than sparsity into consideration, which leads to non-trivial modifications in designing efficient projection oracles and convergence analysis. Later, we show that the proposed Algorithm 1 ensures a near-optimal convergence (Theorem 2 and Theorem 4) while greatly reduces the computational complexity (Remark 2) of the prediction stage for solving compressed SHORE.

Before diving into theoretical analysis, we first highlight the differences between the proposed prediction stage (3), general sparsity-constrained optimization (SCO), and sparse regression in the following remark. Remark 1. Proposed prediction stage v.s. General SCO: To be clear, the SCO here denotes the following minimization problem min α 0 k Aα β 2 2. Thus, the prediction stage is a special case

of general SCO problem. In particular, the predicted stage takes a random projection matrix Φ with restricted isometry property (RIP) to be its A and uses c W x with c W obtained from the compressed training-stage to be its β. As a result (Theorem 2 and Theorem 4), the proposed Algorithm 1 for prediction stage ensures a globally linear convergence to a ball with center by (optimal solution of the prediction-stage) and radius O( Φby c W x 2), which might not hold for general SCO problems.

Proposed prediction stage v.s. Sparse regression: Although the proposed prediction stage and sparse high-dimensional regression share a similar optimization formulation min β 0 k Y X β 2 2, the proposed prediction stage (3) is distinct from the sparse regression in the following parts:

(1) Underlying Model: Most existing works about sparse high-dimensional regression assume that samples are i.i.d. generated from the linear relationship Y = X β + ϵ with underlying sparse ground truth β . In the proposed prediction stage, we do not assume additional underlying models on samples if there is no further specific assumption. The problem we studied in the predicted stage takes the random projection matrix Φ with restricted isometry property (RIP) as its X (whereas X in sparse regression does not ensure RIP), and uses c W x with c W obtained from the compressed training-stage as its Y .

(2) Problem Task: The sparse regression aims to recover the sparse ground truth β given a sample set {(xi, yi)}n i=1 with n i.i.d. samples. In contrast, the task of the proposed prediction stage is to predict a sparse high-dimensional output by given a random projection matrix Φ and a single input x. As a quick summary, some typical and widely used iterative algorithms [38, 3, 4, 29] for sparse regression cannot be directly applied to the proposed prediction stage.

Then, we provide the computational complexity with and without the compression for the proposed two-stage framework to complete this subsection.

Remark 2. Training stage: Conditioned on XX is invertible, the compressed regressor c W has a closed form solution c W = ΦY X (XX ) 1 with overall computational complexity

O(Kmn + mnd + nd2 + d3 + md2) O(Kmn).

Compared with the computational complexity of finding b Z from the uncompressed SHORE (1)

O(Knd + nd2 + d3 + Kd2) O(K(n + d)d),

solving c W enjoys a smaller computational complexity on the training stage if m d. In later analysis (see Section 3.2), m = O(δ 2 s log( K

τ )) with some predetermined constants δ, τ and sparsity-level s d, thus in many applications with large output space, the condition m d holds.

Prediction stage: The computational complexity of each step-3 in Algorithm 1 is

O(Km + K + Km + K) O(Km).

The projection in step-4 is polynomially solvable with computational complexity O(K min{s, log K}) (see proof in Appendix A.5.1). Thus, the overall computational complexity of Algorithm 1 is

O(K(m + min{s, log K})T).

Compared with the complexity O(K(d + min{s, log K})) of predicting by from the uncompressed SHORE (1), the compressed version enjoys a smaller complexity on the prediction stage if

(m + min{s, log K})T d + min{s, log K}. (4)

In later analysis (see Theorem 2), since m = O(δ 2 s log( K

τ )) with predetermined constants δ, τ,

sparsity-level s d, and T = O(log[ by v(0) 2

Φby c W x 2 ]) from inequality (5) , we have condition 4 holds.

Whole computational complexity: Based on the analysis of computational complexity above, we conclude that when the parameters (K, d, m, T) satisfies

K > K1/3 > d O(δ 2 log(K/τ) T) = m T,

the compressed SHORE enjoys a better computational complexity with respect to the original one (1).

Algorithm 1 Projected Gradient Descent (for Second Stage)

Input: Regressor c W , input sample x, stepsize η, total iterations T

1: Initialize point v(0) VK s F. 2: for t = 0, 1, . . . , T 1: do 3: Update ev(t+1) = v(t) η Φ (Φv(t) c W x). 4: Project v(t+1) = Π(ev(t+1)) := arg minv VK s F v ev(t+1) 2 2. 5: end for Output: v(T ).

3.2 Worst-Case Analysis for Arbitrary Samples

We begin this subsection by introducing the generalization method of the compressed matrix Φ. Assumption 1. Given an m-by-K compressed matrix Φ, all components Φi,j for 1 i m and 1 j K, are i.i.d. generated from a Gaussian distribution N(0, 1/m).

Before presenting the main theoretical results, let us first introduce the definition of restricted isometry property (RIP, [10]), which is ensured by the generalization method (Assumption 1). Definition 1. (V, δ)-RIP: A m-by-K matrix Φ is said to be (V, δ)-RIP over a given set of vectors V RK, if, for every v V, (1 δ) v 2 2 Φv 2 2 (1 + δ) v 2 2. In the rest of the paper, we use (s, δ)-RIP to denote (VK s , δ)-RIP. Recall VK s = {v RK | v 0 s} is the set of s-sparse vectors. Remark 3. From Johnson-Lindenstrauss Lemma [43], for any δ (0, 1), any τ (0, 1), and any finite vector set |V| < , if the number of rows m O δ 2 log( |V|

τ ) , then the compressed

matrix Φ generated by Assumption 1 satisfies (V, δ)-RIP with probability at least 1 τ.

Now, we are poised to present the first result on training loss defined in (2). Theorem 1. For any δ (0, 1) and τ (0, 1), suppose compressed matrix Φ follows Assumption 1 with m O( 1

τ )). We have the following inequality for training loss

ΦY c W X 2 F (1 + δ) Y b ZX 2 F ,

holds with probability at least 1 τ, where b Z, c W are optimal solutions for the uncompressed (1) and compressed SHORE (2), respectively.

The proof of Theorem 1 is presented in Appendix A.1. In short, Theorem 1 shows that the optimal training loss for the compressed version is upper bounded within a (1 + δ) multiplicative ratio with respect to the optimal training loss for the uncompressed version. Intuitively, Theorem 1 implies that SHORE remains similar performances for both compressed and compressed versions, while the compressed version saves roughly O(Kn(d m) + Kd2) computational complexity in the training stage from Remark 2. Moreover, the lower bound condition on m O( 1

τ )) ensures that the generated compressed matrix Φ is (1, δ)-RIP with probability at least 1 τ. For people of independent interest, Theorem 1 only needs (1, δ)-RIP (independent with the sparsity level) due to the unitary invariant property of Φ from Assumption 1 (details in Appendix A.1). Additionally, due to the inverse proportionality between m and δ2, for fixed K and τ, the result can be written as

ΦY c W X 2 F 1 + O(1/ m) Y b ZX 2 F , which is verified in our experiments 4.

We then present the convergence result of the proposed Algorithm 1 for solving prediction problem (3). Theorem 2. For any δ (0, 1) and τ (0, 1), suppose the compressed matrix Φ follows Assumption 1 with m O( s

τ )). With a fixed stepsize η ( 1 2 2δ, 1), the following inequality

by v(t) 2 ct 1 by v(0) 2 + c2 1 c1 Φby c W x 2

holds for all t [T] simultaneously with probability at least 1 τ, where c1 := 2 2η + 2ηδ < 1 is some positive constant strictly smaller than 1, and c2 := 2η

1 + δ is some constant.

The proof of Theorem 2 is given in Appendix A.2. Here, the lower bound condition on the number of rows m ensures that the generated compressed matrix Φ is (3s, δ)-RIP with probability at least 1 τ by considering a δ/2-net cover of set V = VK 3s B2(0; 1) from Johnson-Lindenstrauss Lemma [43]. Moreover, since the number of rows m required in Theorem 2 is greater than the one required in Theorem 1, term Φby c W x 2 can be further upper bounded using the uncompressed version (1 + δ) by b Zx 2 with probability at least 1 τ. Then we obtain a direct corollary of Theorem 2: suppose by v(0) 2 > Φby c W x 2, if

t t := O log by v(0) 2/ Φby c W x 2 . log (1/c1) , (5)

the proposed Algorithm 1 guarantees a globally linear convergence to a ball B(by; O( Φby c W x 2)).

In contrast with OMP used in [17] for multi-label predictions, Theorem 2 holds for arbitrary sample set without the so-called bounded coherence guarantee on Φ. Moreover, as reported in Section 4, the proposed prediction method (Algorithm 1) has better computational efficiency than OMP.

3.3 Generalization Error Bounds for IID Samples

This subsection studies a specific scenario when every sample (xi, yi) is i.i.d. drawn from some underlying sub Gaussian distribution D over sample space Rd VK s . Specifically, we use

=: µ and Var D

= Σxx Σxy Σyx Σyy

=: Σ 0(d+K) (d+K)

to denote its mean and variance, respectively. Let ξx := x µx, ξy := y µy, ξ := (ξ x , ξ y )

be centered sub Gaussian random variables of x, y, (x , y ) , respectively. Let a, b {x, y}, we use Mab := ED[ab ] = Σab + µaµ b , c Mab := 1

n Pn i=1 ai(bi) to denote the population second (cross-)moments and empirical second (cross-)moments, respectively. Then, the population training loss is defined as min Z RK d L(Z) := E(x,y) D y Zx 2 2

with its optimal solution Z := Myx M 1 xx . Similarly, given a Φ, the compressed training loss is given by LΦ(W ) := E(x,y) D Φy W x 2 2

with optimal solution W := ΦMyx M 1 xx . We then define the following assumption: Assumption 2. Let D be σ2-sub Gaussian for some positive constant σ2 > 0, i.e., the inequality ED[exp(λu ξ)] exp λ2σ2/2 holds for any λ > 0 and unitary vector u Rd+K. Moreover, the covariance matrix Σxx is positive definite (i.e., its minimum eigenvalue λmin(Σxx) > 0). Remark 4. Assumption 2 ensures the light tail property of distribution D. Note that in some real applications, e.g., factuality check [31], algorithmic trading [16], one can normalize input and output vector to ensure bounded ℓ2-norm. Under such a situation, Assumption 2 is naturally satisfied.

Our first result in this subsection gives the generalization error bounds. Theorem 3. For any δ (0, 1) and τ (0, 1

3), suppose compressed matrix Φ follows Assumption 1 with m O( s

τ )), and Assumption 2 holds, for any constant ϵ > 0, the following results hold:

(Matrix Error). The inequality for matrix error M 1/2 xx c M 1 xx M 1/2 xx op 4 holds with probability at

least 1 2τ as the number of samples n n1 with

n1 := max 64C2σ4

9λ2 min(Mxx) (d + log(2/τ)) , 322 µx 2 2σ2

λ2 min(Mxx)

log(1/τ) 2 ,

where C is some fixed positive constant used in matrix concentration inequality of operator norm.

(Uncompressed). The generalization error bound for uncompressed SHORE satisfies L( b Z) L(Z ) + 4ϵ with probability at least 1 3τ, as the number of samples n max{n1, n2} with

4( Z 2 F + K) d + 2 p

d log(K/τ) + 2 log(K/τ)

ϵ , 4 µy Z µx 2 2 d

(Compressed). The generalization error bound for the compressed SHORE satisfies LΦ(c W ) LΦ(W ) + 4ϵ with probability at least 1 3τ, as the number of sample n max{n1, en2} with

4( W 2 F + Φ 2 F ) d + 2 p

d log(m/τ) + 2 log(m/τ)

ϵ , 4 Φµy W µx 2 2 d

The proof of Theorem 3 is presented in Appendix A.3. The proof sketch mainly contains three steps: In Step-1, we represent the difference L( b Z) L(Z ) or LΦ(c W ) LΦ(W ) as a product between matrix error (in Theorem 3) and rescaled approximation error (see Appendix A.3 for definition), i.e.,

L( b Z) L(Z ) or LΦ(c W ) LΦ(W ) (matrix error) (rescaled approximation error);

Step-2 controls the upper bounds for matrix error and rescaled approximation error separately, using concentration for sub Guassian variables; for Step-3, we combine the upper bounds obtained in Step-2 and complete the proof. Based on the result of Theorem 3, ignoring the logarithm term for τ, the proposed generalization error bounds can be bounded by

L( b Z) L(Z ) + e Oτ

max{ Z 2 F , µy Z µx 2 2, K} d

LΦ(c W ) LΦ(W ) + e Oτ

max{ W 2 F , Φµy W µx 2 2, Φ 2 F } d

Remark 5. To make a direct comparison between the generalization error bounds of the uncompressed and the compressed version, we further control the norms W 2 F , Φµy W µx 2 2, Φ 2 F based on additional conditions on the compressed matrix Φ. Recall the generalization method of the compressed matrix Φ as mentioned in Assumption 1, we have the following event

E1 := Φ Rm K W 2 F = ΦZ 2 F (1 + δ) Z 2 F Φµy W µx 2 F = Φ(µy Z µx) 2 F (1 + δ) µy Z µx 2 F

holds with probability at least 1 τ due to the RIP property for a fixed matrix. Moreover, since every component Φi,j is i.i.d. drawn from a Gaussian distribution N(0, 1/m), using the concentration tail bound for chi-squared variables (See Lemma 1 in [26]), we have the following event

Φ Rm K Φ 2 F K + 2

m + 2 log(1/τ)

holds with probability at least 1 τ. Conditioned on these two events E1 and E2, the generalization error bound of the compressed version achieves the same order (ignoring the logarithm term of τ) as the generalization error bound of the uncompressed version. That is to say,

LΦ(c W ) (1 + δ) L(Z ) + e Oτ

max{ Z 2 F , µy Z µx 2 2, K} d

holds with probability at least 1 5τ.

Comparing with existing results on generalization error bounds mentioned in Section 2, we would like to emphasize that Theorem 4 guarantees that the generalization error bounds maintain the order before and after compression. This result establishes on i.i.d. sub Gaussian samples for the SHORE model without additional regularity conditions on loss function and feasible set as required in [45]. Additionally, we obtained a O(Kd/n) generalization error bound for squared Frobenius norm loss function L or LΦ, which is smaller than O(K2d/n) as presented in [Theorem 4, [45]].

We then give results on prediction error bounds. Theorem 4. For any δ (0, 1) and any τ (0, 1/3), suppose the compressed matrix Φ follows Assumption 1 with m O( s

τ )), and Assumption 2 holds. Given any learned regressor c W from training problem (2), let (x, y) be a new sample drawn from the underlying distribution D, we have the following inequality holds with probability at least 1 τ:

ED[ by y 2 2] 4 1 δ ED[ Φy c W x 2 2],

where by is the optimal solution from prediction problem (3) with input vector x.

The proof of Theorem 4 is presented in Appendix A.4. Theorem 4 gives an upper bound of ℓ2norm distance between by and y. Since v(T ) y 2 v(T ) by 2 + by y 2, combined with Theorem 2, we have ED[ v(T ) y 2 2] O(ED[ Φy c W x 2]) when T t defined in (5) (see Appendix A.5.5), where the final inequality holds due to the optimality of by. Hence, we achieve an upper bound of ℓ2-norm distance between v(T ) and y as presented in Theorem 4, see Remark 6. Remark 6. . For any δ (0, 1) and τ (0, 1/3), suppose compressed matrix Φ follows Assumption 1 with m O( s

τ )), and Assumption 2 holds, for any constant ϵ > 0, the following inequality holds with probability at least 1 3τ:

ED[ v(T ) y 2 2] O(ED[ Φy c W x 2]) O(LΦ(W ) + 4ϵ).

4 Numerical Experiments

In this section, we conduct numerical experiments on two types of instances (i.e., synthetic data sets and real data sets) to validate the theoretical results and illustrate both efficiency and accuracy of the proposed prediction method compared with typical existing prediction baselines, i.e., Orthogonal Matching Pursuit (OMP, [46]), Fast Iterative Shrinkage-Thresholding Algorithm (FISTA, [2]) and Elastic Net (EN, [47]). Due to the space limit, we put the implemented prediction method (Algorithm 2) in Appendix A.6.1, aforementioned existing prediction baselines in Appendix A.6.2, experiment setting details and results for real data in Appendix A.7.

Performance measures. Given a sample (x, y) with input x and corresponding true output y, we use v to denote the predicted output obtained from any prediction method, and measure the numerical performances based on the following three metrics:

1. For a ground truth y with sparsity-level s, the metric precision over selected supports, i.e., Precision@s := 1

s|supp(v) supp(y)| measures the percentage of correctly identified supports in the predicted output; 2. The metric output difference, i.e., Output diff := v y 2 2 measures the ℓ2-norm distance between the predicted output and the ground-truth;

3. For any given MOR c W and compressed matrix Φ, the metric prediction loss, i.e., Prediction Loss := Φv c W x 2 2 computes the prediction loss with respect to c W x.

Synthetic data generation procedure. The synthetic data set is generated as follows: Every input xi for i [n] is i.i.d. drawn from a Gaussian distribution N(µx, Σxx), where its mean vector µx and covariance matrix Σxx are selected based on the procedures given in Appendix A.6.3. For any given sparsity-level s, underlying true regressor Z RK d, and Signal-to-Noise Ratio (SNR), the groundtruth yi (corresponding with its given input xi) is generated by yi = ΠVK s F Z xi + ϵi , where ϵi RK is a i.i.d. random noise drawn from the Gaussian distribution N(0K, SNR 2 Z xi IK).

Parameter setting. For synthetic data, we set input dimension d = 104, output dimension K = 2 104, and sparsity-level s = 3. We generate in total n = 3 104, i.i.d. samples as described above, i.e., Ssyn := {(xi, yi)}3 104 i=1 with SNR 1 {1, 0.32, 0.032} to ensure the signal-to-noise decibels (d B, [14]) takes values on d B := 10 log(SNR2) {0, 10, 30}. We select the number of rows for compressed matrix Φ by m {100, 300, 500, 700, 1000, 2000}. For computing the empirical regressor c W Rm d, we first split the whole sample set Ssyn into two non-overlap subsets, i.e., a training set Stra with 80% and a testing set Stest with rest 20%. The regressor c W is therefore obtained by solving compressed SHORE (2) based on the training set Stra with a randomly generated compressed matrix Φ. For evaluating the proposed prediction method, Algorithm 2, we pick a fixed stepsize η = 0.9, F = RK + , and set the maximum iteration number as T = 60, and run prediction methods over the set Stest.

Hardware & Software. All experiments are conducted in Dell workstation Precision 7920 with a 3GHz 48Cores Intel Xeon CPU and 128GB 2934MHz DDR4 Memory. The proposed method and other methods are solved using Py Torch version 2.3.0 and scikit-learn version 1.4.2 in Python 3.12.3.

Numerical Results & Discussions. The results are demonstrated in Figure 1, which does not include the results from the Elastic Net and OMP due to relatively much longer running time.

Figure 1: Numerical results on synthetic data. In short, each dot in the figure represents the average value of 10 independent trials (i.e., experiments) of compressed matrices Φ(1), . . . , Φ(10) on a given tuple of parameters (K, d, n, SNR, m). The shaded parts represent the empirical standard deviations over 10 trials. In the first row, we plot the ratio of training loss after and before compression, i.e., ΦY c W X 2 F / Y b ZX 2 F versus the number of rows m. It is obvious that the ratio converges to one as m increases, which validates the result presented in Theorem 1. In the second row, we plot percision@3 versus the number of rows. As we can observe, the proposed algorithm outperforms CD and FISTA.

Based on Figure 1, we observe that the proposed algorithm enjoys a better computational cost and accuracy on most metrics. The running time for the proposed algorithm and baselines are reported in Table 2 (see in Appendix A.7), which further demonstrates the efficiency of the proposed algorithm. The implemented code could be found on Github https://github.com/from-ryan/Solving_ SHORE_via_compression.

5 Conclusion and Future Directions

In conclusion, we propose a two-stage framework to solve Sparse & High-dimensional-Output REgression (SHORE) problem, the computational and statistical results indicate that the proposed framework is computationally scalable, maintaining the same order of both the training loss and prediction loss before and after compression under relatively weak sample set conditions, especially in the sparse and high-dimensional-output setting where the input dimension is polynomially smaller compared to the output dimension. In numerical experiments, SHORE provides improved optimization performance over existing MOR methods, for both synthetic data and real data.

We close with some potential questions for future investigation. The first is to extend our theoretical results to nonlinear/nonconvex SHORE frameworks [24]. The second direction is to improve existing variable reduction methods for better scalability while maintaining small sacrificing on prediction accuracy, e.g., new design and analysis on randomized projection matrices. The third direction is to explore general scenarios when high dimensional outputs enjoys additional geometric structures [30] from real applications in machine learning or operations management other than s-sparsity and its variants as discussed in the paper. Taking our result for SHORE as an initial start, we expect a stronger follow-up work that applies to MOR with additional structures, which eventually benefits the learning community in both practice and theory.

Acknowledgement

Renyuan Li and Guanyi Wang were supported by the National University of Singapore under Ac RF Tier-1 grant (A-8000607-00-00) 22-5539-A0001. Zhehui Chen would like to thank Google for its support in providing the research environment and supportive community that made this work possible.

[1] Z. Abraham, P.-N. Tan, Perdinan, J. Winkler, S. Zhong, and M. Liszewska. Position preserving multi-output prediction. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part II 13, pages 320 335. Springer, 2013.

[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009.

[3] D. Bertsimas and B. Van Parys. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions (2017). ar Xiv preprint ar Xiv:1709.10029, 2019.

[4] D. Bertsimas and B. Van Parys. Sparse high-dimensional regression. The Annals of Statistics, 48(1):300 323, 2020.

[5] K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, and M. Varma. The extreme classification repository: Multi-label datasets and code, 2016.

[6] T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265 274, 2009.

[7] H. Boche, R. Calderbank, G. Kutyniok, and J. Vybíral. Compressed sensing and its applications: MATHEON workshop 2013. Springer, 2015.

[8] H. Borchani, G. Varando, C. Bielza, and P. Larranaga. A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(5):216 233, 2015.

[9] L. Breiman. Bagging predictors. Machine learning, 24:123 140, 1996.

[10] E. J. Candes and T. Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203 4215, 2005.

[11] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1 45, 2024.

[12] Y.-N. Chen and H.-T. Lin. Feature-aware label space dimension reduction for multi-label classification. Advances in neural information processing systems, 25, 2012.

[13] M. M. Cisse, N. Usunier, T. Artieres, and P. Gallinari. Robust bloom filters for large multilabel classification tasks. Advances in neural information processing systems, 26, 2013.

[14] S. S. Dey, G. Wang, and Y. Xie. Approximation algorithms for training one-node relu neural networks. IEEE Transactions on Signal Processing, 68:6696 6706, 2020.

[15] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, 1970.

[16] O. Honovich, R. Aharoni, J. Herzig, H. Taitelbaum, D. Kukliansy, V. Cohen, T. Scialom, I. Szpektor, A. Hassidim, and Y. Matias. True: Re-evaluating factual consistency evaluation. ar Xiv preprint ar Xiv:2204.04991, 2022.

[17] D. Hsu, S. M. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing, 2009.

[18] D. Hsu, S. M. Kakade, and T. Zhang. An analysis of random design linear regression. ar Xiv preprint ar Xiv:1106.2363, 6, 2011.

[19] J. C. Hull and S. Basu. Options, futures, and other derivatives. Pearson Education India, 2016.

[20] I. M. Jacobs and J. Wozencraft. Principles of communication engineering. 1965.

[21] P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional m-estimation. Advances in neural information processing systems, 27, 2014.

[22] S. Jansen. Machine Learning for Algorithmic Trading: Predictive models to extract signals from market and alternative data for systematic trading strategies with Python. Packt Publishing Ltd, 2020.

[23] K. Jasinska and N. Karampatziakis. Log-time and log-space extreme classification. ar Xiv preprint ar Xiv:1611.01964, 2016.

[24] A. Kapoor, R. Viswanathan, and P. Jain. Multilabel classification using bayesian compressed sensing. Advances in neural information processing systems, 25, 2012.

[25] S. Kim and E. P. Xing. Tree-guided group lasso for multi-response regression with structured sparsity, with an application to e QTL mapping. The Annals of Applied Statistics, 6(3):1095 1117, 2012.

[26] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of statistics, pages 1302 1338, 2000.

[27] T. Levy and F. Abramovich. Generalization error bounds for multiclass sparse linear classifiers. Journal of Machine Learning Research, 24(151):1 35, 2023.

[28] S. Li and Y. Liu. Towards sharper generalization bounds for structured prediction. Advances in Neural Information Processing Systems, 34:26844 26857, 2021.

[29] L. Liu, Y. Shen, T. Li, and C. Caramanis. High dimensional robust sparse regression. In International Conference on Artificial Intelligence and Statistics, pages 411 421. PMLR, 2020.

[30] S. Ludwig et al. Algorithms above the noise floor. Ph D thesis, Massachusetts Institute of Technology, 2018.

[31] Y. Ma, R. Han, and W. Wang. Portfolio optimization with return prediction using deep learning and machine learning. Expert Systems with Applications, 165:113973, 2021.

[32] T. K. R. Medini, Q. Huang, Y. Wang, V. Mohan, and A. Shrivastava. Extreme classification in log memory using count-min sketch: A case study of amazon search with 50m products. Advances in Neural Information Processing Systems, 32, 2019.

[33] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V. Mnih, N. Heess, and J. T. Springenberg. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning, pages 4344 4353. PMLR, 2018.

[34] A. Roberts, C. Raffel, and N. Shazeer. How much knowledge can you pack into the parameters of a language model? ar Xiv preprint ar Xiv:2002.08910, 2020.

[35] M. Sánchez-Fernández, M. de Prado-Cumplido, J. Arenas-García, and F. Pérez-Cruz. Svm multiregression for nonlinear channel estimation in multiple-input multiple-output systems. IEEE transactions on signal processing, 52(8):2298 2307, 2004.

[36] T. Similä and J. Tikka. Input selection and shrinkage in multiresponse linear regression. Computational Statistics & Data Analysis, 52(1):406 422, 2007.

[37] E. Spyromitros-Xioufis, G. Tsoumakas, W. Groves, and I. Vlahavas. Multi-label classification methods for multi-target regression. ar Xiv preprint ar Xiv:1211.6581, pages 1159 1168, 2012.

[38] Q. Sun, H. Zhu, Y. Liu, and J. G. Ibrahim. Sprem: sparse projection regression model for highdimensional linear regression. Journal of the American Statistical Association, 110(509):289 302, 2015.

[39] F. Tai and H.-T. Lin. Multilabel classification with principal label space transformation. Neural Computation, 24(9):2508 2542, 2012.

[40] D. Tuia, J. Verrelst, L. Alonso, F. Pérez-Cruz, and G. Camps-Valls. Multioutput support vector regression for remote sensing biophysical parameter estimation. IEEE Geoscience and Remote Sensing Letters, 8(4):804 808, 2011.

[41] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

[42] A. Watkins, E. Ullah, T. Nguyen-Tang, and R. Arora. Optimistic rates for multi-task representation learning. Advances in Neural Information Processing Systems, 36, 2024.

[43] B. J. William and J. Lindenstrauss. Extensions of lipschitz mapping into hilbert space. Contemporary mathematics, 26(189-206):323, 1984.

[44] D. Xu, Y. Shi, I. W. Tsang, Y.-S. Ong, C. Gong, and X. Shen. Survey on multi-output learning. IEEE transactions on neural networks and learning systems, 31(7):2409 2429, 2019.

[45] H.-F. Yu, P. Jain, P. Kar, and I. Dhillon. Large-scale multi-label learning with missing labels. In International conference on machine learning, pages 593 601. PMLR, 2014.

[46] Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41(12):3397 3415, 1993.

[47] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301 320, 2005.

A Appendix / supplemental material

A.1 Proof of Theorem 1

Recall the following theorem in the main text.

Theorem 1. For any δ (0, 1) and τ (0, 1), suppose the compressed matrix Φ follows Assumption 1 with m O( 1

τ )). We have the following inequality for training loss

ΦY c W X 2 F (1 + δ) Y b ZX 2 F holds with probability at least 1 τ, where b Z, c W are optimal solutions for the uncompressed (1) and compressed SHORE (2), respectively.

Proof. Given a set of n samples {(xi, yi)}n i=1, and a matrix Φ generated from Assumption 1, we have ΦY c W X 2 F ΦY Φ b ZX 2 F = Φ(Y b ZX) 2 F ,

where the inequality holds due to the optimality of c W . Let Y = Y b ZX. Consider the singular value decomposition: Y = UY ΣY V Y , and then we have

Φ(Y b ZX) 2 F = ΦY 2 F = ΦUY ΣY V Y 2 F = ΦUY ΣY 2 F .

Set Φ := ΦUY . Since the generalization method for Φ ensures its (1, δ)-RIP with probability 1 τ (from the Johnson-Lindenstrauss Lemma), and UY is a real unitary matrix, then Lemma 1 for unitary invariant shows that Φ is also (1, δ)-RIP with probability 1 τ. Now, using Φ is (1, δ)-RIP and all columns in ΣY has at most one non-zero component, we have

ΦUY ΣY 2 F = ΦΣY 2 F (1 + δ) ΣY 2 F = (1 + δ) Y 2 F . Combining the above inequalities together implies

ΦY c W X 2 F (1 + δ) Y 2 F = (1 + δ) Y b ZX 2 F , which completes the proof.

A.2 Proof of Theorem 2

Recall the following theorem in the main text.

Theorem 2. For any δ (0, 1) and τ (0, 1), suppose the compressed matrix Φ follows Assumption 1 with m O( s

τ )). With a fixed stepsize η ( 1 2 2δ, 1), the following inequality

by v(t) 2 ct 1 by v(0) 2 + c2 1 c1 Φby c W x 2

holds for all t [T] simultaneously with probability at least 1 τ, where c1 := 2 2η + 2ηδ < 1 is some positive constant strictly smaller than 1, and c2 := 2η

1 + δ is some constant.

Proof. Suppose Assumption 1 holds, then the randomized compressed matrix Φ ensures (3s, δ)-RIP with probability at least 1 τ (see Remark 3). Thus, to complete the proof, it is sufficient to show the above inequality holds for all t [T] under such a compressed matrix Φ with (3s, δ)-RIP. We conclude that the proof is, in general, separated into three steps.

Step-1. We establish an upper bound on the ℓ2-norm distance between the current point and the optimal solution. Due to the optimality of the projection (step-4 in Algorithm 1), we have the following inequality

ev(t+1) v(t+1) 2 2 ev(t+1) v 2 2 holds for all v VK s F, which further implies that

ev(t+1) v + v v(t+1) 2 2 ev(t+1) v 2 2 v v(t+1) 2 2 2 v ev(t+1), v v(t+1)

holds for all v VK s F.

Step-2. We show one-iteration improvement based on the above inequality. Since by VK s F, we can replace v by by, which still ensures the above inequality. Set (t) := by v(t) for all t [T]. Based on the updating rule (step-3 in Algorithm 1), ev(t+1) = v(t) η Φ (Φv(t) c W x). Thus, the above inequality can be written as

(t+1) 2 2 2 by v(t) + η Φ (Φv(t) c W x), (t+1)

= 2 (t), (t+1) + 2η Φv(t) c W x, Φ (t+1)

= 2 (t), (t+1) 2η Φ (t), Φ (t+1) + 2η Φby c W x, Φ (t+1) . (6)

Using Lemma 2 with

(t) 2 , (t+1)

(t+1) 2 , (t)

(t) 2 + (t+1)

(t+1) 2 , (t)

(t) 2 (t+1)

all 3s-sparse vectors, and Φ a (3s, δ)-RIP matrix, we have

(t) 2 , Φ (t+1)

(t) 2 , (t+1)

which implies

2η Φ (t), Φ (t+1) 2δη (t) 2 (t+1) 2 2η (t), (t+1) .

Inserting the above result into inequality (6) gives

(t+1) 2 2 (2 2η) (t), (t+1) + 2δη (t) 2 (t+1) 2 + 2η Φby c W x, Φ (t+1)

(i) (2 2η + 2ηδ) (t) 2 (t+1) 2 + 2η Φby c W x 2 Φ (t+1) 2

(2 2η + 2ηδ) (t) 2 (t+1) 2 + 2η

1 + δ Φby c W x 2 (t+1) 2,

where the above inequality (i) requests η < 1 to ensure the inequality (2 2η) (t), (t+1) (2 2η) (t) 2 (t+1) 2 holds. Therefore, dividing (t+1) 2 on both side implies

(t+1) 2 (2 2η + 2ηδ) (t) 2 + 2η

1 + δ Φby c W x 2, (7)

which gives the one-step improvement.

Step-3. Combine everything together. To ensure contractions in every iteration, we pick stepsize η such that 2 2η + 2ηδ (0, 1) with η < 1, which gives η 1 2(1 δ), 1 . Using the above inequality (7) for one-step improvement, we have

by v(t) 2 (2 2η + 2ηδ)t by v(0) 2 + 2η

1 + δ 2η 2ηδ 1 Φby c W x 2,

which completes the proof.

A.3 Proof of Theorem 3

Recall the following theorem in the main text.

Theorem 3. For any δ (0, 1) and τ (0, 1

3), suppose compressed matrix Φ follows Assumption 1 with m O( s

τ )), and Assumption 2 holds, for any constant ϵ > 0, the following results hold:

(Matrix Error). The inequality for matrix error M 1/2 xx c M 1 xx M 1/2 xx op 4 holds with probability at

least 1 2τ as the number of samples n n1 with

n1 := max 64C2σ4

9λ2 min(Mxx) (d + log(2/τ)) , 322 µx 2 2σ2

λ2 min(Mxx)

log(1/τ) 2 ,

where C is some fixed positive constant used in matrix concentration inequality of operator norm.

(Uncompressed). The generalization error bound for uncompressed SHORE satisfies L( b Z) L(Z ) + 4ϵ with probability at least 1 3τ, as the number of samples n max{n1, n2} with

4( Z 2 F + K) d + 2 p

d log(K/τ) + 2 log(K/τ)

ϵ , 4 µy Z µx 2 2 d

(Compressed). The generalization error bound for the compressed SHORE satisfies LΦ(c W ) LΦ(W ) + 4ϵ with probability at least 1 3τ, as the number of sample n max{n1, en2} with

4( W 2 F + Φ 2 F ) d + 2 p

d log(m/τ) + 2 log(m/τ)

ϵ , 4 Φµy W µx 2 2 d

Proof. Let us start with the uncompressed version.

Step-1. Note that the optimal solutions for population loss and empirical loss are

Z = Myx M 1 xx and b Z = Y X

1 =: c Myx c M 1 xx ,

respectively. Thus, the generalization error bound is

L( b Z) L(Z ) = b Z Z 2 Mxx = ( b Z Z )M 1/2 xx 2 F .

( b Z Z )M 1/2 xx = c Myx c M 1 xx Myx M 1 xx M 1/2 xx

= c Myx Myx M 1 xx c Mxx c M 1 xx M 1/2 xx

= b E[yx Z xx ] c M 1 xx M 1/2 xx = b E[(y Z x)x c M 1/2 xx ] c M 1/2 xx M 1/2 xx ,

where we use b E[ ] to denote the empirical distribution. Then, the above generalization error bound can be upper-bounded as follows ( b Z Z )M 1/2 xx F = b E[(y Z x)x c M 1/2 xx ] c M 1/2 xx M 1/2 xx F

b E[(y Z x)x c M 1/2 xx ] F | {z } rescaled approximation error

c M 1/2 xx M 1/2 xx op | {z } matrix error

Step-2. Next, we provide upper bounds on these two terms b E[(y Z x)x c M 1/2 xx ] F and c M 1/2 xx M 1/2 xx op in the right-hand-side separately.

For matrix error term c M 1/2 xx M 1/2 xx op, we have

c M 1/2 xx M 1/2 xx 2

op = M 1/2 xx c M 1 xx M 1/2 xx op .

Due to Assumption 2, the centralized feature vector ξx := x µx ensures the following inequality

ED exp λv ξx exp λ2 v 2 2σ2

for all v Rd. Consider the empirical second moment of x,

µ x + µxµ x .

Since ξ1 x, . . . , ξn x are i.i.d. σ2-sub Gaussian random vector with zero mean and covariance matrix Σxx, then based on Lemma 3, there exists a positive constant C such that for any τ (0, 1),

d + log(2/τ)

n , d + log(2/τ)

and based on Lemma 4, for any τ (0, 1),

Let xx := Pn i=1 ξi x(ξi x)

n Σxx and ξ := Pn i=1 ξi x n , then c Mxx can be represented by

c Mxx = Σxx + µxµ x | {z } =:Mxx

+ xx + µx ξ + ξµ x ,

and thus we have M 1/2 xx c Mxx M 1/2 xx = Id + M 1/2 xx xx + µx ξ + ξµ x M 1/2 xx . Then the minimum eigenvalue of M 1/2 xx c Mxx M 1/2 xx can be lower bounded as follows

λmin M 1/2 xx c Mxx M 1/2 xx

1 M 1/2 xx xx + µx ξ + ξµ x M 1/2 xx op

1 M 1/2 xx xx M 1/2 xx op M 1/2 xx µx ξ + ξµ x M 1/2 xx op

d + log(2/τ)

n 2 µx 2 λmin(Mxx) 4σ

log(1/τ) n ,

where the final inequality holds with probability at least 1 2τ by inserting the above non-asymptotic bounds. Then, we have M 1/2 xx c M 1 xx M 1/2 xx op

= λ 1 min M 1/2 xx c Mxx M 1/2 xx

d + log(2/τ)

n 2 µx 2 λmin(Mxx) 4σ

holds with probability at least 1 2τ. It is easy to observe that as

n n1 := max 64C2σ4

9λ2 min(Mxx) (d + log(2/τ)) , 322 µx 2 2σ2

λ2 min(Mxx)

log(1/τ) 2 ,

we have M 1/2 xx c M 1 xx M 1/2 xx op 4 holds with probability 1 2τ.

For rescaled approximation error term b E[(y Z x)x c M 1/2 xx ] F , we first compute variance proxy for the sub Gaussian vector y Z x. Note that the j-th component of the sub Gaussian vector y Z x can be written as

[y Z x]j = ( [Z ] j,: | e j ) x y

where [Z ] j,: is the j-th row of Z , e j is a K-dimensional vector with j-th component equals to one and rest components equal to zero. Thus, it is easy to observe that the ℓ2-norm square of ( [Z ] j,: | e j ) is [Z ]j,: 2 2 + 1, and therefore, based on the Assumption 2, we have

exp λ[y Z x]j λ[µy Z µx]j

exp λ2 ( [Z ]j,: 2 2 + 1)σ2/2 ,

i.e., a sub Gaussian with variance proxy σ2 j := ( [Z ]j,: 2 2 + 1)σ2. Thus the rescaled approximation error can be upper-bounded by b E[(y Z x)x c M 1/2 xx ] 2

b E[[y Z x]jx c M 1/2 xx ] 2

b E[([y Z x]j [µy Z µx]j)x c M 1/2 xx ] 2

2 | {z } =:T 1 j

b E[[µy Z µx]jx c M 1/2 xx ] 2

2 | {z } =:T 2 j

We control term T 1 j for all j [K] separately using Lemma 5 as follows: For all τ (0, 1), we have

T 1 j σ2 j (d + 2 p

d log(K/τ) + 2 log(K/τ))

For the term T 2 j , we have

i=1 [µy Z µx]j(xi) c M 1/2 xx

= [µy Z µx]j n

c M 1/2 xx x1 n | | c M 1/2 xx xn n

= [µy Z µx]2 j n

Now let x1 n | | xn

n = Ux Dx V x be the singular value decomposition of the matrix x1 n | | xn

n , the above ℓ2-norm can be further written as

[µy Z µx]2 j n

(i) = [µy Z µx]2 j n 1 n Vx

Id 0d (n d) 0(n d) d 0(n d) (n d)

(ii) = [µy Z µx]2 j n d,

where the equality (i) holds due to the definition of empirical matrix c Mxx = 1

n Pn i=1 xi(xi) , the equality (ii) holds due to the unitary property of matrix Vx. Combining the above two parts implies b E[(y Z x)x c M 1/2 xx ] 2

b E[[y Z x]jx c M 1/2 xx ] 2

j=1 T 1 j + 2

σ2 j (d + 2 p

d log(K/τ) + 2 log(K/τ))

[µy Z µx]2 j n d

= 2( Z 2 F + K) d + 2 p

d log(K/τ) + 2 log(K/τ)

n + 2 µy Z µx 2 2 d

with inequality (iii) holds with probability at least 1 τ. Still, it is easy to observe that for any positive constant ϵ, as

n n2 := max

4( Z 2 F + K) d + 2 p

d log(K/τ) + 2 log(K/τ)

ϵ , 4 µy Z µx 2 2 d

we have b E[(y Z x)x c M 1/2 xx ] 2

F ϵ holds with probability at least 1 τ.

Step-3. Combining two upper bounds together, if n max{n1, n2}, the following inequality for generalization error bound

L( b Z) L(Z ) b E[(y Z x)x c M 1/2 xx ] 2

F M 1/2 xx c M 1 xx M 1/2 xx op 4ϵ

holds with probability at least 1 3τ.

We then study the compressed version.

Step-1 . Similarly, its optimal solutions for population loss and empirical loss are W =

ΦMyx M 1 xx and c W = ΦY X

n 1 =: Φ c Myx c M 1 xx , respectively. Thus, the generalization error bound is

LΦ(c W ) LΦ(W ) = c W W 2 Mxx = (c W W )M 1/2 xx 2 F .

Still, we have (c W W )M 1/2 xx = b E[(Φy W x)x c M 1/2 xx ] c M 1/2 xx M 1/2 xx , and therefore, the generalization error bound can be upper-bounded by (c W W )M 1/2 xx 2

F b E[(Φy W x)x c M 1/2 xx ] 2

c M 1/2 xx M 1/2 xx 2

Step-2 . Next, we provide upper bounds on these two terms b E[(Φy W x)x c M 1/2 xx ] F and c M 1/2 xx M 1/2 xx op in the right-hand-side separately. Note that for the matrix error term c M 1/2 xx M 1/2 xx op, we could use the same upper bounded as mentioned in the proof of uncom-

pressed version.

Now, to give the upper bound on the rescaled approximation error term b E[(Φy W x)x c M 1/2 xx ] F , we first compute the variance proxy for the sub Gaussian vector Φy W x, which is

eσ2 j := ( [W ]j,: 2 2 + Φj,: 2 2)σ2

for all j [m]. Thus, the rescaled approximation error for the compressed version can be upper bounded by b E[(Φy W x)x c M 1/2 xx ] 2

b E[[Φy W x]jx c M 1/2 xx ] 2

b E[([Φy W x]j [Φµy W µx]j)x c M 1/2 xx ] 2

b E[[Φµy W µx]jx c M 1/2 xx ] 2

Still using Lemma 5, for all τ (0, 1), we have

e T 1 j eσ2 j (d + 2 p

d log(m/τ) + 2 log(m/τ))

For the term e T 2 j , following the same proof procedures of the uncompressed version implies

e T 2 j = [Φµy W µx]2 j n

= [Φµy W µx]2 j n d

Therefore, the rescaled approximation error for the compressed version is upper-bounded by b E[(Φy W x)x c M 1/2 xx ] 2

2( W 2 F + Φ 2 F ) d + 2 p

d log(m/τ) + 2 log(m/τ)

n + 2 Φµy W µx 2 2 d

n with probability at least 1 τ. Similarly, it is easy to get that for any positive constant ϵ, as

n en2 := max

4( W 2 F + Φ 2 F ) d + 2 p

d log(m/τ) + 2 log(m/τ)

ϵ , 4 Φµy W µx 2 2 d

we have b E[(Φy W x)x c M 1/2 xx ] 2

F ϵ holds with probability at least 1 τ.

Step-3 . Combining two upper bounds together, if n max{n1, en2}, the following inequality for generalization error bound

LΦ(c W ) LΦ(W ) b E[(Φy W x)x c M 1/2 xx ] 2

F M 1/2 xx c M 1 xx M 1/2 xx op 4ϵ

holds with probability at least 1 3τ.

A.4 Proof of Theorem 4

Recall the following theorem in the main text.

Theorem 4 For any δ (0, 1) and any τ (0, 1/3), suppose the compressed matrix Φ follows Assumption 1 with m O( s

τ )), and Assumption 2 holds. Given any learned regressor c W from training problem (2), let (x, y) be a new sample drawn from the underlying distribution D, we have the following inequality holds with probability at least 1 τ:

ED[ by y 2 2] 4 1 δ ED[ Φy c W x 2 2],

where by is the optimal solution from prediction problem (3) with input vector x.

Proof. Due to the optimality of by, we have Φby c W x 2

2 Φy c W x 2

Φby Φy + Φy c W x 2

2 Φy c W x 2

2 Φby Φy 2 2 2 Φby Φy, Φy c W x

Φby Φy 2 2 2 Φby Φy 2 Φy c W x 2

Φby Φy 2 2 Φy c W x 2

Φby Φy 2 2 4 Φy c W x 2

(1 δ) by y 2 2 4 Φy c W x 2

2 where the final holds due to the (3s, δ)-RIP property of the compressed matrix Φ with probability at least 1 τ. Taking expectations on both sides implies

(1 δ)ED[ by y 2 2] 4ED[ Φy c W x 2 2], which completes the story.

A.5 Technical Lemma

A.5.1 Proof of Claim Proposed in Remark 2

Proof. Let us discuss the computational complexity for F to be RK, RK + , {0, 1}K separately. Given a fixed v,

If F = RK, the projection method arg minv VK s F v v 2 2 can be reformulate using the following mixed-integer programming (MIP),

(v , z ) := arg minv,z PK p=1 zp(vp vp)2

s.t. PK p=1 zp s ,

with v the output of the projection method. Sorting the absolute values {| vp|}K p=1 in decreasing order such that

| v(1)| | v(K)|,

the output v of the proposed projection is

[v ]j = vj if j {(1), . . . , (s)} 0 o.w. ,

with computational complexity O(K min{s, log K}).

If F = RK + , the projection method arg minv VK s F v v 2 2 can be reformulate using the following mixed-integer programming (MIP),

(v , z ) := arg minv,z PK p=1 zp(vp vp)2

s.t. PK p=1 zp s vp 0 p [K] .

Sorting { vp}K p=1 in decreasing order such that

the output v of the proposed projection is

[v ]j = vj I( vj > 0) if j {(1), . . . , (s)} 0 o.w. ,

with computation complexity O(K min{s, log K}).

If F = {0, 1}K, the projection method minv VK s F v v 2 2 presented in step-4 of Algorithm 1 can be represented as

minz PK p=1(1 zp) v2 p + zp( vp 1)2

s.t. PK p=1 zp s zp {0, 1} p [K] = minz PK p=1 v2 p zp(2 vp 1) s.t. PK p=1 zp s zp {0, 1} p [K] .

Sort {2 vp 1}K p=1 in decreasing order such that

2 v(1) 1 2 v(2) 1 2 v(K) 1,

then, the optimal z can be set by

z p = I(2vp 1 > 0) if p {(1), . . . , (s)} 0 o.w. .

For computational complexity, computing the sequence {2vp 1}K p=1 needs O(K), picking the top-s elements of the above sequence requires O(K min{s, log K}), setting the optimal solution z needs O(s), and thus the total computational complexity is O(K) + O(K min{s, log K}) + O(s) = O(K min{s, log K}).

A.5.2 Lemma for the Proof of Theorem 1

Lemma 1. (Unitary invariant). Let Φ Rm d be a randomized compressed matrix as described in Assumption 1, and U Rd d be a real unitary matrix. Then we have Φ = ΦU is (1, δ)-RIP with probability at least 1 τ.

Proof. Note that (i, j)-th component of Φ can be represented as Φi,j = Φi,:U:,j = Pd ℓ=1 Φi,ℓUℓ,j. Since every component Φi,j in Φ is i.i.d. drawn from N(0, 1/m), we have

ℓ=1 Φi,ℓUℓ,j N

1 m U 2 ℓ,j

= N(0, 1/m).

Now, we need to show that any two distinct components Φi1,j1 and Φi2,j2 in Φ are independent. It is easy to observe that Φi1,j1 and Φi2,j2 are independent when i1 = i2 since Φi1,: and Φi2,: are independent. If i1 = i2 = i, then the following random vector satisfies Φi,j1 Φi,j1

= Φi,:U:,j1 Φi,:U:,j2

, 1/m 0 0 1/m

That is to say, Φi1,j1 and Φi2,j2 are jointly Gaussian distributed and uncorrelated, which shows that Φi1,j1 and Φi2,j2 are independent. Combining the above together, we have Φ is a randomized matrix with component i.i.d. from N(0, 1/m). Based on the existing result ([7], Theorem 1.5), when m C1 δ 2[ln(e K) + ln(2/τ)] for any δ > 0 and τ (0, 1), we have Φ ensures (1, δ)-RIP with probability at least 1 τ.

A.5.3 Lemma for the Proof of Theorem 2

Lemma 2. For any integer parameter s( d) and positive parameter δ (0, 1), let Φ Rm d be a (s, δ)-RIP matrix. For u1, u2, if u1, u2, u1 + u2, u1 u2 are all s-sparse, then the following inequality holds

2δ( u1 2 2 + u2 2 2) + 4 u1, u2 4 Φu1, Φu2 2δ( u1 2 2 + u2 2 2) + 4 u1, u2 .

Proof. Since u1, u2, u1 + u2, u1 u2 are s-sparse, we have

(1 δ)( u1 + u2 2 2) Φ(u1 + u2), Φ(u1 + u2) (1 + δ)( u1 + u2 2 2) (8)

(1 δ)( u1 u2 2 2) Φ(u1 u2), Φ(u1 u2) (1 + δ)( u1 u2 2 2) (9)

Subtracting (9) from (8) gives

2δ( u1 2 2 + u2 2 2) + 4 u1, u2 4 Φu1, Φu2 2δ( u1 2 2 + u2 2 2) + 4 u1, u2 ,

which completes the proof.

A.5.4 Lemma for the Proof of Theorem 3

Lemma 3. Let ξ1, . . . , ξn be n i.i.d. σ2-sub Gaussian random vectors with a zero mean and a covariance matrix Σ. Then, there exists a positive constant C such that for all τ (0, 1),

d + log(2/τ)

n , d + log(2/τ)

Proof. Lemma 3 is a direct corollary from [Theorem 6.5, [41]]. It is easy to observe that the proposed Lemma 3 holds by setting the parameter δ listed in [Theorem 6.5, [41]] as min{1, c p

ln(2/τ)/n} with c some positive constant.

Lemma 4. Let ξ1, . . . , ξn be n i.i.d. σ2-sub Gaussian random vectors with a zero mean and a covariance matrix Σ. Then for any τ (0, 1), we have

Proof. We show this lemma by discretizing the unit ℓ2-norm ball B2(0; 1). Let N1/2 be a 1

2-minimum cover of B2(0; 1) with its cardinality |N1/2| 5d. Since for any vector ξ Rd, we always have

ξ 2 = max v 2 1 v, ξ max v N1/2 v , ξ + max v 2 1/2 v , ξ = max v N1/2 v , ξ + 1

2 max v 2 1 v , ξ ,

then ξ 2 2 maxv N1/2 v , ξ . Therefore, for any σ2-sub Gaussian random vector,

P( ξ 2 t) P max v N1/2 v , ξ t/2 |N1/2| exp t2

which implies that ξ 2 4σ

log(1/τ) with probability at least 1 τ. Now, since ξ = Pn i=1 ξi

n is a σ2/n-sub Gaussian random vector, inserting this variance proxy into the above inequality gives the desired result.

Lemma 5. Let η(x) be a zero-mean, σ2 η-sub Gaussian random variable. Let x1, . . . , xn be n i.i.d. σ2-sub Gaussian random vectors (may not zero-mean) as described in Assumption 2. Conditioned on c Mxx = 1

n Pn i=1 xi(xi) 0d d, for any τ (0, 1), we have

b E[η(x)x c M 1/2 xx ] 2

2 σ2 η(d + 2 p

d log(1/τ) + 2 log(1/τ))

Proof. The proof of Lemma 5 can be found in [Lemma 5, [18]].

A.5.5 Discussion after Theorem 4

Since v(T ) y 2 v(T ) by 2 + by y 2, combined with Theorem 2, we have

ED[ v(T ) y 2 2] 2ED[ v(T ) by 2 2] + 2ED[ by y 2 2]

O(ED[ Φby c W x 2]) + 8 1 δ ED[ Φy c W x 2 2]

O(ED[ Φy c W x 2]).

A.6 Additional Numerical Experiments on Synthetic Data

A.6.1 Implemented Prediction Method

The implemented prediction method is presented as follows, see Algorithm 2. Comparing with Algorithm 1 proposed in the main content, it adds an additional stopping criteria

v(t) v(t 2) 2

0.01 + v(t) 2 < 10 6

to ensure an earlier stop than Algorithm 1.

Note, in later numerical experiments, we use the terminology early stopping to denote that the iteration generated by the prediction algorithm satisfies the above additional stopping criteria within the maximum iteration number, i.e. T = 60 (as listed in Section 4).

A.6.2 Discussions on Baselines

Baselines. We compare our proposed prediction method with the following baselines.

Orthogonal Matching Pursuit. Orthogonal Matching Pursuit(OMP) is a greedy prediction algorithm. It iteratively chooses the most relevant output and then performs least-squares and updates the residuals. The built-in function Orthogonal Matching Pursuit from Python package Sklearn.Linear_model is used in the experiment. Correlation Decoding. Correlation decoding is a standard decoding algorithm. It computes the multiplication of the transpose of compression matrix Φ and the learned regressor c W . For any test point x, the algorithm predicts the top s labels in Φc W x orded by magnitude.

Algorithm 2 Implemented Projected Gradient Descent (for Second Stage)

Input: Regressor c W , input sample x, stepsize η, total iterations T

1: Initialize point v(0) VK s F and t = 0. 2: while t < T and v(t) v(t 2) 2/(0.01 + v(t) 2) > 10 3: do

3: Update ev(t+1) = v(t) η Φ (Φv(t) c W x). 4: Project v(t+1) = Π(ev(t+1)) := arg minv VK s F v ev(t+1) 2 2. 5: Update t := t + 1. 6: end while Output: v(T ).

Elastic Net. The elastic net is a regression method that combines both the ℓ1-norm penalty and the ℓ2-norm penalty to guarantee the sparsity and the stability of the prediction. The parameters for the ℓ1-norm penalty and ℓ1-norm penalty in Elastic Net are set to be 0.1 through the experiments. The built-in function Elastic Net from Python package Sklearn.Linear_model is used in the experiment.

Fast Iterative Shrinkage-Thresholding Algorithm. The fast iterative shrinkage-thresholding algorithm is an advanced optimization method designed to efficiently solve certain classes of unconstrained convex optimization problems. It utilizes momentum-like strategies to speed up convergence rates, which is particularly effective for minimizing functions that may be non-smooth.

A.6.3 Procedures on Generating Mean & Covariance

In this paper, we give exact procedures on selecting µx and Σxx as mentioned in Section 4.

The mean vector µx is selected as follows. We first generate a d-dimensional vector µ from the standard d-dimensional Gaussian distribution N(0d, Id). Then we set

[µx]j = |µ|j for all j [d]

by taking absolute values over all components.

The covariance matrix Σxx is selected as follows. We first generate a d-by-d matrix A, where every component of A is i.i.d. generated from the standard Gaussian distribution N(0, 1). Then the covariance matrix Σxx is set to be

A.7 Additional Numerical Experiments on Real Data

In this subsection, we do experiments on real data and compare the performance of the proposed prediction method (see Algorithm 2) with four baselines, i.e., Orthogonal Matching Pursuit (OMP, [46]), Correlation Decoding (CD,[20]), Elastic Net (EN, [47]), and Fast Iterative Shrinkage-Thresholding Algorithm (FISTA,[2]) see Appendix A.6.2 for detailed explanations.

Real data. We select two benchmark datasets in multi-label classification, Wiki10-31K and EURLex-4K[5] due to their sparsity property. Table 1 shows the details for the datasets.

Input Dim Output Dim Training set Test set Dataset d K n d K n d K EURLex-4K 5,000 3,993 17,413 236.69 5.30 1,935 240.96 5.32 Wiki10-31K 101,938 30,938 14,146 673.45 18.64 6,616 659.65 19.03

Table 1: Statistics and details for training and test sets, where d, K denote their averaged non-zero components for input and output, respectively.

Figure 2: The figure reports the prediction running time (measured in seconds) on synthetic data with early stopping by the proposed algorithm under different compressed output dimensions. As we can observe, the running time first decreases dramatically, then increases almost linearly with respect to m . Such a phenomenon has occurred since the max number of iterations is 60 in the implemented prediction method with early stopping, which is relatively large; As m increases but is still less than 500, the actual number of iterations drops dramatically due to early stopping criteria; After passes 500, the actual number of iterations stays around 10, and then the running time grows linearly as dimension increases.

Parameter setting. In prediction stage, we choose s {1, 3} for EURLex-4K and s {3, 5} for Wiki10-31K. We choose the number of rows m {100, 200, 300, 400, 500, 700, 1000, 2000} on both EURLex-4K and Wiki10-31. Ten independent trials of compressed matrices Φ(1), . . . , Φ(10) are implemented for each tuple of parameters (s, m) on both datasets.

Empirical running time. Here, we report the running time of the proposed algorithm and baselines on both synthetic and real datasets, see Table 2.

Dataset Prop.Algo. OMP CD Elastic Net FISTA Synthetic Data 1 second 200-400 seconds <1 second 700-900 seconds <3 seconds EURLex-4K <1 second 20-80 seconds <1 second - 1 second Wiki10-31K <5 seconds 500-700 seconds <5 seconds - 5-10 seconds

Table 2: Time Complexity Comparison for each prediction

Numerical Results & Discussions. Figure 2 further illustrates that the computational complexity increases linearly with respect to the growth of compressed output dimension m on synthetic data, when m is greater than 500 to ensure the convergence of the prediction algorithm (see Remark 2).

For real data, Figure 3 and Figure 4 present the results of their accuracy performances. In particular, the accuracy grows relatively stable with respect to m when the compression matrix satisfies the RIP-property with high probability. Besides, based on the results presented in Figure 3, Figure 4, and Table 2, we observe that the proposed algorithm slightly outperforms the baselines on precision as s increases while enjoys a significant better computational efficiency, especially on large instances, which demonstrate the stability of the proposed algorithm.

Figure 3: This figure reports the numerical results on real data EURLex-4K. Each dot in the figure represents 10 independent trials (i.e., experiments) of compressed matrices Φ(1), . . . , Φ(10) on a given tuple of parameters (s, m). The curves in each panel correspond to the averaged values for the proposed Algorithm and baselines over 10 trials; the shaded parts represent the empirical standard deviations over 10 trials. In the first row, we plot the output distance versus the number of rows. In the second row, we plot the precision versus the number of rows, and we cannot observe significant differences between these prediction methods.

Figure 4: This figure reports the numerical results on real data Wiki10-31K. Similar to the plot reporting on EURLex-4K above, each dot in the figure represents 10 independent trials (i.e., experiments) of compressed matrices Φ(1), . . . , Φ(10) on a given tuple of parameters (s, m). The curves in each panel correspond to the averaged values for the proposed algorithm and baselines over 10 trials; the shaded parts represent the empirical standard deviations over 10 trials. Similarly, in the first row, the precision of the proposed algorithm outperforms the FISTA especially when s is small. In the second & third rows for output difference and prediction loss, there are only slight improvement on the proposed algorithm than CD of output difference.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The paper s abstract is the short highlight of our introduction, covering the problems we study, the theoretical results we establish, and the numerical experiments we conduct. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We mention assumptions and discuss the limitation of the proposed method, and compare our method with other existing methods. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: Assumptions and complete proofs are presented in the Appendix with a short proof sketch in the main paper. All the theorems, used formulas, and proofs in the paper should be numbered and cross-referenced. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The numerical experiments are controlled by the default random seed, and it could be reproduced. Moreover, we not only provide the code but also include the detailed steps of the experiment in the main paper. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We follow the guidance to provide the code and data. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We add the details in the numerical experiment section. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We provide the confidence interval for the results, which shows the significant improvement. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: The compute resources are mentioned in the numerical section.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: It is anonymous.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justification: Our method aims to solving challenges in the recent hype in large language model, for which hallucination is a big issue. Besides, our paper is also can be used in the fairness considerations.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our paper does not have this issue. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We cite all papers that are used in our numerical experiments. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We provide anonymized zip file. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.