# improving_deep_regression_with_tightness__a3e40926.pdf

Published as a conference paper at ICLR 2025

IMPROVING DEEP REGRESSION WITH TIGHTNESS

Shihao Zhang1, Yuguang Yan2, Angela Yao1

1National University of Singapore 2Guangdong University of Technology zhang.shihao@u.nus.edu ygyan@gdut.edu.cn ayao@comp.nus.edu.sg

For deep regression, preserving the ordinality of the targets with respect to the feature representation improves performance across various tasks. However, a theoretical explanation for the benefits of ordinality is still lacking. This work reveals that preserving ordinality reduces the conditional entropy H(Z|Y) of representation Z conditional on the target Y. However, our findings reveal that typical regression losses fail to sufficiently reduce H(Z|Y), despite its crucial role in generalization performance. With this motivation, we introduce an optimal transport-based regularizer to preserve the similarity relationships of targets in the feature space to reduce H(Z|Y). Additionally, we introduce a simple yet efficient strategy of duplicating the regressor targets, also with the aim of reducing H(Z|Y). Experiments on three real-world regression tasks verify the effectiveness of our strategies to improve deep regression. Code: https://github.com/ needylove/Regression_tightness.

1 INTRODUCTION

Classification and regression are two fundamental tasks in machine learning. Classification maps input data to categorical targets, while regression maps the data to continuous target space. Representation learning in deep neural networks is well-studied for classification (Boudiaf et al., 2020; Achille & Soatto, 2018), but is less explored for regression. One emerging observation in deep regression is the importance of feature ordinality (Zhang et al., 2023). Preserving the ordinality of targets within the feature space leads to better performance, and various regularizers to enhance ordinality have been proposed (Gong et al., 2022; Keramati et al., 2023). But what is the underlying link between ordinality and regression performance?

The information bottleneck principle (Shwartz-Ziv & Tishby, 2017) suggests that a neural network learns representations Z that retain sufficient information about the target Y while compressing irrelevant information. The two aims can be regarded as minimizing the conditional entropies H(Y|Z) and H(Z|Y) (Zhang et al., 2024). Compression reduces representation complexity, prevents overfitting, and bounds the generalization error (Tishby & Zaslavsky, 2015; Kawaguchi et al., 2023; Zhang et al., 2024). We find that preserving ordinality enhances compression by minimizing H(Z|Y), i.e., the conditional entropy of the learned representation Z with respect to the target Y. Following (Boudiaf et al., 2020; Zhang et al., 2024), we refer to this conditional entropy as tightness, and its compression as tightening the representation.

But are ordinal feature spaces not learned naturally by the regressor? We explore this question through gradient analysis and comparing the differences between regression and classification. We find that typical regressors are weak in tightening the learned representations. Specifically, given a fixed linear regressor with weight vector θ, the update direction of zi for a given sample i tends to follow the direction of θ. The movement of zi can be regarded as a probability density shift (Sonoda & Murata, 2019). Deep regressors update and tighten the representation in limited directions perpendicular to θ. In contrast, we find that deep classifiers update zi more flexibly and in diverse directions, leading to better-tightened representations. Such a finding sheds insight into why reformulating regression as a classification task may be more effective (Farebrother et al., 2024; Liu et al., 2019) and why classification losses benefit representation learning for regression (Zhang et al., 2023).

So how can regression representations be further tightened? We take inspiration from classification, where one-hot encodings allow a separate set of classification weights θk for each class k. Similarly,

Published as a conference paper at ICLR 2025

we augment the target space of the regressor into multiple targets. The multiple-target strategy adds extra dimensions to the regression output and incorporates additional regressors, making it more flexible to tighten the feature representations. Additionally, we introduce a Regression Optimal Transport (ROT) Regularizer, or ROT-Reg. ROT-Reg captures local similarity relationships through optimal transport plans. By encouraging similar transport plans between the target and feature space, the regularizer can tighten representations locally. It also helps to preserve the target space topology, which is also desirable for regression representations (Zhang et al., 2024).

Our main contributions are three-fold:

We are the first to analyze the need for preserving target ordinality with respect to the representation space for deep regression and link it to feature tightness.

We reveal the weakness of standard regression in tightening learned feature representations, as the representation updating direction is constrained to follow a single line.

We introduce a multi-target learning strategy and an optimal transport-based regularize, which tighten regression representations globally and locally, respectively.

2 RELATED WORK

Regression representation learning. Existing works mainly focus on the properties of continuity and ordinality. For continuity, DIR (Yang et al., 2021) tackles missing data by smoothing, based on the continuity of both targets and representations. VIR (Wang & Wang, 2024) computes representations with additional information from data with similar targets. Preserving representation continuity also encourages the feature manifold to be homeomorphic with respect to the target space and is highly desirable (Zhang et al., 2024).

For ordinality, Rank Sim (Gong et al., 2022) explicitly preserves ordinality for better performance. (Keramati et al., 2023) further incorporated a contrastive regularizer to preserve ordinality. It is worth mentioning that continuity overlaps with ordinality conceptually; ensuring continuity for neighboring samples inherently involves ordinality. Although ordinality plays a key role in regression representation learning, its importance and characteristics are underexplored. This work tackles these questions by establishing connections between target ordinality and representation tightness.

Recasting regression as classification. For a diverse set of regression tasks, formulating them into a classification problem yields better performance (Li et al., 2022; Bhat et al., 2021; Farebrother et al., 2024). Previous works have hinted at task-specific reasons. For pose estimation, classification provides denser and more effective supervision (Gu et al., 2022). For crowd counting, classification is more robust to noise (Xiong & Yao, 2022). Later, Pintea et al. (2023) empirically found that classification helps when the data is imbalanced, and Zhang et al. (2023) suggests that regression lags in its ability to learn a high entropy feature space. A high entropy feature space implies that representations preserve necessary information about the target. In this work, we provide a derivation and further suggest that regression struggles to compress the representations.

3 ON THE TIGHTNESS OF REGRESSION REPRESENTATIONS

3.1 NOTATIONS & DEFINITIONS

Consider a dataset {xi, zi, yi}N i=1 sampled from a distribution P, where xi is the input, y R is the corresponding label, and zi Z Rd is the feature corresponding to the input xi extracted by a neural network. A regressor fθ parameterized by θ maps zi to a predicted target ˆyi = fθ(zi). Specifically, when fθ is a linear regressor, which is typically the case in deep neural networks, we have ˆyi = θTzi. The encoder and fθ are trained by minimizing a task-specific regression loss Lreg. Typically, the mean-squared error is used, i.e. Lreg = 1

N PN i=1(yi ˆyi)2.

To formulate regression as a classification problem, the continuous target y is quantized to K classes yc i {1, , K}, and the cross-entropy loss is used to train the encoder and classifiers

Published as a conference paper at ICLR 2025

N PN i=1 log exp θT yc i zi PK j=1 exp θT j zi , where θk is the classifier 1 corresponding to the class k. The

function d( , ) measures some distance between two points, e.g., Euclidean distance.

3.2 ORDINALITY AND TIGHTNESS

This section shows that preserving ordinality tightens the learned representation, and conversely, tightening the representation will help preserve ordinality. A lower H(Z|Y) represents a higher compression (Zhang et al., 2024) . The compression is maximized when H(Z|Y) is in its minimal (H(Z|Y) = for differential entropy and H(Z|Y) = 0 for discrete entropy).

First, we define the ordinality following (Gong et al., 2022): Definition 1 (Ordinality). The ordinality is perfectly preserved if i, j, k, the following holds: d(yi, yj) d(yi, yk) d(zi, zj) d(zi, zk).

Theorem 1 Let B(z, ϵ) = {z Z|d(z, z ) ϵ} be the closed ball center at z with radius ϵ. Assume that (x, z, y) P and ϵ > 0, (x , z , y ) P such that z B(z, ϵ) and y = y. Then if the ordinality is perfectly preserved, (xi, zi, yi), (xj, zj, yj) P, the following hold: yi = yj d(zi, zj) = 0.

The detailed proof of Theorem 1 is given in Appendix A.1. Theorem 1 states that if the ordinality is perfectly preserved, then the tightness (i.e. H(Z|Y)) is minimized. This suggests that preserving ordinality will tighten the representations. The assumption in Theorem 1 aligns with the learning target that learning continuously changes representations from continuous targets.

Conversely, if the representations can be correctly mapped to the target and are perfectly tightened, then the representations collapse into a manifold homeomorphic to the target space (e.g., collapse into a single line when the target space is a line) [(Zhang et al., 2024), Proposition 2]. Thus, ordinality will be perfectly preserved locally. Note that reserving ordinality globally constrains the line to be straight, which is not necessary.

3.3 REGRESSION TIGHTNESS

Why are additional efforts to emphasize ordinality necessary? In this work, we find that standard deep regressors are weak in their ability to tighten the representations due to the gradient update direction with respect to the representations. Consider a fixed linear regression with a typical regression loss (e.g., MSE, L1), which has the following gradient with respect to zi:

zi = L reg(θTzi yi)θT. (1)

Here, the direction of Lreg

zi is determined solely by the direction of θ. As such, during learning, all the zi are moved in the direction of θ (or away). This movement can be regarded as a probability density shift (Sonoda & Murata, 2019) , so regression suffers from a weak ability to change probability density in directions perpendicular to θ, which indicates a limited ability to tighten representations in those directions. In other words, regressors can only move zi to Syi, but cannot tighten Syi, where Syi = {z|fθ(z) = yi} is the solution space of fθ(z) = yi. More generally, for a differentiable regressor, we have the following: Theorem 2 Assume fθ is differentiable and Sy i is a convex set, then z i, z j Sy i:

zi (z i z j) = 0, (2)

where y i is the predicted target of zi.

The detailed proof of Theorem 2 is given in Appendix A.2. The regressor fθ is generally differentiable for gradient backpropagation, and Syi is commonly a convex set with widely used regressors, such as the linear regressor. Theorem 2 shows that the gradient with respect to the representation will be perpendicular to its solution space and has no effect within the solution space. In other words, with a

1In this work, a classifier represents a single θj rather than the whole set {θj|j K}.

Published as a conference paper at ICLR 2025

fixed regressor, the gradient only moves the representations to the corresponding solution space and lags in its ability to tighten the feature space.

In reality, the regressor is not fixed (i.e., updating with training), and the solution space is also changing during training. In the case of a linear regressor, the gradient with respect to θ over a batch of b samples can be given as:

i=1 L reg(θTzi yi)z T i = 1

i=1 wiz T i . (3)

Here, the direction of Lreg

θ will tend to be the weighted mean of the direction of zi. As discussed, the direction of zi approaches the direction of θ. Thus, zi will distribute around θ and offset each other, resulting in a limited impact on the direction of θ.

It is worth mentioning that the tightness here is specific to H(Z|Y = yi) within Syi, which is indirectly related to the predicted results and performance. By contrast, the tightness outside Syi directly affects the predicted results and potentially plays a more important role.

3.4 COMPARISON IN TIGHTNESS FOR CLASSIFICATION

When comparing classification with regression, we find that classification has greater flexibility to tighten representations in diverse directions θk, suggesting an ability to better tighten the representation. For the gradient with respect to zi over a batch of b samples:

b Pb i=1 θT yizi) zi + ( 1

b Pb i=1 log PK j=1 eθT j zi)

j=1 pijθT j θT yi (4)

where pij = exp θT j zi PK k=1 exp θT kzi is the probability of sample i belonging to class j. Here, the direction

zi is affected by all θk, and zi will approach θyi with training. In contrast, the direction

zi is purely determined by θ. Classification moves zi to its corresponding classifier θyi even if the sample is correctly classified. At the same time, regression has no effect on zi if it is correctly predicted (i.e., Lreg

zi = 0). This suggests that classification has a higher ability to tighten representations in the solution space Syi. Here, Syi for classification is defined as the set of zi that are classified as class yi.

In reality, the classifiers θk are not fixed and are updated with training. The gradient with respect to kth classifier θk over a batch of b samples is given as:

i:yi=k z T i + 1

eθT kz T i PK j=1 eθT j zi z T i = 1

i=1 (pik δyi,k)z T i (5)

where pik = eθT kzi PK j=1 e θT j zi is the probability of sample i belongs to the class k, and δyi,k is the

Kronecker delta function. For classification, the direction of θk will biased toward zi with respect to the class k , while zi will also bias towards its corresponding classifier. In contrast, for regression, the direction of θ will tend to be the weighted mean of the direction of zi. Thus, the effect of many zi on the direction of θ will offset each other and have a limited impact. As a result, changes in the directions of θk are generally greater than the change of the θ direction in regression, and therefore classification can move z more flexible and thus potentially better tighten the representation.

Our analysis in Sec. 3 inspires us to tighten the regression representations. To this aim, we introduce the Multiple Target (MT) strategy and the Regression Optimal Transport Regularizer (ROT-Reg) to tighten the representations globally (i.e., min Z H(Z|Y)) and locally (i.e., min Z H(Z|Y = yi), Z B(zi, ϵ), ϵ control the degree of locality). Inspired by the effect of multiple classifiers in classification, the MT strategy introduces additional targets as constraints to compress the representations. For ROT-Reg, we exploit it to encourage representations to have local structures similar to the targets, which implicitly tightens the representations.

Published as a conference paper at ICLR 2025

%𝑦= [ 𝜃!, 0, 0 𝑧] %𝑦= [ 𝜃!, 0, 0 𝑧, 0, 𝜃", 0 𝑧]

Figure 1: Illustration of the MT strategy. Changing the target from y to [y, y] will introduce an additional regressor to predict the additional y. The original solution space Sy0 is a line in the feature manifold. The additional y introduces a new constraint, tightening Sy0 from a line to a point.

4.1 TARGET SPACE WITH EXTRA DIMENSIONS BETTER TIGHTEN THE FEATURE SPACE

Our analysis in Sec. 3.4 suggests that classification outperforms regression in its ability to compress representations in multiple directions, resulting from multiple classifiers. Inspired by this, we introduce a simple yet efficient strategy, which adds extra dimensions for the target space to bring in extra regressors as constraints. Here, the additional regressors have a similar effect as individual classifiers. As shown in Figure 1, the additional constraints will result in a lower-dimensional Syi, which indicates higher compression. The number of additional targets depends on the intrinsic dimension of the feature manifold. In our Multiple Targets strategy, the final predicted target is the average over the multiple predicted targets:

t=1 ˆyt i, (6)

where M is the number of the total target dimension and ˆyt i is the tth predicted target.

4.2 REGRESSION OPTIMAL TRANSPORT REGULARIZER (ROT-REG)

The MT strategy tightens the representations globally through additional regressors. We propose to further tighten the representations locally. Specifically, we preserve the local similarity relations between the target and representation space. The local similarities are characterized by a self entropic optimal transport model (Yan et al., 2024; Landa et al., 2021). The model determines the optimal plan is to move a set of samples to the set itself with minimal transport costs, while each sample cannot be moved to itself.

Formally, Given a set S = {s1, . . . , sn}, the corresponding weight vector p = Rn reflects how many masses the samples have, where the weights simplify the simplex constraint Pn i=1 pi = 1. Usually, one can easily implement p as a uniform distribution, i.e., pi = 1

n, i [n]. CS ij is the transport cost between si and sj, which is generally adopted as the Euclidean distance between the samples, and T S ij indicates how many masses are transported from the locations of si to sj. The self entropic optimal transport is defined as follows:

T (S) = arg min T CS, T + γΩ(T) (7)

s.t. T1n = p, T 1n = p, Tii = 0 i [n], (8)

where γ is a trade-off parameter, and Ω(T) = Pn i=1 Pn j=1 T S ij log T S ij is the negative entropic regularization, which is used to smoothen the solution and speed up the optimization (Cuturi, 2013).

Given the solution e TS minimizing the above objective, the element T S ij measures the similarity relation between samples si and sj, since two samples with a large distance CS ij will induce a small transport mass T S ij between them. As a result, the optimal total transport cost Cs, e TS reflects the tightness of the samples.

Published as a conference paper at ICLR 2025

Motivated by this, we employ the self optimal transport model to capture local similarity relations of target and representation spaces, respectively, and encourage a relation consistency between two spaces. In specific, we first construct a self optimal transport model on the target space to obtain e TY = T (Y), which describes the local similarity relations between the regression targets. After that, we learn regression representations Z such that the corresponding optimal transport matrix e TZ = T (Z) is consistent with e TY, which is achieved by the following loss function

Lot = CZ, e TY CZ, e TZ . (9)

ROT-Reg is updating CZ through gradient backpropagation to minimize Lot. In contrast, simply minimizing the gap of e TY and e TZ can introduce optimization challenges, as the two matrices are obtained iteratively through the Sinkhorn algorithm rather than simply through gradient backpropagation. In addition, directly minimizing ||CZ CY||F imposes an overly strict constraint on the feature manifold, forcing it to become identical to the target space, which is unnecessary.

It is worth mentioning that γ controls the smooth of the transport plan T , and determines the degree of locality. When γ = 0, T will approach the minimal spanning tree (i.e., only transports mass to its nearest neighbor), and Lot will encourage the representations to have the same minimal spanning tree to the targets, which is shown to be a strategy to preserve the topology of the target space (Moor et al., 2020). In fact, the topological auto-encoder (Moor et al., 2020) preserves topological information in this way. Compared to topology autoencoder, Lot captures more local structures of targets when γ > 0. The final loss Lf sums the task-specific loss Lt and the regularizer with a trade-off hyper-parameter λ :

Lf = Lt + λLot, (10)

5 EXPERIMENTS

We experiment on three deep regression tasks: age estimation, depth estimation, and coordinate prediction and compare with Rank Sim (Gong et al., 2022), Ordinal Entropy (OE) (Zhang et al., 2023), and PH-Reg (Zhang et al., 2024). Rank Sim explicitly preserves ordinality to serve as the ordinality baseline. OE leverages classification for better regression representations and serves as a regression baseline. PH-Reg preserves the topological structure of the target space by the Topological autoencoder (Moor et al., 2020) and tightens the representation by Birdal s regularizer (Birdal et al., 2021), serving as a topology baseline. More details are given in Appendix B.1.

5.1 REAL-WORLD DATASETS: AGE ESTIMATION AND DEPTH ESTIMATION

For age estimation, we use Age DB-DIR (Yang et al., 2021) and evaluate using Mean Absolute Error (MAE) as the evaluation metric. γ and λ are set to 0.1 and 100, respectively. For depth estimation, we use NYUD2-DIR (Yang et al., 2021) and evaluate using the root mean squared error (RMSE) and the threshold accuracy δ1 as the evaluation metrics. γ and λ are set to 0.05 and 10, respectively. We set the total target dimension M to be 8 for both tasks. Both Age DB-DIR and NYUD2-DIR contain three disjoint subsets (i.e., Many, Med, and Few) divided from the whole set. We exploit the regression baseline models of (Yang et al., 2021), which use Res Net-50 (He et al., 2016) as the backbone, and follow their setting for both tasks.

Tables 1 and 2 show results on age estimation and depth estimation respectively. Both the Multiple Targets strategy (MT) and Lot improve regression performance, and combining both further boosts performance. Specifically, combining both achieves 0.52 overall improvements (i.e. ALL) on age estimation, and a 0.156 reduction of RMSE on depth estimation.

5.2 Lot PRESERVES THE LOCAL SIMILARITY RELATIONSHIPS

The effectiveness of Lot is verified with the coordinate prediction task from Zhang et al. (2024). This task predicts data coordinates sampled from manifolds such as Mammoth, Torus, and Circle, which have different topologies. The inputs are noisy data samples and the goal is to recover the true data coordinates. Figure 2 shows that Lot successfully preserves the similarity relationships of the targets, resulting in a feature manifold similar to the targets. Quantitative comparisons in Table 3 indicate

Published as a conference paper at ICLR 2025

Table 1: Quantitative comparison (MAE) on Age DB-DIR. We report results as mean standard deviation over 10 runs. Bold numbers indicate the best performance.

Method ALL Many Med. Few Baseline 7.80 0.12 6.80 0.06 9.11 0.31 13.63 0.43 + Rank Sim 7.62 0.13 6.70 0.10 8.90 0.33 12.74 0.48 + OE 7.65 0.13 6.72 0.09 8.77 0.49 13.28 0.73 +PH-Reg 7.32 0.09 6.50 0.15 8.38 0.11 12.18 0.38 + MT 7.67 0.06 6.72 0.08 8.87 0.13 13.36 0.16 +Loe 7.36 0.08 6.55 0.07 8.40 0.14 12.14 0.33 + MT + Loe 7.28 0.05 6.52 0.10 8.26 0.19 11.86 0.24

Table 2: Quantitative comparison on NYUD2-DIR.

Method RMSE δ1 ALL Many Med. Few ALL Many Med. Few Baseline 1.477 0.591 0.952 2.123 0.677 0.777 0.693 0.570 +Rank Sim 1.522 0.565 0.889 2.213 0.666 0.791 0.735 0.513 +OE 1.419 0.671 0.925 2.005 0.668 0.727 0.702 0.596 +PH-Reg 1.450 0.789 0.911 2.002 0.620 0.621 0.680 0.596 + MT 1.367 0.605 0.854 1.952 0.715 0.776 0.759 0.636 +Lot 1.353 0.654 0.934 1.899 0.689 0.736 0.697 0.638 + MT + Lot 1.321 0.685 0.951 1.829 0.701 0.731 0.689 0.675

Table 3: Results (Lmse) on the coordinate prediction dataset. We report results as mean standard deviation over 10 runs. Bold numbers indicate the best performance.

Method Mammoth Torus Circle Baseline 211 55 3.01 0.11 0.154 0.006 + Inf Drop 367 50 2.05 0.04 0.093 0.003 + OE 187 88 2.83 0.07 0.114 0.007 +Topological Autoencoder 80 61 0.95 0.05 0.036 0.004 + PH-Reg 49 27 0.61 0.05 0.013 0.008 + MT 174 76 2.99 0.11 0.152 0.005 + Lot 87 26 0.77 0.02 0.010 0.001 + MT+ Lot 76 53 0.75 0.03 0.010 0.001

(a) Target Space

(b) Regression

(c) + PH-Reg

Figure 2: Visualization of the feature manifolds, which shows that Lot preserves the local similarity relationships of the target space.

that Loe performs similarly to PH-Reg, specifically designed to preserve similarity relationships. However, the Multiple Targets (MT) strategy has a limited impact in this context, likely because the target space is three-dimensional, providing sufficient constraints for the feature manifold.

Published as a conference paper at ICLR 2025

Table 4: Correlation between feature and label similarities. Results are mean std dev over 10 runs.

Method RMSE Cosine Distance Euclidean Distance (ALL) Spearman s Kendall s volume Spearman s Kendall s volume Baseline 1.477 0.39 0.15 0.27 0.11 0.573 0.071 0.35 0.14 0.24 0.10 7.72 0.92 + Rank Sim 1.522 0.09 0.04 0.06 0.03 0.000 0.000 0.60 0.16 0.44 0.13 4.26 1.29 + MT 1.367 0.49 0.14 0.34 0.10 0.492 0.047 0.47 0.11 0.33 0.08 6.56 1.18 + Lot 1.353 0.48 0.16 0.34 0.12 0.010 0.003 0.42 0.15 0.29 0.11 5.57 0.77 + MT + Lot 1.321 0.64 0.09 0.46 0.08 0.006 0.002 0.61 0.11 0.44 0.09 4.16 0.51

(a) Baseline

(b) + Rank Sim

Figure 3: Visualizations of the feature manifold on NYUD2-DIR for depth estimation. Preserving the ordinality (+ Rank Sim) has an effect similar to MT, which explicitly tightens the representations.

5.3 TIGHTNESS AND ORDINALITY AFFECT EACH OTHER

Compression for a better ordinality. We examine the impact of tightness on the ordinality. Table 4 presents the Spearman s rank correlation coefficient (Spearman, 1961) and Kendall rank correlation coefficient (Kendall, 1938) between the feature similarities (based on Cosine distance and Euclidean distance) and the label similarities. The two correlation coefficients measure how well ordinality is preserved. Since tightness compresses the feature manifold, reducing its volume, we use volume as a proxy for tightness, and the volume is approximated by the mean of the similarities between samples. The experiments are conducted on NYUD2-DIR, we randomly sample 1000 pixels from a batch of 8 test images. The label similarities are calculated as the Euclidean distances between the 1000 pixels, while the corresponding feature similarities are the distances between their corresponding representations. The results in Table 4 show standard regression fails to preserve the ordinality, while MT and Lot both improve the ordinality, although they are designed to tighten the representations. Combining both has a similar effect on preserving ordinality as Rank Sim, which is specifically designed for this purpose. The lower volumes of our method compared to the baseline indicate that the feature manifold is more compressed. We provide the visualization of the feature similarities in Appendix C.1.

Ordinality for a better compression. To further verify that preserving ordinality leads to better compression, we visualize the feature manifold of the depth estimation task in 3D space. This is done by changing the last hidden layer s feature space to three dimensions. As shown in Figure 3, explicitly preserving the ordinality (i.e., +Rank Sim) compresses the feature manifold into a thin line, which shows a similar effect to explicitly tightening the representations (i.e., +MT).

5.4 TIGHTNESS OF REGRESSION

Our theoretical analysis in Sec. 3 focuses on the gradient direction of representations. However, in reality, the neural network updates its parameters to update the representations indirectly. Here, we verify our analysis by visualizing the update of z and θ in the depth estimation task.

The update of z. We change the last hidden layer s feature space to 2 dimensional for visualization. We randomly sample 1000 pixels from a batch of 8 images in the NYUD2-DIR test set to visualize the feature manifold. Figure 4(a) displays the feature manifolds at epoch 1 (blue dots) and epoch 10 (the final epoch, red dots), the corresponding pixel representations are connected by black arrows. Aligned with our theoretical analysis, the directions of the representation updates follow the direction of θ. To verify this quantitatively, we calculate the principal component of the update directions using

Published as a conference paper at ICLR 2025

(a) Update of z

(b) Update of θ

(c) Update directions of θ

Figure 4: (a) Visualization of the z update, which aligns with θ, (b) θ update, which is steady through the training process, (c) the updating directions of θs, which distributed along a line, with the original as the center.

PCA. We find that the cosine distance between this principal component and θ is very small (0.03), indicating that the updating directions of representations from the beginning to the end of training follow the direction of θ. The visualization also shows that the feature manifold tightened limited in the direction perpendicular to θ throughout the training. The visualizations of feature manifolds at each epoch are provided in Appendix C.2, which reveals that the tightening effect in the direction perpendicular to θ between adjacent epochs is even smaller.

The update of θ. As discussed in Sec. 3, the effect of zi on the direction of θ tends to offset each other and results in a limited impact, while changes in the directions of θk in classification are generally greater. Here we quantitatively verify this by calculating the cosine distances between θ and θ at each epoch from 1 to 10 (final epoch), where θ represents θ at epoch 10. We also convert this regression task into a classification task by uniformly discretizing the target range into 10 classes, and monitoring the change of θk in the same way. As shown in Figure 4(b), the changes in θk are all larger than the changes in θ. The maximum cosine distance between θ at different epochs is very small (i.e., 0.0004), which also verified the limited change of θ.

Multiple θs. Adding additional θs (our MT strategy) , with random initialization, does not change the update speed of θ (see Figure 7 in the Appendix C.3). The update directions of all θs are even aligned. Let vi θ = θi+1 θi be the updating vector of θ in iteration i. Figure 4(c) plots the set of points {vi θ|i = 500k, k Z, 0 k 100} for three θs. This visualizes the change of θs throughout the training process. The three θs are distributed along a line, with the original as the center. When we calculate the principle components of {vθ} for the three θs using PCA, the maximum cosine distances between the three principle components are less than 1e 4, which quantitatively shows the updating directions of all θs are in the same direction. As shown in Eq. 3, vθ is the weighted mean of a batch of z. Different θ leads to different magnitudes of the weight mean while the directions remain steady. It is worth mentioning that the multiple θs do not collapse to a single θ in reality, although their updating directions are the same. This is because θs are randomly initialized, and their directions remain nearly identical during training (see Figure 4(b)), due to the three reasons: 1) The magnitude of Lreg

θ is scaled by wi, since wi often follows a Gaussian distribution centered at the origin, as assumed in models like Bayesian linear regression. When w and z are independent or weak dependent, E[wizi] will approach 0 and causing Lreg

θ be scaled to 0. 2) According to the central limit theorem, the updates of θ follow a Gaussian distribution. This causes partial offsets between updates and results in a reduced accumulated effect. In addition, we empirically observe that the mean of this Gaussian distribution approaches 0 (see Figure 4(c)), indicating that w and z are independent or weakly dependent. 3) The effect of zi on the direction of Lreg

θ over a batch of samples offsets each other, resulting in the stability of the direction of θ throughout training. This occurs because zi tends to be distributed around θ. More details can be found in Appendix C.3.

5.5 ABLATION STUDY

We conduct the ablation study on Age DB-DIR for age estimations. The results are given in Table 5.

Published as a conference paper at ICLR 2025

Table 5: Ablation study on Age DB-DIR for age estimation. We report MAE mean standard deviation over 10 runs, and the default λ, γ and M are set to 100, 0.1 and 8, respectively.

Method ALL Many Med. Few Baseline 7.80 0.12 6.80 0.06 9.11 0.31 13.63 0.43 + Loe λ = 1 7.68 0.08 6.75 0.11 8.81 0.19 13.38 0.37 λ = 10 7.55 0.05 6.64 0.07 8.71 0.12 12.88 0.35 λ = 100 7.36 0.08 6.55 0.07 8.40 0.14 12.14 0.33 λ = 1000 8.80 0.19 7.17 0.10 11.32 0.48 17.26 0.53 γ = 0.1 7.36 0.08 6.55 0.07 8.40 0.14 12.14 0.33 γ = 1 7.47 0.12 6.61 0.08 8.57 0.31 12.55 0.49 γ = 10 7.51 0.07 6.63 0.08 8.60 0.25 12.75 0.31 + MT M=2 7.72 0.11 6.77 0.07 8.92 0.20 13.37 0.50 M=4 7.74 0.06 6.77 0.10 8.96 0.13 13.52 0.41 M=8 7.67 0.06 6.72 0.08 8.87 0.13 13.36 0.16 M=16 7.70 0.11 6.73 0.13 8.93 0.25 13.44 0.36 M=32 7.71 0.08 6.74 0.10 8.96 0.32 13.40 0.32 noise 8.00 0.23 6.91 0.20 9.43 0.33 14.36 0.67

Table 6: Time and memory consumption.

Method Time (mins) Memory(MB) Baseline 65 14433 + MT 70 14457 + Lot 74 14587 + MT + Lot 82 14689

Hyperparameter λ, γ. We keep λ, γ at their default value 100, 0.1, and vary each individually to examine their impact. As shown in Table 5, The MAE (ALL) decreases consistently as λ increases. However, it overtakes the task-specific learning target when set too high (e.g., 1000) and decreases the performance. For the γ, the MAE (ALL) decreases consistently as γ decreases. However, we empirically find that setting γ (e.g., 0.01) too low will easily result in Na N values when calculating the transport matrixes using the Sinkhorn algorithm (Cuturi, 2013). We thus set γ = 0.1.

Number of the total targets M. As shown in Table 5, the performance generally improves with the increase of M, when M 8, and stays steady when M increases further. The primary factor affecting the selection of M is the intrinsic dimension of the feature manifold, which determines how many additional constraints (i.e., M 1) are required to compress the manifold. The range of Y has a limited impact on the selection of M, since M equals 8 works well in NYUD2-DIR (y [0.7, 10]) and Age DB-DIR ( y [0, 101]).

Mean ˆy vs. a single ˆy. We verify the strategy of the mean operation in MT (see Eq. 6), which potentially brings in an ensemble effect. We find that ˆyt i are very similar for all t. For a model with M, the MAE(ALL) results calculated by ˆyt i for t T are with mean equal to 7.579 and standard deviation 0.0003. Thus, the improvement of MT is not due to the ensemble effect, and the mean operation is optional.

Additional y vs. noise. Adding additional targets as noise, as shown in Table 5, does not work.

Time and memory consumption. We monitor the time and memory consumption for training a model from the beginning to the end with a batch size equal to 128. Table 6 shows the added memory is negligible(1.7%), and the added time is limited (17 min).

6 CONCLUSION

In this paper, for the regression task, we provide a theoretical analysis that suggests preserving ordinality enhances the representation tightness, and regression suffers from a weak ability to tighten the representations. Motivated by classification and the self entropic optimal transport, we introduce a simple yet effective method to tighten regression representations.

Published as a conference paper at ICLR 2025

Acknowledgement. This research / project is supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 1 (FY2022), National Natural Science Foundation of China (62206061, U24A20233), Guangdong Basic and Applied Basic Research Foundation (2024A1515011901), Guangzhou Basic and Applied Basic Research Foundation (2023A04J1700).

Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE transactions on pattern analysis and machine intelligence, 40 (12):2897 2905, 2018.

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009 4018, 2021.

Tolga Birdal, Aaron Lou, Leonidas J Guibas, and Umut Simsekli. Intrinsic dimension, persistent homology and generalization in neural networks. Advances in Neural Information Processing Systems, 34:6776 6789, 2021.

Malik Boudiaf, J erˆome Rony, Imtiaz Masud Ziko, Eric Granger, Marco Pedersoli, Pablo Piantanida, and Ismail Ben Ayed. A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses. In European conference on computer vision, pp. 548 564. Springer, 2020.

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/ file/af21d0c97db2e27e13572cbf59eb343d-Paper.pdf.

Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taiga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, Aviral Kumar, and Rishabh Agarwal. Stop regressing: Training value functions via classification for scalable deep RL. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/ forum?id=d Vp FKfq F3R.

Yu Gong, Greg Mori, and Frederick Tung. Rank Sim: Ranking similarity regularization for deep imbalanced regression. In International Conference on Machine Learning (ICML), 2022.

Kerui Gu, Linlin Yang, and Angela Yao. Dive deeper into integral pose regression. In International Conference on Learning Representations, 2022.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Kenji Kawaguchi, Zhun Deng, Xu Ji, and Jiaoyang Huang. How does information bottleneck help deep learning? In International Conference on Machine Learning, pp. 16049 16096. PMLR, 2023.

Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1-2):81 93, 1938.

Mahsa Keramati, Lili Meng, and R David Evans. Conr: Contrastive regularizer for deep imbalanced regression. In ICLR, 2023.

Boris Landa, Ronald R Coifman, and Yuval Kluger. Doubly stochastic normalization of the gaussian kernel is robust to heteroskedastic noise. SIAM journal on mathematics of data science, 3(1): 388 413, 2021.

Yanjie Li, Sen Yang, Peidong Liu, Shoukui Zhang, Yunxiao Wang, Zhicheng Wang, Wankou Yang, and Shu-Tao Xia. Simcc: A simple coordinate classification perspective for human pose estimation. In European Conference on Computer Vision, pp. 89 106. Springer, 2022.

Published as a conference paper at ICLR 2025

Liang Liu, Hao Lu, Haipeng Xiong, Ke Xian, Zhiguo Cao, and Chunhua Shen. Counting objects by blockwise classification. IEEE Transactions on Circuits and Systems for Video Technology, 30(10): 3513 3527, 2019.

Michael Moor, Max Horn, Bastian Rieck, and Karsten Borgwardt. Topological autoencoders. In International conference on machine learning, pp. 7045 7054. PMLR, 2020.

Silvia L Pintea, Yancong Lin, Jouke Dijkstra, and Jan C van Gemert. A step towards understanding why classification helps regression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19972 19981, 2023.

Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. ar Xiv preprint ar Xiv:1703.00810, 2017.

Sho Sonoda and Noboru Murata. Transport analysis of infinitely deep neural network. Journal of Machine Learning Research, 20(2):1 52, 2019.

Charles Spearman. The proof and measurement of association between two things. 1961.

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pp. 1 5. IEEE, 2015.

Ziyan Wang and Hao Wang. Variational imbalanced regression: Fair uncertainty quantification via probabilistic smoothing. Advances in Neural Information Processing Systems, 36, 2024.

Haipeng Xiong and Angela Yao. Discrete-constrained regression for local counting models. In European Conference on Computer Vision, pp. 621 636. Springer, 2022.

Yuguang Yan, Zhihao Xu, Canlin Yang, Jie Zhang, Ruichu Cai, and Michael Kwok-Po Ng. An optimal transport view for subspace clustering and spectral clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 16281 16289, 2024.

Yuzhe Yang, Kaiwen Zha, Yingcong Chen, Hao Wang, and Dina Katabi. Delving into deep imbalanced regression. In International Conference on Machine Learning, pp. 11842 11851. PMLR, 2021.

Shihao Zhang, Linlin Yang, Michael Bi Mi, Xiaoxu Zheng, and Angela Yao. Improving deep regression with ordinal entropy. ICLR, 2023.

Shihao Zhang, Kenji Kawaguchi, and Angela Yao. Deep regression representation learning with topology. In International Conference on Machine Learning (ICML), 2024.

Published as a conference paper at ICLR 2025

A.1 PROOF OF THEOREM 1

Theorem 1 Let B(z, ϵ) = {z Z|d(z, z ) ϵ} be the closed ball center at z with radius ϵ. Assume that (x, z, y) P and ϵ > 0, (x , z , y ) P such that z B(z, ϵ) and y = y. Then if the ordinality is perfectly preserved, (xi, zi, yi), (xj, zj, yj) P, the following hold: yi = yj d(zi, zj) = 0.

d(zi, zj) = d(zi zk + zk zj) (11) d(zi zk) + d(zk zj), (12)

where zk B(z, ϵ). Since d(yk, yj) d(yk, yi), and the ordinality is perfectly preserved, we have:

d(zk zj) d(zi zk). (13)

0 d(zi, zj) 2d(zi zk) 2ϵ. (14)

Let ϵ 0, the result follows.

A.2 PROOF OF THEOREM 2

We first give a lemma:

Lemma 1 Let Sy = {z|g(z) = y} be a convex set, where z Rn is the representation, y is the target and g is the regressor. Assume g is differentiable, then zk, zi, zj Sy, we have:

g(zk)(zi zj) = 0. (15)

Proof Let zϵ k = (1 ϵ)zk +ϵzi, where ϵ [0, 1]. Since g is differentiable, using Taylor expansion, we have:

g(zϵ k) = g((1 ϵ)zk + ϵzi) (16) = g(zk + ϵ(zi zk)) (17) = g(zk) + ϵ g(zk)(zi zk) + o(ϵ). (18)

Since Sy is a convex set, we have zϵ k Sy. Thus:

g(zϵ k) = g(zk) + ϵ g(zk)(zi zk) + o(ϵ) (19) y = y + ϵ g(zk)(zi zk) + o(ϵ) (20) o(ϵ)

ϵ = g(zk)(zk zi). (21)

g(zk)(zk zi) = lim ϵ 0 o(ϵ)

ϵ = 0. (22)

Similarly, we have:

g(zk)(zk zj) = 0. (23)

Combining the two equations above, we have:

g(zk)(zi zj) = g(zk)(zi zk + zk zj) (24) = g(zk)(zi zk) + g(zk)(zk zj) (25) = 0. (26)

Published as a conference paper at ICLR 2025

Theorem 2 Assume fθ is differentiable and Sy i is a convex set, then z i, z j Sy i:

zi (z i z j) = 0, (27)

where y i is the predicted target of zi.

zi = Lreg(g(zi) yi)

= Lreg(g(zi) yi)

(g(zi) yi) (g(zi) yi)

= L reg(g(zi) yi) g(zi). (30)

Based on Lemma 1, we have:

g(zi)(z i z j) = 0. (31)

zi (z i z j) = L reg(g(zi) yi) g(zi)(z i z j) (32)

= L reg(g(zi) yi) 0 (33)

B.1 DETAILS ABOUT THE REAL-WORLD TASKS

For the age estimation on Age DB-DIR, we adopt the suggested hyper-parameters to train the Rank Sim, where λ, γ are set to 2, 1000, and the results of OE and PH-Reg are adopted from their published papers. The evaluation metric MAE: 1 N PN i=1 |yi y i|, where N is the total number of samples, yi, y i are the label and the predicted result.

For depth estimation on NYUD2-DIR, we adopt the suggested hyper-parameters of OE and PHReg to train the models. For Rank Sim, we train the model with the γ range from 1 to 1000. We report the best results for all three baselines. The evaluation metric: threshold accuracy δ1 % of

yp, s.t. max( yp

y p , y p yp ) < 1.25, and root mean squared error (RMS): q

p(yp y p)2.

C VISUALIZATIONS

C.1 VISUALIZATION OF THE UPDATING OF z

We provide the visualization of the feature similarities in Figure. 5.

C.2 VISUALIZATION OF THE UPDATING OF z

The visualizations of feature manifolds at each epoch are provided in Figure 6. For the neural collapse of regression, the feature manifold will collapse into a single line when the target space is a line and the compression is maximized (Zhang et al., 2024). This trend can be observed in Figure 6, where the feature manifold looks like a thick line and evolves to a thinner line over training. However, standard regression s limited ability to tighten representations results in a slower collapse. In contrast, our proposed method and Rank Sim both accelerate this collapse, as shown in Figure 3.

Published as a conference paper at ICLR 2025

(a) Baseline

(b) + Rank Sim

(e) + MT + Lot

Figure 5: Feature similarity matrices (Euclidean Distance). Tightening the representations results in a better ordinality.

C.3 UPDATING OF MULTIPLE θ

The experiments are conducted on NYUD2-DIR, we change the last hidden layer s feature space to three dimensions for visualization, and the M in our MT strategy is set to 3. The change of multiple θs throughout the training is shown in Figure 7. We further plot the change of {vi θ|i = 500, k Z, 0 k 500} for three θs. The visualizations are given in Figure 8. The visualization shows the update directions of θs align with each other, even for a neural network without training.

Published as a conference paper at ICLR 2025

(a) epoch 1 to 2

(b) epoch 2 to 3

(c) epoch 3 to 4

(d) epoch 4 to 5

(e) epoch 5 to 6

(f) epoch 6 to 7

(g) epoch 7 to 8

(h) epoch 8 to 9

(i) epoch 9 to 10

Figure 6: Change of z between adjoin epochs.

Published as a conference paper at ICLR 2025

Figure 7: Change of the multiple θs.

Figure 8: Change of θs within the iteration [0, 500].