# deep_regression_representation_learning_with_topology__06d82d08.pdf Deep Regression Representation Learning with Topology Shihao Zhang 1 Kenji Kawaguchi 1 Angela Yao 1 Most works studying representation learning focus only on classification and neglect regression. Yet, the learning objectives and, therefore, the representation topologies of the two tasks are fundamentally different: classification targets class separation, leading to disconnected representations, whereas regression requires ordinality with respect to the target, leading to continuous representations. We thus wonder how the effectiveness of a regression representation is influenced by its topology, with evaluation based on the Information Bottleneck (IB) principle. The IB principle is an important framework that provides principles for learning effective representations. We establish two connections between it and the topology of regression representations. The first connection reveals that a lower intrinsic dimension of the feature space implies a reduced complexity of the representation Z. This complexity can be quantified as the conditional entropy of Z on the target Y, and serves as an upper bound on the generalization error. The second connection suggests a feature space that is topologically similar to the target space will better align with the IB principle. Based on these two connections, we introduce PH-Reg, a regularizer specific to regression that matches the intrinsic dimension and topology of the feature space with the target space. Experiments on synthetic and real-world regression tasks demonstrate the benefits of PH-Reg. Code: https: //github.com/needylove/PH-Reg. 1. Introduction Regression is a fundamental task in machine learning in which input samples are mapped to a continuous target 1National Unviersity of Singapore. Correspondence to: Shihao Zhang . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). space. Representation learning empowers models to automatically extract, transform, and leverage relevant information from data, leading to improved performance. The information bottleneck (IB) principle (Shwartz-Ziv & Tishby, 2017) provides a theoretical framework and guiding principle for learning effectiveness representations. The IB principle suggests learning a representation Z with sufficient information about the target Y but minimal information about the input X. For representation Z, sufficiency keeps all necessary information on Y, while the minimality reduces Z s complexity and prevents overfitting. The optimal representation, as specified by Achille & Soatto (2018b;a), is the most useful (sufficient) and minimal. Yet, they specified classification and neglect regression. The IB principle is applicable to both classification and regression in that both learn minimal and sufficient representations. However, there are some fundamental differences. For example, classification shortens the distance between features belonging to the same class while elongating the distance between features of different classes; the shortening and elongating of distances can be interpreted as minimality and sufficiency, respectively (Boudiaf et al., 2020). The two effects lead to disconnected representations (Brown et al., 2022a). By contrast, in regression, the representations are shown to be continuous, connected and form an ordinal relationship with respect to the target (Zhang et al., 2023). The disconnected and connected representations are topologically different, as they have different 0th Betti numbers. The 0th Betti number represents the connectivity in topology, influencing the shape of the feature space1. While there are a few works investigating the influence of the representation topology in classification (Hofer et al., 2019; Chen et al., 2019), regression is overlooked. We thus wonder what topology the feature space should have for effective regression and how the topology of the feature space is connected to the IB principle. In this work, we establish two connections between the topology of the feature space and the IB principle for regression representation learning in deep learning. To establish the connections, we first demonstrate that minimizing the conditional entropies H(Y|Z) and H(Z|Y) can better 1In this work, the feature space represents the set of projected data points, i.e. the manifold, rather than the entire ambient space. Deep Regression Representation Learning with Topology align with the IB principle. The entropy of a random variable reflects its uncertainty. Specifically, for regression, the conditional entropy H(Z|Y) is linked to the minimality of Z and serves as an upper-bound on the generalization error. The first connection reveals that H(Z|Y) is bounded by the intrinsic dimension (ID) of the feature space, which suggests encouraging a lower ID feature space for better generalization ability. However, the ID of the feature space should not be less than the ID of the target space to guarantee sufficient representation capabilities. Thus, a feature space with ID equals the target space is desirable. The intrinsic dimension (ID) is a fundamental property of data topology. Intuitively, it can be regarded as the minimal number of dimensions to describe the representation without significant information loss (Ansuini et al., 2019). The second connection reveals that having a representation Z homeomorphic to the target space Y is desirable when both H(Y|Z) and H(Z|Y) are minimal. The homeomorphism between two spaces can be described intuitively as the continuous deformation of one space to the other. From a topological viewpoint, two spaces are considered the same if they are homeomorphic (Hatcher, 2001). However, directly enforcing homeomorphism can be challenging to achieve since the representation Z typically lies in a high-dimensional space that cannot be modeled without sufficient data samples. As such, we opted to enforce the topological similarity between the target and feature spaces. Here, topological similarity refers to the similarity in topological features, such as clusters and loops, and their localization (Trofimov et al., 2023). These connections naturally inspire us to learn a regression feature space that is topologically similar to and has the same intrinsic dimension as the target space. To this end, we introduce a regularizer called Persistent Homology Regression Regularizer (PH-Reg). In classification, interest has grown in regulating the intrinsic dimension. For instance, Zhu et al. (2018) explicitly penalizes intrinsic dimension as regularization, while Ma et al. (2018) uses intrinsic dimensions as weights for noise label correction. However, a theoretical justification for using intrinsic dimension as a regularizer is lacking, and they overlook the topology of the target space. Experiments on various regression tasks demonstrate the effectiveness of PH-Reg. Our main contributions are three-fold: We are the first to investigate effective feature space topologies for regression. We establish novel connections between the topology of the feature space and the IB principle, which also provides justification for exploiting intrinsic dimension as a regularizer. Based on the IB principle, we demonstrate that H(Z|Y) serves as an upper-bound on the generaliza- tion error in regression, providing insights for enhancing generalization ability. We introduce a regularizer named PH-Reg based on the established connections. Applying PH-Reg achieves significant improvement in coordinate prediction on synthetic datasets and real-world regression tasks such as super-resolution, age estimation, and depth estimation. 2. Related Works Intrinsic dimension. Raw data and learned data representations often lie on lower intrinsic dimension manifolds but are embedded within a higher-dimensional ambient space (Bengio et al., 2013). The intrinsic dimension of the feature space from the last hidden layer has shown a strong connection with the network generalization ability (Ansuini et al., 2019), and several widely used regularizers like weight decay and dropout effectively reduce the intrinsic dimension (Brown et al., 2022b). Commonly, the generalization ability increases with the decrease of the intrinsic dimension. However, a theoretical justification for why this happened is lacking, and our established connections provide an explanation for this phenomenon in regression. The intrinsic dimension can be estimated by methods such as the Two NN (Facco et al., 2017) and Birdal s estimator (Birdal et al., 2021). Among the relevant studies, (Birdal et al., 2021) is the most closely related to ours. This work demonstrates that the generalization error can be bounded by the intrinsic dimension of training trajectories, which possess fractal structures. However, their analysis is based on the parameter space, while ours is on the feature space. Furthermore, we take the target space into consideration, ensuring sufficient representation capabilities. Topological data analysis. Topological data analysis is a recent field that provides a set of topological and geometric tools to infer robust features for complex data (Chazal & Michel, 2021). It can be coupled with feature learning to ensure that learned representations are robust and reflect the training data s underlying topology and geometric information (Rieck et al., 2020). It has benefitted diverse tasks ranging from f MRI data analysis (Rieck et al., 2020) to and AI-generated text detection (Tulchinskii et al., 2023). It can also be used as a tool to compare data representations (Barannikov et al., 2021a) and data manifolds (Barannikov et al., 2021b). To learn representations that reflect the topology of the training data, a common strategy is to preserve different dimensional topologically relevant distances of the input space and the feature space (Moor et al., 2020; Trofimov et al., 2023). We follow Moor et al. (2020) to preserve topology information. However, unlike classification, regression s target space is naturally a metric space rich in topology induced by the metric, and Deep Regression Representation Learning with Topology crucial for the intended task. Consequently, we leverage the topology of the target space, marking the first exploration of topology specific to effective representation learning for regression. 3. Learning a Desirable Regression Representation From a topology point of view, what topological properties should a representation for regression have? More simply put, what shape or structure should the feature space have for effective regression? In this work, we suggest a desirable regression representation should (1) have a feature space topologically similar to the target space and (2) the intrinsic dimension of the feature space should be the same as the target space. We arrive at this conclusion by establishing two connections between the topology of the feature space and the Information Bottleneck principle. Below, we first introduce the notations in Sec. 3.1 and connect the IB principle with two terms H(Z|Y) and H(Y|Z) in Sec. 3.2. We then demonstrate that H(Z|Y) is the upperbound on the generalization error in regression in Sec. 3.3. This later provides justification for why lower ID implies higher generalization ability. Subsequently, we establish the first connection in Sec. 3.5, revealing that H(Z|Y) is bounded by the ID of the feature space. Finally, we establish the second connection, the topological similarity between the feature and target spaces, in Sec. 3.6. Two motivating examples are provided in Sec. 3.4 to enhance understanding of the two connections intuitively. 3.1. Notations Consider a dataset S = {xi, zi, yi}N i=1 with N samples, sampled from a distribution P with the corresponding label yi Y. To predict yi, a neural network first encodes the input xi to a representation zi Rd before apply a regressor f, i.e. ˆyi = f(zi). The encoder and the regressor f are trained by minimizing a task-specific regression loss Lm based on a distance between ˆyi and yi, i.e. Lm = g(||ˆyi yi||2). Typically, an L2 loss is used, i.e. Lm = 1 i ||ˆyi yi||2, though more robust variants exist such as L1 or the scale-invariant error (Eigen et al., 2014). Note that the dimensionality of y is task-specific and is not limited to 1. We denote X, Y, and Z as random variables representing x, y, and z, respectively. 3.2. IB purely between Y and Z The IB tradeoff is a practical implementation of the IB principle in machine learning. It suggests that a desirable Z should contain sufficient information about the target Y (i.e., maximize the mutual information I(Z; Y)) and minimal information about the input X (i.e., minimize I(Z; X)). The trade-off between the two aims is typically formulated as a minimization of the associated Lagrangian, IB := I(Z; Y) + βI(Z; X), where β > 0 is the Lagrange multiplier. To establish the connections, we first formulate the IB tradeoff into relationships purely between Y and Z. The following theorem shows that minimizing the conditional entropies H(Y|Z) and H(Z|Y) can be seen as a proxy for optimizing the IB tradeoff when β (0, 1): Theorem 1 Assume that the conditional entropy H(Z|X) is a fixed constant for Z Z for some set Z of the random variables, or that Z is deterministic given X. Then, min Z IB = min Z {(1 β)H(Y|Z) + βH(Z|Y)}. The detailed proof of Theorem 1 is provided in Appendix A.1. Here, we provide a brief overview by decomposing the terms. The conditional entropy H(Y|Z) encourages the learned representation Z to be informative about the target variable Y. When considering I(Z; Y) as a signal, the term H(Z|Y) in Theorem 1 can be thought of as noise, since I(Z; Y) = H(Z) H(Z|Y) and H(Z) represents the total information. Consequently, minimizing H(Z|Y) can be seen as learning a minimal representation by reducing noise. The minimality can reduce the complexity of Z and prevent neural networks from overfitting (Tishby & Zaslavsky, 2015). It is worth mentioning that the fixed constant assumption given in Theorem 1 holds for most neural networks, as neural networks are commonly deterministic functions. For stochastic representations, we commonly learn a distribution p(Z|X) approaching a fixed distribution, like the standard Gaussian distribution in VAE. In this case, H(Z|X) will tend to be a fixed constant for Z. Discussions about the choice of β, i.e. β (0, 1) or β > 1, and more illustrations are given in Appendix B. 3.3. H(Z|Y) upper-bound on the generalization error Next, we show H(Z|Y) upper-bound on the generalization error. Theorem 2 Consider dataset S = {xi, zi, yi}N i=1 sampled from distribution P, where xi is the input, zi is the corresponding representation, and yi is the label. Let dmax = maxy Y minyi S ||y yi||2 be the maximum distance of y to its nearest yi. Assume (Z|Y = yi) follows a distribution D and the dispersion of D is bounded by its entropy: Ez D[||z z||2] Q(H(D)), (1) where z is the mean of the distribution D and Q(H(D)) is some function of H(D). Assume the regressor f is L1- Deep Regression Representation Learning with Topology Lipschitz continuous, then as dmax 0, we have: E{x,z,y} P [||f(z) y||2] (2) E{x,z,y} S(||f(z) y||2) + 2L1Q(H(Z|Y)) (3) The detailed proof of Theorem 2 is provided in Appendix A.2, and a comparison to a related bound is given in Appendix C. Theorem 2 states that the generalization error |EP [||f(z) y||2] ES[||f(z) y||2]|, defined as the difference between the population risk EP [||f(z) y||2] and the empirical risk ES[||f(z) y||2], is bounded by the H(Z|Y) in Theorem 1. Theorem 2 suggests minimizing H(Z|Y) will improve generalization performance. The tightness of the bound in Theorem 2 depends on the function Q, which aims to bound the dispersion (i.e., Ez D[||z z||2]) of a distribution by its entropy. For a given distribution D, Q exists when its dispersion and entropy are bounded, as we can find a Q to scale its entropy larger than its dispersion in this case. Proposition 1 provides examples of the function Q for various distributions, and the corresponding proof is provided in Appendix A.2. Proposition 1 If D is a multivariate normal distribution N( z, Σ = k I), where k > 0 is a scalar and z is the mean of the distribution D. Then, the function Q(H(D)) in Theo- rem 2 can be selected as Q(H(D)) = q d(e2H(D)) 1 d 2πe , where d is the dimension of z. If D is a uniform distribution, then the Q(H(D)) can be selected as Q(H(D)) = e H(D) 3.4. Motivating Examples Encouraging the same intrinsic dimension. Figure 1(a) plots pixel-wise representations of the last hidden layer s feature space, depicted as dots with different colors corresponding to ground truth depth. These representations are obtained from a batch of 32 images from the NYU-v2 test set for depth estimation. A modified Res Net-50 produces these representations, with the last hidden layer changed to dimension 3 for visualization. This figure provides a visualization of the last hidden layer s feature space, where the representations lay on a manifold where the ID varies locally from 1 (blue region) to 3 (green region). The black arrow represents the linear regressor s weight vector θ, and the predicted depth ˆY = f(Z) = θTz is obtained by mapping Z (represented as dots) to θ. The gray plane represents the solution space of f(Z) = ˆyi, and the entropy of its distribution in this plane, i.e. H(Z| ˆY = ˆyi), can be seen as an approximate of H(Z|Y = yi). The target space for depth estimation is one-dimensional; enforcing an intrinsic dimension to match the 1D target space will squeeze the feature space into a line. Under such a scenario, the solution space of f(z) = ˆyi is compressed into a point, implying H(Z| ˆY = ˆyi) = 0 (discrete case) and a lower H(Z|Y = yi). Lower H(Z|Y = yi) for all i implies a lower H(Z|Y). Thus, by controlling the ID, we obtain a lower H(Z|Y), implying a higher generalization ability. Since the ID of the feature space is commonly higher than the ID of the target space, the first connection generally encourages learning a lower ID feature space. In classification, we tighten clusters for a lower H(Z|Y), while in regression, lowering the ID achieves a lower H(Z|Y). Lowering the ID of feature space can be intuitively understood as tightening the clusters in classification, where each solution space represents a cluster in classification. Enforcing topological similarity. Figure 1(b) provides a PCA visualization (from 100 dimension to 3 dimension, t-sne visualization can be found in Figure 3) of the feature space with a Mammoth shape target space (see Sec. 5.1 for details). This feature space is topologically similar to the target space, which indicates regression potentially captures the topology of the target space. The second connection suggests improving such similarity. 3.5. Encouraging the Same Intrinsic Dimension Now, we can establish our first connection, which reveals that H(Z|Y) is bounded by the ID of the feature space. Note, intrinsic dimension is not a well-defined mathematical object, and different mathematical definitions exist (Ma et al., 2018; Birdal et al., 2021). We first define Intrinsic Dimension following Ghosh & Motani (2023): Definition 1 (Intrinsic Dimension). We define the intrinsic dimension of the manifold M of a random variable X as Dim IDM = lim ϵ 0+ Eρ p(X)[dϵ(ρ)], (4) where dϵ(ρ) = min n s.t. (V1, V2, , Vn) Xρ ϵ can be regarded as the intrinsic dimension locally at point ρ in the manifold M. V1, V2, , Vn represent random random variables, means there exist continuous functions f1, f2 such that f1(V1, V2, , Vn) = Xρ ϵ and f2(Xρ ϵ) = (V1, V2, , Vn). Xρ ϵ is a new random variable that follows distribution P ρ ϵ given by: P ρ ϵ (X) = c , if ||X ρ|| ϵ 0, otherwise (5) where c = R ||X ρ|| ϵ P(X)d X. The manifold is assumed to locally resemble a ndimensional Euclidean space. Intuitively, we can consider the ID as the expectation of n over the distribution of this manifold. Deep Regression Representation Learning with Topology Linear regressor Color bar (depth value) (a) Lower intrinsic dimension Target Space Feature Space (b) Topology similarity Figure 1. (a) Visualization of the feature space from depth estimation task. Enforcing an ID equal to the target space (1 dimensional) will squeeze the feature space into a line, reducing the unnecessary H(Z|Y = yi) corresponding to the solution space of f(z) = ˆyi (the gray quadrilateral) for all i and implying a lower H(Z|Y). (b) Visualization of the feature space (right) and the Mammoth shape target space (left), see Sec. 5.1 for details. The feature space is topologically similar to the target space . Theorem 3 Assume that z lies in a manifold M and the Mi M is a manifold corresponding to the distribution (z|y = yi). Let C(ϵ) be some function of ϵ: ||z z || ϵ P (z)dz, (6) where P (z) is the probability of z when (z|y = yi) is uniformly distributed across Mi, and z is any point on Mi. Then, as ϵ 0+, we have: H(Z|Y) = Eyi YH(Z|Y = yi) (7) Eyi Y[ log(ϵ)Dim IDMi + log K C(ϵ)], (8) for some fixed scalar K. Dim IDMi is the intrinsic dimension of the manifold Mi. Theorem 3 is derived from [(Ghosh & Motani, 2023), Proposition 1]. The detailed proof is provided in Appendix A.3. Theorem 3 states that the conditional entropy H(Z|Y) is bounded by the IDs of manifolds corresponding to the distribution (z|y = yi), and the bound is tight when (z|y = yi) are uniformly distributed across the manifolds. Since Mi M, Theorem 3 suggests that reducing the intrinsic dimension of the feature space M will lead to a lower H(Z|Y), which in turn implies a better generalization performance based on Theorem 2. On the other hand, the ID of M should not be less than the intrinsic dimension of the target space to guarantee sufficient representation capabilities. Thus, a M with an intrinsic dimension equal to the dimensionality of the target space is desirable. 3.6. Enforcing Topological Similarity Below, we establish the second connection: topological similarity between the feature and target spaces. We first define the optimal representation following Achille & Soatto (2018b). Definition 2 (Optimal Representation). The representation Z is optimal if (1) H(Y|Z) = H(Y|X) and (2) Z is fully determined given Y, i.e. H(Z|Y) is minimal. In Definition 2, H(Y|Z) = H(Y|X) means Z is sufficient for the target Y, while H(Z|Y) is minimal means Z discards all information that is not relevant to Y, and Z is fully determined given Y. For continuous entropy, a minimal H(Z|Y) implies that H(Z|Y) = , as Z is distributed as a delta function once given Y. In the discrete case, H(Z|Y) = 0. Proposition 2 Let the target Y = Y + N where Y is fully determined by X and N is the aleatoric uncertainty that is independent of X. Assume the underlying mapping f from Z to Y and g from Y to Z are continuous, where the continuous mapping is based on the topology induced by the Euclidean distance. Then the representation Z is optimal if and only if Z is homeomorphic to Y . The detailed proof of Proposition 2 is provided in Appendix A.4. Proposition 2 demonstrates that the optimal Z is homeomorphic to Y , implying the need to learn a Z that is homeomorphic to Y . However, directly enforcing homeomorphism can be challenging to achieve since Y is generally unknown, and the representation Z typically lies in a high-dimensional space that cannot be modeled without sufficient data samples. As such, we opted to enforce the topological similarity between the target and feature spaces, preserving topological features similar to homomorphism. Here, topological similarity refers to the similarity in topological features, such as clusters and loops, and their localization (Trofimov et al., 2023). The two established connections imply that the desired Z should be Deep Regression Representation Learning with Topology topologically similar to the target space and share the same ID as the target space. More illustrations are given in Appendix D and E. 4. PH-Reg for Regression Our analysis in Sec. 3 inspires us to learn a feature space that is (1) topologically similar to the target space and (2) with an intrinsic dimension (ID) equal to that of the target space. To this end, we propose a regularizer named Persistent Homology Regression Regularizer (PH-Reg). PH-Reg features two terms: an intrinsic dimension term Ld and a topology term Lt. Ld follows Birdal s regularizer (Birdal et al., 2021) to control the ID of feature space. Additionally, it considers the target space to ensure sufficient representation capabilities. Ld exploit the topology autoencoder (Moor et al., 2020) to encourage the topological similarity. Note the two regularizer terms are mainly introduced to verify our connections, and other ID and topology regularizers can also be considered. However, empirical observations suggest that our Ld and Lt effectively align with our established connections, perform well, and do not conflict with each other. We first introduce some notations. Let Zn represent the set of n samples from Z, and Yn be the labels corresponding to Zn. We denote PH0(VR(Zn)) the 0-dimensional persistent homology. Intuitively, PH0(VR(Zn)) can be regarded as a set of edge lengths, where the edges are derived from the minimum spanning tree obtained from the distance matrix AZn of Zn. We denote πZn, πYn the set of the index of edges in the minimum spanning trees of Zn and Yn, respectively, and A [π ] the corresponding length of the edges. Let E(Zn) = P γ PH0(VR(Zn)) |I(γ)| be the sum of edge lengths of the minimum spanning trees corresponding to Zn. We define E(Yn) similarly. Some topology preliminaries are given in Appendix F. Birdal et al. (2021) suggests to estimate the intrinsic dimension as the slope between log E(Zn) and log n. Note, the definition of intrinsic dimension used in Birdal et al. (2021) is based on the 0-dimensional persistent homology PH0(VR(Zn)), which is different from ours (Definition 1, coming from Ghosh & Motani (2023)). However, both definitions define the same object, i.e. the intrinsic dimension, and it is thus reasonable to exploit Birdal et al. (2021) s method to constrain the intrinsic dimension. Let e = [log E(Zn1), log E(Zn2), , log E(Znm)] , where Zni is the subset sampled from a batch, with size ni = |Zni|. Let ni < nj for i < j, and n = [log n1, log n2, , log nm]. Birdal et al. (2021) encourage a lower intrinsic dimension feature space by minimizing the slope between e and n, which can be estimated via the least square method: i=1 e i)/(m (9) Intuitively, the growth rate of E(Zn) is proportional to the volume of the corresponding manifold; this volume is proportional to the intrinsic dimension. In fact, there is a classical result on the growth rate of (Steele, 1988), showing that the growth rate (i.e. the slop) can constrain the intrinsic dimension. L d purely encourage the feature space to have a lower intrinsic dimension; sometimes it may even result in an intrinsic dimension lower than that of the target space (see Figure 3, Swiss Roll, where the target space is two-dimensional and the feature space is almost onedimensional.). In contrast, we wish to lower the ID of the feature space while preventing it from being lower than that of the target space. We propose to minimize slope between e = [e1, e2, , em] and n: i=1 ni)2)|, (10) where ei = log E(Zni)/ log E(Yni). Compared with L d, Ld further exploits the topological information of the target space through log E(Yni). When the feature and target spaces have the same ID, E(Zni) = E(Yni) for all i and Ld = 0 is in its minimal. As shown in Figure 2(c) and Figure 3, Ld well controls the ID of the feature space while better preserving the topology of the target space. The topology autoencoder (Moor et al., 2020) enforces the topological similarity between the feature and the target spaces by preserving 0-dimensional topologically relevant distances from the two spaces. We exploit it as the topology part Lt: Lt =||AZnm[πZnm ] AYnm[πZnm]||2 2 (11) + ||AZnm [πYnm] AYnm [πYnm ]||2 2 (12) As shown in Figure 2(d) and Figure 3, Lt well preserves the topology of the target space. We define the persistent homology regression regularizer, PH-Reg, as LR = Ld + Lt. As shown in Figure 2(e) and Figure 3, PH-Reg can both encourage a lower intrinsic dimension and preserve the topology of target space. The final loss function LR is defined as: LR = Lm + λt Lt + λd Ld, (13) where Lm is the task-specific regression loss and λd, λt are trade-off parameters, and their values are determined by the value of the task task-specific loss Lm, e.g. for a high Lm, λd and λt should also be set to high values. Deep Regression Representation Learning with Topology (a) Regression (b) Regression +L d (c) Regression +Ld (d) Regression +Lt (e) Regression +LR Figure 2. Visualization of the last hidden layer s feature space from the depth estimation task. The representations are obtained through a modified Res Net-50, with the last hidden layer changed to dimension 3 for visualization. The target space is a 1-dimensional line, and colors represent the ground truth depth. (b) L d encourages a lower intrinsic dimension yet fails to preserve the topology of the target space. (c) Ld takes the target space into consideration and can further preserve its topology. (d) Lt can enforce the topological similarity between the feature and target spaces. (e) Adding the Lt to Ld better preserves the topology of the target space. 5. Experiments We compare our method with four methods. 1) Information Dropout (Inf Drop) (Achille & Soatto, 2018b). Inf Drop serves as an IB baseline. It functions as a regularizer designed based on IB, aiming to learn representations that are minimal, sufficient, and disentangled. 2) Ordinal Entropy (OE) (Zhang et al., 2023). OE acts as a regression baseline. It takes advantage of classification by learning higher entropy feature space for regression tasks. 3) Birdal s regularizer (i.e., L d) (Birdal et al., 2021) serves as an intrinsic dimension baseline. 4) Topology Autoencoder (i.e., Lt) (Moor et al., 2020) serves as a topology baseline. Note that the proposed PH-Reg is mainly introduced to verify the established connections, and we do not aim for state-of-theart results. 5.1. Coordinate Prediction on the Synthetic Dataset To verify the topological relationship between the feature space and target space, we synthesize a dataset that contains points sampled from topologically different objects, including Swiss roll, torus, circle and the more complex object mammoth (Coenen & Pearce, 2019). We randomly sample 3000 points with coordinate y R3 from each object. These 3000 points are then divided into sets of 100 for training, 100 for validation, and 2800 for testing. Each point yi is encoded into a 100 dimensional vector xi = [f1(yi), f2(yi), f3(yi), f4(yi), noise], where the dimensions 1-4 are signal and the rest 96 dimensions are noise. The coordinate prediction task aims to learn the mapping G(x) = ˆy from x to y, and the mean-squared error Lmse = 1 N P i || ˆyi yi||2 2 is adopted as the evaluation metric. We use a two-layer fully connected neural network with 100 hidden units as the baseline architecture. More details are given in Appendix G. Table 1 shows that encouraging a lower intrinsic dimension while considering the target space (+Ld) enhances performance, particularly for Swiss Roll and Torus. In contrast, Table 1. Results (Lmse) on the synthetic dataset. We report results as mean standard variance over 10 runs. Bold numbers indicate the best performance. Method Swiss Roll Mammoth Torus Circle Baseline 2.99 0.43 211 55 3.01 0.11 0.154 0.006 + Inf Drop 4.15 0.37 367 50 2.05 0.04 0.093 0.003 + OE 2.95 0.69 187 88 2.83 0.07 0.114 0.007 +L d 2.74 0.85 141 104 1.13 0.06 0.171 0.04 +Ld 0.66 0.08 89 66 0.62 0.12 0.090 0.019 +Lt 1.83 0.70 80 61 0.95 0.05 0.036 0.004 +Ld + Lt 0.61 0.17 49 27 0.61 0.05 0.013 0.008 naively lowering the intrinsic dimension (+L d) performs poorly. Enforcing the topology similarity between the feature space and target space (+L t) decreases the Lmse by more than 60%, except for the Swiss roll. The best gains, however, are achieved by incorporating both Lt and Ld, which decrease the Lmse even more than 90% for the circle coordinate prediction task. Figure 3 shows feature space visualization results using t-SNE (100 dimensions 3 dimensions). The feature space of the regression baseline shows a similar structure to the target space, especially for Swiss roll and mammoth, which indicates regression potentially captures the topology of the target space. Regression +Lt significantly preserves the topology of the target space. Regression +Ld potentially preserves the topology of the target space, e.g. circle, while it primarily reduces the complexity of the feature space by maintaining the same intrinsic dimension as the target space. Combining both Ld and Lt in regression preserves the topology information while also reducing the complexity of the feature space, i.e. lower its intrinsic dimension. 5.2. Real-World Regression Tasks We conduct experiments on three real-world regression tasks, including depth estimation (Table 2), superresolution (Table 3) and age estimation (Table 4). The target spaces of the three tasks are topologically different, i.e. a 1dimensional line for depth estimation, 3-dimensional Deep Regression Representation Learning with Topology Target Space Regression Regression + Lt Regression + Ld Regression + Lt +Ld Regression + L'd Figure 3. t-sne visualization of the feature spaces (100 dimensions 3 dimensions) with topological different target spaces. space for super-resolution and discrete points for age estimation. Detailed settings, related introductions and more discussions are given in Appendix H. Results on the three tasks demonstrate that both Lt and Ld can enhance performance, and combining both further boosts the performance. Specifically, combining both achieves 0.48 overall improvements (i.e. ALL) on age estimation, a PSNR improvement of 0.096 on super-resolution for Urban100, and a reduction of 6.7% δ1 error on depth estimation. 5.3. Ablation Studies Hyperparameter λt and λd: We maintain λd and λt at their default value 10 for Swiss roll coordinate prediction, and we vary one of them to examine their impact. Figure 4(a) shows when λt 10, the MSE decreases consistently as λt increases. However, it tends to overtake the original learning objective when set too high, i.e. 1000. Regarding the λd, as shown in Figure 4(b), MSE remains relatively stable over a large range of λd, with a slight increase in Table 2. Quantitative comparison (MAE) on Age DB. We report results as mean standard variance over 3 runs. Bold numbers indicate the best performance. Method ALL Many Med. Few Baseline 7.80 0.12 6.80 0.06 9.11 0.31 13.63 0.43 + Inf Drop 8.04 0.14 7.14 0.20 9.10 0.71 13.61 0.32 + OE 7.65 0.13 6.72 0.09 8.77 0.49 13.28 0.73 +L d 7.75 0.05 6.80 0.11 8.87 0.05 13.61 0.50 +Ld 7.64 0.07 6.82 0.07 8.62 0.20 12.79 0.65 +Lt 7.50 0.04 6.59 0.03 8.75 0.03 12.67 0.24 +Ld + Lt 7.32 0.09 6.50 0.15 8.38 0.11 12.18 0.38 variance when λd = 1000. Sample Size (nm): In practice, we model the feature space using a limited number of samples within a batch. For dense prediction tasks, the available No. of samples is very large (No. pixels per image batch size), while it is constrained to the batch size for image-wise prediction tasks. We investigate the influence of nm from Eq. 10 and 11 on Swiss roll coordinate prediction. Figure 4(c) shows our PH-Reg performs better with a larger nm, while maintain- Deep Regression Representation Learning with Topology (a) MSE with λt (b) MSE with λd (c) MSE with sample size (d) ID of different methods Figure 4. Ablation study based on (a-c) the Swiss roll coordinate prediction task and (d) the depth estimation task. Table 3. Quantitative comparison (PSNR(d B)) of super-resolution results with public benchmark and DIV2K validation set. Bold numbers indicate the best performance. Method Set5 Set14 B100 Urban100 DIV2K Baseline 32.241 28.614 27.598 26.083 28.997 + Inf Drop 32.219 28.626 27.594 26.059 28.980 + OE 32.280 28.659 27.614 26.117 29.005 +L d 32.252 28.625 27.599 26.078 28.989 +Ld 32.293 28.644 27.619 26.151 29.022 +Lt 32.322 28.673 27.624 26.169 29.031 +Ld + Lt 32.288 28.686 27.627 26.179 29.038 Table 4. Depth estimation results with NYU-Depth-v2. Bold and underline numbers indicate the best and second best performance, respectively. Method δ1 δ2 δ3 REL RMS log10 Baseline 0.792 0.955 0.990 0.153 0.512 0.064 + Inf Drop 0.791 0.960 0.992 0.153 0.507 0.064 + OE 0.811 - - 0.143 0.478 0.060 +L d 0.804 0.954 0.988 0.151 0.502 0.063 +Ld 0.795 0.959 0.992 0.150 0.497 0.063 +Lt 0.798 0.958 0.990 0.149 0.502 0.063 +Ld + Lt 0.807 0.959 0.992 0.144 0.481 0.061 ing stability even with a small nm. ID of different methods: Figure 4(d) displays the intrinsic dimension of the last hidden layer, estimated using Two NN (Facco et al., 2017), for the testing set of NYU-Depthv2 from different methods throughout training. While our method is based on Birdal s estimator (Birdal et al., 2021), another estimator, Two NN, captures a decrease in ID when applied Ld. We observe that without Ld, the intrinsic dimension tends to increase after epoch 3, potentially overfitting details, whereas Ld prevents such a trend. Efficiency: Efficiency-wise, the computing complexity equals finding the minimum spanning tree from the distance matrix of the samples, which have a complexity of O(n2 m log nm) using the simple Kruskal s Algorithm, and it can speed up with some advanced methods (Bauer, 2021). The synthetic experiments (Table 5) use a simple 2-layer MLP, so the regularizer adds significant computing Table 5. Quantitative comparison of the time consumption and memory usage on the synthetic dataset and NYU-Depth-v2, and the corresponding training times are 10000 and 1 epoch, respectively. nm Regularizer Coordinate Prediction Depth Estimation (2 Layer MLP) (Res Net-50) Training(s) Memory (MB) Training(s) Memory (MB) 0 - 8.88 959 1929 11821 100 Lt 175.06 959 1942 11833 100 Ld 439.68 973 1950 12211 100 Lt + Ld 617.41 973 1980 12211 300 Lt + Ld - - 2370 12211 time. However, the real-world experiments on depth estimation (Table 5) use a Res Net-50 backbone, and the added time and memory are negligible (18.6% and 0.3%, respectively), even with nm = 300. These increases are only during training and do not add demands for inference. 6. Conclusion In this paper, we establish novel connections between topology and the IB principle for regression representation learning. The established connections imply that the desired Z should exhibit topological similarity to the target space and share the same intrinsic dimension as the target space. Inspired by the connections, we proposed a regularizer to learn the desired Z. Experiments on synthetic and real-world regression tasks demonstrate its benefits. Acknowledgement This research / project is supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 1 (FY2022). Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Deep Regression Representation Learning with Topology Achille, A. and Soatto, S. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947 1980, 2018a. Achille, A. and Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE transactions on pattern analysis and machine intelligence, 40(12):2897 2905, 2018b. Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019. Barannikov, S., Trofimov, I., Balabin, N., and Burnaev, E. Representation topology divergence: A method for comparing neural network representations. ICML, 2021a. Barannikov, S., Trofimov, I., Sotnikov, G., Trimbach, E., Korotin, A., Filippov, A., and Burnaev, E. Manifold topology divergence: a framework for comparing data manifolds. Neur IPS, 34:7294 7305, 2021b. Bauer, U. Ripser: efficient computation of vietoris rips persistence barcodes. Journal of Applied and Computational Topology, 5(3):391 423, 2021. Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35 (8):1798 1828, 2013. Bevilacqua, M., Roumy, A., Guillemot, C., and Alberi Morel, M. L. Low-complexity single-image superresolution based on nonnegative neighbor embedding. 2012. Birdal, T., Lou, A., Guibas, L. J., and Simsekli, U. Intrinsic dimension, persistent homology and generalization in neural networks. Advances in Neural Information Processing Systems, 34:6776 6789, 2021. Boudiaf, M., Rony, J., Ziko, I. M., Granger, E., Pedersoli, M., Piantanida, P., and Ayed, I. B. A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses. In European conference on computer vision, pp. 548 564. Springer, 2020. Brown, B. C., Caterini, A. L., Ross, B. L., Cresswell, J. C., and Loaiza-Ganem, G. Verifying the union of manifolds hypothesis for image data. In The Eleventh International Conference on Learning Representations, 2022a. Brown, B. C., Juravsky, J., Caterini, A. L., and Loaiza Ganem, G. Relating regularization and generalization through the intrinsic dimension of activations. ar Xiv preprint ar Xiv:2211.13239, 2022b. Chazal, F. and Michel, B. An introduction to topological data analysis: fundamental and practical aspects for data scientists. Frontiers in artificial intelligence, 4:108, 2021. Chen, C., Ni, X., Bai, Q., and Wang, Y. A topological regularizer for classifiers via persistent homology. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2573 2582. PMLR, 2019. Coenen, A. and Pearce, A. Understanding umap, mammoth dataset, 2019. URL https://github.com/ MNoichl/UMAP-examples-mammoth-/tree/ master. Eigen, D., Puhrsch, C., and Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014. Facco, E., d Errico, M., Rodriguez, A., and Laio, A. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific reports, 7(1): 12140, 2017. Ghosh, R. and Motani, M. Local intrinsic dimensional entropy. AAAI, 2023. Hatcher, A. Algebraic topology. Cambridge University Press, 2001. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hofer, C., Kwitt, R., Niethammer, M., and Dixit, M. Connectivity-optimized representation learning via persistent homology. In International conference on machine learning, pp. 2751 2760. PMLR, 2019. Huang, J.-B., Singh, A., and Ahuja, N. Single image superresolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5197 5206, 2015. Kawaguchi, K., Deng, Z., Ji, X., and Huang, J. How does information bottleneck help deep learning? ICML, 2023. Lee, J. H., Han, M.-K., Ko, D. W., and Suh, I. H. From big to small: Multi-scale local planar guidance for monocular depth estimation. ar Xiv preprint ar Xiv:1907.10326, 2019. Lim, B., Son, S., Kim, H., Nah, S., and Mu Lee, K. Enhanced deep residual networks for single image superresolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136 144, 2017. Deep Regression Representation Learning with Topology Ma, X., Wang, Y., Houle, M. E., Zhou, S., Erfani, S., Xia, S., Wijewickrema, S., and Bailey, J. Dimensionalitydriven learning with noisy labels. In International Conference on Machine Learning, pp. 3355 3364. PMLR, 2018. Martin, D., Fowlkes, C., Tal, D., and Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pp. 416 423. IEEE, 2001. Moor, M., Horn, M., Rieck, B., and Borgwardt, K. Topological autoencoders. In International conference on machine learning, pp. 7045 7054. PMLR, 2020. Rieck, B., Yates, T., Bock, C., Borgwardt, K., Wolf, G., Turk-Browne, N., and Krishnaswamy, S. Uncovering the topology of time-varying fmri data using cubical persistence. Advances in neural information processing systems, 33:6900 6912, 2020. Shwartz-Ziv, R. and Tishby, N. Opening the black box of deep neural networks via information. ar Xiv preprint ar Xiv:1703.00810, 2017. Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pp. 746 760. Springer, 2012. Steele, J. M. Growth rates of euclidean minimal spanning trees with power weighted edges. The Annals of Probability, 16(4):1767 1787, 1988. Timofte, R., Agustsson, E., Van Gool, L., Yang, M.-H., and Zhang, L. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 114 125, 2017. Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pp. 1 5. IEEE, 2015. Trofimov, I., Cherniavskii, D., Tulchinskii, E., Balabin, N., Burnaev, E., and Barannikov, S. Learning topologypreserving data representations. ICLR, 2023. Tulchinskii, E., Kuznetsov, K., Kushnareva, L., Cherniavskii, D., Barannikov, S., Piontkovskaya, I., Nikolenko, S., and Burnaev, E. Intrinsic dimension estimation for robust detection of ai-generated texts. Neur IPS, 2023. Yang, Y., Zha, K., Chen, Y., Wang, H., and Katabi, D. Delving into deep imbalanced regression. In International Conference on Machine Learning, pp. 11842 11851. PMLR, 2021. Zeyde, R., Elad, M., and Protter, M. On single image scaleup using sparse-representations. In Curves and Surfaces: 7th International Conference, Avignon, France, June 24-30, 2010, Revised Selected Papers 7, pp. 711 730. Springer, 2012. Zhang, S., Yang, L., Mi, M. B., Zheng, X., and Yao, A. Improving deep regression with ordinal entropy. ICLR, 2023. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., and Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2472 2481, 2018. Zhu, W., Qiu, Q., Huang, J., Calderbank, R., Sapiro, G., and Daubechies, I. Ldmnet: Low dimensional manifold regularized neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2743 2751, 2018. Deep Regression Representation Learning with Topology A.1. Proof of the Theorem 1 Theorem 1 Assume that the conditional entropy H(Z|X) is a fixed constant for Z Z for some set Z of the random variables, or Z is determined given X. Then, min Z IB = min Z {(1 β)H(Y|Z) + βH(Z|Y)}. Proof From the definition of the mutual information, we have I(Z; X) = H(Z) H(Z|X) = I(Z; Y) + H(Z|Y) H(Z|X). By substituting the right-hand side of this equation into I(Z; X), IB = I(Z; Y) + βI(Z; X) = (β 1)I(Z; Y) + βH(Z|Y) βH(Z|X) (14) Since I(Z; Y) = H(Y) H(Y|Z), IB = (β 1)(H(Y) H(Y|Z)) + βH(Z|Y) βH(Z|X) (15) = (1 β)H(Y|Z) + βH(Z|Y) + (β 1)H(Y) βH(Z|X). (16) 1) If H(Z|X) is a constant for Z Z. Since H(Y) is a fixed constant for any Z, this implies that IB = (1 β)H(Y|Z) + βH(Z|Y) + C, where C is a fixed constant for Z Z. Thus: min Z IB = min Z {(1 β)H(Y|Z) + βH(Z|Y)}. 2) If Z is determined given X, then H(Z|X) is not a term can be optimized. Since H(Y) is a fixed constant for any Z: min Z IB = min Z {(1 β)H(Y|Z) + βH(Z|Y)}. A.2. Proof of the Theorem 2 and Proposition 1 Theorem 2 Consider dataset S = {xi, zi, yi}N i=1 sampled from distribution P, where xi is the input, zi is the corresponding representation, and yi is the label. Let dmax = maxy Y minyi S ||y yi||2 be the maximum distance of y to its nearest yi. Assume (Z|Y = yi) follows a distribution D and the dispersion of D is bounded by its entropy: Ez D[||z z||2] Q(H(D)), (17) where z is the mean of the distribution D and Q(H(D)) is some function of H(D). Assume the regressor f is L1-Lipschitz continuous, then as dmax 0, we have: E{x,z,y} P [||f(z) y||2] E{x,z,y} S(||f(z) y||2) + 2L1Q(H(Z|Y)) (18) Proof For any sample {xi, zi, yi}, we define its local neighborhood set Ni as Ni = {{x, z, y} : ||y yi||2 < ||y yj||2, j = i, p(y) > 0}. (19) For each set Ni, we have E{x,z,y} Ni[||f(z) y||2] = E{x,z,y} Ni[||f(z) f(zi) + f(zi) yi + yi y||2] (20) E{x,z,y} Ni[||f(z) f(zi)||2] + E{x,z,y} Ni[||f(zi) yi||2] + E{x,z,y} Ni[||yi y||2] (21) L1E{x,z,y} Ni[||z zi||2] + E{x,z,y} Ni[||f(zi) yi||2] + dmax (22) =L1E{x,z,y} Ni[||z zi + zi zi||2] + E{x,z,y} Ni[||f(zi) yi||2] + dmax (23) L1E{x,z,y} Ni[||z zi||2 + || zi zi||2] + E{x,z,y} Ni[||f(zi) yi||2] + dmax (24) =L1E{x,z,y} Ni[||z zi||2] + L1|| zi zi||2 + E{x,z,y} Ni[||f(zi) yi||2] + dmax (25) Deep Regression Representation Learning with Topology We denote the probability distribution over {Ni} as P , where P(Ni) = P({x, z, y} Ni}). Then, we have E{x,z,y} P [||f(z) y||2] = ENi P E{x,z,y} Ni[||f(z) y||2] (26) ENi P [L1E{x,z,y} Ni[||z zi||2] + L1|| zi zi||2 + E{x,z,y} Ni[||f(zi) yi||2] + dmax] (27) =L1ENi P E{x,z,y} Ni[||z zi||2] + L1ENi P || zi zi||2 + E{x,z,y} S(||f(zi) yi||2) + dmax (28) As dmax 0, we can approximate ENi P E{x,z,y} Ni[||z zi||2] as Eyi YE{(x,z,y)|y=yi}[||z zi||2]. Since (Z|Y = yi) D, we have H(Z|Y) = Ey YH(Z|Y = y) = H(Z|Y = yi) = H(Z|Y = yj) = H(D) for all 1 i, j N, and ENi P ||zi zi||2 can thus be approximate as E{(x,z,y)|y=yi}||z zi||2. We have: E{x,z,y} P [||f(z) y||2] (29) L1ENi P E{x,z,y} Ni[||z zi||2] + L1ENi P || zi zi||2 + E{x,z,y} S(||f(zi) yi||2) + dmax (30) = L1Eyi YE{(x,z,y)|y=yi}[||z zi||2] + L1E{(x,z,y)|y=yi}||zi zi||2 + E{x,z,y} S(||f(zi) yi||2) (31) L1Eyi Y[Q(H(Z|Y = yi))] + L1Q(H(Z|Y = yi)) + E{x,z,y} S(||f(zi) yi||2) (32) = 2L1Q(H(Z|Y)) + E{x,z,y} S(||f(zi) yi||2) (33) Proposition 1 If D is a multivariate normal distribution N( z, Σ = k I), where k > 0 is a scalar and z is the mean of the distribution D. Then, the function Q(H(D)) in Theorem 2 can be selected as Q(H(D)) = q d(e2H(D)) 1 d 2πe , where d is the dimension of z. If D is a uniform distribution, then the Q(H(D)) can be selected as Q(H(D)) = e H(D) Proof We first consider the case when D N( z, Σ = k I). Assume Z N( z, Σ), then H(Z) = 1 2 log(2πe)n|Σ|: z p(z) log(p(z))dz (34) z p(z) log 1 2π)d|Σ| 1 2 e 1 2 (z z)TΣ 1(z z)dz (35) z p(z) log 1 2π)d|Σ| 1 2 dz Z z p(z) log e 1 2 (z z)TΣ 1(z z)dz (36) 2 log(2π)d|Σ| + log e i,j (zi zi)(Σ 1)ij(zj zj)] (37) 2 log(2π)d|Σ| + log e i,j (zi zi)(zj zj)(Σ 1)ij] (38) 2 log(2π)d|Σ| + log e i,j E[(zi zi)(zj zj)](Σ 1)ij (39) 2 log(2π)d|Σ| + log e i Σji(Σ 1)ij (40) 2 log(2π)d|Σ| + log e j (ΣΣ 1)j (41) 2 log(2π)d|Σ| + log e 2 log(2π)d|Σ| + log e 2 log(2πe)d|Σ| (44) Deep Regression Representation Learning with Topology We have the following: E[||z z||2 2] = tr(Σ) = dk. (45) The following also holds: |Σ| = kd. (46) Thus, we have: (E[||z z||2])2 E[||z z||2 2] = d|Σ| 1 d = d(e2H(Z) (2πe)d ) 1 d = d(e2H(Z)) 1 d 2πe (47) E[||z z||2] d(e2H(Z)) 1 d 2πe (48) Thus, Q(H(D)) in Theorem 2 can be selected as Q(H(D)) = q d(e2H(D)) 1 d 2πe , when D N( z, Σ = k I). Similarly, if D is a uniform distribution U(a, b), then its variance is given by: E[||z z||2 2] = (b a)2 and its entropy is given by: H(D) = log(b a). (50) (E[||z z||2])2 E[||z z||2 2] = (b a)2 12 = e2H(D) E[||z z||2] e H(D) Thus, Q(H(D)) in Theorem 2 can be selected as Q(H(D)) = e H(D) 12 , when D is a uniform distribution A.3. Proof of the Theorem 3 We first show a straightforward result of [(Ghosh & Motani, 2023), Proposition 1]: Lemma 1 Assume that z lies in a manifold M and the Mi M is a manifold corresponding to the distribution (z|y = yi). Assume for all features zi Mi, the following holds: Z ||z zi|| ϵ P(z)dz = C(ϵ), (53) where C(ϵ) is some function of ϵ. The above imposes a constraint where the distribution (z|y = yi) is uniformly distributed across Mi. Then, as ϵ 0+, we have: H(Z|Y) = Eyi YH(Z|Y = yi) = Eyi Y[ log(ϵ)Dim IDMi + log K C(ϵ)], (54) for some fixed scalar K. Dim IDMi is the intrinsic dimension of the manifold Mi. Deep Regression Representation Learning with Topology Proof By using the same proof technique as [(Ghosh & Motani, 2023), Proposition 1], we can show H(Z|Y = yi) = log(ϵ)Dim IDMi + log K C(ϵ), (55) Since H(Z|Y) = Eyi YH(Z|Y = yi), the result follows. Theorem 3 Assume that z lies in a manifold M and the Mi M is a manifold corresponding to the distribution (z|y = yi). Let C(ϵ) be some function of ϵ: ||z z || ϵ P (z)dz, (56) where P (z) is the probability of z when (z|y = yi) is uniformly distributed across Mi, and z is any point on Mi. Then, as ϵ 0+, we have: H(Z|Y) = Eyi YH(Z|Y = yi) Eyi Y[ log(ϵ)Dim IDMi + log K C(ϵ)], (57) for some fixed scalar K. Dim IDMi is the intrinsic dimension of the manifold Mi. Proof Since the uniform distribution has the largest entropy over all distributions over the support Mi, based on Lemma 1, we thus have: H(Z|Y) = Eyi YH(Z|Y = yi) Eyi Y[ log(ϵ)Dim IDMi + log K C(ϵ)], (58) A.4. Proof of the Proposition 2 Proposition 2 Let the target Y = Y + N where Y is fully determined by X and N is the aleatoric uncertainty that is independent of X. Assume the underlying mapping f from Z to Y and g from Y to Z are continuous, where the continuous mapping is based on the topology induced by the Euclidean distance. Then the representation Z is optimal if and only if Z is homeomorphic to Y . Proof If Z is optimal (optimal Z Z is homeomorphic to Y ): H(Y|Z) = H(Y + N |Z) = H(Y |Z) + H(N |Z, Y ) = H(Y |Z) + H(N |Y ), (59) H(Y|X) = H(Y + N |X) = H(Y |X) + H(N |X, Y ) = H(Y |X) + H(N |Y ). (60) Since Z is optimal, we have H(Y|Z) = H(Y|X). Based on the two equations above, we have: H(Y |Z) = H(Y |X). (61) Since Y is fully determined by X and H(Y |Z) = H(Y |X), Y is also fully determined by Z. Thus, for each zi Z, there exists and only exists one y i Y corresponding to the zi, and thus the mapping function f exists. Z is optimal also means Z is fully determined given Y, Since N is independent of Z: H(Z|Y) = H(Z|Y + N ) = H(Z|Y ), (62) thus, for each y i, there exist and only exist one zi corresponding to the y i. Thus, the mapping function f is a bijection, and its inverse f 1 is g . Since f and f 1 are continuous, Z is homeomorphic to Y. If Z is homeomorphic to Y : ( Z is homeomorphic to Y optimal Z ): Z is homeomorphic to Y means a continuous bijection exist between Z and Y , thus H(Z|Y) = H(Z|Y ) = H(Y |Z) and H(Z|Y) is minimal. We have: H(Y|Z) = H(Y |Z) + H(N |Y ) = H(N |Y ) = H(Y |X) + H(N |Y ) = H(Y|X), (63) thus, Z is optimal. Deep Regression Representation Learning with Topology B. Discussions about Theorem 1 Choice of β in Theorem 1: β in (0, 1) means we need to maximize I(Z; Y) for sufficiency, while we want to minimize I(Z; X) for minimality. When β > 1, then we value the minimality more than the sufficiency, resulting in the need to maximize H(Y|Z). But, in the typical setting, we always value I(Z; Y) more than I(Z, X) for a good task-specific performance, and β > 0 will lead Z compressed to be a single point, as H(Z|Y) is minimal and H(Y|Z) is maximized in this case. The qualitative behavior should change based on 0 < β < 1 or β > 1, as it controls which we value the more: sufficiency or minimality. Difference between the target Y and the predicted Y : The target Y is different from the predicted Y . H(Y |Z) always equals to 0 if we exploit neural networks as deterministic functions. In an extreme case, we can treat the predicted Y as the representation Z, which shows minimizing H(Y |Y) = 0 is the learning target. From the invariance representation learning point of view, lowering H(Z|Y) is learning invariance representations with respect to Y. More discussions about the assumptions: For discrete entropy, Z is determined given Y implies H(Z|Y) is a constant for Z, as H(Z|Y) = 0 here. However, this does not hold for differential entropy. For differential entropy, Z is determined given Y means H(Z|Y) = , since given Y, Z is distributed as a delta function in this case. C. Connections with the bound in Kawaguchi et al. (2023) Kawaguchi et al. (2023) provide several bounds that are all applicable for both classification and regression for various cases. In the case where the encoder model ϕ (whose output is z) and the training dataset of the downstream task ˆS = {xi, yi}N i=1 are dependent (e.g., this is when z = ϕ(x) for x / S is dependent of all N training data points (xi, yi)N i=1 through the training of ϕ by using ˆS), they show that any valid and general generalization bound of the information bottleneck must include two terms, I(X; Z|Y) and I(ϕ; ˆS), where the second term measures the effect of overfitting the encoder ϕ. This is because the encoder ϕ can compress all information to minimize I(X; Z|Y) arbitrarily well while overfitting to the training data: e.g., we can simply set ϕ(xi) = yi for all (xi, yi) ˆS and ϕ(x) = c = y for all (x, y) / ˆS for some constant c to achieve the best training loss while minimizing I(X; Z|Y) and performing arbitrarily poorly for test loss. Given this observation, they prove the first rigorous generalization bounds for two separate cases based on the dependence of ϕ and ˆS. Their generalization bounds scales with I(X; Z|Y) without the second term I(ϕ; ˆS) in case of ϕ and ˆS being independent, and with I(X; Z|Y) and I(ϕ; ˆS) in case of ϕ and ˆS being dependent. In Theorem 2, we consider the case where ϕ and ˆS are independent, since z in (x, z, y) P is drawn without dependence on the entire N data points {xi, yi}N i=1 in equation (18). Thus, our results are consistent and do not contradict previous findings. Unlike the previous bounds, our bound is determined by the function Q, which characterizes the dispersion or the standard deviation of a distribution by its entropy. The function Q exists for general cases, as the dispersion or the standard deviation and the entropy commonly can be estimated for a specific distribution. We thus can find a function Q to upper bound the relationship on the entropy and its dispersion or the standard deviation. It is worth mentioning that we are not targeting a tight or an advanced bound here. Our bound is introduced to support the analysis that follows after Theorem 2, which is challenging with the previous bounds. D. From Proposition 2, whether the optimal representation Z is the one that equals the ground-truth label? Such Z can be one of the optimal/best representations under the negligible aleatoric uncertainty setting. However, Z is not unique and Proposition 2 is broader than this statement as it says that the optimal representation Z should be homeomorphic to the ground truth - i.e. they only need to be topologically equivalent. In this regard, the feature space may resemble a square, while the target space is a circle. A practical benefit of Proposition 2 is its guidance on the desirable topological properties of Z. For example, if the target space is a single connected component ( i.e. β0 = 1), then the feature space should also be similar; this does not hold in general (the task-specific loss alone cannot guarantee a single connected feature space and the topology of the feature space is also influenced by the input space. In addition, we observe empirically on depth estimation that the feature space sometimes consists of multiple disconnected components). Deep Regression Representation Learning with Topology Table 6. Depth estimation on NYU-Depth-V2. Here, Random represents the encoder is fixed in a random state, while PH-Reg means we first train the encoder purely with PH-Reg for 1 epoch, then fix it and train the regressor. Encoder Regressor δ1 REL RMS log10 Random Linear 0.398 0.390 1.144 0.153 PH-Reg Linear 0.428 0.391 1.043 0.153 Random Non Linear 0.412 0.381 1.121 0.149 PH-Rege Non Linear 0.440 0.374 1.052 0.141 E. Discussions about the regressor 1. Do the appropriate properties (i.e., lower ID and homeomorphism) depend on the regressor? Although the appropriate feature space may vary from regressor to regressor, the appropriate properties, as supported by our theorems, do not depend on the regressor. Specifically, our Theorem 1 shows that the IB tradeoff is fully determined by the values of H(Z|Y) and H(Y|Z). These values of the entropy terms do not depend on the regressor that maps Z to the predicted ˆY. In addition, for depth estimation, representations learned purely by PH-Reg without any other loss terms are also highly competitive (see Table 6). 2. Topology regularization with simple vs highly expressive regressors For both regressors, our regularizer leads to representations with a higher signal-to-noise ratio (since H(Z|Y) and H(Z|Y) are minimized). This should make it easier for the regressor to estimate the true underlying signal. However, more expressive regressors have a higher capacity to estimate the underlying signal directly, so the room for improvement from the regularization is reduced. In the extreme case of too much expressiveness, our regularizer may again lead to improvements as it may limit overfitting, by minimizing noise in the learned representation. The topology properties supported by our theorems align with invariant representation learning (i.e. invariance to noise), where invariance serves as a general prior for desirable representation properties (Bengio et al., 2013). F. Preliminaries on Topology The simplicial complex is a central object in topological data analysis, and it can be exploited as a tool to model the shape of data. Given a set of finite samples S = {si}, the simplicial complex K can be seen as a collection of simplices σ = {s0, , sk} of varying dimensions: vertices (|σ| = 1), edges(|σ| = 2), and the higher-dimensional counterparts(|σ| > 2). The faces of a simplex σ = {s0, , sk} is the simplex spanned by the subset of {s0, , sk}. The dimension of the simplicial complex K is the largest dimension of its simplices. A simplicial complex can be regarded as a high-dimensional generalization of a graph, and a graph can be seen as a 1-dimensional simplicial complex. For each S, there exist many ways to build simplicial complexes and the Vietoris-Rips Complexes are widely used: Definition 3 (Vietoris-Rips Complexes). Given a set of finite samples S sampled from the feature space or target space and a threshold α 0, the Vietoris-Rips Complexes VRα is defined as: VRα(S) = {{s0, , sk}, s S|d(si, sj) α}, (64) where d(si, sj) is the Euclidean distance between samples si and sj. VRα(S) is the set of all simplicial complexes {s0, , sk} where the pairwise distance d(si, sj) is within the threshold α. Let Ck(VRα(S)) denote the vector space generated by its k-dimensional simplices over Z22. The boundary operator k : Ck(VRα(S)) Ck 1(VRα(S)) maps each simplex to its boundary , which consists of the sum of all its faces, is a homomorphism from Ck(VRα(S)) to Ck 1(VRα(S)). It can be shown that k k 1 = 0, which leads to the chain complex: , and the kth homology group Hk(VRα(S)) is defined as the quotient group Hk(VRα(S)) := ker k/im k+1. ker represents kernel, which is the set of all elements that are mapped to the zero element. im represents image, which is the set of all the outputs. Rank Hk(VRα(S)) is known as the kth Betti number βk, which counts the number of k-dimensional holes and can be used to represent the topological features of the manifold that the set of points S sampled from. 2It is not specific to Z2, but Z2 is a typical choice. Deep Regression Representation Learning with Topology 𝑥 Input data Encoder 𝑧 Regressor #𝑦 𝑦 task-specific loss (a) Illustration of the framework birth and death threshold 𝑉𝑅$ at different scales [ birth , death ] 𝛽% (b) Illustration of PH0(VR(S)) Figure 5. Illustration of the (a) the use of PH-Reg for regression, and (b) calculating of PH0(VR(S)). Here S = {s1, s2, s3}. We say three connected components, i.e. β0, ({{s1}, {s2}, {s3}}) birth when α = 0, one death (two left ({{s1, s3}, s2})) when α = α1, and another one death (one left ({{s1, s3, s2})) when α = α2. Thus PH0(VR(S)) = {[0, α1], [0, α2]}. However, the Hk(VRα(S)) is obtained based on a single α, which is easily affected by small changes in S. Thus, it is not robust and is of limited use for real-world datasets. The persistent homology considers all the possible α instead of a single one, which results in a sequence of βk. This is achieved through a nested sequence of simplicial complexes, called filtration: VR0(S) VRα1(S) VRαm(S) for 0 α1 αm. Let γi = [αi, αj] be the interval corresponding to a k-dimensional hole birth at the threshold αi and death at the threshold αj, we denote PHk(VR(S)) = {γi} the set of birth and death intervals of the k-dimensional holes. We only exploit PH0(VR(S)) in our PH-Reg, since we exploit the topological autoencoder as the topology part and higher topological features merely increase its runtime. An illustration of the calculation PH0(VR(S)) is given in Figure 5(b). We define E(S) = P γ PH0(VR(S)) |I(γ)|, where |I(γ)| is the length of the interval γ. G. More details about the Coordinate Prediction task Details about the synthetic dataset: We encode coordinates y R3 into 100 dimensional vectors xi = [f1(yi), f2(yi), f3(yi), f4(yi), noise], where the dimensions 1 4 are signal and the rest 96 dimensions are noise. The encoder functions fi are defined as: f1(yi) = yi1 + yi2 + yi3 f2(yi) = yi1 + yi2 yi3 f3(yi) = yi1 yi2 + yi3 f4(yi) = yi1 + yi2 + yi3 As shown above, the accurate coordinates yi can be obtained correctly when f1(yi), f2(yi), f3(yi), f4(yi) are given. We introduce noise to the remaining 96 dimensions by using f1, f2, f3, f4 on other randomly selected samples yj. The proximity of yj to yi can be intuitively seen as an indicator of the noise s relationship to the signal. More training details: We train the models for 10000 epochs using Adam W as the optimizer with a learning rate of 0.001. We report results as mean standard variance over 10 runs. For the regression baseline Oridnal Entropy and the IB baseline Information Dropout, we tried various weights {0.01, 0.1, 1, 10} and reported the best results. The trade-off parameters λd and λt are default set to 10 and 100, respectively, while λt is set to 10000 and λd is set to 1 for Mammoth, and λd is set to 1 for torus and circle. Deep Regression Representation Learning with Topology Table 7. Results on Age DB. We report results as mean standard variance over 3 runs. Bold numbers indicate the best performance. Method MAE GM ALL Many Med. Few ALL Many Med. Few Baseline (Yang et al., 2021) 7.80 0.12 6.80 0.06 9.11 0.31 13.63 0.43 4.98 0.05 4.32 0.06 6.19 0.07 10.29 0.57 + Information dropout (Achille & Soatto, 2018b) 8.04 0.14 7.14 0.20 9.10 0.71 13.61 0.32 5.11 0.06 4.49 0.17 6.14 0.49 10.54 0.65 + Oridnal Entropy (Zhang et al., 2023) 7.65 0.13 6.72 0.09 8.77 0.49 13.28 0.73 4.91 0.14 4.29 0.06 6.04 0.51 10.09 0.62 +L d 7.75 0.05 6.80 0.11 8.87 0.05 13.61 0.50 4.96 0.04 4.33 0.09 6.05 0.36 10.43 0.40 +Ld 7.64 0.07 6.82 0.07 8.62 0.20 12.79 0.65 4.85 0.05 4.27 0.06 5.91 0.13 9.75 0.53 +Lt 7.50 0.04 6.59 0.03 8.75 0.03 12.67 0.24 4.77 0.07 4.27 0.06 6.09 0.03 9.34 0.70 +Ld + Lt 7.32 0.09 6.50 0.15 8.38 0.11 12.18 0.38 4.69 0.07 4.15 0.08 5.64 0.09 8.99 0.38 Table 8. Quantitative comparison of super-resolution results with public benchmark and DIV2K validation set. We report results as PSNR(d B)/SSIM. Bold numbers indicate the best performance. Method Set5 Set14 B100 Urban100 DIV2K Baseline (Lim et al., 2017) 32.241/ 0.8656 28.614/ 0.7445 27.598/ 0.7120 26.083/ 0.7645 28.997/ 0.8189 + Information dropout (Achille & Soatto, 2018b) 32.219/ 0.8649 28.626/ 0.7441 27.594/ 0.7113 26.059/ 0.7624 28.980/ 0.8182 + Oridnal Entropy (Zhang et al., 2023) 32.280/ 0.8653 28.659/ 0.7445 27.614/ 0.7119 26.117/ 0.7641 29.005/ 0.8188 +L d 32.252/ 0.8653 28.625/ 0.7443 27.599/ 0.7118 26.078/ 0.7638 28.989/ 0.8186 +Ld 32.293/ 0.8660 28.644/ 0.7453 27.619/ 0.7127 26.151/ 0.7662 29.022/0.8197 +Lt 32.322/ 0.8663 28.673/ 0.7455 27.624/ 0.7127 26.169/ 0.7665 29.031/ 0.8196 +Ld + Lt 32.288/ 0.8663 28.686/ 0.7462 27.627/ 0.7132 26.179/ 0.7670 29.038/ 0.8201 H. Details about the real-world tasks H.1. Evaluation metrics Depth Estimation. We denote the predicted depth at position p as yp and the corresponding ground truth depth as y p, the total number of pixels is n. The metrics are: 1) threshold accuracy δ1 % of yp, s.t. max( yp y p , y p yp ) < t1, where t1 = 1.25; 2) average relative error (REL): 1 p |yp y p| yp ; 3) root mean squared error (RMS): q p(yp y p)2; 4) average (log10 error): 1 p | log10(yp) log10(y p)|. Age Estimation. Given N images for testing, yi and y i are the i-th prediction and ground-truth, respectively. The evaluation metrics include 1)MAE: 1 N PN i=1 |yi y i|, and 2)Geometric Mean (GM): (QN i=1 |yi y i|) 1 N . H.2. Age estimation on Age DB-DIR dataset We exploit the Age DB-DIR (Yang et al., 2021) for age estimation task. We follow the setting of Yang et al. (2021) and exploit their regression baseline model, which uses Res Net-50 (He et al., 2016) as the backbone. The evaluation metrics are MAE and geometric mean(GM), and the results are reported on the whole set and the three disjoint subsets, i.e. Many, Med. and Few. The trade-off parameters λd and λt are set to 0.1 and 1, respectively. Table 7 shows that both Lt and Ld can improve the performance, and combining both achieves 0.48 overall improvements (i.e. ALL) on MAE and 0.25 overall improvements on GM. H.3. Super-resolution on DIV2K dataset We use the DIV2K dataset (Timofte et al., 2017) for 4x super-resolution training (without the 2x pretrained model) and we evaluate on the validation set of DIV2K and the standard benchmarks: Set5 (Bevilacqua et al., 2012), Set14 (Zeyde et al., 2012), BSD100 (Martin et al., 2001), Urban100 (Huang et al., 2015). We follow the setting of Lim et al. (2017) and exploit their small-size EDSR model as our baseline architecture. We adopt the standard metrics PNSR and SSIM. The trade-off parameters λd and λt are set to 0.1 and 1, respectively. Table 3 shows that both Ld and Lt contribute to improving the baseline and adding both terms has the largest impact. H.4. Depth estimation on NYU-Depth-v2 dataset We exploit the NYU-Depth-v2 (Silberman et al., 2012) for the depth estimation task. We follow the setting of Lee et al. (2019) and use Res Net50 as our baseline architecture. We exploit the standard metrics of threshold accuracy δ1, δ2, δ3, average relative error (REL), root mean squared error (RMS) and average log10 error. The trade-off parameters λd and λt Deep Regression Representation Learning with Topology Table 9. Quantitative comparison (MAE) on Age DB. We report results as mean standard variance over 3 runs. Method ALL Many Med. Few Baseline 7.80 0.12 6.80 0.06 9.11 0.31 13.63 0.43 + LDS (Yang et al., 2021) 7.67 6.98 8.86 10.89 + FDS (Yang et al., 2021) 7.55 6.50 8.97 13.01 + LDS + FDS (Yang et al., 2021) 7.55 7.01 8.24 10.79 + PH-Reg (ours) 7.32 0.09 6.50 0.15 8.38 0.11 12.18 0.38 + LDS + FDS vs. Baseline + 0.25 -0.19 +0.97 +2.94 + ous vs. Baseline + 0.48 +0.30 +0.73 +1.45 Table 10. Quantitative comparison (PSNR(d B)) of super-resolution results with public benchmark and DIV2K validation set. Bold numbers indicate the best performance. Method Set5 Set14 B100 Urban100 DIV2K Baseline (small-size EDSR (Lim et al., 2017)) 32.24 28.61 27.60 26.08 29.00 EDSR (Lim et al., 2017) 32.46 28.80 27.71 26.64 29.25 RDN (Zhang et al., 2018) 32.47 28.81 27.72 26.61 - Baseline + PH-Reg (ours) 32.29 28.69 27.63 26.18 29.04 RDN vs. EDSR +0.01 +0.01 +0.01 -0.03 - Ours vs. Baseline +0.05 +0.08 +0.03 +0.01 +0.04 are both set to 0.1. Table 4 shows that exploiting Lt and Ld results in reduction of 6.7% and 8.9% in the δ1 and δ2 errors, respectively. H.5. Different improvement gap between synthetic and real-world datasets Our synthetic dataset is relatively simple and clean; the corresponding task (coordinate prediction) is directly related to the topology of the target space, hence the larger improvement on the synthetic data. The improvements on real-world datasets are significant (verified by Welch s t-test), and are also comparable to or better than competing works in the literature (Zhang et al., 2023; Yang et al., 2021; Lim et al., 2017; Zhang et al., 2018). Under the same settings, our improvements are competitive with recently published works. For age estimation on Age-DB, we improve the MAE (all) by almost 2-fold (0.48 vs. 0.25, see Table 9). For super-resolution on DIV2K, we improve the PSNR (Set5, see Table 10) by 0.05, while typical state-of-the-art papers in super-resolution show increments of 0.01 on PSNR (Set5) (Lim et al., 2017; Zhang et al., 2018). For depth estimation, our improvements are on par with the competing method (Zhang et al., 2023) (REL: 0.144 vs 0.143, see Table 4). However, on NYU-Depth-V2, the representation of the head part (with many samples in the target space) is relatively well-learned, leading to a smaller impact on our regularizers. This might be a reason for the lower improvement. From the regression baseline (Figure 2(a)), the representation in the blue region (head part) already shows a lower intrinsic dimension and collapses like a line so there is little opportunity for our regularizers to have a strong impact (and hence the smaller improvement on the MAE (Many)). In contrast, the impact on the synthetic dataset and the green region (corresponding to the tail part, with limited samples in the target space) is more significant, resulting in a higher improvement. Similar effects are observed on the other two real-world datasets, while the feature space on Age-DB tends to be a line (although the target space is discrete, it is too dense) and the feature space on DIV2K tends to be an object with a high density in the middle region (the target space is 3d).