# neural_feature_learning_in_function_space__e82a999d.pdf Journal of Machine Learning Research 25 (2024) 1-76 Submitted 9/23; Revised 1/24; Published 5/24 Neural Feature Learning in Function Space Xiangxiang Xu xuxx@mit.edu Lizhong Zheng lizhong@mit.edu Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Cambridge, MA 02139-4307, USA Editor: Kilian Weinberger We present a novel framework for learning system design with neural feature extractors. First, we introduce the feature geometry, which unifies statistical dependence and feature representations in a function space equipped with inner products. This connection defines function-space concepts on statistical dependence, such as norms, orthogonal projection, and spectral decomposition, exhibiting clear operational meanings. In particular, we associate each learning setting with a dependence component and formulate learning tasks as finding corresponding feature approximations. We propose a nesting technique, which provides systematic algorithm designs for learning the optimal features from data samples with off-the-shelf network architectures and optimizers. We further demonstrate multivariate learning applications, including conditional inference and multimodal learning, where we present the optimal features and reveal their connections to classical approaches. Keywords: Feature geometry, information processing, neural feature learning, nesting technique, multivariate dependence decomposition 1. Introduction Learning useful feature representations from data observations is a fundamental task in machine learning. Early developments of such algorithms focused on learning optimal linear features, e.g., linear regression, PCA (Principal Component Analysis) (Pearson, 1901), CCA (Canonical Correlation Analysis) (Hotelling, 1936), and LDA (Linear Discriminant Analysis) (Fisher, 1936). The resulting algorithms admit straightforward implementations, with well-established connections between learned features and statistical behaviors of data samples. However, practical learning applications often involve data with complex structures which linear features fail to capture. To address such problems, practitioners employ more complicated feature designs and build inference models based on these features, e.g., kernel methods (Cortes and Vapnik, 1995; Hofmann et al., 2008) and deep neural networks (Le Cun et al., 2015). The feature representations serve as the information carrier, capturing useful information from data for subsequent processing. An illustration of such feature-centric learning systems is shown in Figure 1, which consists of two parts: 1. A learning module which generates a collection of features from the data. Data can take different forms, for example, input-output pairs1 or some tuples. The features can be either specified implicitly, e.g., by a kernel function in kernel methods, or explicitly parameterized . This work was presented in part at 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, Sep. 2022 (Xu and Zheng, 2022). 1. In literature, the input variables are sometimes referred to as independent/predictor variables, and the output variables are also referred to as dependent/response/target variables. 2024 Xiangxiang Xu and Lizhong Zheng. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1202.html. Xu and Zheng as feature extractors, e.g., artificial neural networks. The features are learned via a training process, e.g., optimizing a training objective defined on the features. 2. An assembling module which uses learned features to build an inference model or a collection of inference models. The inference models are used to provide information about data. For example, when the data take the form of input-output pairs, a typical inference model provides prediction or estimation of output variables based on the input variables. The assembling module determines the relation between features and resulting models, which can also be specified implicitly, e.g., in kernel methods. Learning Data Assembling Model Feature Figure 1: Schematic diagram of a general feature-centric learning system Learning systems are commonly designed with a predetermined assembling module, which allows the whole system to be learned in an end-to-end manner. One representative example of such designs is deep neural networks. On one hand, this end-to-end characteristic makes it possible to employ large-scale and often over-parametrized neural feature extractors (Le Cun et al., 2015; Krizhevsky et al., 2017; He et al., 2016; Vaswani et al., 2017), which can effectively capture hidden structures in data. On the other hand, the choices of assembling modules and learning objectives are often empirically designed, with design heuristics varying a lot across different tasks. Such heuristic choices make the learned feature extractors hard to interpret and adapt, often viewed as black boxes. In addition, the empirical designs are typically inefficient, especially for multivariate learning problems, e.g., multimodal learning (Ngiam et al., 2011), where there can exist many potentially useful assembling modules and learning objectives to consider. To address this issue and obtain more principled designs, recent developments have adopted statistical and information-theoretical tools in designing training objectives. The design goal is to guarantee that learned features are informative, i.e., carry useful information for the inference tasks. To this end, a common practice is to incorporate information measures in learning objectives, such as mutual information (Tishby et al., 2000; Tishby and Zaslavsky, 2015) and rate distortion function (Chan et al., 2022). However, information measures might not effectively characterize the usefulness of features for learning tasks, due to the essentially different information processing natures. For example, originally introduced in characterizing the optimal rate of a communication system (Shannon, 1948), the mutual information is invariant to bijective transformations on variables as the amount of information to communicate does not change after such transformations. As a consequence, when we consider the mutual information between feature representations and the input/output variables (Tishby and Zaslavsky, 2015), features up to such bijective transformations, e.g., sigmoid or exponential mappings, are all equivalent and give the same mutual information. This is in contrast to their often highly distinct performances on learning tasks. Due to this intrinsic discrepancy, it is often impractical to learn features based solely on the information measures. As a compromise, the information measures are often applied to analyze features learned from other objectives (Tishby and Zaslavsky, 2015), or used as regularization terms in the training objective to facilitate learning (Belghazi et al., 2018), where the learned features generally depend on case-by-case design choices. Neural Feature Learning in Function Space In this work, we aim to establish a framework for learning feature representations that capture the statistical nature of data, and can be assembled to build different inference models without retraining. In particular, the features are learned to approximate and represent the statistical dependence of data, instead of solving specific inference tasks. To this end, we introduce a geometry on function spaces, coined feature geometry, which we apply to connect statistical dependence with features. This connection allows us to represent statistical dependence by corresponding operations in feature spaces. Specifically, the features are learned by approximating the statistical dependence, and the approximation error measures the amount of information carried by features. The resulting features capture the statistical dependence and thus are useful for general inference tasks. Our main contributions of this work are as follows. We establish a framework for designing learning systems that separate feature learning and feature usages, where the learned features can be assembled to build different inference models without retraining. In particular, we introduce the feature geometry, which converts feature learning problems to corresponding function-space operations on statistical dependence. The resulting optimal features capture the statistical dependence of data and can be adapted to different inference tasks. We propose a nesting technique for decomposing statistical dependence and learning the associated feature representations. The nesting technique provides a systematic approach to construct training objectives and develop learning algorithms, where we can learn optimal features by employing the state-of-the-art deep learning practices, e.g., network architecture and optimizer designs. We present the applications of this unified framework in multivariate learning problems, where we design feature learning algorithms to decompose and represent the corresponding multivariate dependence. As case studies, we consider two learning scenarios: learning for conditional inference and multimodal learning with missing modalities. We investigate the optimal features and demonstrate their relations to classical learning problems, such as maximum likelihood estimation and the maximum entropy principle (Jaynes, 1957a,b). The rest of the paper is organized as follows. In Section 2, we introduce the feature geometry, including operations on feature spaces and function representations of statistical dependence. In Section 3, we present the learning system design in a bivariate learning setting, where we demonstrate the feature learning algorithm and the design of assembling modules. In Section 4, we introduce the nesting technique, as a systematic approach to learning algorithm design in multivariate learning problems. We then demonstrate the applications of the nesting technique, where we study a conditional inference problem in Section 5, and a multimodal learning problem in Section 6. Then, we present our experimental verification for the proposed learning algorithms in Section 7, where we compare the learned features with the theoretical solutions. Finally, we summarize related works in Section 8 and provide some concluding remarks in Section 9. 2. Notations and Preliminaries For a random variable Z, we use Z to denote the corresponding alphabet, and use z to denote a specific value in Z. We use PZ to denote the probability distribution of Z. For the sake of simplicity, we present our development with finite alphabets and associated discrete random variables. The corresponding results can be readily extended to general alphabets under certain regularity conditions, which we discuss in Appendix A.1 for completeness. Xu and Zheng 2.1 Feature Geometry 2.1.1 Vector Space Given an inner product space with inner product , and its induced norm , we can define the projection and orthogonal complement as follows. Definition 1 Give a subspace W of V, we denote the projection of a vector v V onto W by Π (v; W) arg min w W v w 2 . (1) In addition, we use V W to denote the orthogonal complement of W in V, viz., V W {v V: v, w = 0 for all w W}. We use to denote the direct sum of orthogonal subspaces, i.e., V = V1 V2 indicates that V = V1 + V2 and V1 V2. Then we have the following facts. Fact 1 If V = V1 V2, then V2 = V V1. In addition, if W is a subspace of V, then V = W (V W). Fact 2 (Orthogonality Principle) Given v V and a subspace W of V, then w = Π (v; W) if and only if w W and v w V W. In addition, we have v = Π (v; W) + Π (v; V W). 2.1.2 Feature Space Given an alphabet Z, we use PZ to denote the collection of probability distributions supported on Z, and use relint(PZ) to denote the relative interior of PZ, i.e., the collection of distributions P PZ with P(z) > 0 for all z Z. We use FZ {Z R} to denote the collection of features (functions) on given Z. Specifically, we use 1Z FZ to denote the constant feature that takes value 1, i.e., 1Z(z) 1 for all z Z. We define the inner product on FZ as2 f1, f2 ER [f1(Z)f2(Z)], where R relint(PZ) is referred to as the metric distribution. This defines operations, e.g., norm and projection, on FZ. Specifically, we have the induced norm f p f, f , and the projection of f FZ onto a subspace GZ of FZ, i.e., Π (f; GZ), is defined according to Definition 1. We also use FZ {f FZ : ER [f(Z)] = 0} to denote the collection of zero-mean features on Z, which corresponds to the orthogonal complement of constant features, i.e., FZ = FZ span{1Z}. For each k 1, we use F k Z (FZ)k = {Z Rk} to denote the collection of k-dimensional features. For f F k Z, we use span{f} FZ to denote the subspace of FZ spanned by all dimensions of f, i.e., span{f} = span{f1, . . . , fk}, and use Π (f; GZ) to denote the corresponding projection on each dimension, i.e., Π (f; GZ) = [Π (f1; GZ) , . . . , Π (fk; GZ)]T G k Z (GZ)k. For f1 F k1 Z , f2 F k2 Z , we denote Λf1,f2 ER f1(Z)f T 2 (Z) Rk1 k2. Specifically, we define Λf Λf,f = ER f(Z)f T(Z) for feature f F k Z. We also use [k] to denote the set {i Z: 1 i k} with ascending order, and define f[k] F k Z as f[k] : z 7 (f1(z), . . . , fk(z))T for given f1, . . . , fk FZ. 2.1.3 Neural Feature Extractors Our development will focus on neural feature extractors, i.e., features represented by neural networks. To begin, we consider a neural network of a given architecture with m trainable parameters 2. Throughout our development, we use EQ [f(Z)] to denote the expectation of function f with respect to Q PZ, i.e., E ˆ Z Q f( ˆZ) . Specifically, we will also use E [f(Z)] to represent EPZ [f(Z)], i.e., the expectation with respect to the underlying distribution; similar conventions apply to conditional expectations. Neural Feature Learning in Function Space and k-dimensional outputs. Let Z denote the domain of input data Z, and denote the trainable parameters by3 θ Rm. Then, for each θ Rm, we can denote the associated neural feature as f(θ) F k Z. For the sake of simplicity, our discussions focus on the ideal case where the neural network has sufficient expressive power4, i.e., each feature in F k Z can be well-approximated by a large m and some θ Rm. Under this idealized assumption, optimizing over network parameters θ is equivalent to finding the optimal f F k Z. Therefore, we will omit the network parameter θ and use f to denote such neural feature extractor, which we represented as a trapezoid block5, as shown in Figure 2a. (a) Feature Extractor (b) Linear Layer (c) Feature Subspace Figure 2: Schematic representations of neural feature extractors. (a): a general feature extractor f F k Z; (b): a linear layer with weight matrix W and bias b; (c): the composition of feature extractor blocks, where each dimension of the output lies in the feature subspace span{φ}. The linear layer is an important building block in neural feature extractors, which performs an affine transformation on the input, as shown in Figure 2b. In particular, a linear layer with input dimension d and output dimension k is specified by a weight matrix W Rk d and bias vector b Rk, which outputs Wv + b for an input v Rd. We can use linear layers to construct feature subspaces, as shown in Figure 2c, where we have considered a given d-dimensional feature φ = (φ1, . . . , φd)T F d X and a linear layer without the bias term. Then, the collection of output features by varying weight matrix W is6 {Wφ: W Rk d} = spank{φ}. Note that since the generated features are restricted to a subspace, the cascaded feature extractor has restricted expressive power, which can generally affect the feature learning. We will discuss such effects in our later developments (cf. Section 3.3). 2.1.4 Joint Functions Given alphabets X, Y and a metric distribution RX,Y relint(PX Y), FX Y is composed of all joint functions of x and y. In particular, for given f FX, g FY, we use f g to denote their product ((x, y) 7 f(x) g(y)) FX Y, and refer to such functions as product functions. For each product function γ FX Y, we can find σ = γ 0, and f FX, g FY with f = g = 1, such that γ = σ (f g). (2) We refer to (2) as the standard form of product functions. 3. We use underlined lowercase letters to denote Euclidean vectors, e.g., θ, v, b, to distinguish them from scalars. 4. Such idealized network can be constructed by using the universal approximation theorem; see, e.g.,Cybenko (1989), for detailed discussions. 5. The literature sometimes use the longer (shorter) bases to indicate the larger (smaller) dimensions of input and output. However, our schematic representation does not make this distinction. 6. If we also consider the bias term b, the feature subspace will become spank{1Z, φ1, . . . , φd}, where 1Z represents the constant function Z z 7 1. Xu and Zheng In addition, for given f = (f1, . . . , fk)T F k X and g = (g1, . . . , gk)T F k Y , we denote f g Pk i=1 fi gi. For two subspaces GX and GY of FX and FY, respectively, we denote the tensor product of GX and GY as GX GY span{f g: f GX, g GY}. Note that by extending each f = (x 7 f(x)) FX to ((x, y) 7 f(x)) FX Y, FX becomes a subspace of FX Y, with the metric distribution being the marginal distribution RX of RX,Y . We then denote the orthogonal complement of FX in FX Y as FY|X FX Y FX. (3) We establish a correspondence between the distribution space PZ and the feature space FZ by the density ratio function. Definition 2 Given a metric distribution R relint(PZ), for each P PZ, we define the (centered) density ratio function ℓP;R FZ as ℓP;R(z) P(z) R(z) R(z) , for all z Z. It is easy to verify that ℓP;R has mean zero , i.e., ℓP;R FZ. We will simply refer to ℓP;R as the density ratio or likelihood ratio and use ℓP to denote ℓP;R when there is no ambiguity about the metric distribution R. In particular, given random variables X and Y with the joint distribution PX,Y , we denote the density ratio ℓPX,Y ;PXPY by i X;Y , i.e., i X;Y (x, y) = PX,Y (x, y) PX(x)PY (y) PX(x)PY (y) , for all x X, y Y. (4) We refer to i X;Y as the canonical dependence kernel (CDK) function, which connects the (X; Y ) dependence with FX Y. With the feature geometry, we can associate function-space concepts with corresponding operations on features, which we summarize as follows. Property 1 Consider the feature geometry on FX Y defined by the metric distribution RX,Y = PXPY . Then, we have f1 g1, f2 g2 = EPX [f1(X)f2(X)] EPY [g1(Y )g2(Y )] for given f1, f2 FX, g1, g2 FY. In addition, For any k 1 and f F k X, g F k Y , we have Π (f; span{1X}) = EPX [f(X)] , Π f; FX = f EPX [f(X)] , (5) i X;Y , f g = EPX,Y f T(X)g(Y ) (EPX [f(X)])T EPY [g(Y )] , (6) f g 2 = tr(ΛfΛg), (7) where Λf = EPX f(X)f T(X) , Λg = EPY g(Y )g T(Y ) . 2.1.5 Feature Geometry on Data Samples In practice, the variables of interest typically have unknown and complicated probability distributions, with only data samples available for learning. We can similarly define the feature geometry on data samples by considering the corresponding empirical distributions. To begin, given a dataset consisting of n samples of Z, represented as {zi}n i=1, we denote the corresponding empirical distribution ˆPZ PZ by i=1 1{zi=z}, for all z Z, (8) Neural Feature Learning in Function Space where 1{ } denotes the indicator function. Then, for any function f of Z, we have E ˆPZ [f(Z)] = P z Z ˆPZ(z) f(z) = 1 n Pn i=1 f(zi), which is the empirical average of f over the dataset. Specifically, given a dataset D = {(xi, yi)}n i=1 with n sample pairs of (X, Y ), the corresponding joint empirical distribution ˆPX,Y PX Y is given by ˆPX,Y (x, y) = 1 i=1 1{xi=x,yi=y}, for all x X, y Y. (9) It is easily verified that the marginal distributions of ˆPX,Y are the empirical distributions ˆPX of {xi}n i=1 and ˆPY of {yi}n i=1. Therefore, we can express the CDK function associated with the empirical distribution ˆPX,Y as [cf. (4)] ˆi X;Y (x, y) = ˆPX,Y (x, y) ˆPX(x) ˆPY (y) ˆPX(x) ˆPY (y) , for all x X, y Y. (10) As a result, given the dataset D, we can define the associated feature geometry on FX Y by using the metric distribution RX,Y = ˆPX ˆPY . From Property 1, we can evaluate the induced geometric quantities over data samples in D, by replacing the distributions by the corresponding empirical distributions, and applying the empirical CDK function ˆi X;Y in (6). Such characteristic allows us to discuss theoretical properties and algorithmic implementations in a unified notation. In our later developments, we will state both theoretical results and algorithms designs using the same notation of distribution, say, PX,Y . This PX,Y corresponds to the underlying distribution in theoretical analyses, and represents the corresponding empirical distribution in algorithm implementations. Remark 3 Note that for finite Z, FZ is a space with a finite dimension |Z|. Therefore, we can choose a basis of FZ and represent each feature as corresponding coefficient vectors in Euclidean space R|Z|. Similarly, we can represent operators on function spaces as matrices. Such conventions have been introduced and adopted in previous works, e.g., (Huang et al., 2024; Xu et al., 2022), which we summarize in Appendix A.2 for completeness and comparisons. 2.2 Modal Decomposition We then investigate a representation of joint functions in FX Y by using features in FX and FY. This representation will be our basic tool to connect statistical dependence with feature spaces. Throughout this section, we set the metric distribution of FX Y to the product form7, i.e., RX,Y = RXRY . This induced metric distributions on FX and FY are RX and RY , respectively. 2.2.1 Definitions and Properties We use the operator ζ on FX Y to denote the optimal rank-1 approximation, i.e., ζ(γ) arg min γ : γ =f g f FX,g FY γ γ , for all γ FX Y. (11) In addition, for all k 1, we define the operator ζk as ζ1 ζ and 7. Under this choice, the norm on FX Y corresponds to the Frobenius (Hilbert Schmidt) norm. Xu and Zheng which we refer to as the k-th mode of γ. Then, we use i=1 ζi(γ) and rk(γ) γ ζ k(γ) (13) to denote the superposition of the top k modes and the corresponding remainder, respectively. Remark 4 If the minimization problem (11) has multiple solutions, the corresponding ζ(γ) (and ζk(γ)) might not be unique. In this case, ζ1(γ), . . . ζk(γ) represent one of such solutions obtained from the sequential rank-1 approximations. For each γ FX Y, we define the rank of γ as rank(γ) inf{k 0: rk(γ) = 0}. Suppose K rank(γ), then from the definition (13), we have γ = ζ K(γ) + r K(γ) = ζ K(γ), i.e., i=1 ζi(γ). (14) For each i [K], let us denote the standard form [cf. (2)] of ζi(γ) by ζi(γ) = σi (f i g i ). Therefore, from (14) we obtain i=1 σi f i (x) g i (y), for all x X, y Y, (15) where f i = g i = 1 and σi = ζi(γ) . We refer to (15) as the modal decomposition of γ, which is a special case of Schmidt decomposition (Schmidt, 1907; Ekert and Knight, 1995), or singular value decomposition (SVD) in function space. We list several useful characterizations as follows. Fact 3 Let K rank(γ), then σ1 σ2 σK > 0. In addition, for all i, j = 1, . . . , K, we have8 f i , f j = g i , g j = δij, and ζi(γ), ζj(γ) = 0, if i < j, ζi(γ), rj(γ) = 0, if i j, where rj( ) is as defined in (13). Fact 4 For all i [K], we have (f i , g i ) = arg max (f,g) Di γ, f g where we have recursively defined each Di as Di = {(f, g) FX FY : f = g = 1 and f, f j = g, g j = 0 for all j [i 1]}. Fact 5 (Eckart Young Mirsky theorem, Eckart and Young 1936) For all γ FX Y and k 1, we have ζ k(γ) = arg min γ : rank(γ ) k γ γ = arg min γ : γ =f g, f F k X, g F k Y Therefore, we refer to ζ k(γ) as the rank-k approximation of γ, and the remainder rk(γ) represents the approximation error. 8. We adopt the Kronecker delta notation ( 0 if i = j, 1 if i = j. Neural Feature Learning in Function Space 2.2.2 Constrained Modal Decomposition We then introduce the constrained modal decomposition, which provides an effective implementation of projection operators. For given subspace GX of FX and subspace GY of FY, we define ζ(γ|GX, GY) arg min γ : γ =f g f GX,g GY ζk(γ|GX, GY) ζ i=1 ζi(γ|GX, GY) , for all k 1, (17) where ζ1(γ|GX, GY) ζ(γ|GX, GY). Similarly, we denote ζ k(γ|GX, GY) Pk i=1 ζi(γ|GX, GY). We can extend the properties of modal decomposition to the constrained case. In particular, we have the following extension of Fact 5, of which a proof is provided in Appendix C.1. Proposition 5 Suppose GX and GY are subspace of FX and FY, respectively. Then, for all γ FX Y and k 1, we have ζk(γ|GX, GY) = ζk(Π (γ; GX GY)), and ζ k(γ|GX, GY) = ζ k(Π (γ; GX GY)) = arg min γ : γ =f g, f G k X, g G k Y where we have defined G k X (GX)k and G k Y (GY)k. Therefore, we can implement projection operators by solving an equivalent constrained low-rank approximation (or modal decomposition) problem. 2.3 Statistical Dependence and Induced Features Given (X, Y ) PX,Y , we consider the space FX Y with the metric distribution RX,Y = PXPY . Then, we can characterize the statistical dependence between X and Y by the CDK function i X;Y , as defined in (4). We can characterize the energy of i X;Y as i X;Y 2, also known as the mean square contingency (Pearson, 1904). Specifically, it is easy to verify that i X;Y = 0 if and only if X and Y are independent. Suppose rank(i X;Y ) = K and let the modal decomposition be [cf. (15)] i=1 ζi(i X;Y ) = i=1 σi (f i g i ), (19) where for each i [K], ζi(i X;Y ) = σi (f i g i ) is the standard form of i-th rank-one dependence mode, with strength characterized by σi = ζi(i X;Y ) . Note that since different modes are orthogonal (cf. Fact 3), we have i X;Y 2 = PK i=1 σ2 i . From σ1 σK, these modes are ordered by their contributions to the joint dependence. In particular, the features f i s, g i s are the maximally correlated features in FX, FY, known as Hirschfeld Gebelein R enyi (HGR) maximal correlation functions (Hirschfeld, 1935; Gebelein, 1941; R enyi, 1959). To see this, let us denote the covariance for given f FX, g FY as cov(f, g) EPX,Y [f(X)g(Y )] EPXPY [f(X)g(Y )] . (20) From Fact 4 and the fact cov(f, g) = i X;Y , f g , we obtain the following corollary. Xu and Zheng Corollary 6 (HGR Maximal Correlation Functions) For each i = 1, . . . , K, we have σi = cov(f i , g i ) = EPX,Y [f i (X)g i (Y )] and (f i , g i ) = arg max (f,g) Di cov(fi, gi), where we have recursively defined each Di as Di = {(f, g) FX FY : f = g = 1 and f, f j = g, g j = 0 for all j [i 1]}. We can further extend the results to the constrained modal decomposition (cf. Section 2.2.2) of i X;Y . Specifically, given subspaces GX and GY of FX and FY, respectively, let ζi(i X;Y |GX, GY) = ˆσi ( ˆf i ˆg i ), i 1, (21) be corresponding standard form representations. Then, we can interpret ˆσi, ˆf i , ˆg i as the solution to a constrained maximal correlation problem, formalized as the following extension of Corollary 6. A proof is provided in Appendix C.2. Proposition 7 Given subspaces GX and GY of FX and FY, respectively, for each i 1, we have ˆσi = cov( ˆf i , ˆg i ) = EPX,Y ˆf i (X)ˆg i (Y ) , ( ˆf i , ˆg i ) = arg max (f,g) ˆDi cov(f, g), where cov denotes the covariance [cf. (20)], and where we have recursively defined each ˆDi as ˆDi = {(f, g) GX GY : f = g = 1 and f, ˆf j = g, ˆg j = 0 for all j [i 1]}. In particular, we can interpret CCA (Canonical Correlation Analysis) as the modal decomposition constrained to linear functions. Example 1 (Canonical Correlation Analysis) Suppose X and Y are vector spaces, and GX, GY are the space of all linear functions defined on X, Y, respectively. Then, Proposition 7 gives solutions to CCA (Canonical Correlation Analysis) (Hotelling, 1936), where ˆσi s are canonical correlations. Weak Dependence and Local Geometric Analyses In the particular case where the statistical dependence between X and Y is weak, we can establish further connections between feature geometry and conventional information measures. Such analyses have been extensively studied in Huang et al. (2024), referred to as the local geometric analysis, formalized as follows. Definition 8 (ϵ-Dependence) Given (X, Y ) PX,Y , X and Y are ϵ-dependent if i X;Y = O(ϵ). For such weakly dependent variables, we can characterize their mutual information as follows. Lemma 9 (Huang et al. 2024, Lemma 4.11) If X and Y are ϵ-dependent, then we have the mutual information I(X; Y ) = 1 2 i X;Y 2 + o(ϵ2). Therefore, from (19) we can also decompose the mutual information into different modes: I(X; Y ) = 1 2 i X;Y 2 + o(ϵ2) = 1 2 PK i=1 σ2 i + o(ϵ2). 3. Dependence Approximation and Feature Learning In this section, we demonstrate the learning system design with feature geometry in a learning setting. In particular, we consider optimal feature representations of the statistical dependence, and present learning such features from data and assembling them to build inference models. To begin, let X and Y denote the random variables of interest, with the joint distribution PX,Y . We characterize the statistical dependence between X and Y as the CDK function [cf. (4)] i X;Y FX Y. In our development, we consider the feature geometry on FX Y with respect to the metric distribution RX,Y = PXPY . We also assume i X;Y has the modal decomposition (19). Neural Feature Learning in Function Space 3.1 Low Rank Approximation of Statistical Dependence In learning applications, the joint distribution PX,Y is typically unknown with enormous complexity, making direct computation or estimation of i X;Y infeasible. To tackle this problem, we consider the representation of i X;Y using features of X and Y , and develop the feature learning algorithms that can be effectively implemented on (X, Y ) samples. Specifically, for given k 1 and k-dimensional features f F k X and g F k Y , we consider the approximation of i X;Y by the rank-k joint function f g = Pk i=1 fi gi. With this formulation, we can convert the computation of the rank-k approximation ζ k(i X;Y ) to an optimization problem, where the objective is the approximation error i X:Y f g , and where the optimization variables are k-dimensional features f and g. Then, ζ k(i X;Y ) can be represented by the resulting optimal features in a factorized form. However, we cannot directly compute the error i X:Y f g for given f and g, due to the unknown i X:Y . To address this issue, we introduce the H-score, proposed in (Xu and Huang, 2020; Xu et al., 2022). Definition 10 Given k 1 and f F k X, g F k Y , the H-score H (f, g) is defined as i X;Y 2 i X;Y f g 2 (22) = E f T(X)g(Y ) (E [f(X)])T E [g(Y )] 1 2 tr (ΛfΛg) , (23) where Λf = E f(X)f T(X) , Λg = E g(Y )g T(Y ) . The H-score measures the goodness of the approximation, with a larger H-score value indicating a smaller approximation error. In particular, for k-dimensional feature inputs, the maximum value of H-score gives the total energy of top-k dependence modes, achieved by the optimal rank-k approximation. Formally, we have the following property from Fact 5. . Property 2 Given k 1, let σi = ζi(i X;Y ) for i [k]. Then, for all f F k X and g F k Y , 2 ζ k(i X;Y ) 2 = 1 i=1 σ2 i , (24) where the inequality holds with equality if and only if f g = ζ k(i X;Y ). In practice, for given features f and g, we can efficiently compute the H-score H (f, g) from data samples, by evaluating corresponding empirical averages in (23). Since H (f, g) is differentiable with respect to f and g, we can use it as the training objective for learning the low-rank approximation of i X;Y , where we use neural networks to parameterize f and g and optimize their parameters by batch (minibatch) gradient descent. Suppose the networks have sufficient expressive power, then the optimal solution gives the desired low-rank approximation ζ k(i X;Y ). Remark 11 It is worth mentioning that the optimal features learned from finite data samples correspond to the modal decomposition of the associated empirical distribution (cf. Section 2.1.5), which is generally different from the underlying distribution. As a result, the learned features will deviate from the theoretical values of features; see, e.g., Huang and Xu (2020); Makur et al. (2020) for detailed discussions on the sample complexity of learning such features. Xu and Zheng Note that in this particular bivariate setting, the roles of X and Y (and the learned features f and g) are symmetric. Moreover, we learn the features by directly factorizing the statistical dependence between X and Y , instead of solving a specific inference task, e.g., predicting Y based on X, or vice versa. Nevertheless, we can readily solve these inference tasks by simply assembling the learned features, as we will demonstrate next. 3.2 Feature Assembling and Inference Models With features f F k X, g F k Y learned from maximizing the H-score H (f, g), we then discuss the construction of different inference models by assembling these features. We first consider the case where k rank(i X;Y ) and we have learned f g = i X;Y (cf. Property 2). Then, we have the following proposition. A proof is provided in Appendix C.3. Proposition 12 Suppose f g = i X;Y . Then, we have i X;Y 2 = tr(ΛfΛg) and PY |X(y|x) = PY (y) 1 + f T(x)g(y) . (25) In addition, for any d-dimensional function ψ F d Y , we have E [ψ(Y )|X = x] = E [ψ(Y )] + Λψ,gf(x), (26) where Λψ,g = E ψ(Y )g T(Y ) . Therefore, we can compute the strength of (X; Y ) dependence, i.e., i X;Y from the features f and g. In addition, the posterior distribution (25) and conditional expectation (26) are useful for supervised learning tasks. Specifically, we consider the case where X and Y are the input variable and target variable, respectively. The Y represents the categorical label in classification tasks, or the target to estimate in regression tasks. In classification tasks, we can compute the posterior distribution PY |X of the label Y from (25). The corresponding corresponding MAP (maximum a posteriori) estimation is ˆy MAP(x) = arg max y Y PY |X(y|x) = arg max y Y PY (y) 1 + f T(x)g(y) , (27) where PY can be obtained from training set. This approach is also referred to as the maximal correlation regression (MCR) (Xu and Huang, 2020). Similarly, the maximum likelihood estimation (MLE) is given by ˆy MLE(x) = arg max y Y PX|Y (x|y) = arg max y Y f T(x)g(y). (28) If the target variable Y is continuous, it is often of interest to estimate Y , or more generally, some transformation ψ of Y . Then, the MMSE (minimum mean square error) estimation of ψ(Y ) based on X = x is the conditional expectation E [ψ(Y )|X = x]. From (26), we can efficiently compute the conditional expectation, where E [ψ(Y )] and Λψ,g = E ψ(Y )g T(Y ) can be evaluated from the training dataset by taking the corresponding empirical averages. Therefore, we obtain the model for estimating ψ(Y ) for any given ψ, by simply assembling the learned features without retraining. In practice, it can happen that feature dimension k < rank(i X;Y ), due to a potentially large rank(i X;Y ). In such case, the best approximation of i X;Y would be the rank-k approximation ζ k(i X;Y ), and we can establish a similar result as follows. A proof is provided in Appendix C.4. Neural Feature Learning in Function Space Proposition 13 Suppose f g = ζ k(i X;Y ) for k 1. Then, for all d-dimensional function ψ spand{1Y, g 1, . . . , g k}, we have E [ψ(Y )|X = x] = E [ψ(Y )]+Λψ,gf(x), where 1Y is the constant function (y 7 1) FY, and where for each i [k], g i is obtained from the standard form of ζi(i X;Y ): σi(f i g i ) = ζi(i X;Y ). 3.3 Constrained Dependence Approximation We can readily extend the above analysis to the constrained low-rank approximation problem. Specifically, we consider the constrained rank-k approximation [cf. (18)] ζ k(i X;Y |GX, GY) for k 1, where GX and GY are subspaces of FX and FY, respectively. Analogous to Property 2, when we restrict f G k X and g G k Y , the H-score H (f, g) is maximized if and only if f g = ζ k(i X;Y |GX, GY). H Wx X Y φ ψ Wy f g Figure 3: Features f, g as the output of linear layers. The linear layers are represented as triangle modules, with inputs φ, ψ, and weight matrices Wx, Wy, respectively. As an application, we can characterize the effects of the restricted expressive power on feature learning. To begin, we consider the maximization of H-score H (f, g), where features f and g are k-dimensional outputs of neural networks. In particular, we assume the last layers of the networks are linear layers, which is a common network architecture design in practice. The overall network architecture is shown in Figure 3, where we express f as the composition of feature extractor φ F dx X and the last linear layer with weight matrix Wx Rk dx. Similarly, we represent g as the composition of ψ F dy Y and the linear layer with weight Wy Rk dy. Suppose we have trained the weights Wx, Wy and the parameters in φ, ψ to maximize the Hscore H (f, g), and the weights Wx and Wy have converged to their optimal values with respect to φ and ψ. Note that for any given φ, ψ, f = Wxφ takes values from the set {Wxφ: Wx Rk dx} = spank{φ}, and, similarly, g = Wyψ takes values from spank{ψ}. Therefore, the optimal (f, g) corresponds to the solution of a constrained low-rank approximation problem, and we have f g = ζ k(i X;Y | span{φ}, span{ψ}). In addition, from Proposition 5 and the orthogonality principle, we can express the approximation error as i X;Y f g 2 = i X;Y i X;Y + i X;Y ζ k(i X;Y ) 2 = i X;Y i X;Y 2 + rk(i X;Y ) 2, (29) where i X;Y Π (i X;Y ; span{φ} span{ψ}). Note that (29) decomposes the overall approximation into two terms, where the first term characterizes the effects of insufficient expressive power of φ, ψ, and the second term characterizes the impacts of feature dimension k. 3.4 Relationship to Classification DNNs We conclude this section by discussing a relation between the dependence approximation framework and deep neural networks, studied in Xu et al. (2022). We consider a classification task where X and Y denote the input data and the target label to predict, respectively. Then, we can interpret Xu and Zheng the log-likelihood function of DNN as an approximation of the H-score, and thus DNN also learns strongest modes of (X; Y ) dependence. X f P (f,g,b) Y |X Figure 4: A classification DNN for predicting label Y based on the input X. All layers before the classification layer are represented as feature extractor f. The weight and bias associated with each class Y = y are denoted by g(y) and b(y), respectively, which give weight matrix GT and bias vector b with G = [g(1), . . . , g(|Y|)], b = [b(1), . . . , g(|Y)]T. The softmax module outputs a posterior probability, parameterized by f, g and b. To begin, let {(xi, yi)}n i=1 denote the training data, with empirical distribution PX,Y as defined in (9). We depict the architecture of typical classification DNN in Figure 4, where we abstract all layers before classification layer as a k-dimensional feature extractor f F k X. The feature f is then processed by a classification layer with weight matrix GT and the bias vector b, and activated by the softmax function9. Without loss of generality, we assume Y = {1, . . . , |Y|}, then we can represent G and b as G [g(1), . . . , g(|Y|)] Rk |Y|, b [b(1), . . . , b(|Y|)]T R|Y|, (30) where g(y) Rk and b(y) R denote the weight and bias associated with each class Y = y, respectively. Then, the softmax output of (GTf(x) + b) gives a parameterized posterior P (f,g,b) Y |X (y|x) exp(f(x) g(y) + b(y)) P y Y exp(f(x) g(y ) + b(y )). (31) The network parameters are trained to maximize10 the resulting log-likelihood11 function L(f, g, b) 1 i=1 log P (f,g,b) Y |X (yi|xi) = E( ˆ X, ˆY ) PX,Y h log P (f,g,b) Y |X ( ˆY | ˆX) i . (32) We further define L(f, g) max b FY L(f, g, b), by setting the bias b to its optimal value with respect to given f and g. It can be verified that L(f, g) depends only on the centered versions of f and g, formalized as follows. A proof is provided in Appendix C.5. Property 3 For all k 1 and f F k X, g F k Y , we have L(f, g) = L( f, g), where we have defined f F k X, g F k Y as f Π f; FX , g Π g; FY , i.e., f(x) = f(x) E [f(X)], g(y) = g(y) E [g(Y )], for all x X, y Y. 9. The softmax function is defined such that, for all k > 1 and each v = (v1, . . . , vk)T Rk, we have softmax(v) Rk, with each i-th entry being [softmax(v)]i exp(vi) Pk j=1 exp(vj) , i [k]. 10. This is equivalent to minimizing L(f, g, b), i.e., the log loss (cross entropy loss). 11. Throughout our development, all logarithms are base e, i.e., natural. Neural Feature Learning in Function Space Therefore, it is without loss of generalities to restrict our discussions to zero-mean f and g. Specifically, we can verify that for the trivial choice of feature f = 0, the resulting likelihood function is L(0, g) = L(0, 0) = H(Y ), achieved when the posterior distribution satisfies P (0,g,b) Y |X = PY , where H( ) denotes the Shannon entropy. In general, we have the following characterization of L(f, g), which extends (Xu et al., 2022, Theorem 4). A proof is provided in Appendix C.6. Proposition 14 Suppose X and Y are ϵ-dependent. For all k 1, and f F k X, g F k Y , if L(f, g) L(0, 0) = H(Y ), then we have f g = O(ϵ), and L(f, g) = L(0, 0) + 1 2 i X;Y 2 i X;Y f g 2 | {z } =H (f,g) +o(ϵ2) (33) which is maximized if and only if f g = ζ k(i X,Y ) + o(ϵ). From Proposition 14, the H-score H (f, g) coincides with likelihood function L(f, g) in the local regime. For a fully expressive feature extractor f of dimension k, the optimal feature f and weight matrix GT are approximating the rank-k approximation of (X; Y ) dependence. In this sense, the weight matrix GT in classification DNN essentially characterizes a feature of the label Y , with a role symmetric to feature extractor f. However, unlike the H-score implementation, the classification DNN is restricted to categorical Y to make the softmax function (31) computable. Remark 15 For general X, Y , the optimal features f, g that optimize the likelihood function (32) can deviate from the low-rank approximation characterization in Proposition 14 and have different behaviors. See Xu et al. (2018) for detailed discussions. 4. Nesting Technique for Dependence Decomposition In multivariate learning applications, it is often difficult to summarize the statistical dependence as some bivariate dependence. Instead, the statistical dependence of interest is typically only a component decomposed from the original dependence. In this section, we introduce a nesting technique, which allows us to implement such dependence decomposition operations by training corresponding neural feature extractors. For the ease of presentation, we adopt the bivariate setting introduced previously and consider the feature geometry on FX Y with metric distribution PXPY . We will discuss the multivariate extensions in later sections. 4.1 Nesting Configuration and Nested H-score The nesting technique is a systematic approach to learn features representing projected dependence components or their modal decomposition. In particular, for a given dependence component of interest, we can construct corresponding training objective for learning the dependence component. The resulting training objective is an aggregation of different H-scores, where the inputs to these H-scores are features forming a nested structure. We refer to such functions as the nested H-scores. To specify a nested H-score, we introduce its configuration, referred to as the nesting configuration, defined as follows. Definition 16 Given X, Y and k l 1, we define an l-level nesting configuration for kdimensional features as the tuple n (d1, . . . , dl); G(1) X , . . . , G(l) X ; GY o , where (d1, , dl) is a sequence with di > 0 and Pl i=1 di = k; Xu and Zheng G(1) X , . . . , G(l) X is an increasing sequence of l subspaces of FX: G(1) X G(l) X ; GY is a subspace of FY. Nested H-score Given a nesting configuration C = n (d1, . . . , dl); G(1) X , . . . , G(l) X ; GY o for kdimensional features, the associated nested H-score is a function of k-dimensional feature pair f and g, which we denote by H (f, g; C), specified as follows. To begin, let us define ki Pi j=1 dj for each 0 i l, representing the total dimension up to i-th level. Then, we define the domain of H (f, g; C), denoted by dom(C), as dom(C) n (f, g): f F k X, g G k Y , fj G(i) X , for all ki 1 < j ki o . (34) Then, for (f, g) dom(C) and each i [l], we obtain the H-score H (f[ki], g[ki]) by taking the first ki dimensions of f, g. We define the nested H-score H (f, g; C) by taking the sum of these l H-scores, H (f, g; C) i=1 H (f[ki], g[ki]), (f, g) dom(C). (35) From (35), the nested H-score aggregates different H-scores with nested input features. The nested structure of features is specified by the increasing sequence of dimension indices: [k1] [kl] = [k], determined by the sequence (d1, . . . , dl). The domain of features is specified by subspaces in the configuration. When G(i) X = GX for all i [l], we can simply write the configuration as C = {(d1, . . . , dl); GX; GY} without ambiguity. In particular, we can represent the original H-score for k-dimensional input features as a nested H-score configured by {k; FX; FY} . Remark 17 Note that the nested H-score (35) is obtained by using a sum function to aggregate different H-score terms H (f[ki], g[ki]), i = 1, . . . , l. We shall comment that the choice of such aggregation functions is not unique. Generally, for an l-level nesting configuration, we can apply any differentiable Γ: Rl R as an aggregation function if Γ is strictly increasing in each argument. The aggregated result Γ(H (f[k1], g[k1]), . . . , H (f[kl], g[kl])) defines a nested H-score that satisfies the same collection of properties. For the ease of presentation, we adopt the sum form (35) throughout our development, but also provide general discussions in Appendix B for completeness. Remark 18 By symmetry, we can also define the configuration n (d1, . . . , dl); GX; G(1) Y , . . . , G(l) Y o and the associated nested H-score, for subspaces GX of FX and G(1) Y G(l) Y of FY. Refinements of Nesting Configuration Given a nesting configuration for k-dimensional features C = n (d1, . . . , dl); G(1) X , . . . , G(l) X ; GY o , the sequence (d1, . . . , dl) defines a partition that separates the k dimensions into l different groups. By refining such partition, we can construct new configurations with higher levels, which we refer to as refined configurations. In particular, the finest refinement corresponds to the partition where each group has only one dimension. We use C to denote the finest refinement of C, given by C n (1)k; G(1) X d1, . . . , G(l) X dl ; GY o , (36) where we have used (1)k to denote the all-one sequence of length k, and where G(1) X d1, . . . , G(l) X dl represents the length-k sequence starting with d1 terms of G(1) X , followed by d2 terms of G(2) X , Neural Feature Learning in Function Space ++ ++ f[2] g[2] ++ ++ f[3] g[3] ++ ++ f[k] g[k] f[k 1] g[k 1] i=1 H (f[i], g[i]) Figure 5: Nesting technique for modal decomposition: the nested H-score is computed with a nested architecture, where ++ denotes the concatenation operation of two features. up to dl terms of G(l) X . From (34), such refinements do not change the domain, and we have dom(C ) = dom(C). The corresponding nested H-score is H (f, g; C ) = i=1 H (f[i], g[i]), (f, g) dom(C). (37) 4.2 Nesting Technique for Modal Decomposition We then demonstrate the application of nesting technique in learning modal decomposition. Given k-dimensional features f, g, we consider the nesting configuration {(1)k; FX; FY}, which can also be obtained from the original H-score by the refinement (36): {(1)k; FX; FY} = {k; FX; FY} . The corresponding nested H-score is the sum of k H-scores: H (f, g; {(1)k; FX; FY}) = i=1 H (f[i], g[i]). (38) Note that from Property 2, for each i [k], the H-score H (f[i], g[i]) is maximized if and only if f[i] g[i] = ζ i(i X;Y ). Therefore, all k terms of H-scores are maximized simultaneously, if and only if we have f[i] g[i] = ζ i(i X;Y ) for all i [k]. By definition, this is also equivalent to fi gi = ζi(i X;Y ), i [k]. (39) Hence, the nested H-score H (f, g; {(1)k; FX; FY}) is maximized if and only if we have (39), which gives the top k modes of (X; Y ) dependence. In practice, we can compute the nested H-score by using a nested architecture as shown in Figure 5, where we have used the ++ symbol to indicate Xu and Zheng the concatenation of two vectors, i.e., u ++ v [ u v ] for two column vectors u, v. By maximizing the nested H-score, we can use (39) to retrieve each i-th dependence mode from corresponding feature pair (fi, gi), for i [k]. Compared with the features learned in Section 3, the nesting technique provides several new applications. First, from Fact 3, the learned features f and g have orthogonal dimensions, i.e., different dimensions are uncorrelated. In addition, from (39), we can compute the energy contained in each i-th dependence mode, via ζi(i X;Y ) 2 = fi gi 2 = E f2 i (X) E g2 i (Y ) , for i [k]. This provides a spectrum of (X; Y ) dependence and characterizes the usefulness or contribution of each dimension. Similarly, we can retrieve top k maximal correlation functions f i , g i and coefficients σi, by using the relations [cf. (19) and Corollary 6] E f2 i (X) , g i = gi q E g2 i (Y ) , σi = q E f2 i (X) E g2 i (Y ) , i [k]. (40a) Nested Optimality From (39), for any d k, we can represent the optimal rank-d approximation of i X;Y as ζ d(i X;Y ) = f[d] g[d], which corresponds to top d-dimensions of learned features. We refer to this property as nested optimality: the learned features give a collection of optimal solutions for different dimensions, with a nested structure. This nested optimality provides a convenient and principled feature selection on the fly: we can obtain the optimal selection of d feature pairs by simply taking the top d dimensions of learned features. It is worth noting that, this optimal selection indeed gives the optimal d-dimensional feature extraction, corresponding to the most significant d modes of the (X; Y ) dependence. This equivalence is hardly guaranteed in typical feature selection approaches. In practice, we can choose the dimension d based on the dependence spectrum, such that selected features capture sufficient amount of dependence information, and then take f[d], g[d] for further processing. We can readily extend the discussion to constrained modal decomposition problems. Let GX and GY be subspaces of FX and FY, respectively. Then, the nested H-score H (f, g; {(1)k; GX; GY}) defined for k-dimensional features f G k X, g G k Y , is maximized if and only if fi gi = ζi(i X;Y |GX, GY), for all i [k]. (41) From (41), we can establish a similar nested optimality in the constrained case. In particular, when GX and GY correspond to the collection of features that can be expressed by neural feature extractors, the result also characterizes the effects of restricted expressive power of neural networks (cf. Section 3.3). Specifically, from (41), when we use feature extractors with restricted expressive power, we can still guarantee the learned features have uncorrelated dimensions. 4.3 Nesting Technique for Projection With the nesting technique, we can also operate projections of statistical dependence in feature spaces. Such operations are the basis of multivariate dependence decomposition, which we will detail in the following sections. To begin, let GX denote a subspace of FX. Then, from FX = GX (FX GX), we obtain an orthogonal decomposition of function space FX Y = FX FY = (GX FY) ((FX GX) FY). (42) Therefore, by projecting the statistical dependence i X;Y to these function spaces, we obtain its orthogonal decomposition [cf. Fact 2] i X;Y = Π (i X;Y ; GX FY) + Π (i X;Y ; (FX GX) FY) . (43) Neural Feature Learning in Function Space In particular, the first term Π (i X;Y ; GX FY) characterizes the dependence component aligned with the subspace GX, and the second term represents the component orthogonal to GX. For convenience, we denote these two dependence components by π(i X;Y ) and π (i X;Y ), respectively, and demonstrate the geometry of the decomposition in Figure 6. i X;Y π (i X;Y ) Figure 6: Orthogonal decomposition of the CDK function i X;Y : π(i X;Y ) denotes the projection onto the plane GX FY, and π (i X;Y ) denotes the residual, orthogonal to the plane. In general, the information carried by decomposed dependence components depends on the choices of subspace GX, which varies in different learning settings. In spite of such differences, we can learn the decomposition with a unified procedure, which we demonstrate as follows. To begin, we consider the feature representations of the dependence components. For example, by applying the rank-k approximation on the orthogonal component π (i X;Y ), we obtain ζ k(π (i X;Y )) = ζ k(Π (i X;Y ; (FX GX) FY)) = ζ k(i X;Y |FX GX, FY), which can be represented as a pair of k-dimensional features. To learn such feature representations, we introduce the two-level nesting configuration Cπ {( k, k); (GX, FX); FY} (44) for some feature dimensions k, k 1. The corresponding nested H-score is = H ( f, g) + H f f defined on the domain [cf. (34)] dom(Cπ) = f f : f G k X, g F k Y , f F k X, g F k Y where for convenience, we explicitly express the first-level features as f, g, both of dimension k. We can use a nested network architecture to compute the nested H-score (45), as shown in Figure 7. To see the roles of the two H-score terms in (45), note that if we maximize only the first term H ( f, g) of the nested H-score over the domain (46), we will obtain the solution to a constrained dependence approximation problem (cf. Section 3.3): f g = ζ k(i X;Y |GX, FY). Specifically, if k is sufficiently large, we would get f g = Π (i X;Y ; GX FY) = π(i X;Y ), which gives the aligned component. With such f and g, we can express the second H-score term as 2 h i X;Y 2 i X;Y f g | {z } =π (i X;Y ) Xu and Zheng Figure 7: Two-level Nested Network With Features f f Therefore, if we maximize the second H-score over only f and g, we would get the orthogonal dependence component: f g = ζ k(π (i X;Y )). This gives a two-phase training strategy for computing the decomposition (43). In contrast, the nested H-score (45) provides a single training objective to obtain both dependence components simultaneously. We formalize the result as the following theorem, of which a proof is provided in Appendix C.7. Theorem 19 Given k rank(Π (i X;Y ; GX FY)), H f f is maximized if and only if f g = Π (i X;Y ; GX FY) , (47a) f g = ζ k(i X;Y |FX GX, FY). (47b) We can further consider the modal decomposition of dependence components, to obtain features with nested optimality. To learn such features, it suffices to consider the refined configuration C π = n (1) k+k; (G k X, F k X); FY o , which we formalize as follows. A proof is provided in Appendix C.8. Theorem 20 Given k rank(Π (i X;Y ; GX FY)), H f f is maximized if and only if fi gi = ζi(i X;Y |GX, FY) for all i [ k], (48a) fi gi = ζi(i X;Y |FX GX, FY) for all i [k]. (48b) 4.4 Learning With Orthogonality Constraints We conclude this section by discussing an application of the nesting technique, where the goal is to learn optimal features uncorrelated to some given features. Such settings arise in many learning scenarios, where the given features correspond to some prior knowledge. For instance, when the given features are extracted from pre-trained models, we can use this formulation to avoid information overlapping and extract only the part not carried by these models. Another example is to extract privacy-preserving features, where we use the given features to indicate the sensitive information. In feature geometry, the uncorrelatedness conditions correspond to orthogonality constraints, and we can formalize the learning problem as follows. Given a k-dimensional feature φ F k X, our goal is to learn k-dimensional feature f from X for inferring Y , with the constraint that Neural Feature Learning in Function Space span{f} span{φ}, i.e., E [fi(X)φj(X)] = fi, φj = 0, for all i [k], j [ k]. We therefore consider the constrained low-rank approximation problem minimize f F k X,g F k Y : span{f} span{φ} i X;Y f g . (49) We can demonstrate that the solution to (49) corresponds to learning the decomposition (43), with the choice GX = span{φ}. To see this, we rewrite (49) as i X;Y = πφ(i X;Y ) + (i X;Y πφ(i X;Y )) . (50) where we have denoted the aligned component πφ(i X;Y ) Π (i X;Y ; span{φ} FY) . In addition, note that the orthogonality constraint of (49) is f (FX span{φ})k , g F k Y . Therefore, it follows from Proposition 5 that the solution to (49) is f g = ζ k(i X;Y |FX span{φ}, FY) = ζ k(Π (i X;Y ; (FX span{φ}) FY)) = ζ k (i X;Y πφ(i X;Y )) , (51) where to obtain the last equality we have used the orthogonal decomposition (50), as well as the fact that (FX span{φ}) FY = FX Y (span{φ} FY). Remark 21 From the decomposition (50), we can characterize the amount of dependence information captured by feature φ, as the energy πφ(i X;Y ) 2. This quantity (with a 1/2 scaling factor) is also referred to as the single-sided H-score (Xu and Huang, 2020; Xu et al., 2022) of φ, due to the connection: max g F k Y H (φ, g) = 1 2 πφ(i X;Y ) 2. Figure 8: Nesting technique for learning features orthogonal to feature φ, where the φ block is frozen during learning. The feature φ can be given either in its analytical expressions, or as a pretrained neural network. To learn the features (51), we can apply the nesting technique and maximize the nested H-score configured by Cπ with GX = span{φ}. Specifically, from Theorem 19, f = φ is already in the optimal solution set. Therefore, we can fix f to f = φ, and optimize f=φ = H (φ, g) + H φ f Xu and Zheng over g F k Y , f F k X, and g F k Y . We can compute the objective (52) by the nested network structure as shown in Figure 8. It is also worth noting that from Proposition 14, we can also interpret the solution to (49) as the features extracted by classification DNNs subject to the same orthogonality constraints. However, compared with the H-score optimization, putting such equality constraints in DNN training typically requires non-trivial implementation. 5. Learning With Side Information In this section, we study a multivariate learning problem involving external knowledge and demonstrate learning algorithm design based on the nesting technique. Specifically, we consider the problem of learning features from X to infer Y , and assume some external knowledge S is available for the inference. We refer to S as the side information, which corresponds to extra data sources for facilitating the inference. In particular, we consider the setting where we cannot obtain (X, S) joint pair during the information processing, e.g., X and S are collected and processed by different agents in a distributed system. Otherwise, we can apply the bivariate dependence learning framework, by treating the (X, S) pair as a new variable and directly learn features to predict Y . Feature Extractor X Inference ˆY Figure 9: Learning Setting With Side Information S We depict this learning setting in Figure 9, where the inference is based on the both features extracted from X and the side information S. Our goal is to design an efficient feature extractor which carries only the information not captured by S. In addition, we need to design a fusion mechanism for the inference module to combine such features with the side information S, and provide inference results conditioned on the side information. Let PX,S,Y denote the joint distribution of X, S, Y . Throughout our development in this section, we consider the feature geometry on FX S Y with the metric distribution RX,S,Y PXPS,Y . 5.1 Dependence Decomposition and Feature Learning To begin, we represent the joint dependence in function space as the CDK function i X;S,Y FX S Y. Since the side information S is provided for the inference, we focus on the dependence between X and target Y not captured by the side information. To this end, we separate the (X; S) dependence from the joint dependence, by considering the orthogonal decomposition of function space [cf. Fact 1 and (3)]: FX S Y = FX S FY|(X S). (53) This induces an orthogonal decomposition of the joint dependence i X;S,Y = πM(i X;S,Y ) + πC(i X;S,Y ), (54) where we have defined πM(γ) Π (γ; FX S) and πC(γ) Π γ; FY|(X S) for all γ FX S Y. We characterize the decomposed components as follows, a proof of which is provided in Appendix C.9. Neural Feature Learning in Function Space Proposition 22 We have πM(i X;S,Y ) = i X;S = ℓP M X,S,Y , where P M X,S,Y PX|SPSPY |S. From Proposition 22, we have πM(i X;S,Y ) = i X;S = ℓP M X,S,Y , where P M X,S,Y PX|SPSPY |S. More generally, the space FX S characterizes CDK functions associated with such Markov distributions, which we formalize as follows. A proof of which is provided in Appendix C.10. Proposition 23 Given QX,S,Y with QX = PX, QS,Y = PS,Y , let i(Q) X;S,Y denote the corresponding CDK function. Then, i(Q) X;S,Y FX S if and only if QX,S,Y = QX|SQSQY |S. Hence, we refer to the dependence component πM i X;S,Y = i X;S as the Markov component. Then, we have πC i X;S,Y = i X;S,Y i X;S, which characterizes the joint dependence not captured by S. We refer to it as the Conditional dependence component, and also denote it by i X;Y |S, i.e., i X;Y |S(x, s, y) i X;S,Y (x, s, y) i X;S(x, s) = " PX,S,Y P M X,S,Y RX,S,Y (x, s, y). (55) Therefore, the conditional dependence component i X;Y |S vanishes, i.e., i X;Y |S = 0, if and only if X and Y are conditionally independent given S. In general, from the Pythagorean relation, we can write i X;Y |S 2 = i X;S,Y 2 i X;S 2 , (56) analogous to the expression of the conditional mutual information I(X; Y |S) = I(X; S, Y ) I(X; S). Indeed, we can establish an explicit connection in the local regime where X and (S, Y ) are ϵdependent, i.e., i X;S,Y = O(ϵ). Then, from Lemma 9 we obtain i X;S,Y 2 = 2 I(X; S, Y )+o(ϵ2), and similarly, i X;S 2 = 2 I(X; S) + o(ϵ2). Therefore, (56) becomes i X;Y |S 2 = i X;S,Y 2 i X;S 2 = 2 I(X; Y |S) + o(ϵ2). We then discuss learning these two dependence components by applying the nesting technique. To begin, note that since i X;S = πM(i X;S,Y ) = Π (i X;S,Y ; FX FS) , (57) i X;Y |S = πC(i X;S,Y ) = Π i X;S,Y ; FX FY|S , (58) we recognize the decomposition i X;S,Y = i X;S + i X;Y |S as a special case of (43). Therefore, similar to our discussions in Section 4.3, we consider the nesting configuration CMC and its refinement C MC, where [cf. (44)] CMC ( k, k); FX; (FS, FS Y) . (59) The corresponding nested H-scores are defined on dom(CMC) = dom(C MC) = f f : f F k X, g F k S , f F k X, g F k Y S In particular, we can compute the nested H-score configured by CMC from a nested network structure as shown in Figure 10. Then, we can obtain both dependence components by optimizing the nested H-scores. Formally, we have the following corollary of Theorem 19 and Theorem 20. Xu and Zheng f g ++ ++ f f Figure 10: Nesting Technique for Learning With Side Information S Corollary 24 Given k rank(i X;S), H f f is maximized if and only if f g = i X;S, (61a) f g = ζ k(i X;Y |S). (61b) In addition, H f f is maximized if and only if fi gi = ζi(i X;S), i [ k]. (62a) fi gi = ζi(i X;Y |S), i [k]. (62b) 5.2 Feature Assembling and Inference Models We then assemble the features for inference tasks, particularly the inference conditioned on S. We first consider the case where we have learned both dependence components i X;S and i X;Y |S, for which we have the following characterization (cf. Proposition 12). A proof is provided in Appendix C.11. Proposition 25 Suppose features f F k X, g F k S and f F k X, g F k S Y satisfy f g = i X;S, f g = i X;Y |S. Then, we have i X;S 2 = tr(Λ fΛ g), i X;Y |S 2 = tr(ΛfΛg), and PY |X,S(y|x, s) = PY |S(y|s) 1 + f T(x)g(s, y) 1 + f T(x) g(s) In addition, for any function ψ F d Y , E [ψ(Y )|X = x, S = s] = E [ψ(Y )|S = s] + 1 1 + f T(x) g(s) Λ(s) ψ,gf(x), (64) where we have defined Λ(s) ψ,g E ψ(Y )g T(s, Y )|S = s for each s S. Therefore, we can compute the strength of both the Markov component i X;S and the conditional component i X:Y |S from the features. Similarly, we can further compute the spectrum of the dependence components, by learning the modal decomposition according to (62). From Proposition 25, we can obtain inference models conditioned on the side information S. In particular, for classification task, we can use (63) to compute the posterior probability, with the resulting MAP estimation conditioned on S = s [cf. (27)]: ˆy MAP(x; s) = arg max y Y PY |X,S(y|x, s) = arg max y Y PY |S(y|s) 1 + f T(x)g(s, y) 1 + f T(x) g(s) Neural Feature Learning in Function Space Specifically, PY |S can be obtained by a separate discriminative model that predicts Y from the side information S. In addition, when Y is continuous, we can obtain the MMSE estimator of ψ(Y ) conditioned on S = s from (64), where we can learn E [ψ(Y )|S = s] and Λ(s) ψ,g = E ψ(Y )g T(s, Y )|S = s separately from (S, Y ) pairs. As we construct both models by assembling learned features, the model outputs depend on input data X only through the features f and f of X, as desired. Moreover, we can conduct a conditional independence test without learning the complete conditional dependence i X;Y |S. In particular, suppose we have learned features f F k X, g F k S Y with f g = ζ k(i X;Y |S) for some k 1. Then we obtain tr(ΛfΛg) = ζ k(i X;Y |S) 2 0, where the equality holds if and only if i X;Y |S = 0, i.e., X and Y are conditionally independent given S. 5.3 Theoretical Properties and Interpretations We conclude this section by demonstrating theoretical properties of the learned features. In particular, we focus on the conditional dependence component i X;Y |S and associated features, as the Markov component i X;S shares the same properties as discussed in the bivariate case. To begin, let K rank(i X;Y |S), and let the modal decomposition of i X;Y |S be ζi(i X;Y |S) = σi (f i g i ), i [K], (66) where we have represented each mode in the standard form. Then, we can interpret the σi, f i , g i as the solution to a constrained maximal correlation problem. To see this, note that from i X;Y |S = Π i X;S,Y ; FY|X,S = Π i X;S,Y ; FX FY|S , we can obtain σi(f i g i ) = ζi(i X;Y |S) = ζi(i X;S,Y |FX, FY|S). Therefore, f i , g i are the constrained maximal correlation function of X and (S, Y ) as defined in Proposition 7, with the subspaces GX = FX, GS Y = FY|S. 5.3.1 Local Posterior Distribution and Conditional Dependence In a local analysis regime, we can simplify the posterior distribution PY |X,S as follows. A proof is provided in Appendix C.12. Proposition 26 If X and (S, Y ) are ϵ-dependent, we have PY |X,S(y|x, s) = PY |S(y|s) i=1 σif i (x)g i (s, y) + o(ϵ), (67) where σi, f i , g i are as defined in (66). From (67), the dominant term of PY |X,S(y|x, s) depends on x only through f i (x), i = 1, . . . , K. Therefore, the feature f [K](X) = (f 1 (X), . . . , f K(X))T captures the conditional dependence between X and Y given S except possibly higher-order terms of ϵ. 5.3.2 Relationship to Multitask Classification DNNs We can also establish a connection between the side information problem and deep neural networks for multitask learning. Specifically, we consider a multitask classification task where X and Y denote the input data and target label to predict, respectively, and S denotes the index for tasks. When conditioned on different values of S, the dependence between data and label are generally different. We then demonstrate that a multitask DNN also learns the optimal approximation of the conditional dependence component i X;Y |S. We consider a classical multitask classification DNN design (Caruana, 1993; Ruder, 2017), as shown in Figure 11. In this figure, feature f F k X of X is shared among all tasks. For each task Xu and Zheng softmax GT 1 GT 1 softmax GT 1 GT 2 softmax GT 1 GT |S| P (f,g,b) Y |X,S=1 P (f,g,b) Y |X,S=2 P (f,g,b) Y |X,S=|S| Figure 11: A multihead network for extracting feature f shared among different tasks in S = {1, . . . , |S|}. Each task s S corresponds to a separate classification head with weight matrix GT s and bias vector bs for generating the associated posterior P (f,g,b) Y |X,S=s. s S, the corresponding classification head with weight matrix GT s R|Y| k and bias bs R|Y| are applied to compute the corresponding posterior probability P (f,g,b) Y |X,S (y|x, s) exp (f(x) g(s, y) + b(s, y)) P y Y exp (f(x) g(s, y ) + b(s, y )), (68) where g FS Y and b FY are related to Gs and bs via [cf. (30)] Gs(i, y) = gi(s, y) for all i [k], y Y, bs = [b(s, 1), . . . , b(s, |Y|)]T. (69) Given data samples {(xi, si, yi)}n i=1 with the empirical distribution PX,S,Y , we write the corresponding likelihood function as LS(f, g, b) 1 i=1 log P (f,g,b) Y |X,S (yi|xi, si) = E( ˆ X, ˆY , ˆS) PX,Y,S h log P (f,g,b) Y |X,S ( ˆY | ˆX, ˆS) i . (70) Note that we can relate the posterior probability P (f,g,b) Y |X,S to the posterior P (f,g,b) Y |X in the ordinary classification DNN, as defined in (31). To see this, note that for all s S, we have P (f,g,b) Y |X,S=s = P (f,g(s),b(s)) Y |X , where we have defined g(s) F k Y and b(s) FY for each s S, as g(s)(y) g(s, y), b(s)(y) b(s, y). Then, we rewrite (70) as LS(f, g, b) = P s S PS(s)L(s) S (f, g(s), b(s)), where L(s) S (f, g, b) E( ˆ X, ˆY ) PX,Y |S=s h log P (f,g,b) Y |X ( ˆY | ˆX) i is the expected likelihood value conditioned on S = s. We further assume the all bias terms are trained to their optimal values with respect to f and g. This gives the likelihood s S PS(s)L(s) S (f, g(s)) = max b FS Y LS(f, g, b), (71) Neural Feature Learning in Function Space where we have denoted L(s) S (f, g) maxb FY L(s) S (f, g, b) for each s S. Then, from Property 3 we can verify that LS(f, g) depends only on some centered features, formalized as follows. Property 4 We have LS(f, g) = LS( f, g), where we have defined f Π f; FX , and g Π g; FY|S , i.e., f(x) = f(x) E [f(X)] and g(s, y) = g(s, y) E [g(s, Y )|S = s]. Therefore, we can focus on centered features f F k X and g F k Y|S, i.e., E [f(X)] = 0 and E [g(s, Y )|S = s] = 0 for all s S. We also restrict to features f, g that perform better than the trivial choice of zero features, by assuming that L(s) S (f, g(s)) L(s) S (0, g(s)) = L(s) S (0, 0) = H(Y |S = s), for all s S. (72) Then, we have the following characterization, which extends Proposition 14 to the multitask setting. A proof is provided in Appendix C.13. Theorem 27 Suppose X and (S, Y ) are ϵ-dependent. For f F k X and g F k Y|S with (72), LS(f, g) = LS(0, 0) + 1 2 i X;Y |S 2 i X;Y |S f g 2 + o(ϵ2), (73) which is maximized if and only if f g = ζ k(i X;Y |S) + o(ϵ). Therefore, the multitask classification network essentially learns features approximating the conditional dependence i X;Y |S. Different from the nested H-score implementation, this multitask network implements the conditioning by directly applying a separate classification head for each task S = s. As a consequence, this design requires |S| many different heads, and is not applicable when the side information S is continuous or has complicated structures. 6. Multimodal Learning With Missing Modalities In this section, we demonstrate another multivariate learning application, where we need to conduct inferences based on different data sources. In particular, we focus on the setting where the goal is to infer Y from two different data sources, denoted by X1 and X2. We refer to such problems as the multimodal learning12 problems, and are particularly interested in the cases where we have missing modalities: either X1 or X2 can be missing during the inference. Our goal is to design a learning system to solve all the three problems: (i) inferring Y based on X1, (ii) inferring Y based on X2, and (iii) inferring Y based on (X1, X2). Throughout our discussions in this section, we use PX1,X2,Y to denote the joint distribution of X1 X1, X2 X2, Y Y. For convenience, we also denote X (X1, X2) X X1 X2. We consider the feature geometry on FX Y = FX1 X2 Y, with the metric distribution RX1,X2,Y = PX1,X2PY , or equivalently, RX,Y = PXPY . 6.1 Dependence Decomposition To begin, we decompose the joint dependence i X1,X2;Y FX1 X2 Y as i X1,X2;Y = πB(i X1,X2;Y ) + πI(i X1,X2;Y ), (74) 12. The literature typically uses multimodal to refer to the different forms (modalities) of data sources, e.g., video, audio, and text. However, such distinction is insignificant in our treatment, when we model each data source as a random variable. Xu and Zheng where we have defined πB(γ) Π (γ; FX1 Y + FX2 Y) , πI(γ) γ πB(γ). We refer to πB(i X1,X2;Y ) as the Bivariate dependence component, and refer to πI(i X1,X2;Y ) as the Interaction component. The bivariate dependence component πB(i X1,X2;Y ) is uniquely determined by all pairwise dependencies among X1, X2, Y . Formally, let QB denote the collection of distributions with the same pairwise marginal distributions as PX1,X2,Y , i.e., QB {QX1,X2,Y PX1 X2 Y : QX1,X2 = PX1,X2, QX1,Y = PX1,Y , QX2,Y = PX2,Y }. (75) Then we have the following result. A proof is provided in Appendix C.14. Proposition 28 For all QX1,X2,Y QB, we have πB(i(Q) X1,X2;Y ) = πB(i X1,X2;Y ), where i(Q) X1,X2;Y denotes the CDK function associated with QX1,X2,Y . πB = πB1 + πB2 FX1 Y + FX2 Y Figure 12: Decompose the joint dependence i X1,X2;Y into bivariate dependence component πB and interaction dependence component πI. The plane denotes the sum of FX1 Y and FX2 Y. We show the relation between different dependence components in Figure 12, where we have further decomposed the bivariate dependence component πB(i X1,X2;Y ) as πB(i X1,X2;Y ) = πB1(i X1,X2;Y )+ πB2(i X1,X2;Y ) for some πBi(i X1,X2;Y ) FXi Y, i = 1, 2. For comparison, we have also demonstrated πM1(i X1,X2;Y ) = i X1;Y and πC1(i X1,X2;Y ) = i X1,X2;Y i X1;Y , obtained from the decomposition introduced in Section 5.1. Note that since the interaction component πI(i X1;X2;Y ) does not capture any bivariate dependence, we can also obtain πM1(i X1,X2;Y ) directly from the bivariate component πB(i X1,X2;Y ) via a projection: πM1(i X1,X2;Y ) = πM1(πB(i X1,X2;Y )). 6.2 Feature Learning from Complete Data We consider learning the features representations for the two dependence components. Here, we assume the data are complete (X1, X2, Y ) triplets with the empirical distribution PX1,X2,Y . We will discuss the learning with incomplete data later. Again, we apply the nesting technique to design the training objective. Note that since X = (X1, X2), with GX = FX1 + FX2, we can express the two components as [cf. (43)] πB(i X;Y ) = Π (i X;Y ; GX FY) , (76) πI(i X;Y ) = Π (i X;Y ; (FX GX) FY) . (77) Neural Feature Learning in Function Space Figure 13: Nesting Technique for Learning Features from Multimodal Data Therefore, we consider the nesting configuration CBI and its refinement C BI, as [cf. (44)] CBI ( k, k); (FX1 + FX2, FX); FY . (78) The corresponding nested H-scores are defined on dom(CBI) = dom(C BI) = f f : f F k X1 + F k X2, g F k Y , f F k X, g F k Y Specifically, we can compute the nested H-score configured by CBI using a nested network structure as shown in Figure 13. Then we can obtain both dependence components by maximizing the corresponding nested H-scores, formalized as follows (cf. Theorem 19 and Theorem 20). Corollary 29 Given k rank(πB(i X1,X2;Y )), the nested H-score H f f is maximized if and only if f g = πB(i X1,X2;Y ), (80a) f g = ζ k (πI(i X1,X2;Y )) . (80b) In addition, H f f is maximized if and only if fi gi = ζi (πB(i X1,X2;Y )) , i [ k], (81a) fi gi = ζi (πI(i X1,X2;Y )) , i [k]. (81b) 6.3 Feature Assembling and Inference Models We then illustrate how to assemble the learned features for the inference tasks and deal with incomplete data. For convenience, we define the conditional expectation operators τi, i = 1, 2, such that for f F k X1 X2 with k 1, we have [τi(f)](xi) E [f(X1, X2)|Xi = xi] , for all xi Xi. (82) Note that we can also interpret τi as to the projection onto FXi, i.e., τi(f) = Π (f; FXi). Then, we have the following result. A proof is provided in Appendix C.15. Xu and Zheng Proposition 30 Suppose we have f g = πB(i X1,X2;Y ), f g = πI(i X1,X2;Y ) for features f = f(1) + f(2) with f(i) F k Xi, i = 1, 2, g F k Y , f F k X, and g F k Y . Then, we have PY |X1,X2(y|x1, x2) = PY (y) 1 + f T(x1, x2) g(y) + f T(x1, x2)g(y) , (83a) PY |X1(y|x1) = PY (y) 1 + f(1)(x1) + [τ1( f(2))](x1) T g(y) , (83b) PY |X2(y|x2) = PY (y) 1 + f(2)(x2) + [τ2( f(1))](x2) T g(y) . (83c) In addition, for all ψ F d Y , we have E [ψ(Y )|X1 = x1, X2 = x2] = E [ψ(Y )] + Λψ, g f(x1, x2) + Λψ,gf(x1, x2), (84a) E [ψ(Y )|X1 = x1] = E [ψ(Y )] + Λψ, g f(1)(x1) + [τ1( f(2))](x1) , (84b) E [ψ(Y )|X2 = x2] = E [ψ(Y )] + Λψ, g f(2)(x2) + [τ2( f(1))](x2) . (84c) From Proposition 30, we can obtain inference models for all three different types of input data, by simply assembling the learned features in different ways. The resulting inference models also reveal the different roles of two dependence components. For example, the features associated with the interaction dependence component, i.e., f and g, are used only when we have both X1 and X2 observations. In practice, we can use (83) and (84) for classification and estimation tasks, respectively. To apply (84), we can compute Λψ, g and Λψ,g from the corresponding empirical averages over the training dataset, and learn features τ1( f(2)) and τ2( f(1)) from (X1, X2) pairs. For example, we can use Proposition 12 to implement the conditional expectation operators τ1 and τ2 [cf. (26)]. 6.4 Theoretical Properties and Interpretations We then introduce several theoretical properties of the dependence decomposition and induced feature representations, including their connections to the principle of maximum entropy (Jaynes, 1957a,b) and the optimal transformation of variables (Breiman and Friedman, 1985). 6.4.1 Dependence Decomposition We can relate the bivariate-interaction decomposition (74) to decomposition operations in both the probability distribution space and the data space. Decomposition in Distribution Space We assume that for all (x1, x2, y) X1 X2 Y, [πB(i X1,X2;Y )](x1, x2, y) 1, [πI(i X1,X2;Y )](x1, x2, y) 1, (85) and define the associated distributions P B X1,X2,Y (x1, x2, y) PX1,X2(x1, x2)PY (y)(1 + [πB(i X1,X2;Y )](x1, x2, y)), (86a) P I X1,X2,Y (x1, x2, y) PX1,X2(x1, x2)PY (y)(1 + [πI(i X1,X2;Y )](x1, x2, y)). (86b) Then, we have the following characterization, a proof of which is provided in Appendix C.16. Proposition 31 Under assumption (85), we have P B X1,X2,Y , P I X1,X2,Y PX1 X2 Y, with marginal distributions P B X1,X2 = P I X1,X2 = PX1,X2 and P B Xi,Y = PXi,Y , P I Xi,Y = PXi PY for i = 1, 2. Neural Feature Learning in Function Space From Proposition 31, P I X1,X2,Y has marginal distributions P I Xi,Y = PXi PY , i = 1, 2, and does not capture (X1; Y ) or (X2; Y ) dependence. On the other hand, P B X1,X2,Y has the same pairwise marginal distributions as PX1,X2,Y , i.e., P B X1,X2,Y QB with QB as defined in (75). We can show that P B X1,X2,Y also achieves the maximum entropy in QB in the local analysis regime. Formally, let P ent X1,X2,Y arg max QX1,X2,Y QB H(QX1,X2,Y ) (87) denote the entropy maximizing distribution on QB, where H(QX1,X2,Y ) denotes the entropy of (X1, X2, Y ) QX1,X2,Y . Then we have the following result. A proof is provided in Appendix C.17. Proposition 32 Suppose X = (X1, X2) and Y are ϵ-dependent, and let i(ent) X1,X2;Y denote the CDK function associated with P ent X1,X2,Y . Then, we have πB(i X1,X2;Y ) i(ent) X1,X2;Y = o(ϵ), or equivalently, P ent X1,X2,Y (x1, x2, y) = P B X1,X2,Y (x1, x2, y) + o(ϵ), for all x1, x2, y. (88) Decomposition in Data Space For each triplet (x1, x2, y) X1 X2 Y, we consider the decomposition (x1, x2, y) 7 (x1, x2), (x1, y), (x2, y). (89) Suppose the dataset13 D n x(i) 1 , x(i) 2 , y(i) o i [n] has the empirical distribution PX1,X2,Y , where each tuple x(i) 1 , x(i) 2 , y(i) X1 X2 Y. Then, by applying this decomposition on D and grouping the decomposed pairs, we obtain three separate datasets n x(i) 1 , x(i) 2 o i [n] , n x(i) 1 , y(i) o i [n] , n x(i) 2 , y(i) o i [n] , (90) which have empirical distributions PX1,X2, PX1,Y , and PX2,Y , respectively. Therefore, we can interpret the decomposition (89) as extracting the bivariate dependence component from the joint dependence: the new pairwise datasets retain all pairwise dependence, but do not capture any interaction among X1, X2, Y . Indeed, it is easy to see that, for each dataset with an empirical distribution taken from QB, the decomposition (89) leads to the same pairwise datasets. Reversely, we can reconstruct P B X1,X2,Y from the pairwise datasets (90). We will discuss the reconstruction algorithm design later. 6.4.2 Feature Representations Let K rank(πB(i X1,X2;Y )) and K rank(πI(i X1,X2;Y )). Then, we can represent the dependence modes of the bivariate component πB(i X1,X2;Y ) and πI(i X1,X2;Y ) in their standard forms, as ζi(πB(i X1,X2;Y )) = σi ( f i g i ), i [ K], (91a) ζi(πI(i X1,X2;Y )) = σi (f i g i ), i [K]. (91b) By applying Proposition 7, we can interpret these features as solutions to corresponding constrained maximal correlation problems. For example, since ζi(πB(i X1,X2;Y )) = ζi(i X1,X2;Y |FX1 + FX2, FY), ( f i , g i ) is the i-th constrained maximal correlation function pair of X = (X1, X2) and Y restricted to subspaces FX1 + FX2 and FY, respectively. 13. Though the dataset is modeled as a multiset without ordering, we introduce the index i for the convenience of presentation, which corresponds to a specific realization for traversing the dataset. Xu and Zheng The top mode ( σ1, f 1 , g 1) in (91a) also characterizes the optimal solution to a classical regression formulation. Specifically, given input variables X1, X2 and the output variable Y , Breiman and Friedman (1985) formulated the regression problem minimize φ(1) FX1,φ(2) FX2 ψ FY : ψ =1 E ψ(Y ) φ(1)(X1) φ(2)(X2) 2 , (92) where the minimization is over zero-mean functions φ(1), φ(2), and ψ. The solution of (92), referred to as the optimal transformations (Breiman and Friedman, 1985), can be characterized as follows. A proof is provided in Appendix C.18. Proposition 33 The minimum value of optimization problem (92) is 1 σ2 1, which can be achieved by φ(1) + φ(2) = σ1 f 1 and ψ = g 1. Therefore, the optimal transformations depend on, and thus characterize only, the top mode of the bivariate dependence component πB(i X1,X2;Y ). 6.5 Learning With Missing Modalities We conclude this section by briefly discussing feature learning based on incomplete samples. 6.5.1 Learning from Pairwise Samples A special case of the incomplete samples is the pairwise datasets (90) obtained from the decomposition (89). Specifically, suppose we obtain (90) from D n x(i) 1 , x(i) 2 , y(i) o i [n], and let PX1,X2,Y denote the empirical distribution of D. Since the bivariate dependence is retained in the decomposition (89), we can learn πB(i X1,X2;Y ) from the pairwise datasets (90). In particular, when we set k = 0 in CBI [cf. (78)], we have H f f = 2 H ( f, g), H ( f, g) = H f(1) + f(2), g = E f(1)(X1) + f(2)(X2) T g(Y ) E h f(1)(X1) + f(2)(X2) i T E [ g(Y )] 1 2 tr Λ fΛ g = H ( f(1), g) + H ( f(2), g) tr Λ f(1), f(2)Λ g . (93) Therefore, we can evaluate (93) from the pairwise datasets (90), since each H ( f(i), g) depends only on PXi,Y for i = 1, 2, and Λ f(1), f(2) depends only on PX1,X2. Then, from Corollary 29, we can obtain πB(i X1,X2;Y ) and the same set of features. 6.5.2 General Heterogeneous Training Data We then consider general forms of heterogeneous training data, as shown in Table 1. In particular, suppose there are n n0 + n1 + n2 training samples, and we group them into separate datasets: D0 contains n0 complete observations of (X1, X2, Y ), and, for i = 1, 2, each Di has ni sample pairs of (Xi, Y ). Our goal is to learn features from the heterogeneous data and obtain similar inference models as we introduced in Section 6.3. In this case, in addition to the empirical distributions of these datasets, we also need to consider the sample sizes n0, n1, n2 that indicate the relative qualities. To begin, we use a metric distribution Neural Feature Learning in Function Space Datasets Empirical Distribution Remark D0 = (x(i) 1 , x(i) 2 , y(i)) n0 i=1 ˆP (0) X1,X2,Y Complete Observation D1 = (x(i) 1 , y(i)) n0+n1 i=n0+1 ˆP (1) X1,Y X2 missing D2 = (x(i) 2 , y(i)) n0+n1+n2 i=n0+n1+1 ˆP (2) X2,Y X1 missing Table 1: Heterogeneous Training Data With Missing Modalities of the product form RX1,X2,Y = RX1,X2RY , where RX1,X2 and RY correspond to some empirical distributions of training data. For example, we can set RX1,X2 = ˆP (0) X1,X2 and RY = η0 ˆP (0) Y + η1 ˆP (1) Y + η2 ˆP (2) Y with ηi ni/n for i = 0, 1, 2, which correspond to the empirical distributions of all (X1, X2) sample pairs and all Y samples, respectively. Then, for any given QX1,X2,Y PX1 X2 Y , we use a weighted sum L(QX1,X2,Y ) to characterize the difference between QX1,X2,Y and the heterogeneous observations, defined as L(QX1,X2,Y ) η0 ℓˆP (0) X1,X2,Y ℓQX1,X2,Y 2 + η1 ℓˆP (1) X1,Y ℓQX1,Y 2 + η2 ℓˆP (2) X2,Y ℓQX2,Y 2. We use P (est) X1,X2,Y to denote the optimal distribution that minimizes (94). We can again apply the nesting technique to learn the feature representations associated with P (est) X1,X2,Y . To begin, we use H (f, g; QX,Y ) to denote the H-score computed over the joint distribution QX,Y , defined as H (f, g; QX,Y ) 1 ℓQX,Y 2 ℓQX,Y f g 2 = EQX,Y f T(X)g(Y ) (ERX [f(X)])T ERY [g(Y )] 1 2 tr (ΛfΛg) , with Λf = ERX f(X)f T(X) and Λg = ERY g(Y )g T(Y ) . Then, we define the H-score associated with the heterogeneous datasets shown in Table 1, as Hm(f, g) η0 H f, g; ˆP (0) X1,X2,Y + η1 H τ1(f), g; ˆP (1) X1,Y + η2 H τ2(f), g; ˆP (2) X2,Y , (95) where we have defined conditional expectation operators τi, i = 1, 2 as in (82), with respect to the distribution RX1,X2. By applying the same nesting configuration CBI, we can obtain the corresponding nested H-score = Hm( f, g) + Hm Then, we have the following theorem, which extends Corollary 29 to the heterogeneous datasets. A proof is provide in Appendix C.19. Theorem 34 Given k rank πB ℓP (est) X1,X2,Y , the nested H-score Hm in (96) is maximized if and only if f g = πB ℓP (est) X1,X2,Y , f g = ζ k πI ℓP (est) X1,X2,Y Xu and Zheng We can also use the refined configuration C BI to obtain modal decomposition of the dependence components. The inference models can be built by assembling learned features, as we have discussed in Section 6.3. Furthermore, we can show that the estimation P (est) X1,X2,Y coincides with the maximum likelihood estimation (MLE) in a local analysis regime. Formally, let P {D0, D1, D2; QX1,X2,Y } denote the probability of observing datasets D0, D1, D2, when all data samples are independently generated by QX1,X2,Y . Then, we can write the MLE solution as P (ML) X1,X2,Y arg max QX1,X2,Y P {D0, D1, D2; QX1,X2,Y } , (98) and we have the following characterization. A proof is provided in Appendix C.20. Theorem 35 If L(RX1,X2,Y ) = O(ϵ2), then we have P (ML) X1,X2,Y (x1, x2, y) = P (est) X1,X2,Y (x1, x2, y) + o(ϵ), for all x1, x2, y. (99) 7. Experimental Verification To verify the learning algorithms as well as established theoretical properties, we design a series of experiments with various types of data. The main goal is to compare the features learned by neural feature extractors and the corresponding theoretical results. To allow such comparisons, we generate data from given probability distributions of which we know the analytical form of optimal features. The source codes for all experiments are available at github.com/Xiangxiang Xu/NFE, and we defer the implementation details to Appendix D. 7.1 Learning Maximal Correlation Functions We first consider learning dependence modes, i.e., maximal correlation functions from sample pairs of (X, Y ), by maximizing the nested H-score (38). We verify the effectiveness by experiments on both discrete and continuous data, and also discuss one application in analyzing sequential data. 7.1.1 Discrete Data The simplest case for dependence learning is when X and Y are both discrete with small alphabet sizes |X| and |Y|. In this case, we can design neural feature extractors with ideal expressive powers. Suppose X = {1, . . . , |X|}, then we can express f F k X on X by first mapping each i X to i-th standard basis vector in R|X|, also known as the one-hot encoding in practice, and then applying a linear function R|X| Rk to the mapped result, which we implement by using a linear layer. Then, any f F k X can be expressed in this way by setting corresponding weights in the linear layer. Similarly, we can express g F k Y using another linear layer. In the experiment, we set |X| = 8, |Y| = 6, and randomly generate a PX,Y PX Y. We generate N = 30 000 training samples from PX,Y , and learn k = 3 dimensional features f, g by maximizing H (f, g; {(1)k; FX; FY}). Then, we normalize each fi, gi to obtain corresponding estimations of f i , g i , and σi by applying (40). We show the estimated features and singular values in Figure 14, which are consistent with the corresponding theoretical values computed from PX,Y . 7.1.2 Continuous Data We proceed to consider a continuous dataset with degenerate dependence modes, i.e., the singular values σi s are not all distinct. Neural Feature Learning in Function Space Theoretical Learned Theoretical Learned Theoretical Learned (a) f i (x) for each i = 1, 2, 3 Theoretical Learned Theoretical Learned Theoretical Learned (b) g i (y) for each i = 1, 2, 3 Theoretical Learned Figure 14: Top three dependence modes learned from a discrete dataset, which are consistent with the theoretical results. In particular, we consider X, Y taking values from X = Y = [ 1, 1], where the joint probability density function p X,Y takes a raised cosine form: p X,Y (x, y) = 1 4 [1 + cos(π(x y))] , (x, y) [ 1, 1]2. (100) Then, it can be verified that the corresponding marginal distributions of X, Y are uniform distributions p X = p Y = Unif([ 1, 1]). In addition, the resulting CDK function is i X;Y (x, y) = p X,Y (x, y) p X(x)p Y (y) p X(x)p Y (y) = cos(π(x y)). (101) Note that we have cos(π(x y)) = cos(πx + θ0) cos(πy + θ0) + sin(πx + θ0) sin(πy + θ0), for any θ0 [ π, π). Therefore, we have rank(i X;Y ) = 2, and the associated dependence modes are given by σ1 = σ2 = 1/2 and the maximal correlation functions 2 cos(πx + θ0), f 2 (x) = 2 sin(πx + θ0), (102a) 2 cos(πy + θ0), g 2(y) = 2 sin(πy + θ0), (102b) Xu and Zheng (a) Histogram of Training Data 1.0 0.5 0.0 0.5 1.0 x 1.0 0.5 0.0 0.5 1.0 y (b) Learned Dependence Modes Figure 15: Dependence modes learned from continuous data, in consistent with theoretical results. for any θ0 [ π, π). During this experiment, we first generate N = 50 000 sample pairs of X, Y for training, with histogram showing in Figure 15a. Then, we learn k = 2 dimensional features f1, f2 of X and g1, g2 of Y by maximizing the nested H-score (38), where f and g are parameterized neural feature extractors detailed in Appendix D.1.2. Figure 15b shows the learned functions after the normalization (40). The learned results well match the theoretical results (102): (i) The learned f 1 and f 2 are sinusoids differ in phase by π/2, and (ii) g i coincides with f i , for each i = 1, 2. It is also worth mentioning that due to the degeneracy σ1 = σ2, the initial phase θ0 in learned sinusoids (102) can be different during each run of the training algorithm. 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x ψ1 Est. Opt. ψ2 Est. Opt. ψ3 Est. Opt. Figure 16: MMSE estimators E [ψi(Y )|X = x] obtained from learning dependence modes, in comparison with theoretical results. Neural Feature Learning in Function Space Based on the learned dependence modes, we then demonstrate estimating functions of Y based on observed X = x. Here, we consider the functions ψ1(y) = y, ψ2(y) = y2, ψ3(y) = ey. From Proposition 12, we can compute the learned MMSE estimator E [ψi(Y )|X = x] for each i, by estimating E [ψi(Y )] and Λψi,g = E ψi(Y )g T(Y ) from the training set and then applying (26). For comparison, we compute the theoretical values E [ψi(Y )|X = x] = Z 1 1 p Y |X(y|x)ψi(y) dy, with p Y |X(y|x) = 1 2 [1 + cos(π(y x))], which gives E [Y |X = x] = 1 π sin(πx), E Y 2 X = x = 1 π2 cos(πx), (103a) E e Y X = x = e2 1 2e(1 + π2) (π sin(πx) cos(πx) + π2 + 1). (103b) Figure 7.1.2 shows the estimators obtained by applying (26) on the learned features, which are consistent with the theoretically optimal estimators given by (103). 7.1.3 Sequential Data We proceed with an example of learning dependence modes among sequence pairs. For simplicity, we consider binary sequences X and Y , of lengths l and m, respectively. Suppose we have the Markov relation X U V Y for some unobserved binary factors U, V U = V = {0, 1}. In addition, we assume14 X = (X1, . . . , Xl)T, Y = (Y1, . . . , Ym)T satisfy X|U = i BMS(l, qi), Y |V = i BMS(m, qi), for i = 0, 1, (104) where BMS(l, q) denotes the distribution of a binary first-order Markov sequence of length l with state flipping probability q. The corresponding state transition diagram is shown in Figure 17a. Therefore, if Z BMS(l, q), then Z1 Unif({0, 1}) and (Z1, . . . , Zl) forms a first order Markov chain over binary states {0, 1}, with flipping probability P {Zi+1 = Zi|Zi = z} = q for both z = 0, 1. Formally, Z BMS(l, q) if and only if h (1 q)δzizi+1 q1 δzizi+1 i for all (z1, . . . , zl) {0, 1}l. As a consequence, the resulting alphabets are X = {0, 1}l, Y = {0, 1}m, with sizes |X| = 2l, |Y| = 2m. In our experiment, we set l = 40, m = 30, q0 = 0.1, q1 = 0.9, and use the following joint distribution PU,V : Prob. U = 0 U = 1 V = 0 0.1 0.2 V = 1 0.4 0.3 We generate N = 50 000 training sample pairs of X, Y , with instances shown in Figure 17b. We also generate N = 10 000 sample pairs in the same manner, as the testing dataset. Then, we learn k = 1 dimensional features f and g by maximizing H (f, g) over the training set. We plot the extracted features in Figure 18. In particular, each point represents an (f(x), g(y)) pair 14. For convenience, we adopt the vector notation to represent sequences. Xu and Zheng (a) State Transitions of BMS(l, q) 5 10 15 20 25 30 35 40 5 10 15 20 25 30 5 10 15 20 25 30 35 40 5 10 15 20 25 30 5 10 15 20 25 30 35 40 5 10 15 20 25 30 (b) Generated Samples of X, Y Pair Figure 17: Sequential Data: Model and Generated X, Y Samples Pairs 1.0 0.5 0.0 0.5 1.0 f(x) (U, V ) = (0, 0) (U, V ) = (0, 1) (U, V ) = (1, 0) (U, V ) = (1, 1) Figure 18: Features Learned from Sequential Data X and Y evaluated on an instance from testing set, with corresponding values of binary factors (U, V ) shown for comparison. For the ease of demonstration, here we plot only 1, 000 sample pairs randomly chosen from the testing set. From the figure, the learned features are clustered according to the underlying factors. This essentially reconstructs the hidden factors U, V . For example, one can apply a standard clustering algorithm on the features, e.g., k-means (Hastie et al., 2009), then count the proportion of each cluster, to obtain an estimation of PU,V up to permutation of symbols. For a closer inspection, we can compare the learned features with the theoretical results, formalized as follows. A proof is provided in Appendix C.21. Proposition 36 Suppose X, Y satisfy the Markov relation X U V Y with U, V {0, 1} and the conditional distributions (104). Then, we have rank(i X;Y ) 1, and the corresponding maximal correlation functions f 1 , g 1 satisfy f 1 (x) = c [tanh (2w ϕ(x) + b U) tanh(b U)] , (105a) g 1(y) = c tanh 2w ϕ(y) + b V tanh(b V ) , (105b) Neural Feature Learning in Function Space for some c, c R, where w 1 2 log PU(1) PU(0), b V 1 2 log PV (1) PV (0), and where we have defined ϕ: {0, 1} R, such that for each z = (z1, . . . , zl)T {0, 1}l, we have ϕ(z) l 1 i=1 δzizi+1. Then, we compute the correlation coefficients between f(X) and f 1 (X), and between g(Y ), g 1(Y ), respectively, using sample pairs in the testing set. The absolute values of both correlation coefficients are greater than 0.99, demonstrating the effectiveness of the learning algorithm. 7.2 Learning With Orthogonality Constraints We verify the feature learning with orthogonality constraints presented in Section 4.4 on the same dataset used for Section 7.1.2. Here, we consider the settings k = k = 1, i.e., we learn onedimensional feature f(x) uncorrelated to given one dimensional feature φ FX. Note that without any orthogonality constraints [cf. (102)], the optimal feature will be sinusoids with any initial phase, i.e., f 1 (x) = 2 cos(πx + θ0) for any θ0 [ π, π). Here, we consider the following two choices of φ, (x 7 x) and (x 7 x2), which are even and odd functions, respectively. Since the underlying p X is uniform on [ 1, 1], we can verify the optimal features under the two constraints are f 1 (x) = 2 cos(πx) for φ(x) = x, and f 1 (x) = 2 sin(πx) for φ(x) = x2. 1.0 0.5 0.0 0.5 1.0 x Learned Theoretical (a) φ(x) = x 1.0 0.5 0.0 0.5 1.0 x Learned Theoretical (b) φ(x) = x2 Figure 19: Learning features uncorrelated to given φ under different settings of φ. The learned results are compared with theoretical results. By maximizing the nested H-score restricted to f = φ [cf. (52)], we can learn the optimal feature f 1 , as shown in Figure 19. The learned features are in consistent with the theoretical ones, validating the effectiveness of the learning algorithm. 7.3 Learning With Side Information We design an experiment to verify the connection between our learning algorithm and the multitask classification DNN, as demonstrated in Theorem 27. In particular, we consider the discrete X, Y, S with |X| = 8, |S| = |Y| = 3, and randomly choose a joint distribution PX,S,Y on X S Y. Then, we generate N = 50 000 training samples of (X, S, Y ) triples. In our implementation, we set k = |S| 1 = 2, k = 1 and maximize the nested H-score configured by CMC [cf. (59)] on the training set. Xu and Zheng Side Info Multihead NN (a) Learned feature f(x) Side Info Multihead NN Side Info Multihead NN Side Info Multihead NN (b) Learned feature g(s, y) Figure 20: Experimental verification of the connection between learning with side information and training a multihead neural network. For comparison, we also train the multihead network shown in Figure 11, where we maximize the log-likelihood function (70) to learn the corresponding feature f and weight matrices Gs for all s S. Then, we convert the weights to g FS Y via the correspondence [cf. (69)] g(s, y) = Gs(1, y). The features learned by our algorithm (labeled as Side Info ) and the multihead neural network are shown in Figure 20, where the results are consistent. 7.4 Multimodal Learning With Missing Modalities To verify the multimodal learning algorithms presented in Section 6, we consider multimodal classification problems in two different settings. Suppose X1, X2 are multimodal data variables, and Y Y = { 1, 1} denotes the binary label to predict. In the first setting, we consider the training set with complete (X1, X2, Y ) samples. In the second setting, only the pairwise observations of (X1, X2), (X1, Y ), and (X2, Y ) are available, presented in three different datasets. In both settings, we set X1 = X2 = [ 1, 1] with PX1,X2(x1, x2) = 1 4 [1 + cos(2π(x1 x2))] . (106) We consider predicting Y based on the learned features, where some modality in X1, X2 might be missing during the prediction. 7.4.1 Learning from Complete Observations We consider the X1, X2, Y dependence specified by (106) and the conditional distribution PY |X1,X2(y|x1, x2) = 1 4 [cos(πx1) + cos(πx2) + cos(π(x1 + x2))] (107) for x1, x2 [ 1, 1] and y = 1. It can be verified that PY satisfies PY (1) = PY ( 1) = 1 2. The corresponding CDK function and dependence components [cf. (74)] are given by i X1,X2;Y (x1, x2, y) = y 2 [cos(πx1) + cos(πx2) + cos(π(x1 + x2))] (108a) [πB(i X1,X2;Y )](x1, x2, y) = y 2 [cos(πx1) + cos(πx2)] , (108b) Neural Feature Learning in Function Space [πI(i X1,X2;Y )](x1, x2, y) = y 2 cos(π(x1 + x2)). (108c) Therefore, we have rank(πB(i X1,X2;Y )) = rank(πI(i X1,X2;Y )) = 1, and the functions obtained from modal decompositions [cf. (91)] are g 1(y) = g 1(y) = y and f 1 (x1, x2) = cos(πx1) + cos(πx2), f 1 (x1, x2) = 2 cos(π(x1 + x2)). (109) (a) X1, X2 Histogram Theoretical f (b) f 1 (x1, x2) Theoretical f (c) f 1 (x1, x2) Theoretical (d) PY |X1,X2(1|x1, x2) Figure 21: Features and posterior probability learned from multimodal data X1, X2, Y , in comparison with theoretical results. In the experiment, we first generate N = 50 000 triples of (X1, X2, Y ) for training. The histogram of (X1, X2) pair is shown in Figure 21a. Then, we set k = k = 1, and learn the features f, f, g, g by maximizing the nested H-score configured by CBI [cf. (78)]. We then normalize f, f to obtain the estimated f 1 and f 1 , and compute the posterior probability PY |X1,X2(1|x1, x2) based on (83a). The results of learned features f 1 , f 1 and posterior PY |X1,X2(1|x1, x2) are shown in Figure 21, which are consistent with the theoretical results. We then consider the prediction problem with missing modality, i.e., predict label Y based on unimodal data X1 or X2. In particular, based on learned f = f(1) + f(2), we train two separate networks to operate as τ1 and τ2, then apply (83b) and (83c) to estimate the posteriors PY |X1 and PY |X2. Then, for each i = 1, 2, the MAP prediction of Y based on observed Xi = xi can be obtained by comparing PY |Xi(1|xi) with the threshold 1/2, via arg max y Y PY |Xi(y|xi) = ( 1 if PY |Xi(1|xi) > 1/2, 1 if PY |Xi(1|xi) 1/2. We plot the estimated results in Figure 22, in comparison with the threshold 1/2 and the theoretical values PY |Xi(y|x) = 1 4 cos(πx), for i = 1, 2. (110) From the figure, the estimated posteriors PY |X1, PY |X2 have consistent trends with the ground truth posteriors, and the induced Y predictions are well aligned. Xu and Zheng Learned Theoretical 1/2 (a) PY |X1(1|x1) Learned Theoretical 1/2 (b) PY |X2(1|x2) Figure 22: Label prediction from unimodal data using learned features 7.4.2 Learning from Pairwise Observations We proceed to consider the multimodal learning with only pairwise observations. Specifically, we consider the joint distribution of (X1, X2, Y ), specified by (106) and PY |X1,X2(y|x1, x2) = 1 4 [cos(πx1) + cos(πx2)] . (111) for x1, x2 [ 1, 1] and y = 1. It can be verified that PY (1) = PY ( 1) = 1 2, and the associated CDK function satisfies i X1,X2;Y (x1, x2, y) = y 2 [cos(πx1) + cos(πx2)] . (112) and i X1,X2;Y = πB(i X1,X2;Y ). Therefore, the interaction dependence component πI(i X1,X2;Y ) = 0, and the joint dependence can be learned from all pairwise samples, as we discussed in Section 6.5.1. We then construct an experiment to verify learning joint dependence from all pairwise observations. Specifically, we generate N = 50 000 triples of (X1, X2, Y ) from (106) and (111). Then, we adopt the decomposition (89) on each triple, to obtain three pairwise datasets with samples of (X1, X2), (X1, Y ), (X2, Y ), where each dataset has N sample pairs. We use these three pairwise datasets for training and learn one dimensional f FX, g FY that maximize H ( f, g). Here, we compute H ( f, g) based on the minibatches from the three pairwise datasets, according to (93). Based on learned f, g, we then compute the normalized f 1 and posterior distribution PY |X1,X2(1|x1, x2), as shown in Figure 23, where the learned results match the theoretical values. Similar to the previous setting, we consider the unimodal prediction problem, and show the estimated results in Figure 24. It is worth noting that from (108b), (112), the joint distributions PX1,X2,Y in both settings contain the same bivariate dependence component. Therefore, the theoretical results for f 1 and PY |X1, PY |X2 are the same. 8. Related Works Maximal Correlation Functions: Optimality, Learning Algorithms, and Applications The Hirschfeld Gebelein R enyi (HGR) maximal correlation (Hirschfeld, 1935; Gebelein, 1941; R enyi, 1959) provides an important connection between statistical dependence and function space. The same concept has been studied with various formulations, and often in different terminologies, Neural Feature Learning in Function Space Theoretical f (a) f 1 (x1, x2) Theoretical (b) PY |X1,X2(1|x1, x2) Figure 23: Features and posterior probability learned from pairwise datasets of (X1, X2), (X1, Y ), and (X2, Y ), in comparison with theoretical results. Learned Theoretical 1/2 (a) PY |X1(1|x1) Learned Theoretical 1/2 (b) PY |X2(1|x2) Figure 24: Label prediction from unimodal data, with features learned from pairwise datasets. such as correspondence analysis (Greenacre, 2017), functional canonical variates (Buja, 1990), and principal inertial components (du Pin Calmon et al., 2017). See, e.g., (Huang et al., 2024, Section II), for a detailed discussion. The optimality of HGR maximal correlation functions has been extensively studied in literature, e.g., as the optimal transformations in regression problems (Breiman and Friedman, 1985; Buja, 1990). In particular, if one variable is categorical and represents the class label, the maximal correlation functions also provide the best inter-class separability (Xu and Huang, 2020, Section III-B). The maximal correlation functions also play a fundamental role in the discussion of informative feature extraction and local information geometry, as we will detail next. The first practical algorithm for learning such functions is the alternating conditional expectations Xu and Zheng (ACE) algorithm (Breiman and Friedman, 1985; Buja, 1990), which learns the top dependence mode and is mostly used for processing low-dimensional data. To tackle high-dimensional data, recent developments focused on learning maximal correlation functions by DNNs. Specifically, the H-score was derived from the low-rank approximation of the canonical dependence kernel, referred as the Soft-HGR objective (Wang et al., 2019). It was also introduced as the local approximation of the log loss function (Xu et al., 2022), as we have discussed in Section 3.4. In an independent line of work (Hao Chen et al., 2021), a special form of H-score was proposed for self-supervised learning tasks, referred to as the spectral contrastive loss therein. There are also other formulations and algorithms proposed for computing the maximal correlation functions; see, e.g., Andrew et al. (2013) and Hsu et al. (2021). Compared to the low-rank approximation formulation, such approaches typically require explicit whitening procedures or computations of matrix inverses, which are less applicable for learning high-dimensional features. The sample complexity for learning maximal correlation functions have been investigated recently; see, e.g., Huang and Xu (2020); Makur et al. (2020). The maximal correlation learning, particularly the H-score maximization formulation, has been widely applied in feature extraction for multimodal data, e.g., Wang et al. (2019) and the follow-up works. For example, Xu and Huang (2020) developed maximal correlation regression (MCR), which uses maximal correlation functions to solve classification tasks. Informative Feature Extraction and Local Information Geometry Huang et al. (2024) has provided an in-depth characterization of informative features by applying information-theoretic tools, with a focus on bivariate learning problems and the local analysis regime. Under such settings, it was shown that a series of statistical and learning formulations lead to the same optimal features, characterized as the HGR maximal correlation functions. Examples of these formulations include the information bottleneck (Tishby et al., 2000) and a cooperative game. There are similar information-theoretic discussions on multivariate problems. For example, Xu and Zheng (2022) presented the information-theoretic optimality of the features defined in (66). The features introduced in (91a) were also studied by Xu and Huang (2021) in characterizing distributed hypothesis testing problems. Decomposition of Probability Distributions The decomposition of probability distributions has been studied, particularly in the context of information geometry (Amari and Nagaoka, 2000). For example, Amari (2001) established a decomposition in distribution space and also investigated the maximum entropy formulation (87), cf. (Amari, 2001, Theorem 7). The information geometry induced by classification DNNs, with softmax activation and log loss function, was investigated in Xu et al. (2018), which corresponds to the optimal features without the weak dependence assumption [cf. Proposition 14]. The learning framework we presented is built upon these existing works and the perspectives from information theory, statistical analysis, and machine learning. The key component is a unified function-space view of feature learning design, which has provided nontrivial extensions to existing works. In particular, with the function-space viewpoint and nesting technique, we have developed several operations on general multivariate statistical dependence, with the existing bivariate learning algorithm (Wang et al., 2019; Xu and Huang, 2020) being atomic operations and special cases. Our developments have also revealed the learning implications of existing theoretical developments, e.g., Amari and Nagaoka (2000) and Huang et al. (2024). Neural Feature Learning in Function Space 9. Conclusions and Discussion We have presented a framework for designing learning systems with neural feature extractors, which allows us to learn informative features and assemble them to build different inference models. Based on the feature geometry, we use feature representations as the interface, and convert learning problems to corresponding function-space operations. We then introduce the nesting technique for implementing such operations, which provides a systematic design of both feature learning and feature assembling modules. We demonstrate its applications in learning multivariate dependence by considering conditional inference and multimodal learning problems. The developed framework has provided a unified solution to general feature-centric learning problems, where we have used DNNs as building blocks to directly represent the statistical dependence. The connection between feature and statistical dependence enables principled learning algorithm designs, especially in tackling complicated multivariate dependence. Such designs also provide a learning-based computational approach for investigating the statistical dependence behind high-dimensional data with often complicated structures. 9.1 Applications and Extensions For simplicity, our presentation focused on theoretical concepts and basic settings, which can be readily applied to practical learning scenarios and extended to more complicated settings. For instance, we can establish the feature optimality in multivariate problems by extending the bivariate characterizations (Huang et al., 2024; Xu and Huang, 2020); see, e.g, the discussions in Xu and Zheng (2022). For applications, the results in Proposition 12 can be applied to classification and estimation tasks, e.g., image classification (Xu and Huang, 2020). Similarly, we can employ the results in Section 4.4 to address physical or engineering constraints induced by practical learning applications. We can also apply the learning framework to process multimodal data dependencies and learn useful features, as discussed in Section 5 and Section 6. In all such examples, we can incorporate pretrained models to reduce data requirements and computational costs. Our discussions on the dependence decomposition framework can also be extended to general settings. In particular, we can get refined dependence components by iteratively applying the decomposition (43). One example is to decompose the dependence of a random process into different dependence components, with each component representing the dependence at a certain time delay, as discussed in Xu and Zheng (2023b). More generally, the nesting technique can be extended by considering a configuration (cf. Definition 16) with subspace sequences of both FX and FY. We can also extend our discussions and analyses on supervised learning examples to other scenarios, e.g., contrastive (Chen et al., 2020) and self-supervised (Hao Chen et al., 2021) learning problems. In addition to the neural feature extractors, the feature geometry can also be applied to implicit features, e.g., the feature subspaces in kernel methods. In particular, Xu and Zheng (2023a) developed a quantitative measure of kernel choices and demonstrated a connection between kernel methods and existing feature learning approaches. 9.2 Future Directions The established framework provides abundant opportunities for further explorations in both theoretical analyses and applications. To highlight the general framework, our development has adopted simplified or idealized assumptions on learning factors, e.g., the network expressive power. By relaxing such idealized assumptions, further investigations can give a better theoretical understanding and also provide practical guidance. Preliminary examples include our discussions on the expressive Xu and Zheng power of feature extractors (cf. Section 3.3) and the sample size of training data (cf. Section 6.5.2). More in-depth characterizations on the learning behaviors, e.g., generalizability, can be built upon the spectrum decomposition nature of this learning framework, by using existing analyses on linear operators (Dunford and Schwartz, 1988) or the spectral methods for matrices (Chen et al., 2021). Another interesting direction is to investigate the interaction between feature geometry, the manifold structure of the neural feature extractor, and the geometry of data spaces (Bronstein et al., 2017, 2021). Moreover, the nested H-score and the induced optimization behaviors can be of independent interest to optimization studies, e.g., the landscape and convergence analyses. In terms of applications, the established framework can be extended to other operations on statistical dependence beyond decomposing or learning given dependence. For instance, a potential application is to generate data satisfying certain dependence constraints, e.g., independence or conditional independence, by employing generative models. The integration with generative models, as well as the learning algorithm design, is of practical interest for further studies. Acknowledgments This work was supported by the Office of Naval Research (ONR) under grant N00014-19-1-2621. Appendix A. Data Alphabets in Feature Geometry We discuss the feature geometry on general data alphabets, e.g., continuous alphabets in Appendix A.1. In Appendix A.2, we briefly summarize the vector and matrix notation conventions established for discrete alphabets, which have been extensively used in related works, e.g., Huang et al. (2024); Xu et al. (2022). A.1 General Alphabets: Regularity Conditions and Examples Our development on feature spaces can be extended to general Hilbert spaces (Young, 1988; Weidmann, 2012), where the alphabet Z and the metric distribution PZ (cf. Section 2.1.2) are extended to a measurable set and the measure (Weidmann, 2012, Example 6), respectively. One particularly important example is Z = Rd with d 1. We consider the random variable Z Z. For simplicity, we restrict to the case where Z has the probability density function p Z. Then, we define the feature space FZ and the inner product on FZ as [cf. Section 2.1.2] FZ f : Z R, Z f2(z)p Z(z)dz < , and f1, f2 Z f1(z)f2(z)p Z(z)dz for f1, f2 FZ, respectively. We can define the joint function similarly for X, Y under the corresponding regularity conditions. For instance, we consider (X, Y ) with alphabets X = Y = R and the joint probability density function p X,Y . Then, the corresponding definition of the CDK function becomes [cf. (4)] i X;Y (x, y) = p X,Y (x, y) p X(x)p Y (y) 1. (113) Note that to have i X;Y FX Y, we need to assume i X;Y 2 = ZZ [p X,Y (x, y)]2 p X(x)p Y (y) dxdy 1 < , (114) where the norm is defined with respect to the metric distribution p Xp Y . Neural Feature Learning in Function Space Remark 37 The results can be further generalized to the case where density functions do not necessarily exist. See, e.g., Lancaster (1958) and (Buja, 1990, Proposition 3.1) for detailed discussions. In particular, we have the following result for bivariate normal variables, which has been extensively discussed in literature; see, e.g., Lancaster (1958) and Huang et al. (2024). Example 2 For bivariate normal variables X Y with |ρ| < 1, we have i X;Y (x, y) = i! Hei(x) Hei(y). (115) where for each i 1, Hei denotes the i-th (probabilist s) Hermite polynomial, defined as Hei(x) ( 1)i e x2 Then, each i-th mode of i X;Y can be written in the standard form as ζi(i X;Y ) = σi(f i g i ), with σi = |ρ|i, f i = 1 i! Hei, g i = [sgn(ρ)]i i! Hei. (116) Finally, i X;Y 2 = X i 1 σ2 i = ρ2 The equation (115) is referred to as Mehler s identity or Mehler s decomposition (Mehler, 1866). In some scenarios, X and Y can be of different types, e.g., X is continuous, and Y is discrete. A classical example is mixture models, where X and Y corresponds to the observation variable and the latent categorical variable, respectively. Specifically, we introduce the following Gaussian mixture example, which is a multidimensional extension of an example discussed in Buja (1990). Example 3 We consider the probability model with Y = {1, 1}, PY 1 2, and X|Y = y N(y µ, Σ) for y Y, where µ Rd and Σ is a positive definite matrix of order d, for some d 1. Then, we have rank(i X;Y ) = 1 with the standard form i X;Y = σ1(f 1 g 1), where g 1(y) = y and f 1 (x) = c tanh(µTΣ 1x) for some c R. Proof We have i X;Y (x, y) = p X|Y (x|y) 2(x yµ)TΣ 1(x yµ)) 2(x + µ)TΣ 1(x + µ)) + exp( 1 2(x µ)TΣ 1(x µ)) 1 = 2 exp(yµTΣ 1x) exp(µTΣ 1x) + exp( µTΣ 1x) 1 = y exp(µTΣ 1x) exp( µTΣ 1x) exp(µTΣ 1x) + exp( µTΣ 1x) = y tanh(µTΣ 1x), which implies that rank(i X;Y ) = 1, f 1 (x) tanh(µTΣ 1x), and g 1(y) = y. From Example 3, the maximal correlation function f 1 can be represented by a d 1 linear layer activated by tanh( ), with zero bias and weight Σ 1µ. Xu and Zheng A.2 Finite Alphabets: Information Vector and Canonical Dependence Matrix For finite data alphabets, it can be convenient to introduce the vector and matrix representations of features. To begin, we assume the random variable Z takes finite many possible values, i.e., |Z| < , then the resulting feature space FZ is a finite-dimensional vector space. It is sometimes more convenient to represent features using vector and matrix notations. Specifically, each f FZ can be equivalently expressed as one vector in R|Z| as follows. Suppose RZ is the metric distribution, then we can construct an orthonormal basis n b(z) Z : z Z o of FZ, where b(z) Z (z ) δzz p RZ(z) for all z, z Z. (117) We refer to this basis as the canonical basis of FZ. For all f FZ, we can represent f as a linear combination of these basis functions, i.e., z Z ξ(z ) b(z ) Z , where the coefficient ξ(z) for each z Z is given by ξ(z) = D f, b(z) Z E = f(z) p RZ(z). (118) In particular, when f is the density ratio ℓPZ for some PZ FZ, the corresponding coefficient ξ(z) for each z Z is ξ(z) = D ℓPZ, b(z) Z E = PZ(z) RZ(z) p RZ(z) . (119) This establishes a one-to-one correspondence between ℓPZ (or PZ) and the vector ξ [ξ(z), z Z]T R|Z|, which is referred to as the information vector associated with ℓPZ (or PZ). Similarly, for X and Y with |X| < , |Y| < , we can represent the CDK function i X;Y FX Y as an |X| |Y| matrix BX;Y : BX;Y (x; y) PX,Y (x, y) PX(x)PY (y) p PY (y) , (120) which is referred to as the canonical dependence matrix (CDM) of X and Y . With the metric distribution PXPY on FX Y, each BX;Y (x; y) is the coefficient associated with the basis function b(x,y) X,Y [cf. (117)]. In addition, we have rank( BX;Y ) = rank(i X;Y ). Suppose σi(f i g i ) = ζi(i X;Y ), for each 1 i rank(i X;Y ), then σi is the i-th singular vector of BX;Y , and the corresponding i-th left and right singular vector pair are given by ξX i R|X|, ξY i R|Y|, with [cf. (118)] ξX i (x) = p PX(x) f i (x), ξY i (y) = p PY (y) g i (y). (121) Therefore, for small-scale discrete data, we can use the connection (121) to obtain the modal decomposition by solving the SVD of the corresponding CDM. Neural Feature Learning in Function Space Appendix B. Characterization of Common Optimal Solutions We consider optimization problems defined on a common domain D. Specifically, given k functions hi : D R, i = 1, . . . , k, let us consider optimization problems maximize u D hi(u), i = 1, . . . , k. (122) For each i = 1, . . . , k, we denote the optimal solution and optimal value for i-th problem by D i arg max u D hi(u), t i max u D hi(u), (123) respectively, and suppose D i = , i = 1, . . . , k. Then, the set D k i=1D i represents the collection of common optimal solutions for all k optimization problems (122). When such common solutions exist, i.e., D is nonempty, we can obtain D by a single optimization program, by using an objective that aggregates original k objectives. We formalize the result as follows. Proposition 38 If D = , we have D = arg max u D Γ(h1(u), h2(u), . . . , hk(u)) for every Γ: Rk R that is strictly increasing in each argument. Proof Let D arg maxu D h(u) with h(u) Γ(h1(u), h2(u), . . . , hk(u)). Then the proposition is equivalent to D = D . We then establish D D and D D , respectively. To prove D D , take any u D . Then, for all u D, we have hi(u) hi(u ), i = 1, . . . , k, which implies that h(u) = Γ(h1(u), . . . , hk(u)) Γ(h1(u ), . . . , hk(u )) = h(u ). Therefore, we have u D . Since u can be arbitrarily chosen from D , we have D D . We then establish D D , which is equivalent to (D \ D ) (D \ D ). (124) Note that (124) is trivially true if D\D = . Otherwise, take any u D\D . Then, we have hi(u ) hi(u ) for all i [k], and the strict inequality holds for at least one i [k]. This implies that h(u ) = Γ(h1(u ), . . . , hk(u )) < Γ(h1(u ), . . . , hk(u )) = h(u ). Hence, u / D , and thus u (D \ D ), which establishes (124). To apply Proposition 38, the first step is to test the existence of common optimal solutions. A naive test is to solve all optimization problems (122) and then check the definition, which can be difficult in practice. Instead, we can consider a related multilevel optimization problem as follows. Let D0 D, and for each i = 1, . . . , k as, we define Di and ti R as Di arg max u Di 1 hi(u), ti max u Di 1 hi(u). (125) Note that for each i, the Di solved by level i optimization problem gives the elements in Di that maximize hi, and ti denotes the corresponding optimal value. Therefore, Dk can be obtained by sequentially solving k optimization problems as defined in (125). Then, the following result provides an approach to test the existence of common optimal solutions. Xu and Zheng Proposition 39 The following three statements are equivalent: 1. The optimization problems (122) have common optimal solutions, i.e., D = ; 2. ti = t i for all i = 1, . . . , k; 3. Di 1 D i = for all i = 1, . . . , k. In addition, if one of these statements holds, then we have Dk = D . Proof We establish the equivalence of the statements 1 to 3, by showing that 1 = 2 , 2 = 3 , and 3 = 1 . 1 = 2 Suppose 1 holds, and take any u D = k i=1D i . We then establish 2 by induction. First, note that u D0. For the induction step, we can show that for each i = 1, . . . , k, if u Di 1, then u Di and t i = ti. Indeed, we have t i ti = max u Di 1 hi(u) hi(u ) = t i , where the first inequality follows from the fact that D = D0 D1 Dk, where the second inequality follows from the inductive assumption u Di 1, and where the last equality follows from that u D D i . 2 = 3 For each i = 2, . . . , k, ti = t i implies that there exists some ui Di 1, such that hi(ui) = ti = t i = maxu D hi(u), and thus ui D i . Therefore, ui Di 1 D i , which establishes 3 . 3 = 1 For each i = 2, . . . , k, from Di 1 D i = and the definitions (123) and (125), we have Di = Di 1 D i . It can also be verified that Di = Di 1 D i holds for i = 1. Therefore, we obtain Dk = Dk 1 D k = Dk 2 D k 1 D k = = D0 = D D = D . (126) This implies that D = Dk 1 D k = . Finally, from (126) we know that statement 3 implies Dk = D . Since all three statements are equivalent, each statement implies Dk = D . Appendix C. Proofs C.1 Proof of Proposition 5 Let ˆγ Π (γ; GX GY). We first consider the second equality of (18), i.e., ζ k(ˆγ) = arg min γ : γ =f g, f G k X, g G k Y For each k rank(ˆγ), consider γ = f g with f G k X, g G k Y . Then, we have γ γ 2 = γ ˆγ + ˆγ γ 2 = γ ˆγ 2 + ˆγ γ 2 γ ˆγ 2 + rk(ˆγ) 2, (128) Neural Feature Learning in Function Space where to obtain the second equality we have used the orthogonality principle with fact that (γ ˆγ) GX GY and (ˆγ γ ) GX GY. Note that the inequality in (128) holds with equality if and only if γ = ζ k(ˆγ), which establishes (127). We then establish ζk(γ|GX, GY) = ζk(ˆγ) by induction. To begin, set k = 1 in (127), and the right hand side becomes ζ(γ|GX, GY), which implies ζ(γ|GX, GY) = ζ(ˆγ), i.e., ζ1(γ|GX, GY) = ζ1(ˆγ). As the inductive hypothesis, suppose we have ζi(γ|GX, GY) = ζi(ˆγ), i = 1, . . . , m. (129) From (17), we have ζm+1(γ|GX, GY) = ζ i=1 ζi(γ|GX, GY) i=1 ζi(ˆγ); GX GY i=1 Π (ζi(ˆγ); GX GY) = ζm+1(ˆγ), (133) where (130) (131) follow from the inductive assumption (129), where (132) follows from the linearity of projection operator. To obtain the first equality of (132), we have again applied the assumption (129): for i = 1, . . . , m, ζi(ˆγ) = ζi(γ|GX, GY) GX GY. Finally, ζ k(γ|GX, GY) = ζ k(ˆγ) can be readily obtained by definition. C.2 Proof of Proposition 7 It is easy to verify that cov( ˆf i , ˆg i ) = i X;Y , ˆf i ˆg i = ζi(i X;Y ) = ˆσ i , where i X;Y = Π (i X;Y ; GX GY). From Fact 4, we have ( ˆf i , ˆg i ) = arg max(f,g) ˆD i i X;Y , f g for each i 1, where we have defined each ˆD i as ˆD i {(f, g) FX FY : f = g = 1 and f, ˆf j = g, ˆg j = 0 for all j [i 1]}. Therefore, for each i and (f, g) ˆDi ˆD i, we have cov(fi, gi) = i X;Y , fi gi = i X;Y , fi gi ˆσ i = cov( ˆf i , ˆg i ), where the second equality follows from that fi gi GX GY. Hence, we obtain ( ˆf i , ˆg i ) = arg max(f,g) ˆDi cov(f, g). C.3 Proof of Proposition 12 The result of i X;Y directly follows from Property 1. In addition, f T(x)g(Y ) = i X;Y (x, y) = PX,Y (x, y) PX(x)PY (y) PX(x)PY (y) , for all (x, y) X Y, Xu and Zheng which implies PY |X(y|x) = PY (y) 1 + f T(x)g(y) , i.e., (25). Therefore, for all ψ F d Y , we have E [ψ(Y )|X = x] = X y Y PY |X(y|x)ψ(y) = X y Y PY (y)(1 + f T(x)g(y))ψ(y) y Y PY (y)ψ(y) + X y Y PY (y)(ψ(y)g T(y))f(x) = E [ψ(Y )] + Λψ,gf(x), which gives (26). C.4 Proof of Proposition 13 E [ψ(Y )|X = x] = X y Y PY |X(y|x)ψ(y) y Y PY (y)(1 + i X;Y (x, y))ψ(y) y Y PY (y)[1 + ζ k(i X;Y )](x, y)ψ(y) + X y Y PY (y) X i>k σ i f i (x)g i (y)ψ(y) y Y PY (y)(1 + f T(x)g(y))ψ(y) y Y PY (y)ψ(y) + X y Y PY (y)(ψ(y)g T(y))f(x) = E [ψ(Y )] + Λψ,gf(x), where the fourth equality follows from the fact that X y Y PY (y) X i>k σ i f i (x)g i (y)ψ(y) = X i>k σ i f i (x)E [g i (Y )ψ(Y )] = 0, due to ψj {1Y, g 1, . . . , g k} for each j [d]. C.5 Proof of Property 3 The property is equivalent to L(f, g) = L(f + u, g + v) for all u, v Rk. Hence, it suffices to prove that L(f + u, g + v) L(f, g), for all u, v Rk. (134) Note that for all b FY, since (f(x) + u)T (g(y) + v) + b(y) = f(x) g(y) + v Tf(x) + u Tg(y) + b(y) + u Tv, (135) we have [cf. (31)] P (f+u,g+v,b) Y |X = P (f,g,b+u Tg) Y |X , which implies that L(f +u, g+v, b) = L(f, g, b+u Tg). Therefore, we obtain L(f + u, g + v) = max b FY L(f + u, g + v, b) max b FY L(f, g, b + u Tg) L(f, g). (136) Neural Feature Learning in Function Space C.6 Proof of Proposition 14 We first prove a useful lemma. Lemma 40 Suppose p > 0, q > 0, p + q = 1, then we have log p exp u min u2, u0|u| , for all u R, where u0 ln 2 3 min{p, q}. Proof [Proof of Lemma 40] Let pmin min{p, q}. h(u) log p exp p 1u + q exp q 1u . Then, we have h (u) = exp p 1u exp q 1u p exp (p 1u) + q exp ( q 1u) h (u) = p exp p 1u + q exp q 1u 2 exp p 1u + exp q 1u 2 exp p 1u exp q 1u 2 [p exp (p 1u) + q exp ( q 1u)]2 = 4 exp (p 1 q 1) u [p exp (p 1u) + q exp ( q 1u)]2 . Moreover, for all |u| u0, we have exp (p 1 q 1) u exp |p 1 q 1| u0 exp u0 p exp p 1u + q exp q 1u exp u0 As a result, for all |u| u0, we have h (u) 4 exp 3u0 Therefore, h (u0) h (0)+2 (u0 0) = 2u0 > u0, and similarly, h ( u0) = h (0) h ( u0) 2u0, i.e., h ( u0) 2u0 u0. Moreover, for all |u| u0, h(u) h(0) + h (0) u + 1 2u2 2 = u2. Therefore, for all u > u0, h (u) h (u0) > u0, which implies that h(u) h(u0) + u0(u u0) u2 0 + u0u u2 0 = u0u. Similarly, we have h(u) u0u for all u < u0. Proceeding to the proof of Proposition 14, we consider zero-mean k-dimensional f, g. Without loss of generality, we assume b FY satisfies E [b(Y )] = H(Y ). Then, let a FY be a(y) b(y) log PY (y), and define γ FX Y as γ(x, y) f(x) g(y) + a(y). Note that since exp(f(x) g(y) + b(y)) = PY (y) exp(f(x) g(y) + a(y)) = PY (y) exp(γ(x, y)), P (f,g,b) Y |X (y|x) = exp(f(x) g(y) + b(y)) P y Y exp(f(x) g(y ) + b(y )) = PY (y) exp(γ(x, y)) P y Y PY (y ) exp(γ(x, y )). Xu and Zheng = E( ˆ X, ˆY ) PXPY h (i X;Y ( ˆX, ˆY ) + 1) log P (f,g,b) Y |X ( ˆY | ˆX) i = E( ˆ X, ˆY ) PXPY (i X;Y ( ˆX, ˆY ) + 1) log PY ( ˆY ) + γ( ˆX, ˆY ) log X y Y PY (y ) exp(γ( ˆX, y )) = H(Y ) + i X;Y , γ E ˆ X PX y Y PY (y ) exp(γ( ˆX, y )) As a result, for all (f, g, b) that satisfies L(f, g, b) H(Y ), we have i X;Y , γ E ˆ X PX y Y PY (y ) exp(γ( ˆX, y )) In addition, note that for all x X and y Y, we have y Y PY (y ) exp(γ(x, y )) = PY (y) exp(γ(x, y)) + (1 PY (y)) X PY (y ) 1 PY (y) exp(γ(x, y )) PY (y) exp(γ(x, y)) + (1 PY (y)) exp PY (y) 1 PY (y) γ(x, y) . where the inequality follows from Jensen s inequality and P y Y PY (y)γ(x, y) = 0. Let us define q X min x X PX(x) > 0, q Y min y Y PY (y) > 0. Then, from Lemma 40 we have y Y PY (y ) exp(γ(x, y )) log PY (y) exp(γ(x, y)) + (1 PY (y)) exp PY (y) 1 PY (y) γ(x, y) min (PY (y)γ(x, y))2, ln 2 3 q Y |PY (y)γ(x, y)| ln 2 q2 Y 3 min (γ(x, y))2, |γ(x, y)| , which implies that y Y PY (y ) exp(γ( ˆX, y )) PX(x) log X y Y PY (y ) exp(γ(x, y )) ln 2 q Xq2 Y 3 min (γ(x, y))2, |γ(x, y)| . (139) On the other hand, since i X;Y = O(ϵ), there exists a constant C > 0 such that i X;Y Cϵ. Therefore, i X;Y , γ i X;Y γ γmax ϵ, (140) Neural Feature Learning in Function Space where γmax maxx X,y Y |γ(x, y)|. Hence, by combining (138), (139) and (140), we obtain ln 2 q Xq2 Y 3 min γ2 max, γmax C γmax ϵ, (141) where we have taken (x, y) = arg maxx ,y |γ(x , y )| in (139). From (141), if ϵ < ln 2 3 q Xq2 Y 3C we have γmax < ϵ. This implies that |γ(x, y)| < ϵ, for all x X, y Y, (142) and we obtain γ < ϵ. In addition, note that since γ 2 = f g + a 2 = f g 2 + a 2, we obtain f g = O(ϵ). From (142), we have y Y PY (y ) exp(γ(x, y )) = X y Y PY (y ) 1 + γ(x, y ) + (γ(x, y ))2 y Y PY (y )(γ(x, y ))2 + o(ϵ2) h γ(x, ˆY ) i2 + o(ϵ2). y Y PY (y ) exp(γ( ˆX, y )) 2 E( ˆ X, ˆY ) PXPY h γ( ˆX, ˆY ) i2 + o(ϵ2) = 1 2 γ 2 + o(ϵ2), and the likelihood (137) becomes L(f, g, b) = H(Y ) + i X;Y , γ E ˆ X PX y Y PY (y ) exp(γ( ˆX, y )) = H(Y ) + i X;Y , γ 1 2 γ 2 + o(ϵ2) 2 i X;Y 2 i X;Y γ 2 H(Y ) + o(ϵ2) 2 i X;Y 2 i X;Y f g a 2 H(Y ) + o(ϵ2) 2 i X;Y 2 i X;Y f g 2 a 2 H(Y ) + o(ϵ2), (143) where the last equality follows from the fact that i X;Y f g a 2 = i X;Y f g 2 + a 2, due to the orthogonality (i X;Y f g) FY a. Finally, from (143), for given f, g, L(f, g, b) is maximized when a = o(ϵ). Therefore, we have L(f, g) = max b FY L(f, g, b) = 1 2 i X;Y 2 i X;Y f g 2 H(Y ) + o(ϵ2), which gives (33). Xu and Zheng C.7 Proof of Theorem 19 Throughout this proof, we consider L1( f, g, f, g) i X;Y f g 2, L2( f, g, f, g) i X;Y f g f g 2 defined on the domain D {( f, g, f, g): f G k X, g F k Y , f F k X, g F k Y }, and let D 1 arg min ( f, g,f,g) D L1( f, g, f, g), D 2 arg min ( f, g,f,g) D L2( f, g, f, g). We also define γ, γ FX Y, as γ Π (i X;Y ; GX FY) , γ i X;Y γ = Π (i X;Y ; (FX GX) FY) . (144) Then, it suffices to establish that D D 1 D 2 = {( f, g, f, g) D: f g = γ, f g = ζ k(γ)}. (145) To see this, note that the common solution D (145) coincides with the optimal solution (47) to be established. From Proposition 38, since D = , we have D = arg min ( f, g,f,g) D L1( f, g, f, g) + L2( f, g, f, g). (146) Furthermore, from the definition of the H-score (cf. Definition 10), we have = H ( f, g) + H f f 2 i X;Y 2 L1( f, g, f, g) + 1 2 i X;Y 2 L2( f, g, f, g) = i X;Y 2 1 2 L1( f, g, f, g) + L2( f, g, f, g) , which implies that arg max ( f, g,f,g) D H f f = arg min ( f, g,f,g) D L1( f, g, f, g) + L2( f, g, f, g) = D . It remains only to establish (145). From Proposition 39, it suffices to establish that t2 = t 2 and (cf. (145)) D2 = {( f, g, f, g) D: f g = γ, f g = γ}, (147) where we have defined t 2 min ( f, g,f,g) D L2( f, g, f, g), t2 min ( f, g,f,g) D 1 L2( f, g, f, g), D2 arg min ( f, g,f,g) D 1 L2( f, g, f, g). To this end, let us define h G k X and h (FX GX)k as h Π (f; GX), and h f h = Π (f; FX GX). Then, we have L2( f, g, f, g) = i X;Y f g f g 2 (148) Neural Feature Learning in Function Space = γ + γ f g ( h + h) g 2 (149) = γ f g h g 2 + γ h g 2 (150) γ h g 2 (151) rk(γ) 2 (152) where (150) follows from the orthogonality principle, since γ f g h g GX FY and (γ h g) GX FY. Moreover, it is easily verified that the lower bound (152) is tight: all inequalities hold with equality for ( f, g, f, g) satisfying (47). Therefore, we have t 2 = min ( f, g,f,g) D L2( f, g, f, g) = rk(γ) 2 (153) On the other hand, since k rank( γ), from Proposition 5, we have D 1 = {( f, g, f, g): f g = γ}. Hence, for all ( f, g, f, g) D 1, we have L2( f, g, f, g) = i X;Y f g f g 2 = i X;Y γ f g 2 = γ f g 2 rk(γ) 2, where the inequality holds with equality if and only if f g = ζ k(γ). As a result, D2 = arg min( f, g,f,g) D 1 L2( f, g, f, g) is given by (147), and we have t2 = min ( f, g,f,g) D 1 L2( f, g, f, g) = rk(γ) 2 = t 2. C.8 Proof of Theorem 20 To begin, we have i=1 H ( f[i], g[i]) + i=1 H f f[i] i=1 H ( f[i], g[i]) + k 1 H ( f, g) + H f f[i] Note that for f f dom(C π) and each i [ k], H ( f[i], g[i]) is maximized if and only if f[i] g[i] = ζ i(i X;Y |GX, FY). (154) In addition, from Theorem 19, for each i [k], the term k 1 H ( f, g) + H f f[i] is maximized if and only if f g = Π (i X;Y ; GX FY) , f[i] g[i] = ζ i(Π (i X;Y ; (FX GX) FY)). (155) Note that (48) is the common solution of (154) and (155). Hence, the proof is completed by applying Proposition 38. Xu and Zheng C.9 Proof of Proposition 22 The relation i X;S = ℓP M X,S,Y can be directly verified from definition. To establish πM(i X;S,Y ) = Π (i X;S,Y ; FX S) = i X;S, from Fact 2, it suffices to show that (i X;S,Y i X;S) FX S. To this end, note that i X;S,Y (x, s, y) i X;S(x, s) = PX,S,Y (x, s, y) P M X,S,Y (x, s, y) RX,S,Y (x, s, y) = PX,S,Y (x, s, y) PX|S(x|s)PS(s)PY |S(y|s) RX,S,Y (x, s, y) . Therefore, for all f FX S, we have i X;S,Y i X;S, f = X x X,s S,y Y PX,S,Y (x, s, y)f(x, s) X x X,s S,y Y PX|S(x|s)PS(s)PY |S(y|s) f(x, s) = EPX,S [f(X, S)] EPX,S [f(X, S)] C.10 Proof of Proposition 23 Let γ = i(Q) X;S,Y , then the statement is equivalent to γ FX S QX,S,Y = QX|SQSQY |S. = If γ FX S, then we have QX,S,Y (x, s, y) = QX(x)QS,Y (s, y) (1 + γ(x, s)) = PX(x)PS,Y (s, y) (1 + γ(x, s)) = PX(x)PS(s) (1 + γ(x, s)) PY |S(y|s), for all x, s, y. (156) QX,S(x, s) = X y Y QX,S,Y (x, s, y) = PX(x)PS(s) (1 + γ(x, s)) . (157) From (156) and (157), we obtain QX,S,Y (x, s, y) = QX,S(x, s)PY |S(y|s) = QX,S(x, s)QY |S(y|s) = QX|S(x|s)QS(s)QY |S(y|s). (158) = It suffices to note that i(Q) X;S,Y (x, s, y) = QX,S,Y (x, s, y) QX(x)QS,Y (s, y) 1 = QX,S(x, s)QY |S(y|s) QX(x)QS(s)QY |S(y|s) 1 = QX,S(x, s) QX(x)QS(s) 1. Neural Feature Learning in Function Space C.11 Proof of Proposition 25 The results on i X;S and i X;Y |S directly follows from Property 1. To establish (63), note that from Proposition 22, we have PX,S,Y (x, s, y) PX|S(x|s)PS(s)PY |S(y|s) = PX(x)PS,Y (s, y) i X;Y |S(x, s, y), which implies that PY |X,S(y|x, s) = PY |S(y|s) 1 + PX(x)PS(s) PX,S(x, s) i X;Y |S(x, s, y) = PY |S(y|s) 1 + 1 1 + i X;S(x, s) i X;Y |S(x, s, y) (159) = PY |S(y|s) 1 + f T(x)g(s, y) 1 + f T(x) g(s) which further implies (64). C.12 Proof of Proposition 26 When X and (S, Y ) are ϵ-dependent, we have i X;S,Y = O(ϵ), and thus i X;S(x, s) = O(ϵ), i X;Y |S(x, s, y) = O(ϵ). Therefore, we have 1 1 + i X;S(x, s) i X;Y |S(x, s, y) = (1 i X;S(x, s)) i X;Y |S(x, s, y) + o(ϵ) = i X;Y |S(x, s, y) + o(ϵ). Then, it follows from (159) that PY |X,S(y|x, s) = PY |S(y|s) 1 + i X;Y |S(x, s, y) + o(ϵ). Finally, the proof is completed by using the decomposition (66). C.13 Proof of Theorem 27 For each s, x, y, let us define R(s) X,Y (x, y) PX|S=s(x)PY |S=s(y) and i(s) X;Y (x, y) PX,Y |S=s(x, y) PX|S=s(x)PY |S=s(y) PX|S=s(x)PY |S=s(y) . (161) In addition, we define µs Rk as µs E [f(X)|S = s] for each s S. Also, let β FS Y be defined as β(s, y) µT s g(s, y). Then, for each s S, from (72) and Proposition 14 we have |f T(x)g(s, y) β(s, y)| = |(f(x) µs)Tg(s, y)| = O(ϵ), for all x, s, y, (162) which implies that f T(x)g(s, y) = O(ϵ), for all x, s, y. (163) Xu and Zheng In addition, from Property 3, we have L(s) S (f, g(s)) = L(s) S (f µs, g(s)) 2 i(s) X;Y 2 R(s) X,Y i(s) X;Y (f µs) g(s) 2 R(s) X,Y H(Y |S = s) + o(ϵ2) = i(s) X;Y , f g(s) µT s g(s) 2 f g(s) µT s g(s) 2 R(s) X,Y H(Y |S = s) + o(ϵ2), (164) where , R and R denote the inner product and corresponding induced norm on the function space, with respect to the metric distribution R. For the first two terms in (164), we compute their expectations over PS. For the first term, X s S PS(s) i(s) X;Y , f g(s) µT s g(s) x X,s S,y Y PS(s)R(s) X,Y (x, y)i(s) X;Y (x, y) f T(x)g(s)(y) µT s g(s)(y) x X,s S,y Y PX(x)PS,Y (s, y)i X;Y |S(x, s, y) f T(x)g(s, y) β(s, y) = i X;Y |S, f g i X;Y |S, β = i X;Y |S, f g (165) where to obtain the second equality we have used the facts that i(s) X;Y (x, y) = PX,Y |S=s(x, y) PX|S=s(x)PY |S=s(y) PX|S=s(x)PY |S=s(y) = PX,S,Y (x, s, y) P M X,S,Y (x, s, y) PX(x)PS,Y (s, y) PX(x) PX|S=s(x) = i X;Y |S(x, s, y) 1 1 + i X;S(x, s) (166) PS(s)R(s) X,Y (x, y) = P M X,S,Y (x, s, y) = PX(x)PS,Y (s, y) (1 + i X;S(x, s)), (167) and where to obtain (165) we have used the fact that i X;Y |S FS Y β. For the second term of (164), we have X s S PS(s) f g(s) µT s g(s) 2 R(s) X,Y x X,s S,y Y P M X,S,Y (x, s, y) h (f(x) µs)Tg(s, y) i2 (168) x X,s S,y Y PX(x)PS,Y (s, y) h (f(x) µs)Tg(s, y) i2 + o(ϵ2) (169) x X,s S,y Y PX(x)PS,Y (s, y) f T(x)g(s, y) β(s, y) 2 + o(ϵ2) (170) Neural Feature Learning in Function Space = f g β 2 + o(ϵ2) (171) = f g 2 + β 2 + o(ϵ2), (172) where to obtain the second equality we have used (162), (167), and the fact that i X;S = O(ϵ). Furthermore, we can show that β = o(ϵ). To see this, note that β(s, y) = µT s g(s, y) = X x X PX|S(x|s)f T(x)g(s, y) x X PX(x) [1 + i X;S(x, s)] f T(x)g(s, y) x X PX(x) i X;S(x, s) f T(x)g(s, y). Then, from i X;S = O(ϵ) and (163), we obtain |β(s, y)| = O(ϵ2) and β = O(ϵ2) = o(ϵ). Therefore, we can refine (172) as X s S PS(s) f g(s) µT s g(s) 2 R(s) X,Y = f g 2 + o(ϵ2). (173) Combining (164), (165), and (173), we have LS(f, g) = X s S PS(s) L(s) S (f, g(s)) = i X;Y |S, f g 1 2 f g 2 H(Y |S) + o(ϵ2) 2 i X;Y |S 2 i X;Y |S f g 2 H(Y |S) + o(ϵ2), which is maximized if and only if f g = ζ k(i X;Y |S) + o(ϵ). C.14 Proof of Proposition 28 Given any h FX1 Y + FX2 Y, we can represent h = h(1) + h(2) with h(i) FXi Y, i = 1, 2, i.e., h(x1, x2, y) = h(1)(x1, y) + h(2)(x2, y). Then, for any QX1,X2,Y QB, we have i(Q) X1,X2;Y , h(i) = X x1 X1,x2 X2,y Y PX1,X2(x1, x2)PY (y)i(Q) X1,X2;Y (x1, x2, y) h(i)(xi, y) x1 X1,x2 X2,y Y [QX1,X2,Y (x1, x2, y) PX1,X2(x1, x2)PY (y)] h(i)(xi, y) xi Xi,y Y [PXi,Y (xi, y) PXi(xi)PY (y)] h(i)(xi, y) = i Xi;Y , h(i) for i = 1, 2. As a result, πB(i(Q) X1,X2;Y ), h = i(Q) X1,X2;Y πI(i(Q) X1,X2;Y ), h = i(Q) X1,X2;Y , h = i(Q) X1,X2;Y , h(1) + i(Q) X1,X2;Y , h(2) i X1;Y , h(1) + i X2;Y , h(2) , where the second equality follows from the fact that πI(i(Q) X1,X2;Y ), h = 0. Hence, we obtain πB(i(Q) X1,X2;Y ) πB( ℓPX1,X2,Y ), h = 0 for all h FX1 Y + FX2 Y, which implies that πB(i(Q) X1,X2;Y ) πB( ℓPX1,X2,Y ) = 0. Xu and Zheng C.15 Proof of Proposition 30 From the definition, we have PX1,X2,Y (x1, x2, y) = PX1,X2(x1, x2)PY (y) (1 + [πB(i X1,X2;Y )](x1, x2, y) + [πI(i X1,X2;Y )](x1, x2, y)) = PX1,X2(x1, x2)PY (y) 1 + f T(x1, x2) g(y) + f T(x1, x2)g(y) , which implies (83a). To obtain (83b), note that from (174), we have PX1,Y (x1, y) = X x2 X2 P B X1,X2,Y (x1, x2, y) = PX1(x1)PY (y) 1 + E f T(x1, X2) g(y)|X1 = x1 = PX1(x1)PY (y) 1 + f(1)(x1) + [τ1( f(2))](x1) T g(y) , where to obtain the last equality we have used the fact that τ1( f) = f(1) + τ1( f(2)). Similarly, we can obtain (83b). Finally, (84) can be readily obtained from (83). C.16 Proof of Proposition 31 Note that since i X1,X2;Y FX1 X2 Y, we have πB(i X1,X2;Y ) FX1 X2 Y, which implies that X (x1,x2,y) X1 X2 Y PX1,X2(x1, x2)PY (y) [πB(i X1,X2;Y )](x1, x2, y) = πB(i X1,X2;Y ), 1 = 0. Therefore, it follows from the definition of P B X1,X2,Y that X (x1,x2,y) X1 X2 Y P B X1,X2,Y (x1, x2, y) = 1. Similarly, we have X (x1,x2,y) X1 X2 Y P I X1,X2,Y (x1, x2, y) = 1. Moreover, from (85), we have P B X1,X2,Y (x1, x2, y) 0, P I X1,X2,Y (x1, x2, y) 0 for all (x1, x2, y). Therefore, we obtain P B X1,X2,Y , P I X1,X2,Y PX1 X2 Y. Since i X1,X2;Y FX1 X2, we have πB(i X1,X2;Y ) FX1 X2. Therefore, for any (ˆx1, ˆx2) X1 X2, let f(x1, x2) = δx1ˆx1δx2ˆx2, then we have X y Y [πB(i X1,X2;Y )](ˆx1, ˆx2, y) = X (x1,x2,y) X1 X2 Y [πB(i X1,X2;Y )](x1, x2, y) δx1ˆx1δx2ˆx2 = πB(i X1,X2;Y ), f = 0. This implies that P B X1,X2 = PX1,X2. Similarly, we have P I X1,X2 = PX1,X2. Finally, note that since PX1,X2,Y (x1, x2, y) = PX1,X2(x1, x2)PY (y) [1 + [πB(i X1,X2;Y )](x1, x2, y) + [πI(i X1,X2;Y )](x1, x2, y)] , PX1,Y (x1, y) P B X1,Y (x1, y) = P I X1,Y (x1, y) PX1(x1)PY (y) x 2 X2 PX1,X2(x1, x 2)PY (y)[πI(i X1,X2;Y )](x1, x 2, y) Neural Feature Learning in Function Space To obtain the last equality, note that for any ˆx1, ˆy X1 Y, let us define γ FX1 Y as γ(x1, y) = δx1ˆx1δyˆy. Then, from πI(i X1,X2;Y ) FX1 Y, we have 0 = πI(i X1,X2;Y ), γ = X x 2 X2 PX1,X2(ˆx1, x 2)PY (ˆy)[πI(i X1,X2;Y )](ˆx1, x 2, ˆy). (174) Similarly, we can show that PX2,Y (x2, y) P B X2,Y (x2, y) = P I X2,Y (x2, y) PX2(x2)PY (y) = 0. C.17 Proof of Proposition 32 Note that since H(QX1,X2,Y ) = H(PX1,X2) + H(PY ) IQ(X1, X2; Y ), P ent X1,X2,Y = arg min QX1,X2,Y QB IQ(X1, X2; Y ), (175) where IQ(X1, X2; Y ) denotes the mutual information between (X1, X2) and Y with respect to QX1,X2,Y . Specifically, when we take PX1,X2,Y as the QX1,X2,Y , we have IP (X1, X2; Y ) = 1 2 i X1,X2;Y 2 + o(ϵ2). Therefore, to solve (175), it suffices to consider QX1,X2,Y with IQ(X1, X2; Y ) < IP (X1, X2; Y ) = O(ϵ2). Since QX1,X2QY = PX1,X2PY relint(PX1 X2 Y), we have ℓQX1,X2,Y 2 = O(ϵ2) [cf. (Sason and Verd u, 2016, Eq. (338))]. Therefore, IQ(X1, X2; Y ) = 1 2 ℓQX1,X2,Y 2 + o(ϵ2) 2 πB( ℓQX1,X2,Y ) 2 + 1 2 πI( ℓQX1,X2,Y ) 2 + o(ϵ2) 2 πB(i X1,X2;Y ) 2 + 1 2 πI( ℓQX1,X2,Y ) 2 + o(ϵ2) 2 πB(i X1,X2;Y ) 2 + o(ϵ2) where the inequality holds with equality when πI( ℓQX1,X2,Y ) = o(ϵ). Hence, we obtain i(ent) X1,X2;Y = ℓP ent X1,X2,Y = πB(i X1,X2;Y ) + o(ϵ), which gives (88). C.18 Proof of Proposition 33 For all φ φ(1) +φ(2) FX1 +FX2 and ψ FY with ψ = 1, we have EPX1,X2PY [ψ(Y )φ(X1, X2)] = 0. Hence, E [ψ(Y )φ(X1, X2)] = EPX1,X2PY [(1 + i X1,X2;Y (X1, X2, Y )) ψ(Y )φ(X1, X2)] Xu and Zheng = i X1,X2;Y , φ ψ = πB(i X1,X2;Y ), φ ψ + πI(i X1,X2;Y ), φ ψ = πB(i X1,X2;Y ), φ ψ , where the first equality follows from the definition of i X1,X2;Y , and where the last equality follow from the fact that πI(i X1,X2;Y ) (FX1 Y + FX2 Y) φ ψ. Therefore, we can rewrite the objective (92) as E ψ(Y ) φ(1)(X1) φ(2)(X2) 2 = E h (ψ(Y ) φ(X1, X2))2i = ψ 2 + φ 2 2 E [ψ(Y )φ(X1, X2)] = 1 + φ ψ 2 2 πB(i X1,X2;Y ), φ ψ = 1 + πB(i X1,X2;Y ) φ ψ 2 πB(i X1,X2;Y ) 2. Finally, the proof is completed by noting that πB(i X1,X2;Y ) 2 = P K i=1 σ2 i , and we have πB(i X1,X2;Y ) φ ψ 2 r1(πB(i X1,X2;Y )) 2 = where the inequality becomes equality if and only if φ ψ = ζ(πB(i X1,X2;Y )) = σ1( f 1 g 1). C.19 Proof of Theorem 34 We first introduce a useful lemma. Lemma 41 For all k 1, f F k X, g F k Y , let h Π (f; FX1 + FX2), h = f h. Then, we have Hm(f, g) = 1 2 L(RX1,X2,Y ) η0 πI( ℓˆP (0) X1,X2,Y ) h g 2 LB( h g) , (176) where we have defined LB(γ) η0 πB( ℓˆP (0) X1,X2,Y ) πB(γ) 2 + η1 ℓˆP (1) X1,Y πM1(γ) 2 + η2 ℓˆP (2) X2,Y πM2(γ) 2. (177) Proof By definition, we have H (f, g; ˆP (0) X1,X2,Y ) = 1 ℓˆP (0) X1,X2,Y 2 ℓˆP (0) X1,X2,Y f g 2 ℓˆP (0) X1,X2,Y 2 (πI( ℓˆP (0) X1,X2,Y ) h g) + (πB( ℓˆP (0) X1,X2,Y ) h g) 2 ℓˆP (0) X1,X2,Y 2 πI( ℓˆP (0) X1,X2,Y ) h g 2 πB( ℓˆP (0) X1,X2,Y ) h g 2 , where to obtain the last equality we have used the orthogonality between πI( ℓˆP (0) X1,X2,Y ) h g (FX1 Y + FX2 Y) and πB( ℓˆP (0) X1,X2,Y ) h g (FX1 Y + FX2 Y). Neural Feature Learning in Function Space In addition, for each i = 1, 2, from τi(f) = τi( h) we have H (τi(f), g; ˆP (i) Xi,Y ) = H (τi( h), g; ˆP (i) Xi,Y ) = 1 ℓˆP (i) Xi,Y 2 ℓˆP (i) Xi,Y τi( h) g 2 , ℓˆP (i) Xi,Y 2 ℓˆP (i) Xi,Y πMi( h g) 2 , (179) where the last equality follows from that πMi( h g) = τi( h) g. From (178) and (179), we can rewrite (95) as Hm(f, g) = η0 H (f, g; ˆP (0) X1,X2,Y ) + η1 H τ1(f), g; ˆP (1) X1,Y + η2 H τ2(f), g; ˆP (2) X2,Y 2 ℓˆP (0) X1,X2,Y 2 πI( ℓˆP (0) X1,X2,Y ) h g 2 πB( ℓˆP (0) X1,X2,Y ) h g 2 2 ℓˆP (i) Xi,Y 2 ℓˆP (i) X1,Y πMi( h g) 2 2 L(RX1,X2,Y ) η0 πI( ℓˆP (0) X1,X2,Y ) h g 2 η0 πB( ℓˆP (0) X1,X2,Y ) h g 2 + i=1 ηi ℓˆP (i) X1,Y πMi( h g) 2 # 2 L(RX1,X2,Y ) η0 πI( ℓˆP (0) X1,X2,Y ) h g 2 LB( h g) , where the last equality follows from the fact that h g = πB( h g). Let use define h Π (f; FX1 + FX2) and h f h. Then, from Lemma 41, we have Hm( f, g) = 1 2 L(RX1,X2,Y ) η0 πI( ℓˆP (0) X1,X2,Y ) 2 LB( f g) , (180) and, similarly, 2 L(RX1,X2,Y ) η0 πI( ℓˆP (0) X1,X2,Y ) h g 2 LB( f g + h g) . (181) Therefore, (180) is minimized if and only if f g = πB ℓP (est) X1,X2,Y and (181) is minimized if and only if f g + h g = πB ℓP (est) X1,X2,Y , h g = ζ k πI ℓˆP (0) X1,X2,Y Hence, the common solution of (182) and (183) is f g = πB ℓP (est) X1,X2,Y , h g = 0, h g = ζ k πI ℓˆP (0) X1,X2,Y which is equivalent to (97). Finally, from Proposition 38, this is also the solution that maximizes the nested H-score (96). Xu and Zheng C.20 Proof of Theorem 35 We first introduce two useful facts. Fact 6 (Cover and Thomas 2006, Theorem 11.1.2) Given m samples Z1, . . . , Zm i.i.d. generated from QZ PZ, th probability of observing {Zi}m i=1 = {zi}m i=1, denoted by P {{zi}m i=1; QZ}, satisfies log P {{zi}m i=1; QZ} = m h H( ˆPZ) + D( ˆPZ QZ) i , where ˆPZ is the empirical distribution of {zi}m i=1. Fact 7 (Huang et al. 2024, Lemma 4.5) Given a reference distribution R relint(PZ). Then, for P, Q PZ with ℓP;R = O(ϵ), ℓQ;R = O(ϵ), we have 2 ℓP;R ℓQ;R 2 + o(ϵ2). Note that since the data from three different datasets are generated independently, we have P {D0, D1, D2; QX1,X2,Y } = P {D0; QX1,X2,Y } P {D1; QX1,X2,Y } P {D2; QX1,X2,Y } = P {D0; QX1,X2,Y } P {D1; QX1,Y } P {D2; QX2,Y } . Therefore, from Fact 6, log P {D0, D1, D2; QX1,X2,Y } = log P {D0; QX1,X2,Y } log P {D1; QX1,Y } log P {D2; QX2,Y } = n0 D ˆP (0) X1,X2,Y QX1,X2,Y + n1 D ˆP (1) X1,Y QX1,Y + n2 D ˆP (2) X2,Y QX2,Y + n0 H( ˆP (0) X1,X2,Y ) + n1 H( ˆP (1) X1,Y ) + n2 H( ˆP (2) X2,Y ) = n L(ML)(QX1,X2,Y ) + n0 H( ˆP (0) X1,X2,Y ) + n1 H( ˆP (1) X1,Y ) + n2 H( ˆP (2) X2,Y ) where the second equality follows from Fact 6, and where we have defined L(ML)(QX1,X2,Y ) η0 D ˆP (0) X1,X2,Y QX1,X2,Y + η1 D ˆP (1) X1,Y QX1,Y + η2 D ˆP (2) X2,Y QX2,Y . Hence, we can rewrite the maximum likelihood solution P (ML) X1,X2,Y as P (ML) X1,X2,Y = arg max QX1,X2,Y P {D0, D1, D2; QX1,X2,Y } = arg min QX1,X2,Y L(ML)(QX1,X2,Y ). From L(RX1,X2,Y ) = O(ϵ2), we obtain ℓˆP (0) X1,X2,Y = O(ϵ), ℓˆP (1) X1,Y = O(ϵ), ℓˆP (2) X2,Y By definition, we have L(P (est) X1,X2,Y ) < L(RX1,X2,Y ) = O(ϵ2), which implies that ℓP (est) X1,X2,Y We first consider QX1,X2,Y with ℓQX1,X2,Y ℓP (est) X1,X2,Y Then, we have ℓQX1,X2,Y ℓQX1,X2,Y ℓP (est) X1,X2,Y + ℓP (est) X1,X2,Y = O(ϵ), (186) Neural Feature Learning in Function Space and it follows from Fact 7 that L(ML)(QX1,X2,Y ) = η0 D ˆP (0) X1,X2,Y QX1,X2,Y + η1 D ˆP (1) X1,Y QX1,Y + η2 D ˆP (2) X2,Y QX2,Y η0 ℓˆP (0) X1,X2,Y ℓQX1,X2,Y 2 + η1 ℓˆP (1) X1,Y ℓQX1,Y 2 + η2 ℓˆP (2) X2,Y ℓQX2,Y 2 + o(ϵ2) 2 L(QX1,X2,Y ) + o(ϵ2). (187) Therefore, for QX1,X2,Y that satisfies (185), the minimum value of L(ML)(QX1,X2,Y ) is achieved by QX1,X2,Y = P (est) X1,X2,Y + o(ϵ). Now we consider the case of QX1,X2,Y with ℓQX1,X2,Y ℓP (est) X1,X2,Y > ϵ. Let ϵ = ϵ/ ℓQX1,X2,Y ℓP (est) X1,X2,Y < 1 and define QX1,X2,Y ϵ QX1,X2,Y + (1 ϵ ) P (est) X1,X2,Y . (188) Then, we have ℓ QX1,X2,Y = ϵ ℓQX1,X2,Y + (1 ϵ ) ℓP (est) X1,X2,Y , which implies ℓ QX1,X2,Y ℓP (est) X1,X2,Y = ϵ ℓQX1,X2,Y ℓP (est) X1,X2,Y As a result, we can apply the same analysis on QX1,X2,Y and obtain [cf. (187)] L(ML)( QX1,X2,Y ) = 1 2 L( QX1,X2,Y ) + o(ϵ2). (190) Hence, we obtain from (189) that L(ML)( QX1,X2,Y ) > L(ML)(P (est) X1,X2,Y ) = 1 2 L(P (est) X1,X2,Y ) + o(ϵ2) (191) for ϵ sufficiently small. In addition, since L(ML) is convex, it follows from Jensen s inequality that (cf. (188)) L(ML)( QX1,X2,Y ) ϵ L(ML)(QX1,X2,Y ) + (1 ϵ ) L(ML)(P (est) X1,X2,Y ). (192) As a result, from (191) and (192) we have L(ML)(QX1,X2,Y ) > L(ML)(P (est) X1,X2,Y ). Combining both cases of QX1,X2,Y , we obtain (99). C.21 Proof of Proposition 36 It suffices to establish that there exist α: U R, β : V R, such that i X;U(x, u) = α(u) [tanh (2w ϕ(x) + b U) tanh(b U)] , (193a) i Y ;V (y, v) = β(v) tanh 2w ϕ(y) + b V tanh(b V ) . (193b) To see this, note that from the Markov relation X U V Y , we have PX,Y (x, y) = X u U,v V PX|U(x|u) PY |V (y|v) PU,V (u, v) Xu and Zheng = PX(x)PY (y) X u U,v V PU,V (u, v) (1 + i X;U(x, u)) (1 + i Y ;V (y, v)) = PX(x)PY (y) u U,v V PU,V (u, v) i X;U(x, u) i Y ;V (y, v) where to obtain the last equality we have used the fact that X u U PU(u) i X;U(x, u) = X v V PV (v) i Y ;V (y, v) = 0. Therefore, we obtain i X;Y (x, y) = PX,Y (x, y) PX(x)PY (y) 1 u U,v V PU,V (u, v) i X;U(x, u) i Y ;V (y, v) = E [α(U)β(V )] [tanh (2w ϕ(x) + b U) tanh(b U)] tanh 2w ϕ(y) + b V tanh(b V ) , which gives (105). It remains only to establish (193). For symmetry, we consider only (193a). To begin, by definition, we have PX|U=u(x) = 1 h q (1 δxixi+1) u (1 qu)δxixi+1 i , which implies log PX|U=u(x) = log 1 i=1 (1 δxixi+1) + log(1 qu) i=1 δxixi+1. Therefore, we obtain 1 2 log PX|U=1(x) PX|U=0(x) = 1 2 log PX|U=1(x) log PX|U=0(x) i=1 (1 δxixi+1) + log 1 q1 i=1 δxixi+1 i=1 δxixi+1 As a consequence, PX|U=1(x) PX|U=0(x) PX(x) = PX|U=1(x) PX|U=0(x) PU(1) PX|U=1(x) + PU(0) PX|U=0(x) PX|U=1(x) PX|U=0(x) 1 PU(1) PU(0) PX|U=1(x) PX|U=0(x) + 1 Neural Feature Learning in Function Space = 1 2PU(0)PU(1) PU(1) PU(0) PX|U=1(x) PX|U=0(x) 1 PU(1) PU(0) PX|U=1(x) PX|U=0(x) + 1 PU(1) PU(0) 1 PU(1) PU(0) + 1 = 1 2PU(0)PU(1) [tanh (2w ϕ(x) + b U) tanh(b U)] , where to obtain the last equality we have used the fact that t 1 t+1 = tanh 1 2 log t . Hence, with u = 1 u, we have i X;U(x, u) = PX,U(x, u) PX(x)PU(u) = PX|U=u(x) PX(x) = PX|U=u(x) PU(u)PX|U=u(x) PU(u )PX|U=u (x) = PU(u ) PX|U=u(x) PX|U=u (x) = PU(u ) ( 1)u+1 PX|U=1(x) PX|U=0(x) 2PU(u) [tanh (2w ϕ(x) + b U) tanh(b U)] , which gives (193a) as desired, with α(u) = ( 1)u+1 Appendix D. Implementation Details of Experiments We implement our experiments in Python 3 (Van Rossum and Drake, 2009), where we use the Py Torch (Paszke et al., 2019) library for neural network training and use the Matplotlib (Hunter, 2007) library for plotting. We also make use of Num Py (Harris et al., 2020) and Sci Py (Virtanen et al., 2020) for the computation. In the experiments, we apply Adam (Kingma and Ba, 2015) as the optimizer with the default parameters: a learning rate of 10 3, β1 = 0.9, β2 = 0.999, and ϵ = 10 8. For each MLP (multilayer perceptron) used in the experiments, we set the activation function to be the softplus function x 7 log(1 + ex), which are applied to all layers except the output layer. It is worth mentioning that our choices of network architectures, optimizers and hyperparameters are not optimized with respect to the used data distributions. It is possible to further optimize such choices to improve the performance or convergence. D.1 Learning Maximal Correlation Functions We first introduce the implementation details for Section 7.1, where the goal is to learn maximal correlation functions for different data. The corresponding learning objective is the nested H-score (38), which are maximized during the training. D.1.1 Implementation of Section 7.1.1 We set |X| = 8, |Y| = 6, and feature dimension k = 3. To generate the discrete distributions PX,Y , we draw (|X| |Y|) i.i.d. numbers from Unif[0, 1] and divide each number by their sum. We then Xu and Zheng use the resulting |X| |Y| table as the values for the probability mass function PX,Y . To ensure reproducible results, we set the random seed of Num Py to 20 230 606 in the generating process. Then, we generate N = 30 000 training sample pairs of (X, Y ) from PX,Y , then apply one-hot encoding such that the inputs are represented as |X| and |Y| dimensional vectors. Then, we use two one-layer linear networks as the feature extractors f and g. We train the networks with a minibatch size of 128 for 100 epochs. Then, we obtain the estimated f i , g i , and σi by applying (40) and compare them with corresponding theoretical values, which we compute from the SVD of corresponding CDM matrix [cf. (120)], with the results shown in Figure 14. Note that since f i g i = ( f i ) ( g i ), both (f i , g i ) and ( f i , g i ) are the optimal feature pairs. For the sake of presentation, we applied a sign modification before the plotting. D.1.2 Implementation of Section 7.1.2 In this experiment, we first generate N = 50 000 samples of (X, Y ) R2 for training, to learn k = 2 dimensional features f and g. We use two MLPs of the same architecture as the feature extractors for f and g. Specifically, each MLP is with three layers, where the dimensions for all intermediate features, from input to output, are: input = 1 32 32 2 = output. We then train the networks with a minibatch size of 256 for 100 epochs and use the learned features for estimation tasks, as demonstrated in Section 7.1.2. D.1.3 Implementation of Section 7.1.3 In this experiment, we set k = 1. To extract f and g from input sequences X and Y , we use onedimensional convolutional neural networks as the feature extractors, which are used in sentence classifications (Kim, 2014; Zhang and Wallace, 2017). In particular, f and g are of the same architecture, composed of an embedding (linear) layer, a 1 dimensional convolutional layer, an average pooling layer, and a fully connected (linear) layer. We use feature extractor f as an example to illustrate the processing of sequential data. First, we represent x sequence as a one-hot encoded list, i.e., each xi {(1, 0)T, (0, 1)T}. Then, the embedding layer maps each xi to a 4-dimensional vector. The one-dimensional convolutional layer then processes the length-l list of embedded 4-dimensional vectors, by 32 convolutional kernels of size 4. We then activate the convolution results by the Re LU function x 7 max{x, 0}. The output from each convolutional kernel is further averaged by the average pooling layer, leading to a 32 dimensional feature, with each dimension corresponding to a convolutional kernel. Finally, we feed the 32 dimensional feature to the fully connected layer and generate k = 1 dimensional output. Then, we train the feature extractors f and g with a minibatch size of 128 for 100 epochs. The learned features are shown in Section 7.1.3. D.2 Learning With Orthogonality Constraints In this experiment, we use the same dataset generated in Appendix D.1.2. We set k = k = 1, i.e., we learn one-dimensional feature f from X orthogonal to given one-dimensional f. To this end, we use three MLPs of the same architecture as the feature extractors for g, f, g, with dimensions input = 1 32 32 1 = output. We then train the networks with a minibatch size of 256 for 100 epochs to maximize the nested H-score restricted to f = φ [cf. (52)], for φ(x) = x and φ(x) = x2, respectively. Neural Feature Learning in Function Space D.3 Learning With Side Information We set |X| = 8, |S| = |Y| = 3, and generate PX,S,Y in a manner similar to Appendix D.1.1, with the same random seed. Then, we generate N = 50 000 training samples of (X, S, Y ) triples. In our implementation, we set k = |S| 1 = 2 and k = 1, and set feature extractors f F 2 X, g F 2 S , f FX, g FS Y as corresponding one-layer linear networks with one-hot encoded inputs. In particular, we convert each (s, y) to one unique one-hot vector of in R|S| |Y| as the input to the network g. Then, we train these feature extractors on the training set with a minibatch size of 256 for 100 epochs, to maximize the nested H-score configured by CMC [cf. (59)]. For comparison, we train a multihead network shown in Figure 11 on the same dataset, with the same minibatch size and epochs. The feature f is again implemented by a one-layer linear network. In particular, we maximize the log-likelihood function (70) to learn the corresponding feature and weights. Then, we convert the weights to g FS Y via the correspondence [cf. (69)] g(s, y) = Gs(1, y). The comparison between features learned from two approaches are shown in Figure 20. For the sake of presentation, we have normalized the features before plotting, such that f and each g(s, ) are zero-mean, and unit variance with respect to Unif(X) and Unif(Y), respectively. D.4 Multimodal Learning With Missing Modalities D.4.1 Implementation of Section 7.4.1 We first generate N = 50 000 triples of (X1, X2, Y ) for training. In implementing the algorithm, we k = k = 1. To represent f FX, we set each f(i), i {1, 2} as an MLP with dimensions input = 1 32 32 1 = output. To represent f, we use an MLP with dimensions input = 2 32 32 1 = output with the input set to X1 ++X2 = (X1, X2)T. Since Y is discrete, we use one-layer linear network as g and g, where the inputs are one-hot encoded Y . Therefore, both g and g are with one linear layer, taking |Y| = 2 dimensional input and outputting one-dimensional feature. We then train these feature extractors on the training set with a minibatch size of 256 for 100 epochs, to maximize the nested H-score configured by CBI [cf. (78)]. The learned features are then shown in Figure 21. For the prediction problem with missing modality, we use two MLPs to represent conditional expectation operators φ1 τ1( f(2)) and φ2 τ2( f(1)), each with dimensions input = 1 32 32 1 = output. These two networks are optimized by minimizing the corresponding mean square error. For example, we train φ1 by minimizing E h f(2)(X2) φ1(X1) 2i over all parameters in φ1 network. The training of φ1, φ2 is with minibatch size of 256, and 100 epochs. D.4.2 Implementation of Section 7.4.2 We generate N = 50 000 triples of (X1, X2, Y ) from (106) and (111), and adopt the decomposition (89) on each triple. This gives three pairwise datasets with samples of (X1, X2), (X1, Y ), (X2, Y ), where each dataset has N sample pairs. Then, we adopt the same setting of networks to represent one-dimensional f(1), f(2) and g. We then train these feature extractors on the three datasets for 100 epochs with a minibatch size of 256, to maximize H ( f, g). Here, we compute H ( f, g) based on the minibatches from the three pairwise datasets according to (93). To solve the prediction problem with missing modality, we use the same network architectures and training settings to learn τ1( f(2)) and τ2( f(1)), as introduced in Appendix D.4.1. Xu and Zheng S-I Amari. Information geometry on hierarchy of probability distributions. IEEE transactions on information theory, 47(5):1701 1711, 2001. Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2000. Galen Andrew, Raman Arora, JeffBilmes, and Karen Livescu. Deep canonical correlation analysis. In International conference on machine learning, pages 1247 1255. PMLR, 2013. Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International conference on machine learning, pages 531 540. PMLR, 2018. Leo Breiman and Jerome H Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American statistical Association, 80(391):580 598, 1985. Michael M Bronstein, Joan Bruna, Yann Le Cun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4): 18 42, 2017. Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021. Andreas Buja. Remarks on functional canonical variates, alternating least squares methods and ace. The Annals of Statistics, pages 1032 1069, 1990. R Caruana. Multitask learning: A knowledge-based source of inductive bias1. In Proceedings of the Tenth International Conference on Machine Learning, pages 41 48. Citeseer, 1993. Kwan Ho Ryan Chan, Yaodong Yu, Chong You, Haozhi Qi, John Wright, and Yi Ma. Redunet: A white-box deep network from the principle of maximizing rate reduction. The Journal of Machine Learning Research, 23(1):4907 5009, 2022. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597 1607. PMLR, 2020. Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, et al. Spectral methods for data science: A statistical perspective. Foundations and Trends in Machine Learning, 14(5):566 806, 2021. Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273 297, 1995. Thomas M Cover and Joy A Thomas. Elements of information theory (wiley series in telecommunications and signal processing), 2006. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. Flavio du Pin Calmon, Ali Makhdoumi, Muriel M edard, Mayank Varia, Mark Christiansen, and Ken R Duffy. Principal inertia components and applications. IEEE Transactions on Information Theory, 63(8):5011 5038, 2017. Neural Feature Learning in Function Space Nelson Dunford and Jacob T Schwartz. Linear operators, part 1: general theory, volume 10. John Wiley & Sons, 1988. Carl Eckart and Gale Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211 218, 1936. Artur Ekert and Peter L Knight. Entangled quantum systems and the schmidt decomposition. American Journal of Physics, 63(5):415 423, 1995. Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179 188, 1936. Hans Gebelein. Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift f ur Angewandte Mathematik und Mechanik, 21(6):364 379, 1941. Michael Greenacre. Correspondence analysis in practice. CRC press, 2017. JeffZ Hao Chen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000 5011, 2021. Charles R. Harris, K. Jarrod Millman, St efan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern andez del R ıo, Mark Wiebe, Pearu Peterson, Pierre G erard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, September 2020. doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2. Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. Hermann O Hirschfeld. A connection between correlation and contingency. In Proceedings of the Cambridge Philosophical Society, volume 31, pages 520 524, 1935. Thomas Hofmann, Bernhard Sch olkopf, and Alexander J Smola. Kernel methods in machine learning. 2008. Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321 377, 1936. Hsiang Hsu, Salman Salamatian, and Flavio P Calmon. Generalizing correspondence analysis for applications in machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9347 9362, 2021. Shao-Lun Huang and Xiangxiang Xu. On the sample complexity of HGR maximal correlation functions for large datasets. IEEE Transactions on Information Theory, 67(3):1951 1980, 2020. Xu and Zheng Shao-Lun Huang, Anuran Makur, Gregory W. Wornell, and Lizhong Zheng. Universal features for high-dimensional learning and inference. Foundations and Trends in Communications and Information Theory, 21(1-2):1 299, 2024. ISSN 1567-2190. doi: 10.1561/0100000107. URL http://dx.doi.org/10.1561/0100000107. John D Hunter. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9 (03):90 95, 2007. Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957a. Edwin T Jaynes. Information theory and statistical mechanics. ii. Physical review, 108(2):171, 1957b. Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746 1751, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/ D14-1181. URL https://aclanthology.org/D14-1181. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann Le Cun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84 90, 2017. Henry Oliver Lancaster. The structure of bivariate distributions. The Annals of Mathematical Statistics, 29(3):719 736, 1958. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436 444, 2015. Anuran Makur, Gregory W Wornell, and Lizhong Zheng. On estimation of modal decompositions. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2717 2722. IEEE, 2020. F Gustav Mehler. Ueber die entwicklung einer function von beliebig vielen variablen nach laplaceschen functionen h oherer ordnung. 1866. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In ICML, 2011. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. Advances in neural information processing systems, 32, 2019. Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559 572, 1901. Karl Pearson. Mathematical contributions to the theory of evolution XIII: On the theory of contingency and its relation to association and normal correlation. Draper s Co. Research Memoirs. Biometric Series, No. 1 (Reprinted 1948), 1904. Neural Feature Learning in Function Space Alfr ed R enyi. On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica, 10(3-4):441 451, 1959. Sebastian Ruder. An overview of multi-task learning in deep neural networks. ar Xiv preprint ar Xiv:1706.05098, 2017. Igal Sason and Sergio Verd u. f-divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973 6006, 2016. doi: 10.1109/TIT.2016.2603151. Erhard Schmidt. Zur theorie der linearen und nichtlinearen integralgleichungen. i. teil: Entwicklung willk urlicher funktionen nach systemen vorgeschriebener. Mathematische Annalen, 63:433 476, 1907. Claude E Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379 423, 1948. Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1 5. IEEE, 2015. Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. ar Xiv preprint physics/0004057, 2000. Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. Create Space, Scotts Valley, CA, 2009. ISBN 1441412697. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261 272, 2020. Lichen Wang, Jiaxiang Wu, Shao-Lun Huang, Lizhong Zheng, Xiangxiang Xu, Lin Zhang, and Junzhou Huang. An efficient approach to informative feature extraction from multimodal data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5281 5288, 2019. Joachim Weidmann. Linear operators in Hilbert spaces, volume 68. Springer Science & Business Media, 2012. Xiangxiang Xu and Shao-Lun Huang. Maximal correlation regression. IEEE Access, 8:26591 26601, 2020. Xiangxiang Xu and Shao-Lun Huang. An information theoretic framework for distributed learning algorithms. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 314 319. IEEE, 2021. Xiangxiang Xu and Lizhong Zheng. Multivariate feature extraction. In 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1 8. IEEE, 2022. Xiangxiang Xu and Lizhong Zheng. Kernel subspace and feature extraction. In 2023 IEEE International Symposium on Information Theory (ISIT) (ISIT 2023), pages 1032 1037, Taipei, Taiwan, June 2023a. Xu and Zheng Xiangxiang Xu and Lizhong Zheng. Sequential dependence decomposition and feature learning. In 2023 59th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1 8. IEEE, 2023b. Xiangxiang Xu, Shao-Lun Huang, Lizhong Zheng, and Lin Zhang. The geometric structure of generalized softmax learning. In 2018 IEEE Information Theory Workshop (ITW), pages 1 5. IEEE, 2018. Xiangxiang Xu, Shao-Lun Huang, Lizhong Zheng, and Gregory W Wornell. An information theoretic interpretation to deep neural networks. Entropy, 24(1):135, 2022. Nicholas Young. An introduction to Hilbert space. Cambridge university press, 1988. Ye Zhang and Byron C Wallace. A sensitivity analysis of (and practitioners guide to) convolutional neural networks for sentence classification. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 253 263, 2017.