# fmicl_understanding_and_generalizing_infoncebased_contrastive_learning__6fe7a443.pdf Published in Transactions on Machine Learning Research (10/2023) f-MICL: Understanding and Generalizing Info NCE-based Contrastive Learning Yiwei Lu yiwei.lu@uwaterloo.ca School of Computer Science University of Waterloo Vector Institute Guojun Zhang guojun.zhang@huawei.com Huawei Noah s Ark Lab Sun Sun sun.sun@nrc-cnrc.gc.ca School of Computer Science University of Waterloo National Research Council Canada Hongyu Guo hongyu.guo@uottawa.ca National Research Council Canada University of Ottawa Yaoliang Yu yaoliang.yu@uwaterloo.ca School of Computer Science University of Waterloo Vector Institute Reviewed on Open Review: https: // openreview. net/ forum? id= ZD03VUZm Rx In self-supervised contrastive learning, a widely-adopted objective function is Info NCE, which uses the heuristic cosine similarity for the representation comparison, and is closely related to maximizing the Kullback Leibler (KL)-based mutual information. In this paper, we aim at answering two intriguing questions: (1) Can we go beyond the KL-based objective? (2) Besides the popular cosine similarity, can we design a better similarity function? We provide answers to both questions by generalizing the KL-based mutual information to the f-Mutual Information in Contrastive Learning (f-MICL) using the f-divergences. To answer the first question, we provide a wide range of f-MICL objectives which share the nice properties of Info NCE (e.g., alignment and uniformity), and meanwhile result in similar or even superior performance. For the second question, assuming that the joint feature distribution is proportional to the Gaussian kernel, we derive an f-Gaussian similarity with better interpretability and empirical performance. Finally, we identify close relationships between the f-MICL objective and several popular Info NCE-based objectives. Using benchmark tasks from both vision and natural language, we empirically evaluate f MICL with different f-divergences on various architectures (Sim CLR, Mo Co, and Mo Co v3) and datasets. We observe that f-MICL generally outperforms the benchmarks and the best-performing f-divergence is task and dataset dependent. Equal contribution Published in Transactions on Machine Learning Research (10/2023) 1 Introduction Recent advances in self-supervised learning aim at learning similar representations from different augmented views of the same data sample. However, naively implementing this idea would easily make representations converge to some trivial constant (i.e., feature collapse) in practice. To address this problem, researchers propose new algorithms either from the model architecture perspective or the training objective perspective. The former method (e.g., Grill et al. (2020); Chen & He (2021); Zhang et al. (2021)) applies techniques such as stop gradient or predictor module to create asymmetry in networks, while the latter method encourages the contrastiveness between similar (positive) and dissimilar (negative) sample pairs through the objective design. In this paper, we intend to deepen our understanding of contrastive learning by generalizing the current objective design. To achieve self-supervised contrastive learning, existing objectives are proposed from different perspectives such as the mutual information (e.g., Info NCE (Wu et al., 2018; van den Oord et al., 2018; Chen et al., 2020; Hénaff et al., 2020; He et al., 2020) ), the information redundancy (e.g., Barlow Twins (Zbontar et al., 2021)), and the regularization (e.g., VICReg (Bardes et al., 2021)). In particular, the Info NCE objective is widely used, which aims to maximize the probability of picking a similar sample pair among a batch of sample pairs, and can be interpreted as a lower bound of the mutual information (MI) between two views of samples (van den Oord et al., 2018; Bachman et al., 2019; Tian et al., 2020a; Tschannen et al., 2020). This is consistent with the well-known Info Max principle" (Linsker, 1988). To measure the similarity between sample pairs, cosine similarity is usually adopted. To attain the aforementioned goals of better understanding and generalizing contrastive learning, we here focus on the widely-adopted Info NCE objective, and aim at two questions regarding it: (1) MI is essentially the Kullback Leibler (KL) divergence between the joint distribution and the product of the marginal distributions. Is this KL-based objective optimal? If not, can we go beyond the KL-based objective? (2) Besides the commonly used cosine similarity for measuring the distance between samples, can we provide a better similarity function with a theoretical basis? To answer the above two questions, we generalize the KL-based mutual information to the broader fdivergence family (Ali & Silvey, 1966; Csiszár, 1967), and propose the benchmark of f-mutual information in contrastive learning (f-MICL). By searching through a wide range of f-divergences, we observe that the KL divergence is not always the best, and several other f-divergences in fact show similar or even superior performance in practice. For the second question, while it is challenging to provide an answer based on the Info NCE objective, it is possible to derive a proper similarity function under the f-MICL framework. By assuming that the joint feature distribution is proportional to the popularly-adopted Gaussian kernel, we propose a novel f-Gaussian similarity function that enjoys better empirical performance. Finally, we show the generalization of f-MICL by drawing connections between the f-MICL objective and several popular Info NCE-based objectives (e.g., Sim CLR(Chen et al., 2020), Mo Co(He et al., 2020), and Alignment and Uniformity (AU) (Wang & Isola, 2020)). We identify that those objectives are closely related to f-MICL: Alignment and Uniformity (AU) (Wang & Isola, 2020) can be treated as a special case, and Sim CLR (Chen et al., 2020) and Mo Co (He et al., 2020) are upper bounds for a transformed f-MICL. These results provide a different angle to better understand Info NCE. Moreover, we show both theoretically and empirically that nice properties of Info NCE (e.g., alignment and uniformity (Wang & Isola, 2020)) can be naturally extended to f-MICL. We summarize our main contributions as follows: Motivated by Info NCE, we propose a general framework for contrastive learning by extending the MI to the general f-MI, which provides a wide range of objective choices. Published in Transactions on Machine Learning Research (10/2023) Instead of using heuristic similarity functions, we provide a novel similarity function, called f Gaussian similarity, based on the convex conjugate and an assumption on the joint feature distribution. We identify close relationships between our f-MICL objective and several Info NCE-based contrastive learning objectives. Empirically, we show that f-MICL achieves notable improvement over benchmarks on various datasets, and the best-performing f-divergence depends on the specific task and dataset. In addition, our proposed f-Gaussian similarity consistently outperforms the cosine similarity. 2 f-Mutual Information To provide answers to the above two questions regarding the Info NCE objective, we first extend the KL-based mutual information to the more general f-mutual information. The definition of the f-mutual information (f-MI) is as follows: Definition 1 (f-mutual information, Csiszár 1967). Consider a pair of random variables (X, Y ) with joint density function p(x, y) and marginal densities p(x) and p(y). The f-mutual information If between X and Y is defined as If(X; Y ) := Z f p(x, y) p(x)p(y) dλ(x, y), (1) where f : R+ R is (closed) convex with f(1) = 0, and λ is a dominating measure (e.g., Lebesgue). Note that the f-MI is essentially the f-divergence between the joint distribution and the product of marginal distributions. It is well-known that the f-MI is non-negative and symmetric. Moreover, provided that f is strictly convex, If(X; Y ) = 0 iff X and Y are independent (Ali & Silvey, 1966). Since it is challenging to provide an accurate estimation of the f-divergences in high dimensions, Nguyen et al. (2010) used the convex conjugate as a lower bound for the f-divergences. With this result we can lower bound If(X; Y ) as follows: If(X; Y ) sup s F E (X,Y ) p(x,y)s(X, Y ) E (X,Y ) p(x)p(y)f s(X, Y ) , (2) where p(x, y) denotes the joint density, p(x)p(y) stands for the product of marginal densities, and the symbol denotes function composition. The function f (t) := supx R+(xt f(x)) is known as the convex conjugate1 of f and is monotonically increasing, and s( ) belongs to F, a class of functions on (x, y) that we can parameterize. Using results in Nguyen et al. (2010), one can show that eq. (2) is equal to If(X; Y ) if there exists s F such that for any (x, y) supp(p(x)) supp(p(y)), where supp( ) denotes the support of a distribution, we have: s (x, y) = f p(x, y) In other words, plugging the optimal s (x, y) into eq. (2) we obtain equality. In Table 1 we list common choices of f-divergences, their conjugates, and the derivatives. We also include the composition f f for later purposes (see Theorem 4 below). With the introduction of the more general f-MI we now proceed to the design of the objective and similarity function. Then we will analyze the property of the proposed framework and compare it with some existing Info NCE-based benchmarks. 1More precisely, this is the monotone convex conjugate since we restrict the domain of f to R+. Published in Transactions on Machine Learning Research (10/2023) Table 1: Common choices of f-divergences. KL: Kullback Leibler; JS: Jensen Shannon; SH: Squared Hellinger; VLC: Vincze Le Cam (Le Cam, 2012). For JS, we define φ(u) = (u + 1) log 1+u 2 + u log u. The Tsallis-α divergence is defined in Tsallis (1988). See Appendix A.1 for more details. Divergence f(u) f (t) f (u) f f (u) KL u log u exp(t 1) log u + 1 u JS φ(u) log(2 et) log 2 + log u 1+u log 2 1+u Pearson χ2 (u 1)2 t2/4 + t 2(u 1) u2 1 SH ( u 1)2 t 1 t 1 u 1/2 u1/2 1 Tsallis-α uα α 1 ( α 1 α t) α α 1 α α 1uα 1 uα u+1 4 t 4 1 t 1 4 (u+1)2 3 4 u+1 3.1 f-MICL objective Contrastive learning is a popular self-supervised method for representation learning. In contrastive learning, we expect similar sample pairs to be close to each other in the embedding space, while dissimilar pairs to be far apart. Based on the f-MI introduced in 2, we propose a general framework for contrastive learning, coined as f-MICL. We denote g : X Sd 1 as the feature encoder (usually constructed by a neural network) from the input space X to the hypersphere, and we use the shorthands xg := g(x) and yg := g(y) to represent the feature embeddings. The notation pd stands for the data distribution, and p := pd pd means its self product (product of marginals, e.g., pairs of images). We denote p+ as the distribution of positive pairs, i.e., two samples with similar feature embeddings (joint distribution, e.g., the same image with different data augmentation). Using the lower bound of the f-MI in eq. (2), we have the general f-MICL objective as follows: E(x,y) p+s(xg, yg) E(x,y) p f s(xg, yg), (4) where s(xg, yg) can be understood as the similarity measurement between two feature embeddings in the context of contrastive learning. Essentially, we are studying the variational lower bound eq. (2) in the feature space, with the feature embeddings learnable. We can treat the first term as the similarity score between positive pairs with similar feature embeddings, and the second term as the similarity score between two random samples, a.k.a. negative pairs. As f is an increasing function, maximizing the f-MI is equivalent to simultaneously maximizing the similarity between positive pairs and minimizing the similarity between negative pairs. With eq. (4) we have answered the first question: there are a spectrum of f-MICL objectives that can be applied in contrastive learning by using different f functions. We will discuss how to choose the best f empirically in 4. 3.2 f-Gaussian similarity Previous works in constrastive learning usually adopt some heuristic similarity function such as the cosine similarity function s(xg, yg) = xg yg. Although it shows promising performance in practice, our second question is that, can we provide a better similarity function than the popular cosine similarity? Note that eq. (3) is a natural choice of s( ) from the perspective of deriving the f-MI. In the context of contrastive learning, by denoting the density functions of the marginal feature distributions as pg(xg) and pg(yg), and the density of the joint feature distribution as pg(xg, yg), from eq. (3) we have an optimal similarity function as follows: Published in Transactions on Machine Learning Research (10/2023) Figure 1: Experiment for verifying Assumption 3. Here we draw the relation between the squared distances xg yg 2 and the averaged log likelihood log pg, with log pg estimated by the flow model Real NVP (Dinh et al., 2017). (left) Gaussian prior; (right) Uniform prior. The features are learned by Sim CLR trained on CIFAR-10. See more details in Appendix D.3. Lemma 2 (e.g., Nguyen et al. 2010, Lemma 1). Suppose f is differentiable, and the embedding function g is fixed. The following similarity function s maximizes eq. (4): s (xg, yg) = f pg(xg, yg) pg(xg)pg(yg) Obviously, the optimal s in fact induces the f-MI on the feature space, which is a low bound of the original f-MI. Although eq. (5) provides an optimal similarity function, it nevertheless depends on the unknown density functions. How can we implement eq. (5) in practice? Among various known density functions, it is natural to choose a typical kernel function for structured data for validation (Balcan et al., 2008; Powell, 1987; Murphy, 2012), e.g., the Gaussian kernel. Assumption 3 (Gaussian kernel). The joint feature density (wrt the uniform distribution over the hypersphere Sd 1) is proportional to a Gaussian kernel, namely pg(xg, yg) Gσ( xg yg 2) = µ exp xg yg 2 where µ := exp( 1 c2 is a constant that we determine below. Since xg, yg Sd 1 have unit (Euclidean) norm, we have pg(xg, yg) exp( xg yg which belongs to the von Mises-Fisher bivariate distribution (Mardia, 1975, Eq. 2.11). It is clear that the marginals of pg are uniform. Indeed, for any orthogonal matrix Q, with 1 C := E(xg,yg) Sd 1 Sd 1 exp( xg yg σ2 ) we have pg(Qxg) = Eyg Sd 1C exp( Qxg yg σ2 ) = Eyg Sd 1C exp( xg Q yg σ2 ) = pg(xg) =: c, (7) where we have used the fact that the only invariant distribution on Sd 1 wrt the orthogonal group is the uniform distribution. Similarly, we have pg(yg) c (where c is the reciprocal of the surface area of the hypersphere Sd 1). The distribution (6) has a nice interpretation2 in terms of maximum entropy (Mardia, 1975), and admits the factorization pg(xg, yg) = pg(xg) pg(yg|xg) = pg(yg) pg(xg|yg), (8) 2As suggested by the action editor, we may also interpret the distribution (6) as a copula, i.e., a joint density on Sd 1 Sd 1 with uniform marginals. Note that the conventional notion of copula replaces the hypersphere Sd 1 with the unit interval [0, 1]. More generally, we could consider the copula pg(xg, yg) h(xg yg) for an increasing function h, to capture other types of correlation between the two views xg and yg. Published in Transactions on Machine Learning Research (10/2023) where the marginals pg(xg) and pg(yg) are uniform while the conditionals pg(yg|xg) and pg(xg|yg) again belong to the von Mises-Fisher distribution. Combining eq. (5) and Assumption 3, we can write the similarity function with the Gaussian kernel as follows: sf(xg, yg) = f Gσ( xg yg 2). (9) As noted above, the product of marginals pg(xg)pg(yg) is a constant, which has been absorbed into our definition of Gσ, see µ in Assumption 3. We observe that sf( ) depends on the choice of f as well, thus we call it f-Gaussian similarity. As a result, we have provided a new way to design the similarity function, again from the f-MI perspective. Verifying Assumption 3: One may question that Assumption 3 can be too strong for practical usage. For example, replacing the Gaussian kernel Gσ with any other decreasing function would also provide a valid assumption. However, we found that among several popular choices only the Gaussian kernel works well in practice. Also, we can empirically verify that Assumption 3 approximately holds. To this end, it is sufficient to check whether the log density, i.e., log pg(xg, yg), is linear with the distance between each positive pair, i.e., xg yg 2. In Figure 1, we use the flow-based model Real NVP (Dinh et al., 2017)3 to estimate the log density with a Gaussian prior and a uniform prior, and learn the feature encoder g from Sim CLR (Chen et al., 2020). We observe that the linear relationship approximately holds for CIFAR-10 4. We will empirically compare our f-Gaussian similarity with the cosine similarity in 4. 3.3 Implementation With our designed f-Gaussian similarity sf we now have an implementable f-MICL objective in eq. (4). Bringing the f-Gaussian Similarity sf in eq. (9) into our objective eq. (4) we have a specific f-MICL objective: E (x,y) p+sf(xg, yg) E (x,y) p f sf(xg, yg). (10) Given a batch of N samples, its empirical estimation is as follows: i=1 sf(xg i , yg i ) 1 N(N 1) i =j f sf(xg i , xg j), (11) where xi and yi are two types of data augmentation of the i-th sample, and xi and xj are different samples with independently sampled data augmentations. With the f-MICL objective in eq. (11) we propose our algorithm for contrastive learning in Algorithm 1. To balance the two terms in our objective, we additionally include a weighting parameter α in front of the second term (which also absorbs the parameter µ in Gσ). This change can still be incorporated within our f-MICL framework, as we show in Appendix A.2. Figure 2 gives a high-level summary of our f-MICL framework. Given a batch of samples (e.g., images), we generate positive pairs via data augmentation and negative pairs using other augmented samples in the same batch. This sampling method follows Sim CLR (Chen et al., 2020). 3.4 f-MICL family In this section, we will deepen the understanding of f-MICL by drawing connections with some popular constrastive learning methods. Connection with Info NCE: Firstly, we show that Info NCE is an upper bound of f-MICL. Recall our f-MICL objective in eq. (4), and the popular Info NCE objective LInfo NCE as follows (here we take the 3Real NVP applies real-valued non-volume preserving transformation for log-likelihood computation. 4The linear relationship in Figure 1 might also depend on the data, i.e., the CIFAR-10 dataset here. In practice, other customized datasets might require additional verification. Published in Transactions on Machine Learning Research (10/2023) Algorithm 1: f-MICL Input: batch size N, function f, weighting parameter α, constant µ (in Gσ), variance σ2 1 for each sampled mini-batch {zi}N i=1 do 2 for k in 1, . . . , N do 3 randomly sample two augmentation functions t1 and t2 4 yk t1(zk), xk t2(zk) 5 define sf(xg, yg) = f Gσ( xg yg 2) 6 compute L as i=1 sf(xg i , yg i ) α N(N 1) i =j f sf(xg i , xg j) 7 update g by minimizing L Figure 2: Network architecture of f-MICL. imagei: the ith image in the current batch; f: the function used in the f-mutual information ( 2); g: feature embedding; t, t1, t2: augmentation functions drawn from the same family T of augmentations; f : the derivative; f : the Fenchel conjugate. The symbol denotes the function composition. The sum of the two terms gives the variational lower bound of f-mutual information. xi and yi are two types of data augmentation of the i-th sample, and xi and xj are different samples with independently sampled data augmentations. max stands for maximization. See eq. (11) for more details. maximization) (van den Oord et al., 2018): E(x,y) p+s(xg, yg) Ex pd log Ey pd exp(s(xg, yg)). (12) Consider that we perform a Donsker-Varadhan (DV) shift transformation v (Donsker & Varadhan, 1975; Tsai et al., 2021) from eq. (4) such that by taking the maximum over the transformation we have: E(x,y) p+s(xg, yg) v E(x,y) p f (s(xg, yg) v) . (13) In practice, such a shift transformation can be approximated by a scaling factor (α in Algorithm 1) such that eq. (4) and eq. (13) are equivalent. Given that f is the KL divergence, thus f(u) = u log u and Published in Transactions on Machine Learning Research (10/2023) Figure 3: f-MICL generalizes Info NCE-based objectives. f (t) = exp(t 1) from Table 1, the maximizer of v in eq. (13) occurs at v = log(E(x,y) p s(xg, yg)) 1). With v , eq. (13) can be written as follows: E(x,y) p+s(xg, yg) log E(x,y) p exp(s(xg, yg)). (14) According to Jensen s inequality we have Ex p( ) log Ey p( ) exp(s(xg, yg)) log E(x,y) p exp(s(xg, yg)). (15) E(x,y) p+s(xg, yg) log E(x,y) p exp(s(xg, yg)) LInfo NCE. (16) The above transformation shows that the Info NCE loss is an upper bound of the f-MICL objective. In other words, maximizing f-MICL can potentially increase the Info NCE objective. Connection with Alignment and Uniformity (AU) (Wang & Isola, 2020): We further show that the Alignment and Uniformity (AU) loss is a special cases of f-MICL. Wang & Isola (2020) shows that Info NCE approximately aligns positive feature embeddings while encouraging uniformly distributed negative ones. Wang & Isola (2020) further proposes a new objective which quantifies such properties. Here we show that this new objective is essentially a subclass of the Info NCE loss under the f-MICL framework. Concretely, applying the f-Gaussian similarity function for the KL divergence, we have f (u) = log u+1 from Table 1 and thus sf(xg, yg) = xg yg 2. Using sf(xg, yg) in eq. (14) we can recover the AU objective: E(x,y) p+ xg yg 2 log E(x,y) p exp( xg yg 2) . (17) Note that for KL, this is equivalent to the cosine similarity with a scaling factor: xg yg 2 = 2xg yg 2. Connection with the Spectral Contrastive Loss: Here we show that f-MICL objective is closely related to the Spectral Contrastive Loss (Hao Chen et al., 2021). Recall our objective: E(x,y) p+s(xg, yg) E(x,y) p f s(xg, yg), (18) where sf(xg, yg) = f Gσ( xg yg 2). If we choose the Pearson χ2 divergence, where f(u) = (u 1)2, f (u) = 2(u 1), f f (u) = u2 1, we have our χ2-MICL objective: 2E(x,y) p+Gσ( xg yg 2) E(x,y) p Gσ( xg yg 2)2 3. (19) Published in Transactions on Machine Learning Research (10/2023) This recovers the spectral contrastive loss exactly if we choose the proper hyperparameter and apply the cosine similarity instead. Thus we generalize the spectral contrastive loss as a special case of χ2-MICL. More on AU: Finally, based on our objective in eq. (10) we will show that the alignment and uniformity (AU) property of Info NCE also extends to the general f-MICL family: (1) Alignment: In the ideal case, maximizing the first term of eq. (10) would yield xg = yg for all (x, y) p+, i.e., similar sample pairs should have aligned representations. Note that the derivative f is increasing since f is convex. (2) Uniformity: We demonstrate the uniformity property by minimizing the second term of eq. (10), or more rigorously and realistically, its empirical version in eq. (11). Theorem 4 (Uniformity). Suppose that the batch size N satisfies 2 N d + 1, with d the dimension of the feature space. If the real function h(t) = f f Gσ(t) is strictly convex on [0, 4], (20) then all minimizers of the second term of eq. (11), i.e., P i =j f sf(xg i , xg j), satisfy the following condition: the feature representations of all samples are distributed uniformly on the unit hypersphere Sd 1. In Theorem 4, the assumption N d + 1 is always satisfied in our experiments in 4. For instance, on CIFAR-10 we chose N = d = 512. Also, we claim that the samples are distributed uniformly if the feature vectors form a regular simplex, and thus the distances between all sample pairs are the same. Although minimizing the negative term gives uniformity, the positive term is also needed for aligning similar pairs, as we observe in 4. This implies the tradeoff between alignment and uniformity. In fact, eq. (20) provides us guidance to select proper f-divergences for uniformity. In Table 1, we list some common choices of f-divergences. By inspecting the last column and using the definition of Gσ, we can easily verify that they all satisfy eq. (20). However, this is not true for all f-divergences. In Appendix A.1 we also provide some counterexamples that violate eq. (20) and thus Theorem 4, such as the Reversed Kullback Leibler (RKL) and the Neynman χ2 divergences. Experimentally, we found that these divergences generally result in feature collapse (i.e., all feature vectors are the same) and thus poor performance in downstream applications. 4 Experiments In this section, we empirically evaluate the analysis on our provided answers: (1) Can we go beyond the KL-based objective ( 4.2): we apply various f-MICL objectives to popular vision and language datasets. In particular, we show that under the same network architecture design, f-MICL can always provide a better choice of objective. We observe that the best-performing f-divergence is largely dataset dependent. (2) Can we design a better similarity function ( 4.3): we show that the proposed f-Gaussian similarity is more powerful than the heuristic cosine similarity, regardless of the choice of f. Moreover, we confirm empirically that f-MICL extends the nice property of alignment and uniformity in 4.4. 4.1 Experimental settings Our detailed settings can be found in Appendix D. In all our experiments, we change only the objective of different methods for fair comparison. We use the f-Gaussian similarity in f-MICL by default. Vision task. Our vision datasets include CIFAR-10 (Krizhevsky et al., 2009), STL-10 (Coates et al., 2011), Tiny Image Net (Chrabaszcz et al., 2017), and Image Net (Deng et al., 2009) for image classification. After learning the feature embeddings, we evaluate the quality of representations using the test accuracy via a linear classifier. Note that we use α = 40 across all vision experiments. (1) Smaller datasets: For feature encoders, we use Res Net-18 (He et al., 2016) for CIFAR-10; Res Net-50 (He et al., 2016) for the rest. Our implementation is based on Sim CLR (Chen et al., 2020), where we used the same 3-layer projection head during training. All models are trained for 800 epochs. (2) Image Net: We choose Vision Transformer (Vi T-S) (Dosovitskiy et al., 2020) as our feature encoder. We choose the smaller Vi T-S model with 6 blocks because larger Vi T models are extremely expensive to train on GPUs. Our implementations are based on Mo Co V3 (Chen et al., 2021), where models are trained for 1000 epochs. Published in Transactions on Machine Learning Research (10/2023) Table 2: We compare the test accuracy (%) obtained with the linear evaluation on the vision datasets. On the Wikipedia dataset, we compare the semantic textual similarity (STS) via the Spearman s correlation. For each dataset and each method we take three different runs to get the mean and the standard derivation. Dataset Baselines f-MICL Mo Co Sim CLR AU KL JS Pearson VLC CIFAR-10 90.30 0.19 89.71 0.37 90.41 0.26 90.61 0.47 89.66 0.28 89.35 0.52 89.13 0.33 STL-10 83.69 0.22 82.97 0.32 84.44 0.19 85.33 0.39 85.94 0.17 82.64 0.37 85.94 0.72 Tiny Image Net 35.72 0.17 30.56 0.28 41.20 0.19 39.46 0.20 42.98 0.18 43.45 0.54 38.65 0.45 Wikipedia 77.88 0.15 77.40 0.12 77.95 0.08 78.02 0.13 76.76 0.09 77.59 0.12 55.07 0.13 Table 3: We compare the test accuracy (%) with SOTA methods on Image Net. We take three different runs to get the mean, where the standard derivations are less than 0.1% for f-MICL. Dataset Baselines f-MICL Sw AV BYOL Barlow Twins VICReg Rényi CL Mo Co v3 KL JS Pearson Image Net 75.3 74.3 73.2 73.2 76.2 73.2 73.9 76.5 74.6 Language task. To show the wide applicability of our f-MICL framework, we also conduct experiments on a natural language dataset, English Wikipedia (Gao et al., 2021). We follow the experimental setting in (Gao et al., 2021), which applies BERT-based models to Sim CLR (Devlin et al., 2019; Liu et al., 2019). Specifically, we choose the BERTbase model due to limited computing resources. For f-MICL objectives, we choose α = 409600. The application task is called semantic textual similarity (STS (Agirre et al., 2013)) and we report the averaged Spearman s correlation in Table 2 for comparison. 4.2 f-MICL objectives Smaller datasets and language task. We first compare f-MICL with several Info NCE-based contrastive learning algorithms (i.e., Sim CLR (Chen et al., 2020), Mo Co (He et al., 2020), and AU (Wang & Isola, 2020)) on smaller datasets and the language task in Table 2. Here we choose four f-divergences with the best overall performance. See Appendix D for results on other f-divergences. From Table 2 we observe that: (1) As we have shown in 3.4 that f-MICL generalizes Info NCE-based objectives, empirically KL-MICL achieves similar performance to the baselines. In practice, we can tune the hyperparameter α such that KL-MICL outperforms the Info NCE-based objectives. (2) KL-MICL is not always the optimal choice. We can see that the best-performing f-MICL objectives refer to four different f-divergences on four datasets. The above results indicate that f-MICL can provide a wide range of objective choices for the downstream tasks. Although how to derive an optimal f-divergence deserves more study in theory, in practice we can select the best f among several common f-divergences on a validation set. Besides the f-divergences in Table 1, in Theorem 4 we have identified non-satisfying f-divergences. In our experiments, we found that applying these f-divergences such as the RKL and Neyman χ2 divergences would result in feature collapse. For example, in Figure 4 we show that with RKL the features all collapse to a constant. Image Net results: To further demonstrate the efficacy of f-MICL we then compare with several popular self-supervised learning methods, including both contrastive-based and non contrastive-based ones. These methods can be categorized into the Res Net-based (Sw AV (Caron et al., 2020), BYOL (Huynh et al., 2020), Barlow Twins (Zbontar et al., 2021), VICReg (Bardes et al., 2021)) and Rényi CL (Lee & Shin, 2022), and the Vi T-based (Dosovitskiy et al., 2020). For the Res Net-based methods, we directly retrieve results from (Bardes et al., 2021), which are obtained by training Res Net-50 models for 1000 epochs. Chen et al. (Chen Published in Transactions on Machine Learning Research (10/2023) Table 4: Comparison between the cosine and f-Gaussian similarities on CIFAR-10 with the test accuracy (%). For the Tsallis-α divergence we take α = 3. Similarity KL JS Pearson SH Tsallis-α VLC Cosine 89.95 88.06 87.79 87.06 88.55 10.00 0.26 0.33 0.42 0.55 0.28 0.00 Gaussian 90.61 89.66 89.35 89.52 89.15 89.13 0.47 0.28 0.52 0.25 0.42 0.33 Figure 4: (left and middle) Distances between pairs of normalized features within a batch. Green region: similar pairs. Orange region: dissimilar pairs. f-MICL gives nearly uniform distances for dissimilar pairs for the f-divergences in Table 1. For non-satisfying f-divergences such as the RKL, the features collapse to a constant and thus the distances are zero. (right) The test accuracy v.s. the batch size after training 200 epochs for all algorithms. et al., 2021) show that these two types of methods are directly comparable in terms of the model size and supervised learning performance. For the Vi T-based method and our f-MICL, we apply Vi T-S for 1000 epochs. Specifically, our f-MICL follows Mo Co v3 with only the objective changed for different choices of the f-divergence. Results in Table 3 show that: (1) By only changing the objective function, our method improves Mo Co v3 by 2.1% using JS-MICL. (2) f-MICL objectives are comparable with the Res Net-based methods (e.g., Sw AV and Rényi CL). Overall, our experiments confirm that f-MICL can provide a better choice of objective than Info NCE on a variety of datasets, tasks, and encoder architectures. 4.3 f-Gaussian similarity Next, we want to examine the effect of our similarity function while fixing the f-divergences. In Table 4 we compare the cosine and f-Gaussian similarities for different f-divergences on CIFAR-10. It can be seen that under our f-MICL framework the f-Gaussian similarity consistently outperforms the cosine similarity for various f-divergences5. This agrees with our Theorem 2 and eq. (9), and also implies the validity of Assumption 3. In particular, we identify that without the theoretical guarantee, the heuristic cosine similarity would fail for certain f-MICL objectives (e.g., VLC). 4.4 Alignment and uniformity test We empirically check the properties of alignment and uniformity for f-MICL by plotting the pairwise distance xg i xg j of the feature representations within the same batch on CIFAR-10. We compute the distances 5As we have shown the equivalence between cosine and Gaussian similarity on KL, the difference results on KL just show the choice of different scaling factors. Published in Transactions on Machine Learning Research (10/2023) between the normalized features of every pair from a random batch, and then sort the pairs in increasing order. From Figure 4 we can see that f-MICL gives nearly uniform distances for dissimilar pairs (orange regions) on both datasets with various proper f-divergences. In contrast, a random initialized model gives a less uniform distribution for dissimilar pairs. Besides, we observe small pairwise distances for similar pairs (green regions). 4.5 Sensitivity to batch size Finally, we study the sensitivity to the batch size of our f-MICL framework on CIFAR-10. On the right panel of Figure 4, we evaluate the classification accuracy by varying the batch size for different f-divergences and Sim CLR. We can see that for all different batch sizes and with the proper choice of f-divergences, our performance is always better than Sim CLR. In other words, we require fewer negative samples to achieve the same performance. 5 Related Work Contrastive learning. Self-supervised contrastive learning learns representations by contrasting sample pairs. Recently it has been shown analytically that improving the contrastiveness can benefit the downstream applications (Saunshi et al., 2019; Tosh et al., 2021). For popular contrastive learning methods such as Contrastive Predictive Coding (CPC) (van den Oord et al., 2018), Sim CLR (Chen et al., 2020), and Mo Co (He et al., 2020), their loss functions can be interpreted as a lower bound of mutual information, which is essentially the KL divergence between the joint distribution and the product of margin distributions. Besides the KL divergence, other statistical divergences or distances have been individually studied under the context of contrastive learning, e.g., the Wasserstein distance (Ozair et al., 2019), Pearson χ2 divergence (Tsai et al., 2021), and Jensen Shannon divergence (Hjelm et al., 2018). f-divergences have been widely used in generative models (Nowozin et al., 2016) and domain adaptation (Acuna et al., 2021), for measuring the discrepancy of two distributions, where the variational lower bound is often employed for estimation. Compared to f-GAN (Nowozin et al., 2016) and f-DAL (Acuna et al., 2021) which minimize the f-divergence between two different distributions, our f-MICL objective is to maximize the f-divergence between the joint distribution and the product of marginal distributions. This agrees with our purpose of contrasting sample pairs. Moreover, we provide a theoretical criterion for choosing proper f-divergences. Mutual Information also plays an important role in the context of deep representation learning (Tian et al., 2020a; Bachman et al., 2019; Hjelm et al., 2018; Tian et al., 2020b; Poole et al., 2019; Belghazi et al., 2018). Loss function wise, our losses partially cover the losses in the literature and generalizes them: e.g., Poole et al. (2019) considers several variational lower bounds of mutual information, where we generalize the DV objective; (b) application-wise, none of them considers contrastive learning: e.g., Poole et al. (2019) considers mutual information estimation, Belghazi et al. (2018) improves adversarial generative models, Hjelm et al. (2018) considers representation learning that maximizes local and global information. Metric learning. Our work is closely related to metric learning (Kaya & Bilge, 2019; Suárez-Díaz et al., 2018), which aims to learn a distance metric bringing similar objects closer and distancing dissimilar objects further. In contrastive learning, a pre-defined similarity metric, e.g., the cosine similarity (Chen et al., 2020; He et al., 2020) or a bilinear function (van den Oord et al., 2018; Tian et al., 2020a; Hénaff et al., 2020) is commonly used to measure the sample similarity. These pre-designed metrics may not necessarily lead to satisfactory performance in practice. Comparably, the design of our similarity function is empirically tailored for contrastive learning. Finally, we summarize existing representation learning methods that utilize f-divergences and compare with f-MICL in Table 5 for a clear view of the literature. Published in Transactions on Machine Learning Research (10/2023) Table 5: Comparison between different representation learning methods that apply f-divergences. Method Objective f-divergence Similarity Task CPC (van den Oord et al., 2018) Info NCE KL log-bilinear predictive coding RPC (Tsai et al., 2021) RPC Pearson χ2 cosine predictive coding MINE (Belghazi et al., 2018) DV bound of MI KL neural net GAN generation DIM (Hjelm et al., 2018) JSD/Info NCE JS & KL neural net representation learning Poole et al. (2019) MI bounds KL joint/separable representation learning Sim CLR (Chen et al., 2020) Info NCE KL cosine contrastive learning Mo Co (He et al., 2020) Info NCE KL cosine contrastive learning AU (Wang & Isola, 2020) Info NCE KL Gaussian contrastive learning f-MICL (Ours) f-MI general f-Gaussian contrastive learning 6 Conclusion We developed f-MICL for contrastive learning, which generalizes the KL-based mutual information to the f-mutual information. With f-MICL we provided a broad spectrum of objective choices with better downstream performance. We also proposed a novel f-Gaussian similarity function, which shows superior performance to the commonly used cosine similarity. In addition, we confirmed the generalization of f-MICL by comparing with popular Info NCE-based objectives. Empirically, we exhibited the efficacy of f-MICL across a wide range of datasets from both vision and natural language. Limitations and future work. While f-MICL provides a variety of objective functions, it is yet unclear how to choose an optimal f based on a task and a dataset in theory, such that we usually rely on a validation set in practice for selection. An interesting future work is to learn an optimal f-divergence using a parametrized neural network. Moreover, Lee & Shin (2022) applied Skew Rényi divergence for contrastive learning. However, we observe that applying Rényi-MICL naively leads to a large variance (similar to Section 4.1 in Lee & Shin 2022), and we leave the discussion on skew divergences for future works. Additionally, Mc Allester & Stratos (2020) showed that there exist some inherent statistical limitations on accurately estimating the mutual information with various lower bounds. In future work it would be interesting to examine if such limitations extend to f-MI, and if a limited estimation of f-MI necessarily affects f-MICL whose goal is to compare and learn representations through (lower bounds of) f-MI. Published in Transactions on Machine Learning Research (10/2023) Acknowledgement We thank the reviewers and the action editor for their constructive comments. Part of this work was performed during YL s internship at NRC. YY thanks NSERC and CIFAR for funding support. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. David Acuna, Guojun Zhang, Marc T Law, and Sanja Fidler. f-domain-adversarial learning: Theory and algorithms. In ICML, 2021. Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. *SEM 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (*SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, 2013. Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 1966. Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. Neur IPS, 2019. Maria-Florina Balcan, Avrim Blum, and Nathan Srebro. A theory of learning with similarity functions. Machine Learning, 72(1):89 112, 2008. Adrien Bardes, Jean Ponce, and Yann Le Cun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. ar Xiv preprint ar Xiv:2105.04906, 2021. Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. JMLR, 2019. Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. MINE: mutual information neural estimation. ar Xiv preprint ar Xiv:1801.04062, 2018. Sergiy V Borodachov, Douglas P Hardin, and Edward B Saff. Discrete energy on rectifiable sets. Springer, 2019. Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912 9924, 2020. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020. URL http://proceedings.mlr.press/v119/chen20j. html. Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750 15758, 2021. Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640 9649, 2021. Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of Image Net as an alternative to the CIFAR datasets. ar Xiv preprint ar Xiv:1707.08819, 2017. Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215 223. JMLR Workshop and Conference Proceedings, 2011. Published in Transactions on Machine Learning Research (10/2023) Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, pp. 229 318, 1967. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In CVPR. IEEE, 2009. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019. Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In ICLR, 2017. Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics, 28(1):1 47, 1975. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. Tianyu Gao, Xingcheng Yao, and Danqi Chen. Sim CSE: Simple contrastive learning of sentence embeddings. In Empirical Methods in Natural Language Processing (EMNLP), 2021. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271 21284, 2020. Jeff Z Hao Chen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Neur IPS, 2021. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020. Olivier J Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. In ICML. PMLR, 2020. R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2018. Tri Huynh, Simon Kornblith, Matthew R Walter, Michael Maire, and Maryam Khademi. Boosting contrastive self-supervised learning with false negative cancellation. ar Xiv preprint ar Xiv:2011.11765, 2020. Mahmut Kaya and Hasan Şakir Bilge. Deep metric learning: A survey. Symmetry, 2019. Diederik P Kingma and Prafulla Dhariwal. Glow: generative flow with invertible 1 1 convolutions. In Neru IPS, 2018. Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902 1914, 2001. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images, 2009. Technical report. Published in Transactions on Machine Learning Research (10/2023) Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer Science & Business Media, 2012. Kyungmin Lee and Jinwoo Shin. Rényicl: Contrastive representation learning with skew rényi divergence. Advances in Neural Information Processing Systems, 35:6463 6477, 2022. Ralph Linsker. Self-organization in a perceptual network. Computer, 1988. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Ro BERTa: A robustly optimized BERT pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019. Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. International Conference on Learning Representations, 2017. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2018. K. V. Mardia. Statistics of directional data. Journal of the Royal Statistical Society: Series B, 37(3): 349 -371, 1975. URL https://doi.org/10.1111/j.2517-6161.1975.tb01550.x. David Mc Allester and Karl Stratos. Formal limitations on the measurement of mutual information. In International Conference on Artificial Intelligence and Statistics, pp. 875 884. PMLR, 2020. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018. Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012. Xuan Long Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 2010. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Neur IPS, 2016. Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron van den Oord, Sergey Levine, and Pierre Sermanet. Wasserstein dependency measure for representation learning. In Neur IPS, 2019. Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171 5180. PMLR, 2019. M. J. D. Powell. Radial Basis Functions for Multivariable Interpolation: A Review. Clarendon Press, USA, 1987. ISBN 0198536127. Ralph Rockafellar. Characterization of the subdifferentials of convex functions. Pacific Journal of Mathematics, 1966. Igal Sason and Sergio Verdú. f-divergence inequalities. IEEE Transactions on Information Theory, 2016. Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019. URL http: //proceedings.mlr.press/v97/saunshi19a.html. Juan Luis Suárez-Díaz, Salvador García, and Francisco Herrera. A tutorial on distance metric learning: Mathematical foundations, algorithms, experimental analysis, prospects and challenges (with appendices on mathematical background and detailed algorithms explanation). ar Xiv preprint ar Xiv:1812.05944, 2018. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In ECCV. Springer, 2020a. Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning. In Neur IPS, 2020b. URL https://proceedings.neurips.cc/ /paper_files/paper/2020/hash/4c2e5eaae9152079b9e95845750bb9ab-Abstract.html. Published in Transactions on Machine Learning Research (10/2023) Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory. PMLR, 2021. Yao-Hung Hubert Tsai, Martin Q Ma, Muqiao Yang, Han Zhao, Louis-Philippe Morency, and Ruslan Salakhutdinov. Self-supervised representation learning with relative predictive coding. In ICLR, 2021. Constantino Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of statistical physics, 1988. Michael Tschannen, Josip Djolonga, Paul K. Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. In ICLR, 2020. Jean-Baptiste Hiriart Urruty and Claude Lemaréchal. Convex analysis and minimization algorithms. Springer-Verlag, 1993. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML. PMLR, 2020. Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018. Jure Zbontar, Li Jing, Ishan Misra, Yann Le Cun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp. 12310 12320. PMLR, 2021. Shaofeng Zhang, Feng Zhu, Junchi Yan, Rui Zhao, and Xiaokang Yang. Zero-cl: Instance and feature decorrelation for negative-free symmetric contrastive learning. In International Conference on Learning Representations, 2021. Published in Transactions on Machine Learning Research (10/2023) Table 6: A summary of common f-divergences. KL: Kullback Leibler; JS: Jensen Shannon; SH: Squared Hellinger. For JS, we define φ(u) = (u + 1) log 1+u 2 + u log u. For Pearson χ2, we take f (t) = 1 if t 2. For Jeffrey, c W = W + W 1 and W( ) is the Lambert-W product log function. The Tsallis-α divergence is defined in Tsallis (1988) and we have α > 1 for f-divergences. We ignore constant addition 1/(α 1) because it does not change the optimization problem. The Vincze Le Cam divergence can be found in Le Cam (2012) which is closely related to χ2 and Hellinger divergences. For the Vincze Le Cam divergence we require 3 < t < 1 and f (t) = 1 if t 3. Divergence f(u) f (t) f (u) f f (u) KL u log u exp(t 1) log u + 1 u Reverse KL log u 1 log( t) 1/u log u 1 JS φ(u) log(2 et) log 2 + log u 1+u log 2 + log(1 + u) Pearson χ2 (u 1)2 t2/4 + t 2(u 1) u2 1 SH ( u 1)2 t 1 t 1 u 1/2 u1/2 1 Neyman χ2 (1 u)2 u 2 2 1 t 1 u 2 2 2u 1 Jeffrey (u 1) log u c W(e1 t) + t 2 1 u 1 + log u c W(e1/u/u) + log u 1+u u Tsallis-α uα α 1 ( α 1 α t) α α 1 α α 1uα 1 uα Vincze Le Cam (u 1)2 u+1 4 t 4 1 t (u 1)(u+3) (u+1)2 3 4 u+1 A Additional theoretical results In this appendix, we provide additional theoretical results, including additional f-divergences and the theory for weighting parameters. Notations. We assume that a dominating measure λ (e.g. Lebesgue) is given and all other probability measures are represented as some density w.r.t. λ. Given the joint density p(x, y), we denote p(x) = R p(x, y)dλ(y) and p(y) = R p(x, y)dλ(x) as the marginals. We use supp( ) to denote the support of a distribution, and f to denote the conjugate of function f. Every norm presented is Euclidean. We use xg := g(x) as the shorthand notation for the feature embedding, with x a raw sample. The notation pd stands for the data distribution, and p := pd pd means its self product. We denote p+ as the distribution of positive pairs, i.e., two samples with similar feature embeddings. The symbol denotes function composition. A.1 Additional f-divergences We expand Table 1 and give more examples of f-divergences in Table 6. As we will see in the proof of Theorem 4, Table 1 gives a special class of f-divergences that guarantees uniformity. A detailed description of f-divergences can be found in e.g. Sason & Verdú (2016). A.2 Weighting parameters In Algorithm 1 we added a weighting parameter α to balance the alignment and uniformity. We prove that even after adding this parameter we are still maximizing the f-mutual information, although with respect to a different f. Proposition 5 (weighting parameter). Given α > 0 and a closed convex function f : R+ R such that f(1) = 0, define fα : α dom f R with fα(x) = αf x Published in Transactions on Machine Learning Research (10/2023) for any x dom f. Then Ifα is still a valid f-mutual information (see Definition 1). Besides, by replacing f with fα in eq. (10) we have the following optimization problem: sup g G E(x,y) p+ f Gσ( xg yg 2) f f Gσ( xg yg 2) where Gσ( xg yg 2) = µ exp xg yg 2 2σ2 is the Gaussian kernel. Note that α dom f means the scalar multiplication of a set which is applied element-wisely. According to Definition 1, fα is also a valid f-divergence. This proposition tells us that rescaling the second term with factor α is equivalent to changing the function f to another convex function fα. The transformation from f to αf x α is also known as right scalar multiplication (Urruty & Lemaréchal, 1993). Let us now move on to our proof: Proof. By definition, we know that fα is convex and closed with fα(1) = 0, and thus Ifα is a valid f-mutual information according to Definition 1. Moreover, we have f α(x) = f ( x α) for any x α dom f and f α(t) = sup x dom fα xt fα(x) = sup x αdom f xt αf x = sup x α dom f x α (αt) αf x = α sup x α dom f = αf (t) + αf 1 where in the last line we used the definition of f (t). Plugging f α and f α into eq. (10) yields the desired result. Lemma 2 (e.g., Nguyen et al. 2010, Lemma 1). Suppose f is differentiable, and the embedding function g is fixed. The following similarity function s maximizes eq. (4): s (xg, yg) = f pg(xg, yg) pg(xg)pg(yg) Proof. From Definition 1, we are computing the following supremum: Z pg(xg, yg) pg(xg)pg(yg)s(xg, yg) f s(xg, yg) dpg d pg d. (22) Suppose s is unconstrained and we fix g. The optimal solution should satisfy: pg(xg, yg) pg(xg)pg(yg) ( f )(s (xg, yg)), (23) almost surely for (x, y) pd pd. From (3.11) of Rockafellar (1966) this is equivalent to: s (xg, yg) f pg(xg, yg) pg(xg)pg(yg) If f is differentiable, then for any u dom f, f(u) = {f (u)} is a singleton. Published in Transactions on Machine Learning Research (10/2023) Theorem 4 (Uniformity). Suppose that the batch size N satisfies 2 N d + 1, with d the dimension of the feature space. If the real function h(t) = f f Gσ(t) is strictly convex on [0, 4], (20) then all minimizers of the second term of eq. (11), i.e., P i =j f sf(xg i , xg j), satisfy the following condition: the feature representations of all samples are distributed uniformly on the unit hypersphere Sd 1. Note that we say the samples are distributed uniformly if the feature vectors form a regular simplex (see Figure 5), and thus the distances between all sample pairs are the same. Proof. From the definition of h it is clear that h is decreasing since f and f are both monotonically increasing white Gσ is decreasing. Using h we rewrite the second term of eq. (11) as min xg 1,...,xg N Sd 1 X i,j h( xg i xg j 2). (25) When N [2, d+1], there exists a neat characterization of the minimizers, see e.g. Borodachov et al. (2019). We include the proof below for completeness. Apply Jensen s inequality, we have: i,j h( xi xj 2) h i,j xi xj 2 i,j xi xj 2 i,j (2 2xi xj) h (2) , (26) where in the first line we used Jensen s inequality; in the third line we used xi = xj = 1 for any i, j [N]; in the last line we note that PN i=1 xi 0 and h is a decreasing function. When h is strictly convex and decreasing, it is in fact strictly decreasing, and hence the two inequalities above can be attained iff i xi = 0, and xi xj 2 c for all i = j, (27) namely that {x1, . . . , x N} form a regular simplex with its center at the origin. We remark that when h is merely convex, points forming a centered regular simplex may form a strict subset of the minimizers. To see the necessity of N d + 1, let us note that ( 1, i = j 1 N 1, i = j , (28) ij xi xj 2 = 2N 2 = N(N 1)c = c = 2N N 1 = 2 + 2 N 1. (29) Published in Transactions on Machine Learning Research (10/2023) Figure 5: A regular simplex on a hypersphere. Performing simple Gaussian elimination we note that the matrix X X has rank N 1 where X = [x1, . . . , x N] Rd N. Therefore, we must have N 1 d. Lastly, we need to show when h is a (strictly) convex function, which may not always be true depending on the f-divergences. We give the following characterization (we ignore the constants µ and 2σ2 in Assumption 3 as they do not affect convexity): h strictly convex: h KL(t) = e t, h JS(t) = log(1+e t) log 2, h Pearson(t) = e 2t 1, h SH(t) = e t/2 1, h Tsallis(t) = e αt, h VLC = 3 4 1+e t ; h convex but not strictly convex: h RKL(t) = t 1 (RKL stands for Reversed Kullback Leibler, see Appendix A.1); h concave: h Neyman(t) = 2 2et (Neyman stands for Neyman χ2, see Appendix A.1). Only for the last case we do not have the guarantee that the minimizing configurations could form a regular simplex. For RKL, in fact, any configuration that centers at the origin suffices since h is a linear function. C Estimation of f-MICL Objective In 3.3 we have provided the empirical estimation of our objective in eq. (11). However, it remains a question whether our estimation of f-mutual information is consistent. In this part we derive an upper bound for the estimation error from statistical learning theory. We denote our f-MICL objective as if(X; Y ) (eq. (10)) and its empirical estimation as bif(X; Y ) (eq. (11)). After fixing the similarity function , we have T(x, y) = sf(xg, yg) = f Gσ( xg yg 2). Theorem 6 (estimation error). Suppose that the function T is taken from a function class T and define Tx as the function class of T(x, ) given some x supp(pd). Denote RP N to be the Rademacher complexity w.r.t. the distribution P with N i.i.d. drawn samples. Then for any T T , the estimation error |if(X; Y ) bif(X; Y )| is upper bounded with probability at least 1 δ: 2Rp+ N (T ) + 2µ E x pd Rpd N (Tx) + 1 i=1 Rpd N 1(Txj) + (r T + 2rf) log 6/δ 2(N 1), (30) with the constants r T = f (µ) f (µe 2/σ2) and rf = f f (µ) f f (µe 2/σ2). Here the constant µ is from our Gaussian kernel in Assumption 3. Rademacher complexity evaluates the richness of a class of real-valued functions regarding a probability distribution, and its formal definition can be found in Koltchinskii (2001). Non-i.i.d. proof. Our conclusion is theoretically non-trivial since our sample pairs are non-i.i.d.: although the individual samples are assumed to be i.i.d., the negative pairs are not independently drawn (e.g., (x1, x2) and (x1, x3)), which makes the derivation challenging. Published in Transactions on Machine Learning Research (10/2023) Note that the function class T depends on the class of the feature encoder g and the f-divergence. Our estimation error eq. (30) is composed of three parts: the Rademacher complexity of the function class T . In general, if T is richer then its Rademacher complexity is also larger. the expected Rademacher complexity of the one-side function class Tx and its empirical estimation; an error term that decreases with more samples. Since the encoders are usually built with neural networks, we can use the existing theory (Bartlett et al., 2019) to give more detailed bounds for the Rademacher complexities of T . Specifically, if the Vapnik Chervonenkis (VC) dimension of T is finite, then our estimation error in eq. (30) goes to zero as N (Mohri et al., 2018). Approximation and estimation tradeoff. In order to minimize the estimation error in eq. (30), we should choose a simpler function class T to reduce the Rademacher complexities. However, T should also be rich enough so that eq. (3) can be satisfied, since our objective if(X; Y ) should approximate the fmutual information If(X; Y ) if we choose the optimal T. Therefore, there is a natural tradeoff between approximation and estimation errors when we change the complexity of T . D Additional experimental results We present additional experiment details in this appendix, to further support our experiments in the main paper. D.1 Implementation details In this paper, we follow the implementations in Sim CLR (https://github.com/sthalles/Sim CLR) and Mo Co v3 (https://github.com/facebookresearch/moco-v3). For vision tasks, we use Res Net (He et al., 2016) and Vi T-S (Dosovitskiy et al., 2020) as the feature encoder, and we adopt the similar procedure of Sim CLR/Mo Co for sampling. For the language dataset, we follow the exact experimental setting of Gao et al. (2021) and only change the objective. Our experimental settings are detailed below: Hardware and package: We train on a GPU cluster with NVIDIA T4 and P100. The platform we use is pytorch. Specifically, the pairwise summation can be easily implemented using torch.nn.functional.pdist from pytorch. Datasets: the datasets we consider include CIFAR-10, STL-10 (Coates et al., 2011), Tiny Image Net (Chrabaszcz et al., 2017), Image Net (Deng et al., 2009) and English Wikipedia (Gao et al., 2021). Augmentation method: For each sample in a dataset we create a sample pair, a.k.a. positive pair, using two different augmentation functions. For image samples, we choose the augmentation functions to be the standard ones in contrastive learning, e.g., in Chen et al. (2020) and He et al. (2020). The augmentation is a composition of random flipping, cropping, color jittering and gray scaling. For text samples, following the augmentation method of Gao et al. (2021) we use dropout masks. Neural architecture: For CIFAR-10 we use Res Net-18 (He et al., 2016); for STL-10, Tiny Image Net we use Res Net-50 (He et al., 2016); for Image Net we use Vi T-S (Dosovitskiy et al., 2020); for the Wikipedia dataset we use BERTbase (Devlin et al., 2019). Batch size and embedding dimension: for experiments in CIFAR-10 we choose batch size 512; for STL-10 we choose batch size 64 to accommodate one GPU training; for Tiny Image Net, we choose batch size 256; for Image Net, we choose batch size 1024. For all the vision datasets, we choose the embedding dimension to be 512. Regarding the language dataset, the batch size is 64 with the feature dimension 768. In all of these cases, our assumption N d + 1 in Theorem 4 is satisfied. Hyperparameters: in all our experiments we fix the constant factor µ = 1. We find that in practice the weight parameter α often needs to be large (e.g., in the Wikipedia dataset), which requires moderate tuning. Optimizer and learning rate scheduler: For smaller vision tasks, we use SGD with momentum for optimization and the cosine learning rate scheduler (Loshchilov & Hutter, 2017). For the Image Net Published in Transactions on Machine Learning Research (10/2023) Table 7: Detailed experimental settings. arch: the neural network architecture used. N: batch size; d: the dimension of the feature representation; lr: learning rate; µ: the constant factor in µ; 1/(2σ2) and α follow from Algorithm 1; epoch: the number of epochs we run; k: the number of nearest neighbors in k-NN evaluation. Dataset arch N d lr µ (2σ2) 1 α epoch k CIFAR-10 Res Net-18 512 512 0.1 1 1 40 800 200 STL-10 Res Net-50 64 512 0.1 1 1 40 800 200 Tiny Image Net Res Net-50 256 512 0.1 1 1 40 800 200 Image Net Vi T-S 1024 512 0.1 1 1 40 1000 n/a Wikipedia BERTbase 64 768 3e-5 1 20 409600 1 n/a task and natural language task, we use Adam with weight decay (Loshchilov & Hutter, 2018) and the linear decay scheduler. Evaluation metric: for vision tasks, we use k-nearest-neighbor (k-NN) (only small datasets) and linear evaluation to evaluate the performance, based on the learned embeddings. For the NLP task, we use the Spearman s correlation to evaluate the averaged semantic textual similarity score (Gao et al., 2021). Baseline methods: for the four baseline methods, we follow the implementations in: Mo Co: https://github.com/facebookresearch/moco; Sim CLR: https://github.com/sthalles/Sim CLR; Uniformity: https://github.com/Ssn L/align_uniform; Mo Co v3: https://github.com/facebookresearch/moco-v3 For fair comparison we use the experimental settings in Table 7 for all the baseline methods, which might differ from the original settings. Table 7 gives common choices of hyperparameters for different datasets. Note that we may need to further finetune α and σ for different f-divergences. See our supplementary code for more details. D.2 Additional ablation study on weighting parameter We provide additional ablation study on the weighting parameter α. We perform experiments using a vision dataset (CIFAR-10) and a language dataset (Wikipedia). For CIFAR-10, we vary α from 0.1 to 50 for KL and JS divergences and run for 200 epochs. We perform the same experiments on KL when α = 1 for 800 epochs and observed an accuracy of 83.58% (lower than Sim CLR). This observation further provides empirical evidence that KL-MICL is different from Info NCE and needs special tuning on α to perform well. Table 8 justifies our choice of α in Table 7, where the downstream test accuracy indicates the optimal performance when choosing α = 40. For the Wikipedia dataset, we observe that a much bigger α is desirable for maximum performance. We vary α from 1 to 106 for KL and Pearson χ2 divergences and run for 1 epoch, as there is a large number of samples (106) in the language dataset. Table 9 justifies our choice of α in Table 7, where the best performance is reached at α = 409600. Such an α is found by starting from α = 100 and doubling iteratively. Table 8: Ablation study on weighting parameter α for KL and JS divergences on CIFAR-10. We compare test accuracies (%) for different choices of α using k-NN evaluation. α 0.1 1 10 20 30 40 50 KL 13.16 77.60 83.53 83.77 81.39 84.19 82.77 JS 8.84 73.31 81.39 83.21 83.49 84.06 82.61 Published in Transactions on Machine Learning Research (10/2023) Table 9: Ablation study on weighting parameter α for KL and Pearson χ2 divergences on Wikipedia. We compare the semantic textual similarity (STS) via the Spearman s correlation for different choices of weighting parameter α. α 1 10 102 103 104 105 409600 106 KL 67.52 70.47 72.43 75.12 76.90 77.78 78.02 77.78 Pearson 64.58 67.78 71.58 74.03 74.95 74.40 77.59 76.47 Figure 6: The training loss curves of various f-divergences on CIFAR-10 with 200 epochs. D.3 Additional experiments Our final experiments show that f-MICL is stable in terms of training and the variation of performance is well controlled. Training stability We depict the training loss curves of different divergences on CIFAR-10 in Figure 6. This figure shows that our methods exhibit stable training dynamics with fast convergence. k-NN evaluation and additional f-divergences We show more detailed results of Table 2 in Table 10, including experiments using k-nearest neighbour (k-NN) evaluation. Additionally, we have added experiments on other f-divergences such as Squared Hellinger and Tsallis-α divergences. Verification of Assumption 3 Throughout our paper we made an assumption (Assumption 3) that the joint feature distribution is a Gaussian kernel. However, is it a valid assumption? In this experiment, we try to show some empirical evidence that this assumption approximately holds in practice. Recall that Assumption 3 says that the joint feature distribution of positive pairs is: pg(xg, yg) exp xg yg 2 if the RBF kernel is Gaussian. In order to estimate the joint density of positive pairs, we use normalizing flows, which is a popular method for density estimation. Popular normalizing flow models include NICE (Dinh et al., 2014), Real NVP (Dinh et al., 2017) and Glow (Kingma & Dhariwal, 2018). Equation (31) is equivalent to the following: log pg(xg, yg) = xg yg 2 2σ2 + const, (32) and thus it suffices to show that the log likelihood is linear w.r.t. the distances between each positive pair. In Figure 7, we plot the relation between log pg, estimated by Real NVP 6 with a Gaussian prior, and the squared 6Code available at https://github.com/ikostrikov/pytorch-flows. Published in Transactions on Machine Learning Research (10/2023) Figure 7: Experiment for verifying Assumption 3. We draw the relation between the squared distances xg yg 2 and the averaged log pg with Real NVP. The features are learned by different algorithms trained on CIFAR-10. (left) Sim CLR; (right) f-MICL with the KL divergence. Table 10: Test accuracy (%) on the smaller vision datasets. For the Wikipedia dataset we evaluate the semantic textual similarity (STS) via the Spearman s correlation. For each method, we take three separate runs, and show the mean and stand derivation. Evaluation Dataset Baselines f-MICL Mo Co Sim CLR Uniformity KL JS Pearson SH Tsallis VLC CIFAR-10 90.30 89.71 90.41 90.61 89.66 89.35 89.52 89.15 89.13 0.19 0.37 0.26 0.47 0.28 0.52 0.25 0.42 0.33 Linear STL-10 83.69 82.97 84.44 85.33 85.94 82.64 82.80 84.79 85.94 0.22 0.32 0.19 0.39 0.17 0.37 0.27 0.34 0.72 Tiny Image Net 35.72 30.56 41.20 34.95 42.98 43.45 40.83 32.99 38.65 0.17 0.28 0.19 0.20 0.18 0.54 0.67 0.49 0.45 CIFAR-10 88.70 84.92 89.42 89.34 89.12 89.44 88.13 89.18 89.15 0.22 0.39 0.18 0.57 0.38 0.60 0.18 0.62 0.23 k-NN STL-10 78.77 74.34 79.57 79.99 80.45 76.64 78.31 76.11 79.34 0.25 0.14 0.52 0.47 0.19 0.26 0.33 0.24 0.62 Tiny Image Net 36.22 29.60 37.44 36.17 38.20 38.14 35.56 33.11 35.21 0.20 0.39 0.27 0.29 0.26 0.63 0.77 0.52 0.33 STS Wikipedia 77.88 77.40 77.95 78.02 76.76 77.59 73.60 72.68 55.07 0.15 0.12 0.08 0.13 0.09 0.12 0.10 0.09 0.13 distances xg yg 2. The representations are learned by Sim CLR, and f-MICL with the KL divergence on the CIFAR-10 dataset. To alleviate the estimation error in the flow model, we divide the distances into small intervals and compute the average log-likelihood within each interval. We can see that the log-likelihood is roughly linear w.r.t. the squared distance, and thus verifying our Assumption 3. Moreover, in Figure 8, we also show the estimation by training Real NVP with a uniform prior. Over 5 different combinations of random data augmentations, we observe that the linear relationship generally holds and the estimations by Gaussian prior and uniform prior are very similar. Published in Transactions on Machine Learning Research (10/2023) Figure 8: Additional experiment for verifying Assumption 3. Here we take 5 different combinations of random data augmentations and draw the relation between the squared distances xg yg 2 and the averaged log pg with Real NVP. (left column) Gaussian prior; (right column) Uniform prior. The features are learned by Sim CLR trained on CIFAR-10.