# minibatch_optimization_of_contrastive_loss__261280da.pdf Published in Transactions on Machine Learning Research (07/2024) Mini-Batch Optimization of Contrastive Loss Jaewoong Cho jwcho@krafton.com KRAFTON Kartik Sreenivasan kartik.sreenivasan@databricks.com Databricks Keon Lee keonlee@krafton.com KRAFTON Kyunghoo Mun kmun@andrew.cmu.edu Carnegie Mellon University Soheun Yi soheuny@andrew.cmu.edu Carnegie Mellon University Jeong-Gwan Lee jeonggwan.lee@krafton.com KRAFTON Anna Lee annalee@krafton.com KRAFTON Jy-yong Sohn jysohn1108@gmail.com Yonsei University Dimitris Papailiopoulos dimitris@papail.io University of Wisconsin-Madison Kangwook Lee kangwook.lee@wisc.edu University of Wisconsin-Madison KRAFTON Reviewed on Open Review: https: // openreview. net/ forum? id= Nux7OVXp J9 Contrastive learning has gained significant attention as a pre-training method for selfsupervised learning due to its ability to leverage large amounts of unlabeled data. A contrastive loss function ensures that embeddings of positive sample pairs (e.g., from the same class or different views of the same data) are similar, while embeddings of negative pairs are dissimilar. However, practical constraints such as large memory requirements make it infeasible to consider all possible positive and negative pairs, leading to the use of mini-batches. In this paper, we investigate the theoretical aspects of mini-batch optimization in contrastive learning with the Info NCE loss. We show that mini-batch optimization is equivalent to fullbatch optimization if and only if all N B mini-batches are selected, while sub-optimality may arise when examining only a subset. We then demonstrate that utilizing high-loss mini-batches can speed up SGD convergence and propose a spectral clustering-based approach for identifying these high-loss mini-batches. Our experimental results validate our theoretical findings and demonstrate that our proposed algorithm outperforms vanilla SGD, providing a better understanding of mini-batch optimization in contrastive learning. Equal Contributions. Published in Transactions on Machine Learning Research (07/2024) 1 Introduction Contrastive learning is commonly used as a pre-training method for self-supervised learning, due to its ability to leverage the vast amount of freely available unlabeled data (Jaiswal et al., 2020). The contrastive loss function is designed to ensure that the embeddings of two samples are similar if they are considered a positive pair, in cases such as coming from the same class (Khosla et al., 2020), being an augmented version of one another (Chen et al., 2020a), or being two different modalities of the same data (Radford et al., 2021). Conversely, if two samples do not form a positive pair, they are considered a negative pair, and the contrastive loss encourages their embeddings to be dissimilar. In practice, it is not feasible to consider all possible positive and negative pairs when implementing a contrastive learning algorithm due to the quadratic memory requirement O(N 2) when working with N samples. To mitigate this issue of full-batch training, practitioners typically choose a set of mini-batches of size B = O(1), and consider the loss computed for positive and negative pairs within each of the N/B batches (Chen et al., 2022; 2020a; Hu et al., 2021; Zeng et al., 2021; Chen et al., 2021; Zolfaghari et al., 2021; Gadre et al., 2023). For instance, Gadre et al. (2023) train a model on a dataset where N = 1.28 107 with B = 4096. This approach results in a memory requirement of O(B2) = O(1) for each mini-batch, and a total computational complexity linear in the number of chosen mini-batches. Despite the widespread practical use of mini-batch optimization in contrastive learning, there remains a lack of theoretical understanding as to whether this approach is truly reflective of the original goal of minimizing full-batch contrastive loss. This paper examines the theoretical aspects of optimizing mini-batches loaded for contrastive learning with the Info NCE loss (Oord et al., 2018). Our study primarily focuses on this loss function due to its fundamental importance in contrastive learning and its analytical tractability. Main Contributions. The main contributions of this paper are twofold. First, we show that under certain parameter settings, mini-batch optimization is equivalent to full-batch optimization if and only if all N B mini-batches are selected. The results are based on an interesting connection between contrastive learning and the neural collapse (Lu & Steinerberger, 2022). From a computational complexity perspective, the identified equivalence condition may be seen as somewhat prohibitive, as it implies that all N B = O(N B) mini-batches must be considered. Our second contribution, however, shows that Ordered SGD (OSGD) algorithm (Kawaguchi & Lu, 2020) can be effective in finding mini-batches that contain the most informative pairs and thereby speeding up convergence. We show that the convergence result from Kawaguchi & Lu (2020) can be applied directly to contrastive learning. We also show that OSGD can improve the convergence rate of SGD by a constant factor in certain scenarios. Furthermore, in a novel approach to address the challenge of applying OSGD to the N B mini-batch optimization (which involves examining O(N B) batches to select high-loss ones), we reinterpret the batch selection as a min-cut problem in graph theory. This transformative shift empowers us to select high-loss batches efficiently via a spectral clustering algorithm established in this field. The following informal theorems summarize the main findings. Theorem 1 (informal). Under certain parameter settings, the mini-batch optimization is equivalent to fullbatch optimization if and only if all N B mini-batches are selected. Although N B mini-batch contrastive loss and full-batch loss are neither identical nor differ by a constant factor, the optimal solutions for both mini-batch and full-batch are identical (see Section 4). Theorem 2 (informal). In a demonstrative toy example, OSGD operating on the principle of selecting high-loss batches, can potentially converge to the optimal solution of mini-batch contrastive loss optimization faster by a constant factor compared to SGD (see Section 5.1). We validate our theoretical findings and the efficacy of the proposed spectral clustering-based batch selection method by conducting experiments on both synthetic and real data. On synthetic data, we show that our proposed batch-selection algorithms do indeed converge to the optimal solution of full-batch optimization significantly faster than the baselines. We also conduct experiments by pre-training on CIFAR-100 (Krizhevsky et al., 2009) and Tiny Image Net (Le & Yang, 2015) using the proposed method. We evaluate the performance of the subsequent models on retrieval tasks. Our experiments on real datasets demonstrate that our batch selection method outperforms vanilla SGD in practically relevant settings. Published in Transactions on Machine Learning Research (07/2024) 2 Related Work Contrastive losses. Contrastive learning has been used for several decades to learn a similarity metric to be used later for verification or recognition applications (Misra & Maaten, 2020; Aberdam et al., 2021). Chopra et al. (2005) proposed one of the early versions of contrastive loss which has been updated and improved over the years (Sohn, 2016; Song & Ermon, 2020; Schroff et al., 2015; Khosla et al., 2020; Oord et al., 2018). More recently, contrastive learning has been shown to rival and even surpass traditional supervised learning methods, particularly on image classification tasks (Chen et al., 2020b; Bachman et al., 2019). Further, its multi-modal adaptation leverages vast unstructured data, extending its effectiveness beyond image and text modalities (Radford et al., 2021; Jia et al., 2021; Pham et al., 2021; Ma et al., 2021; Sachidananda et al., 2022; Elizalde et al., 2023; Goel et al., 2022; Lee et al., 2022; Ramesh et al., 2021; 2022). Unfortunately, these methods require extremely large batch sizes in order to perform effectively. Follow-up works showed that using momentum or carefully modifying the augmentation schemes can alleviate this issue to some extent (He et al., 2020; Chen et al., 2020b; Grill et al., 2020; Wang & Qi, 2022). Effect of batch size. While most successful applications of contrastive learning use large batch sizes (e.g., 32,768 for CLIP and 8,192 for Sim CLR), recent efforts have focused on reducing batch sizes and improving convergence rates (Yeh et al., 2022; Chen et al., 2022). Yuan et al. (2022) carefully studied the effect of the requirements on the convergence rate when a model is trained for minimizing Sim CLR loss, and proved that the gradient of the solution is bounded by O( 1 B ). They also propose Sog CLR, an algorithm with a modified gradient update where the correction term allows for an improved convergence rate with better dependence on B. It is shown that the performance for small batches can be improved with the technique called hard negative mining (Robinson et al., 2021; Kalantidis et al., 2020; Zhang & Stratos, 2021). Neural collapse. Neural collapse is a phenomenon observed in Papyan et al. (2020) where the final classification layer of deep neural nets collapses to the simplex Equiangular Tight Frame (ETF) when trained well past the point of zero training error (Ji et al., 2022; Zhou et al., 2022). Lu & Steinerberger (2022) proved that this occurs when minimizing cross-entropy (CE) loss over the unit ball. Jiang et al. (2023) generalized neural collapse to practically relevant settings where the number of classes is larger than the dimension of embeddings. Extending beyond CE loss minimization, Graf et al. (2021) delved into the optimal embeddings by minimizing supervised contrastive (SC) loss in balanced datasets and compared them with CE loss minimization. Kini et al. (2023) extended this study by characterizing the optimal embeddings of SC loss minimization combined with Re LU in the last layer, which can learn symmetric embedding structures despite data imbalances, and the sufficient and necessary conditions for mini-batch to achieve the same optimal embeddings of full-batch optimization. Our work explores this problem for the unsupervised contrastive learning setting. We characterize (1) the optimal embeddings that minimize the Info NCE loss for any given temperature parameter under certain assumptions, and (2) conditions for mini-batch to achieve the same optimal embeddings of full-batch optimization. Optimal permutations for SGD. The performance of SGD without replacement under different permutations of samples has been well studied in the literature (Bottou, 2009; Recht & Re, 2012; Recht & Ré, 2013; Nagaraj et al., 2019; Ying et al., 2020; Ahn et al., 2020; Rajput et al., 2020; Mishchenko et al., 2020; Safran & Shamir, 2021b;a; Gürbüzbalaban et al., 2021; Nguyen et al., 2021; Lu et al., 2021; Rajput et al., 2022; Tran et al., 2021; Lu et al., 2022; Cha et al., 2023; Cho & Yun, 2023). One can view batch selection in contrastive learning as a method to choose a specific permutation among the possible N B mini-batches of size B. However, it is important to note that these bounds do not indicate an improved convergence rate for general non-convex functions and thus would not apply to the contrastive loss, particularly in the setting where the embeddings come from a shared embedding network. We show that in the case of OSGD (Kawaguchi & Lu, 2020), we can indeed prove that contrastive loss satisfies the necessary conditions in order to guarantee convergence. Published in Transactions on Machine Learning Research (07/2024) 3 Problem Setting Suppose we are given a dataset {(xi, yi)}N i=1 of N positive pairs (data sample pairs that are conceptually similar or related), where xi and yi are two different views of the same data. Note that this setup includes both the multi-modal setting (e.g., CLIP (Radford et al., 2021)) and the uni-modal setting (e.g., Sim CLR (Chen et al., 2020a)) as follows. For the multi-modal case, one can view (xi, yi) as two different modalities of the same data, e.g., xi is the image of a scene while yi is the text description of the scene. For the uni-modal case, one can consider xi and yi as different augmented images from the same image. We consider the contrastive learning problem where the goal is to find embedding vectors for {xi}N i=1 and {yi}N i=1, such that the embedding vectors of positive pairs (xi, yi) are similar, while ensuring that the embeddings of other (negative) pairs are well separated. Let ui Rd be the embedding vector of xi, and vi Rd be the embedding vector of yi. In practical settings, one typically considers parameterized encoders so that ui = fθ(xi) and vi = gϕ(yi). We define embedding matrices U := [u1, . . . u N] and V := [v1, . . . , v N] which are the collections of embedding vectors. Now, we focus on the setting of directly optimizing the embeddings instead of model parameters θ and ϕ in order to gain theoretical insights into learning embeddings. Note that this setting is commonly utilized to achieve a deeper understanding of the underlying principles and mechanisms (Graf et al., 2021; Kini et al., 2023). Consider the problem of directly optimizing the embeddings for N pairs which is given by min U,V Lcon(U, V ) s.t. ui = 1, vi = 1 i [N], (1) where denotes the ℓ2 norm, the set [N] denotes the collection of all integers from 1 to N. In this work, we focus on the standard Info NCE loss (Oord et al., 2018) defined as Lcon(U, V ) := 1 eu i vi/τ PN j=1 eu i vj/τ ev i ui/τ PN j=1 ev i uj/τ with τ being a positive scalar known as a temperature parameter. Note that Lcon(U, V ) is the full-batch version of the loss which contrasts all embeddings with each other. However, due to the large computational complexity and memory requirements during optimization, practitioners often consider the following minibatch version instead. Note that there exist N B different mini-batches, each of which has B samples. For k h N B i , let Bk be the k-th mini-batch satisfying Bk [N] and |Bk| = B. Let UBk := {ui}i Bk and VBk := {vi}i Bk. Then, the contrastive loss for the k-th mini-batch is Lcon(UBk, VBk). 4 Optimization for Full-Batch and Mini-Batch In this section, we investigate the relationship between the problem of optimizing the full-batch loss Lcon(U, V ) and the problem of optimizing the mini-batch loss Lcon(UBk, VBk). Towards this goal, we prove three main results, the proof of which are in Appendix C.1. We derive the optimal solution that minimizes the full-batch loss (Lemma 1, Theorem 3). We show that the solution that minimizes the average of N B mini-batch losses is identical to the one that minimizes the full-batch loss (Proposition 1, Theorem 4). We show that minimizing the mini-batch loss summed over only a strict subset of N B mini-batches can lead to a sub-optimal solution that does not minimize the full-batch loss (Theorem 5). 4.1 Full-batch Contrastive Loss Optimzation In this section, we characterize the optimal solution for the full-batch loss minimization in Equation (1). We start by providing the definition of the simplex equiangular tight frame (ETF) which turns out to be the Published in Transactions on Machine Learning Research (07/2024) optimal solution in certain cases. The original definition of ETF (Sustik et al., 2007) is for N vectors in a d-dimensional space where N d+1 1. Papyan et al. (2020) define the ETF for the case where N d+1 to characterize the phenomenon of neural collapse. In our work, we will introduce and use the latter definition of simplex ETFs which is stated below. Definition 1 (Simplex ETF). We call a set of N vectors {ui}N i=1 form a simplex Equiangular Tight Frame (ETF) if ui = 1, i [N] and u i uj = 1/(N 1), i = j. In the following Lemma, we first prove that the optimal solution of full-batch contrastive learning is the simplex ETF for N d + 1 which follows almost directly from Lu & Steinerberger (2022). Lemma 1 (Optimal solution when N d + 1). Suppose N d + 1. Then, the optimal solution (U , V ) of the full-batch contrastive learning problem in Equation (1) satisfies two properties: (i) U = V , and (ii) the columns of U form a simplex ETF. Recall that in this work, we explore the setting of directly optimizing embeddings as an initial effort to gain theoretical insight. However, Lemma 1 presents the result that can be extended to a more practically relevant setting where embeddings are generated through linear encoders parameterized by θ and ϕ: ui = Wθxi and vi = Wϕyi. The details of this setting and the characteristics of the optimal encoders are stated below. Corollary 1. Let us consider a setting where embedding vectors are generated via overparameterized linear encoders Wθ, Wϕ Rd d0: ui = Wθxi, vi = Wϕyi, where xi, yi Rd0. We define X := [x1, . . . , x N] and Y := [y1, . . . , y N]. Suppose d, d0 N 1 and X X,Y Y are invertible. Then, the optimal solution (W θ , W ϕ) is given as: (i) W θ = U (X X) 1X ; (ii) W ϕ = V (Y Y ) 1Y , where (U , V ) satisfies the two properties in Lemma 1. Several practical scenarios satisfy N > d + 1. However, the approach used in Lu & Steinerberger (2022) cannot be directly applied for N > d + 1, leaving it as an open problem. While solving the open problem for the general case seems difficult, we characterize the optimal solution for the specific case of N = 2d, subject to the conditions stated below. However, the characterization for this case is still challenging, so we employ a certain conjecture; we conjecture that the optimal embeddings (U , V ) for N = 2d are symmetric and antipodal. These properties of embeddings are defined as follows. Definition 2 (Symmetric and Antipodal). Embedding matrices U and V are called symmetric and antipodal if (U, V ) satisfies two properties: (i) Symmetric i.e., U = V ; (ii) Antipodal i.e., for each i [N], there exists j(i) such that uj(i) = ui. Note that the symmetric property also holds for the optimal embeddings when N d+1, and the antipodal property is a common assumption in geometric problems such as the sphere covering problem in Borodachov (2022). This provides a rationale for our conjecture. Theorem 3 shows that when N = 2d, the optimal solution for the full-batch loss minimization, under a symmetric and antipodal configuration, form a cross-polytope which is defined as the following. Definition 3 (Simplex cross-polytope). We call a set of N vectors {u}N i=1 form a simplex cross-polytope if, for all i, the following three conditions hold: ui = 1; there exists a unique j such that u i uj = 1; and u i uk = 0 for all k / {i, j}. Theorem 3 (Optimal solution when N = 2d). Let (U , V ) := arg min (U,V ) A Lcon(U, V ) (3) s.t. ui = 1, vi = 1 i [N], where A := {(U, V ) : U, V are symmetric and antipodal}. Then, the columns of U form a simplex crosspolytope for N = 2d. Proof Outline. By the antipodality assumption, we can apply Jensen s inequality to N 2 indices without itself ui and antipodal point ui for a given i [N]. Then we show that the simplex cross-polytope 1See Def. 4 in Appendix B for the full definition Published in Transactions on Machine Learning Research (07/2024) π π/2 0 π/2 π θ = arctan(u1,2/u1,1) [radian] full batch mini batch (B=4) (B=2) (B=2) All Combinations of Mini-Batches Some Mini-Batches SB = {{1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}} 2mv Pz Kg5n RLQm5LSxjxp Ysp LCev Tb BPjb Nmf/Muj Ke+2y X9/d Ir JFbhj Ni/d B+Z/9Xp Wh ROs Gpqk FRTYhd HS9dct MVf XP7U1WKHBLi ND6me Eq YG+VHn2jy Uzturf Mx F9Mpmb1npe5OV71LWn A7vdx/g R7i013udnabt XN8p RD2MWc1igea5g HVto0Pe V7j HAx4ta V1b N9bte6p VKTUz+LKsuzd6Js KSB = {{1, 2, 3, 4}} SB = {{1, 2}, {3, 4}} Figure 1: (a) Comparing mini-batch loss and full-batch loss when N = 10, B = 2, and d = 2. We illustrate this by manipulating a single embedding vector u1 while maintaining all other embeddings (v1 and {ui, vi}10 i=2) at their optimal solutions. Specifically, u1 = [u1,1, u1,2] is varied as [cos(θ), sin(θ)] for θ [ π, π]. While the two loss functions are not identical, corroborating Proposition1, their minimizers align, providing empirical support for Theorem 4; (b) The relationship between full-batch and mini-batch optimization in contrastive learning. Consider optimizing N = 4 pairs of d = 3 dimensional embedding vectors {(ui, vi)}N i=1 where ui and vi are shown as colored square and circle, respectively. The index i is written in the square/circle. The black rounded box represents a batch. We compare three batch selection options: (i) full batch, i.e., B = 4, (ii) all N B = 6 mini-batches with size B = 2, and (iii) some mini-batches. Here, SB is the set of mini-batches where each minibatch is represented by the set of constituent samples indices. Our theoretical/empirical findings are: the optimal embedding that minimizes full-batch loss and the one that minimizes the sum of N B mini-batch losses are identical, while the one that minimizes the mini-batch losses summed over only a strict subset of N B batches does not guarantee the negative correlation between ui and uj for i = j. This illustration is supported by our mathematical results in Thms. 4 and 5. minimizes this lower bound while satisfying the conditions that make the applications of Jensen s inequality tight. See Appendix C for the full proof. For the general case of N > d + 1, excluding N = 2d, we still leave it as an open problem. 4.2 Mini-batch Loss Optimization Here we consider the mini-batch contrastive loss optimization problem, where we first choose multiple minibatches of size B and then find U, V that minimize the sum of contrastive losses computed for the chosen mini-batches. Note that this is the loss that is typically considered in the contrastive learning since computing the full-batch loss is intractable in practice. Let us consider a subset of all possible N B mini-batches and denote their indices by SB h N B i . For a fixed SB, the mini-batch optimization problem is formulated as: min U,V Lcon mini(U, V ; SB) (4) s.t. ui = 1, vi = 1 i [N], where the loss of given mini-batches is Lcon mini(U, V ; SB) := 1 |SB| P i SB Lcon(UBi, VBi). To analyze the relationship between the full-batch loss minimization in Equation (1) and the mini-batch loss minimization in Equation (4), we first compare the objective functions of two problems as below. Proposition 1. The mini-batch loss and full-batch loss are not identical, nor is one a simple scaling of the other by a constant factor. In other words, when SB = h N B i , for all B 2, there exists no constant c such that Lcon mini(U, V ; SB) = c Lcon(U, V ) for all U, V . We visualize this proposition in Figure 1(a). Interestingly, the following result shows that the optimal solutions of both problems are identical. Theorem 4 (Optimization with all possible N B mini-batches). Suppose B 2. The set of minimizers of the N B mini-batch problem in Equation (4) is the same as that of the full-batch problem in Equation (1) Published in Transactions on Machine Learning Research (07/2024) for two cases: (i) N d + 1, and (ii) N = 2d and the pairs (U, V ) are restricted to those satisfying the conditions stated in Def. 2. In such cases, the solutions (U, V ) for the N B mini-batch optimization problem satisfies the following: Case (i) {ui}N i=1 forms a simplex ETF and ui = vi for all i [N]; Case (ii): {ui}N i=1 forms a simplex cross-polytope. Proof Outline. Similar to the proof of Lemma 1, we bound the objective function from below using Jensen s inequality. We then show that this lower bound is equivalent to a scaling of the bound from the proof of Lemma 1 by employing non-trivial counting arguments. Furthermore, we demonstrate that both the simplex ETF and the simplex cross-polytope minimize this lower bound for each case while satisfying the conditions that make the applications of Jensen s inequality tight. See Appendix C.1 for the detailed proof. Now, we present mathematical results specifying the cases when the solutions of mini-batch optimization and full-batch optimization differ. First, we show that when B = 2, minimizing the mini-batch loss over any strict subset of N B batches, is not equivalent to minimizing the full-batch loss. Theorem 5 (Optimization with fewer than N B mini-batches). Suppose B = 2 and N d + 1. Then, the minimizer of Equation (4) for SB h N B i is not the minimizer of the full-batch optimization in Equation (1). Proof Outline. We show that there exist embedding vectors that are not the simplex ETF, and have a strictly lower objective value. This implies that the optimal solution of any set of mini-batches that does not contain all N 2 mini-batches is not the same as that of the full-batch problem. See Appendix C.1 for the detailed proof. The result of Theorem 5 is extended to the general case of B 2, under some mild assumption; please check Proposition 2 and 3 in Appendix C.1. Figure 1 summarizes the main findings in this section. These findings suggest that mini-batch optimization can achieve the optimal embeddings of full-batch optimization, as long as all N B possible mini-batches are considered. However, this poses a computational challenge, as it involves loading all these mini-batches. In the following section, we introduce efficient algorithms for speeding up the N B mini-batch optimization. 5 Ordered SGD for Mini-Batch Contrastive Learning Recall that the optimal embeddings for the full-batch optimization problem in Equation (1) can be obtained by minimizing the sum of N B mini-batch losses. An easy way of achieving the optimal embeddings is using gradient descent (GD) on the sum of losses for N B mini-batches, or using a stochastic approach that applies GD on the loss for a randomly chosen mini-batch. Recent works found that applying GD on selective batches outperforms SGD in some cases (Kawaguchi & Lu, 2020; Lu et al., 2021; Loshchilov & Hutter, 2015). A natural question is, whether this holds for mini-batch contrastive learning, i.e., (i) Is SGD enough to guarantee good convergence on mini-batch contrastive learning?, and (ii) Can we come up with a batch selection method that performs better than vanilla SGD? To answer this question: We show that Ordered SGD (OSGD) (Kawaguchi & Lu, 2020) can potentially accelerate convergence compared to vanilla SGD in a demonstrative toy example (Section 5.1), and extend the convergence results from Kawaguchi & Lu (2020) to mini-batch contrastive loss optimization (Section 5.2). We reformulate the batch selection problem into a min-cut problem in graph theory (Cormen et al., 2022), by considering a graph with N nodes where each node is each positive pair and each edge represents a proxy to the contrastive loss between two nodes. This allows us to devise an efficient batch selection algorithm by leveraging a spectral clustering algorithm (Ng et al., 2001) (Section 5.3). 5.1 Convergence Comparison in a Toy Example: OSGD vs. SGD This section investigates the convergence of two gradient-descent-based methods, OSGD and SGD. The below lemma shows that the contrastive loss is geodesically non-quasi-convex, which implies the hardness of proving the convergence to the optimal embeddings of gradient-based methods for contrastive learning in Equation (1). The proof is provided in Appendix C.2. Published in Transactions on Machine Learning Research (07/2024) 0 25 50 75 100 125 150 175 200 Update steps Contrastive loss OSGD SGD N choose B (full) Figure 2: (a) Toy example considered in Section 5.1; (b) The training loss curves of three algorithms (OSGD, SGD, and N B full-batch gradient descent) applied on the toy example when N = 4 and B = 2. The x-axis represents the number of update steps, while the y-axis displays the loss in Equation (2). OSGD converges the fastest among the three methods. Lemma 2. Contrastive loss Lcon(U, V ) is a geodesically non-quasi-convex function of U, V on T = {(U, V ) : ui = vi = 1, i [N]}. In order to compare the convergence of OSGD and SGD to the optimal embeddings, we focus on a toy example where convergence to the optimal solution is achievable with appropriate initialization. Consider a scenario where we have N = 4 embedding vectors {ui}N i=1 with ui R2. Each embedding vector is defined as u1 = (cos θ1, sin θ1); u2 = (cos θ2, sin θ2); u3 = ( cos θ3, sin θ3); u4 = ( cos θ4, sin θ4) for parameters {θi}N i=1. For all i, the initial parameters are set as θ(0) i = ϵ > 0, and the other embedding vectors are initialized as v(0) i = u(0) i . This setting is illustrated in Figure 2(a). Over time step t, we consider updating the parameters θ(t) := [θ(t) 1 , θ(t) 2 , θ(t) 3 , θ(t) 4 ] using the gradient descent based method on Lcon(UB, VB) with a learning rate η: θ(t+1) = θ(t) η θLcon(UB(t), VB(t)). At each time step t, each learning algorithm begins by selecting a mini-batch B(t) {1, 2, 3, 4} with batch size |B(t)| = 2. SGD randomly selects a mini-batch, while OSGD selects a mini-batch as follows: B(t) = arg max B S Lcon(UB(θ(t)), VB(θ(t))). For a sufficiently small margin ρ > 0, let TOSGD, TSGD be the minimal time required for the algorithms to reach the condition that each component of the expected value E[θ(T )] falls within (π/4 ρ, π/4): TOSGD := min{T 0 : E[θ(T ) OSGD] (π/4 ρ, π/4)N}; TSGD := min{T 0 : E[θ(T ) SGD] (π/4 ρ, π/4)N}, where θ(t) OSGD and θ(t) SGD denote the model parameters at step t when OSGD and SGD are used, respectively. Under this setting, the following theorem compares OSGD and SGD, in terms of the lower bound on the time required for the convergence to the optimal solution. Theorem 6. Consider the described setting where the parameters θ(t) of embedding vectors are updated, as shown in Figure 2(b). Suppose there exist ϵ, T such that for all t satisfying B(t) = {1, 3} or {2, 4}, θ(t)Lcon(UB(t), VB(t)) ϵ, and TOSGD, TSGD < T. Then, we have the following inequalities: TOSGD τ π/4 ρ ϵ + O(η2ϵ + ηϵ3) ηϵ , TSGD 3(e2/τ + 1) e2/τ 1 τ π/4 ρ ϵ + O(η2ϵ + η2 ϵ) ηϵ + O(ηϵ3 + η ϵ) . Corollary 2. Suppose lower bounds of TOSGD, TSGD in Theorem 6 are tight, and the learning rate η is small enough. Then, TOSGD/TSGD = (e2/τ 1)/3(e2/τ + 1) < 1. In Figure 2(b), we present training loss curves of the full-batch contrastive loss in Equation (2) for various algorithms implemented on the toy example. One can observe that the loss of all algorithms eventually converges to 1.253, the optimal loss achievable when the solution satisfies ui = vi and {ui}N i=1 form simplex cross-polytope. Note that the optimality of this solution is given in Theorem 4. This empirical evidence corroborates our theoretical findings, indicating that OSGD converges faster than SGD. Published in Transactions on Machine Learning Research (07/2024) 2 4 6 contrastive loss # of batches 0.0 0.5 1.0 0 0.0 0.5 1.0 0 epoch = 101 random spectral clustering Figure 3: Histograms of batch counts for N/B batches, for the contrastive loss measured from Res Net-18 models trained on CIFAR-100 using SGD, where N=50,000 and B=20. Each plot is derived from a distinct training epoch. Here we compare two batch selection methods: (i) random shuffling N samples and partition them into N/B batches of size B, (ii) our SC method given in Algorithm 1. The histograms show that batches generated through the proposed spectral clustering method tend to contain a higher proportion of large loss values when compared to random batch selection. Similar results are observed in different settings, details of which are given in Appendix E.1. 5.2 Convergence of OSGD in Mini-batch Contrastive Learning Setting Recall that it is challenging to prove the convergence of gradient-descent-based methods to the optimal embeddings for contrastive learning problems in Equation (1) due to the non-quasi-convexity of the contrastive loss Lcon. Instead, we show that OSGD minimizes a modification of the empirical contrastive loss Lcon that considers the order of the batch losses, and is guaranteed to converge at a sublinear rate to a critical point the loss by extending the results in Kawaguchi & Lu (2020). We consider a proxy, the weighted contrastive loss defined as e Lcon(U, V ) := 1 q P( N B) j=1 γj Lcon(UB(j), VB(j)) with γj = Pq 1 l=0 j 1 l ( N B) j k l 1 / ( N B) k for two ar- bitrary natural numbers k, q N B where B(j) is a mini-batch with j-th largest loss among batches of size B. Indeed, this is a natural objective obtained by applying OSGD to our problem. OSGD updates the embeddings using the gradient averaged over q batches that have the largest losses among randomly chosen k batches (see Algorithm 2 in Appendix C.2). Let U (t), V (t) be the updated embedding matrices when applying OSGD for t steps starting from U (0), V (0), using the learning rate ηt. Then the following theorem, proven in Appendix C.2, holds. Theorem 7 (Convergence results). Consider sampling t from [T] with probability proportional to {ηt}T t=0, that is, P(t = t) = ηt/(PT i=0 ηi). Then ρ > ρ0 = p 8/Bτ 2 + 4e2/τ/Bτ 2, E e Lcon(U (t ), V (t )) 2 (ρ + ρ0)2 e Lcon(U (0), V (0)) e Lcon + 8ρ PT t=0 η2 t PT t=0 ηt , where e Lcon denotes the minimized value of e Lcon. Given sufficiently small learning rate ηt O(t 1/2), E e Lcon 2 decays at the rate of e O(T 1/2). 5.3 Spectral Clustering-based Approach Applying OSGD to mini-batch contrastive learning has a potential benefit as shown in Section 5.1, but it also has some challenges. Choosing the best q batches with high loss in OSGD is only doable after we evaluate losses of all N B combinations. This process requires the computational complexity of O(N B log N B), which is computationally infeasible for large N. A naive solution to tackle this challenge is to first randomly choose k batches and then select q high-loss batches among k batches. However, this naive random batch selection method does not guarantee that the chosen q batches have the highest loss among all N B candidates. Motivated by these issues of OSGD, we relax this problem by reducing our search space such that the q = N/B chosen batches B1, , Bq form a partition of N samples, i.e., Bi Bj = and i [q]Bi = [N]. We suggest an alternative batch selection method inspired by graph theory. The mini-batch contrastive loss Published in Transactions on Machine Learning Research (07/2024) Algorithm 1: Spectral Clustering Method Input: the number of positive pairs N, mini-batch size B, embedding matrices: U, V Output: selected mini-batches {Bj}N/B j=1 1 Construct the affinity matrix A: Aij = w(i, j) in Equation (5) if i = j 0 else 2 Construct the degree matrix D from A: Dij = 0 if i = j PN j=1 Aij else 3 L D A; k N/B 4 Find the k largest eigenvectors of L, denoted as Vk RN k 5 Normalize the rows of Vk to have unit ℓ2-norm 6 Apply the K-means clustering algorithm on the rows of the normalized Vk to get cluster centers Z Rk k 7 Construct a bipartite graph Gassign: (i) the first partite set is Vk and (ii) the second set is the collection of B copies of each center in Z 8 Compute distances between row vectors of Vk and B copies of each center in Z, and assign these as edge weights in Gassign 9 Solve the minimum weight matching problem in Gassign using a method such as the Hungarian algorithm 10 return {Bj}N/B j=1 Lcon(UB, VB) for a given batch B is lower bounded using Jensen s inequality as follows: Lcon(UB, VB) = 1 eu i vi/τ PN j=1 eu i vj/τ ev i ui/τ PN j=1 ev i uj/τ j B\{i} eu i (vj vi)/τ j B\{i} ev i (uj ui)/τ j B\{i} log 1 + (B 1)eu i (vj vi)/τ + log 1 + (B 1)ev i (uj ui)/τ We employ this lower bound due to a nice property that can be expressed as a summation of losses over a pair (i, j) of samples within batch B. We consider a graph G with N nodes and the weight w(k, l) between node k and l is defined as w(k, l) := X (i,j) {(k,l),(l,k)} log(1 + (B 1)eu i (vj vi)/τ) + log(1 + (B 1)ev i (uj ui)/τ). (5) Recall that our goal is to choose q batches having the highest contrastive losses among N B batches. In such a scenario, our target problem is equivalent to the problem of clustering N nodes in graph G into q clusters with equal size, where the objective is to minimize the sum of weights of inter-cluster edges. This problem is the min-cut problem (Cormen et al., 2022). To approximate the solution to this problem based on the graph G, we propose a variant of the well-known spectral clustering algorithm (Ng et al., 2001). Since the spectral clustering algorithm cannot guarantee to assign an equal number of nodes to each cluster, we introduce an additional step to ensure that each cluster (batch) has the equal number B of positive pairs. This step addresses a minimum weight matching problem in a bipartite graph (Crouse, 2016) involving N data points and replicated cluster centers from spectral clustering. One partite set is the collection of data points and the other set consists of B copies of each cluster center. The aim is to allocate exactly B data points to each center, minimizing overall cost, defined as the total distance from data points to their respective centers. This ensures each cluster receives an equal number of data points while minimizing the assignment cost. The pseudo-code of our batch selection method2 is provided in Algorithm 1. The computational complexity of our spectral clustering (SC) method is O(k2B2N), while the complexity of OSGD is O(N B log N B) due to the necessity of sorting N B mini-batches. We provide a detailed complexity comparison of the algorithms in Appendix F. Figure 3 shows that the proposed SC method indeed favors batches with larger loss values. 2Our algorithm finds N/B clusters at once, instead of only finding a single best cluster. Compared with such an alternative approach, our method is (i) more efficient when we update models for multiple iterations, and (ii) guaranteed to load all samples with N/B batches, thus expected to have better convergence (Bottou, 2009; Haochen & Sra, 2019; Gürbüzbalaban et al., 2021). Published in Transactions on Machine Learning Research (07/2024) 0 100 200 300 400 500 0.0 0.5 1.0 1.5 2.0 2.5 OSGD Spectral Clustering Method SGD (a) optimal (b) full-batch (c) N B -all (d) N B -sub 0 100 200 300 400 500 Update steps (e) norm difference Figure 4: The behavior of embedding matrices U, V optimized by different batch selection methods for N = 8 and B = 2. (a)-(d): Heatmap of N N matrix visualizing the pairwise inner products u i vj, where (a): ground-truth solution (ETF for d = 2N, cross-polytope for d = N/2), (b): optimized the full-batch loss with GD, (c): optimized the sum of N B mini-batch losses with GD, (d): optimized a partial sum of N B mini-batch losses with GD. Note that both (b) and (c) reaches the ground-truth solution in (a), while (d) does not, supporting our theoretical results in Section 4.2. Further, (e) compares the convergence of three mini-batch selection algorithms: 1) SGD, 2) OSGD, and 3) our spectral clustering method, when updating embeddings for 500 steps. OSGD and our method nearly converge to the optimal solution, while SGD does not. Here, y-axis represents the Frobenius norm of the difference between the heatmaps of the optimal solution and the updated embeddings, denoted by U V U V F . 6 Experiments We validate our theoretical findings and the effectiveness of the proposed method by providing experimental results on synthetic and real datasets. We first show that experimental results on synthetic dataset coincides with two main theoretical results: (i) the relationship between the full-batch contrastive loss and the minibatch contrastive loss given in Section 4, (ii) the analysis on the convergence of OSGD and the proposed SC method given in Section 5. To demonstrate the practicality of our method, we provide experimental results on CIFAR-100 (Krizhevsky et al., 2009) and Tiny Image Net (Le & Yang, 2015). Details of the experimental setting can be found in Appendix E, and our code is available at https://github.com/ krafton-ai/mini-batch-cl. 6.1 Synthetic Dataset Consider optimizing the embedding matrices U, V using GD, where each column of U, V is initialized as a multivariate normal vector and then normalized as ui = vi = 1, i. We use learning rate η = 0.5, and apply the normalization step at every iteration. First, we compare the minimizers of three optimization problems: (i) full-batch optimization in Equation (1); (ii) mini-batch optimization in Equation (4) with SB = N B ; (iii) mini-batch optimization with SB N B . We apply GD to each problem for N = 8 and B = 2, obtain the updated embedding matrices, and then show the heatmap plot of N N gram matrix containing all the pairwise inner products u i vj in Figure 4(b)- (d). One can observe that when either full-batch or all N B mini-batches are used for training, the trained embedding vectors reach a simplex ETF and simplex cross-polytope solutions for d = 2N and d = N/2, respectively, as proved in Theorem 4. In contrast, when a strict subset of N B mini-batches is used for training, these solutions are not achieved. Second, we compare the convergence speed of three algorithms: (i) the proposed SC method; (ii) OSGD which updates the model using the top-1 batch that exhibits the highest loss from among the N B possible batches at each step (see Algorithm 5); and (iii) SGD (see Algorithm 4). Figure 4(e) shows that both OSGD and the proposed method nearly converge to the ground-truth solutions proved in Theorem 4 within 500 steps, while SGD does not. We obtain similar results for other values of N and d, given in Appendix E.2. Published in Transactions on Machine Learning Research (07/2024) Table 1: Top-1 retrieval accuracy on CIFAR-100 (or Tiny Image Net), when each algorithm employs one of the following datasets to pretrain Res Net-18 with Sim CLR and Sog CLR objective: (1) CIFAR-100, (2) Tiny Image Net, (3) CIFAR-100-C, or (4) Tiny Image Net-C. SC algorithm proposed in Section 5.3 outperforms existing baselines. Image Retrieval CIFAR-100 Tiny Image Net CIFAR-100-C Tiny Image Net-C Sim CLR Sog CLR Sim CLR Sog CLR Sim CLR Sog CLR Sim CLR Sog CLR SGD 70.29 60.66 78.39 67.48 38.48 38.34 42.42 45.04 OSGD (approximated) 70.00 61.48 78.57 69.31 38.55 38.71 43.11 45.63 SC 74.71 73.32 82.95 79.73 38.99 41.51 43.45 46.81 6.2 Real Datasets In this section, we show that the proposed SC method is effective in more practical settings where the embedding is learned by a parameterized encoder, and can be easily applied to existing uni-modal frameworks, such as Sim CLR (Chen et al., 2020a) and Sog CLR (Yuan et al., 2022). We integrate different batch selection methods from (i) the proposed SC method; (ii) OSGD without replacement (see Algorithm 6), an approximated OSGD, selected due to the intractability of evaluating all N B batches on real datasets; and (iii) SGD (see Algorithm 4) into these frameworks. We compare the pre-trained models performances in the retrieval downstream tasks. We conduct mini-batch contrastive learning with the mini-batch size B = 32 using Res Net18-based encoders on CIFAR-100 and Tiny Image Net datasets. All learning is executed on a single NVIDIA A100 GPU. The training code and hyperparameters are based on the official codebase of Sog CLR3 (Yuan et al., 2022). We use LARS optimizer (You et al., 2017) with the momentum of 0.9 and the weight decay of 10 6. We utilize the learning rate scheduler which starts with a warm-up phase in the initial 10 epochs, during which the learning rate increases linearly to the maximum value ηmax = 0.075 B. After this warm-up stage, we employ a cosine annealing (half-cycle) schedule for the remaining epochs. For the approximated OSGD, we employ k = 1500, q = 150. To expedite batch selection in the proposed SC, we begin by randomly partitioning N positive pairs into k B-sized clusters, using k = 40. We then apply the SC method to each k B cluster to generate k mini-batches, resulting in a total of k (N/k B) = N/B mini-batches. We train models for a total of 100 epochs. We measure the performances of the models under the retrieval task which is defined as finding the positive pair image of a given image among all pairs (the number of images of the validation dataset): (1) We extract normalized embedding features from augmented images by employing the pre-trained models; (2) We identify the pair image of the given augmented image among all of the validation images with the cosine similarity of embedding vectors. We also measure performances of the model pre-trained on CIFAR-10 (or Tiny Image Net) under a harder setting, where the various corruptions are applied per image so that we can consider a set of corrupted images as hard negative samples. We employ CIFAR-100-C (or Tiny Image Net-C) which is the corrupted dataset (Hendrycks & Dietterich, 2019) designed for robustness evaluation. CIFAR-100-C (Tiny Image Net C) has the same images as CIFAR-100 (Tiny Image Net), but these images have been altered by 19 (15) different types of corruption (e.g., image noise, blur, etc.). Each type of corruption has five severity levels. We utilize images corrupted at severity level 1. These images tend to be more similar to each other than those corrupted at higher severity levels, which consequently makes it more challenging to retrieve positive pairs among other images. To perform the retrieval task, we follow the following procedures: (i) We apply two distinct augmentations to each image to generate positive pairs; (ii) We extract normalized embedding features from augmented images by employing the pre-trained models; (iii) we identify the pair image of the given augmented image among augmentations of 19 (15) corrupted images with the cosine similarity of embedding vectors. This process is iterated across 10K CIFAR-100 images (10K Tiny-Image Net images). The top-1 accuracy measures a percentage of retrieved images that match its positive pair image, where each pair contains two different modalities stemming from a single image. Table 1 presents the top-1 retrieval accuracy on CIFAR-100, Tiny Image Net, and their corrupted counterparts. The proposed SC method 3https://github.com/Optimization-AI/Sog CLR Published in Transactions on Machine Learning Research (07/2024) surpasses SGD and the approximated OSGD. We show the efficacy of the SC method, yet it has a limitation in that the proposed method requires more time for batch selection compared to SGD. To mitigate this issue in practice, we suggest parallel batch selection using the SC method while updating models through gradient descent. This approach involves using the same batches for multiple epochs as the next batch selection proceeds, based on the assumption that batches with higher loss values do not rapidly change. Exploring this approach is beyond the scope of this work so we leave it as a future work. 7 Conclusion We theoretically analyzed mini-batch contrastive learning with the Info NCE loss. First, we showed that the solution of mini-batch optimization and that of full-batch optimization are identical if and only if all N B mini-batches are considered. Second, we analyzed the convergence of OSGD and devised spectral clustering (SC) method, a new batch selection method which handles the complexity issue of OSGD in mini-batch contrastive learning. Experimental results support our theoretical findings and the superiority of SC. Our future work includes (1) characterizing the optimal embeddings for N > d + 1 case, (2) generalizing our findings to broader contrastive losses, and (3) exploring a complexity-efficient batch selection method. Aviad Aberdam, Ron Litman, Shahar Tsiper, Oron Anschel, Ron Slossberg, Shai Mazor, R Manmatha, and Pietro Perona. Sequence-to-sequence contrastive learning for text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15302 15312, 2021. Kwangjun Ahn, Chulhee Yun, and Suvrit Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. Advances in Neural Information Processing Systems, 33:17526 17535, 2020. Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32, 2019. Sergiy Borodachov. Optimal antipodal configuration of 2d points on a sphere in Rd for covering. ar Xiv preprint ar Xiv:2210.12472, 2022. Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pp. 2624 2633, 2009. Jaeyoung Cha, Jaewook Lee, and Chulhee Yun. Tighter lower bounds for shuffling sgd: Random permutations and beyond, 2023. Changyou Chen, Jianyi Zhang, Yi Xu, Liqun Chen, Jiali Duan, Yiran Chen, Son Dinh Tran, Belinda Zeng, and Trishul Chilimbi. Why do we need large batchsizes in contrastive learning? a gradient-bias perspective. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. Hao Chen, Benoit Lagadec, and Francois Bremond. Ice: Inter-instance contrastive encoding for unsupervised person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14960 14969, 2021. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020a. Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020b. Hanseul Cho and Chulhee Yun. Sgda with shuffling: faster convergence for nonconvex-pł minimax optimization, 2023. Published in Transactions on Machine Learning Research (07/2024) Sumit Chopra, Raia Hadsell, and Yann Le Cun. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05), volume 1, pp. 539 546. IEEE, 2005. Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2022. David F Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679 1696, 2016. Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019. Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1 5, 2023. Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. ar Xiv preprint ar Xiv:2304.14108, 2023. Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, and Aditya Grover. Cyclip: Cyclic contrastive language-image pretraining. In Advances in Neural Information Processing Systems, volume 35, pp. 6704 6719. Curran Associates, Inc., 2022. Florian Graf, Christoph Hofer, Marc Niethammer, and Roland Kwitt. Dissecting supervised contrastive learning. In International Conference on Machine Learning, pp. 3821 3830. PMLR, 2021. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271 21284, 2020. M. Gürbüzbalaban, A. Ozdaglar, and P. A. Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186(1):49 84, 2021. Jeff Haochen and Suvrit Sra. Random shuffling beats sgd after finite epochs. In International Conference on Machine Learning, pp. 2624 2633. PMLR, 2019. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729 9738, 2020. Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In 7th International Conference on Learning Representations, ICLR 2019, 2019. Qianjiang Hu, Xiao Wang, Wei Hu, and Guo-Jun Qi. Adco: Adversarial contrast for efficient learning of unsupervised representations from self-trained negative adversaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1074 1083, 2021. Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020. Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations, 2022. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904 4916. PMLR, 2021. Published in Transactions on Machine Learning Research (07/2024) Jiachen Jiang, Jinxin Zhou, Peng Wang, Qing Qu, Dustin Mixon, Chong You, and Zhihui Zhu. Generalized neural collapse for a large number of classes. ar Xiv preprint ar Xiv:2310.05351, 2023. Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, and Diane Larlus. Hard negative mixing for contrastive learning. Advances in Neural Information Processing Systems, 33:21798 21809, 2020. Kenji Kawaguchi and Haihao Lu. Ordered sgd: A new stochastic optimization framework for empirical risk minimization. In International Conference on Artificial Intelligence and Statistics, pp. 669 679. PMLR, 2020. Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661 18673, 2020. Ganesh Ramachandra Kini, Vala Vakilian, Tina Behnia, Jaidev Gill, and Christos Thrampoulidis. Supervised-contrastive loss learns orthogonal frames and batching matters. ar Xiv preprint ar Xiv:2306.07960, 2023. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015. Janghyeon Lee, Jongsuk Kim, Hyounguk Shon, Bumsoo Kim, Seung Hwan Kim, Honglak Lee, and Junmo Kim. Uni CLIP: Unified framework for contrastive language-image pre-training. In Advances in Neural Information Processing Systems, 2022. Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. ar Xiv preprint ar Xiv:1511.06343, 2015. Jianfeng Lu and Stefan Steinerberger. Neural collapse under cross-entropy loss. Applied and Computational Harmonic Analysis, 59:224 241, 2022. ISSN 1063-5203. Special Issue on Harmonic Analysis and Machine Learning. Yucheng Lu, Si Yi Meng, and Christopher De Sa. A general analysis of example-selection for stochastic gradient descent. In International Conference on Learning Representations, 2021. Yucheng Lu, Wentao Guo, and Christopher De Sa. Grab: Finding provably better data permutations than random reshuffling. In Advances in Neural Information Processing Systems, 2022. Shuang Ma, Zhaoyang Zeng, Daniel Mc Duff, and Yale Song. Active contrastive learning of audio-visual video representations. In International Conference on Learning Representations, 2021. Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33:17309 17320, 2020. Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707 6717, 2020. Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pp. 4703 4711. PMLR, 2019. Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001. Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten Van Dijk. A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22 (1):9397 9440, 2021. Published in Transactions on Machine Learning Research (07/2024) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652 24663, 2020. Hieu Pham, Zihang Dai, Golnaz Ghiasi, Hanxiao Liu, Adams Wei Yu, Minh-Thang Luong, Mingxing Tan, and Quoc V Le. Combined scaling for zero-shot transfer learning. ar Xiv preprint ar Xiv:2111.10050, 2021. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748 8763. PMLR, 2021. Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964 7973. PMLR, 2020. Shashank Rajput, Kangwook Lee, and Dimitris Papailiopoulos. Permutation-based SGD: Is random optimal? In International Conference on Learning Representations, 2022. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821 8831. PMLR, 2021. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. Benjamin Recht and Christopher Re. Toward a noncommutative arithmetic-geometric mean inequality: Conjectures, case-studies, and consequences. In Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pp. 11.1 11.24. PMLR, 2012. Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201 226, 2013. Joshua David Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. In International Conference on Learning Representations, 2021. Vin Sachidananda, Shao-Yen Tseng, Erik Marchi, Sachin Kajarekar, and Panayiotis Georgiou. Calm: Contrastive aligned audio-language multirate and multimodal representations. ar Xiv preprint ar Xiv:2202.03587, 2022. Itay Safran and Ohad Shamir. How good is sgd with random shuffling?, 2021a. Itay Safran and Ohad Shamir. Random shuffling beats sgd only after many epochs on ill-conditioned problems. Advances in Neural Information Processing Systems, 34:15151 15161, 2021b. Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815 823, 2015. Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29, 2016. Jiaming Song and Stefano Ermon. Understanding the limitations of variational mutual information estimators. In International Conference on Learning Representations, 2020. Mátyás A Sustik, Joel A Tropp, Inderjit S Dhillon, and Robert W Heath Jr. On the existence of equiangular tight frames. Linear Algebra and its applications, 426(2-3):619 635, 2007. Published in Transactions on Machine Learning Research (07/2024) Trang H Tran, Lam M Nguyen, and Quoc Tran-Dinh. Smg: A shuffling gradient-based method with momentum. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 10379 10389. PMLR, 2021. Xiao Wang and Guo-Jun Qi. Contrastive learning with stronger augmentations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann Le Cun. Decoupled contrastive learning. In European Conference on Computer Vision, pp. 668 684. Springer, 2022. Bicheng Ying, Kun Yuan, and Ali H. Sayed. Variance-reduced stochastic learning under random reshuffling. IEEE Transactions on Signal Processing, 68:1390 1408, 2020. doi: 10.1109/TSP.2020.2968280. Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. ar Xiv preprint ar Xiv:1708.03888, 2017. Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, and Tianbao Yang. Provable stochastic optimization for global contrastive learning: Small batch does not harm performance. In International Conference on Machine Learning, pp. 25760 25782. PMLR, 2022. Dewen Zeng, Yawen Wu, Xinrong Hu, Xiaowei Xu, Haiyun Yuan, Meiping Huang, Jian Zhuang, Jingtong Hu, and Yiyu Shi. Positional contrastive learning for volumetric medical image segmentation. In Medical Image Computing and Computer Assisted Intervention MICCAI 2021: 24th International Conference, Strasbourg, France, September 27 October 1, 2021, Proceedings, Part II 24, pp. 221 230. Springer, 2021. Wenzheng Zhang and Karl Stratos. Understanding hard negatives in noise contrastive estimation. In North American Chapter of the Association for Computational Linguistics, 2021. Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pp. 27179 27202. PMLR, 2022. Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. Crossclr: Cross-modal contrastive learning for multi-modal video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1450 1459, 2021. Published in Transactions on Machine Learning Research (07/2024) Organization of the Appendix 1. Appendix A outlines the limitations of this paper. 2. In Appendix B we introduce some additional definitions for posterity. 3. In Appendix C we provide detailed proofs of the results from the main paper as well as any intermediate results/lemmas that we found useful. (a) Specifically, in Appendix C.1 we provide proofs of the results from Section 4 which focuses on the relationship between the optimal solutions for minimizing the mini-batch and full-batch constrastive loss. (b) Appendix 5 contains the proofs of results from Section C.2 which concern the application of Ordered SGD to mini-batch contrastive learning. (c) Appendix C.3 is intended to supplement Appendix 5 in that it contains auxiliary notation and proofs required in the proof of Theorem 7. 4. Appendix D specifies the pseudo-code and details for the two algorithms: (i) Stochastic Gradient Descent and (ii) Ordered SGD. 5. Appendix E provide some additional results. 6. In Appendix F we provide computational complexities for the three algorithms. A Limitations We note that our theoretical results have two major limitations: 1. While we would like to extend our results to the general case of N > d + 1, we were only able to characterize the optimal solution for the specific case of N = 2d. Furthermore, our result for the case of N = 2d in Theorem 4 requires the use of the conjecture that the optimal solution is symmetric and antipodal. However, as mentioned by Lu & Steinerberger (2022), the general case of N > d + 1 seems quite challenging in the non-asymptotic regime. 2. In practice, the embeddings are usually the output of a shared neural network encoder. However, our results are for the case when the embeddings only have a norm constraint. Thus, our results do not readily indicate any generalization to unseen data. We expect however, that it is possible to extend our results to the shared encoder setting by assuming sufficient overparameterization. B Additional Definitions Definition 4 (Sustik et al. (2007)). A set of N vectors {ui}N i=1 in the Rd form an equiangular tight frame (ETF) if (i) they are all unit norm: ui = 1 for every i [N], (ii) they are equiangular: u i uj = α 0 for all i = j and some α 0, and (iii) they form a tight frame: UU = (N/d)Id where U is a d N matrix whose columns are u1, u2, . . . , u N, and Id is the d d identity matrix. C.1 Proofs of results from Section 4 Lemma 1 (Optimal solution when N d + 1). Suppose N d + 1. Then, the optimal solution (U , V ) of the full-batch contrastive learning problem in Equation (1) satisfies two properties: (i) U = V , and (ii) the columns of U form a simplex ETF. Published in Transactions on Machine Learning Research (07/2024) Proof. First, we define the contrastive loss as the sum of two symmetric one-sided contrastive loss terms to simplify the notation. We denote the following term as the one-sided contrastive loss L(U, V ) = 1 eu i vi/τ PN j=1 eu i vj/τ Then, the overall contrastive loss is given by the sum of the two one-sided contrastive losses: Lcon(U, V ) = L(U, V ) + L(V , U). (7) Since Lcon is symmetric in its arguments, results pertaining to the optimum of L(U, V ) readily extend to Lcon. Now, let us consider the simpler problem of minimizing the one-sided contrastive loss from Equation (6) which reduces the problem to exactly the same setting as Lu & Steinerberger (2022): L(U, V ) = 1 eu i vi/τ PN j=1 eu i vj/τ j=1,j =i e(vj vi) ui/τ Note that, we have for any fixed 1 i N, j=1,j =i e(vj vi) ui/τ = e (v i ui/τ) N X j=1,j =i ev j ui/τ = (N 1)e (v i ui/τ) 1 N 1 j=1,j =i ev j ui/τ (a) (N 1)e (v i ui/τ) exp j=1,j =i v j ui/τ (b) = (N 1)e (v i ui/τ) exp v ui v i ui τ(N 1) = (N 1) exp v ui N(v i ui) τ(N 1) where (a) follows by applying Jensen inequality for et and (b) follows from v := PN i=1 vi. Since log( ) is monotonic, we have that x > y log(x) > log(y) and therefore, i=1 log 1 + (N 1) exp v ui τ(N 1) N(v i ui) τ(N 1) 1 + (N 1) exp v ui τ(N 1) N(v i ui) τ(N 1) 1 + (N 1) exp v u τ(N 1) N τ(N 1) i=1 (v i ui) where (c) follows by applying Jensen inequality to the convex function ϕ(t) = log(1 + aebt) for a, b > 0, and (d) follow from u := PN i=1 ui. Note that for equalities to hold in Equation (8) and (9), we need constants ci, c such that v j ui = ci j = i, (10) v ui N 1 N(v i ui) N 1 = c i [N]. (11) Published in Transactions on Machine Learning Research (07/2024) Since log( ) and exp( ) are both monotonic, minimizing the lower bound in Equation (8) is equivalent to min v u N 1 N N 1 i=1 v i ui N X i=1 ui . (12) All that remains is to show that the solution that maximizes Eq 12 also satisfies the conditions in Equation (10) and (11). To see this, first note that the maximization problem can be written as max v stack((NIN 1N1 N) Id)ustack where vstack = (v1, v2, . . . , vn) is a vector in RNd formed by stacking the vectors vi together. ustack is similarly defined. IN denotes the N N identity matrix, 1N denotes the all-one vector in Rn, and denotes the Kronecker product. It is easy to see that ustack = vstack = N since each ui = vi = 1. Since the eigenvalues of A B are the product of the eigenvalues of A and B, in order to analyze the spectrum of the middle term in the above maximization problem, it suffices to just consider the eigenvalues of (NIN 1N1 N). As shown by the elegant analysis in Lu & Steinerberger (2022), (NIN 1N1 N)p = Np for any p RN such that PN i=1 pi = 0 and (NIN 1N1 N)q = 0 for any q RN such that q = k1N for some k R. Therefore it follows that its eigenvalues are N with multiplicity (N 1) and 0. Since its largest eigenvalue is N and since ustack = vstack = N, applying cauchy schwarz inequality, we have that max v stack(NIN 1N1 N) Id)u stack = vstack (NIn 1n1 n) Id) ustack Moreover, we see that setting ui = vi and setting {ui}N i=1 to be the simplex ETF attains the maximum above while also satisfying the conditions in Equation (10) and (11) with ci = 1/(N 1) and c = N/(N 1). Therefore, the inequalities in Equation (8) and (9) are actually equalities for ui = vi when they are chosen to be the simplex ETF in Rd which is attainable since d N 1. Therefore, we have shown that if U = {u i }i = 1N is the simplex ETF and u i = v i i [N], then U , V = arg min U,V L(U, V ) over the unit sphere in Rn. All that remains is to show that this is also the minimizer for Lcon. First note that U , V is also the minimizer for L(V , U) through symmetry. One can repeat the proof exactly by simply exchanging ui and vi to see that this is indeed true. Now recalling Equation (7), we have min Lcon = min (L(U, V ) + L(U, V )) min (L(U, V )) + min (L(U, V )) (13) = L(U , V ) + L(V , U ). However, since the minimizer of both terms in Equation (13) is the same, the inequality becomes an equality. Therefore, we have shown that (U , V ) is the minimizer of Lcon completing the proof. Remark 1. In the proof of the above Lemma, we only show that the simplex ETF attains the minimum loss in Equation (1), but not that it is the only minimizer. The proof of Lu & Steinerberger (2022) can be extended to show that this is indeed true as well. We omit it here for ease of exposition. Theorem 3 (Optimal solution when N = 2d). Let (U , V ) := arg min (U,V ) A Lcon(U, V ) (3) s.t. ui = 1, vi = 1 i [N], where A := {(U, V ) : U, V are symmetric and antipodal}. Then, the columns of U form a simplex crosspolytope for N = 2d. Published in Transactions on Machine Learning Research (07/2024) Proof. By applying the logarithmic property that allows division to be represented as subtraction, L(U, V ) = 1 eu i vi/τ PN j=1 eu i vj/τ u i vi/τ log N X j=1 eu i vj/τ Since U = V (symmetric property), the contrastive loss satisfies Lcon(U, V ) = 2L(U, U) u i ui/τ log N X j=1 eu i uj/τ i=1 log N X j=1 eu i uj/τ . (14) Since ui = 1 for any i [N], we can derive the following relations: ui uj 2 = 2 2u i uj, u i uj = 1 ui uj 2 We incorporate these relations into Equation (14) as follows: Lcon(U, V ) = 2 i=1 log N X j=1 e1/τ ui uj 2/2τ i=1 log N X j=1 e ui uj 2/2τ . The antipodal property of U indicates that for each i [N], there exists a j(i) such that uj(i) = ui. By applying this property, we can manipulate the summation of e ui uj 2/2 over j as the following: j=1 e ui uj 2/2τ = e ui ui 2/2τ + e ui uj(i) 2/2τ + X j =i,j(i) e ui uj 2/2τ = 1 + e 2/τ + X j =i,j(i) e ui uj 2/2τ. Lcon(U, V ) = 2 i=1 log 1 + e 2/τ + X j =i,j(i) e ui uj 2/2τ (a) 2 N(N 2) j =i,j(i) log 1 + e 2/τ + (N 2)e ui uj 2/2τ j =i log 1 + e 2/τ + (N 2)e ui uj 2/2τ 2 N 2 log(1 + (N 1)e 2/τ) (b) 2 N(N 2) j =i log 1 + e 2/τ + (N 2)e u i u j 2/2τ 2 N 2 log(1 + (N 1)e 2/τ), Published in Transactions on Machine Learning Research (07/2024) where (a) follows by applying Jensen s inequality to the concave function f(t) = log(1 + e 2/τ + t); and (b) follows by Lemma 3, and the fact that function g(t) = log[1 + e 2/τ + (N 2)e t/2τ] is convex and monotonically decreasing. {u 1, , u N} denotes a set of vectors which forms a cross-polytope. Both inequalities in (a) and (b) are equalities only when the columns of U form a cross-polytope. Therefore, the columns of U form a cross-polytope. Lemma 3. Given a function g(t) is convex and monotonically decreasing, let U := arg min U A j =i g( ui uj 2) s.t. ui = 1, vi = 1 i [N], (15) where A := {U : U is antipodal}. Then, the columns of U form a simplex cross-polytope for N = 2d. Proof. Suppose N = 2d and U A. Given a function g(t) is convex and monotonically decreasing. j(i) denotes the corresponding index for i such that uj(i) = ui, and ui uj(i) 2 = 4. Under these conditions, we derive the following: j =i g( ui uj 2) = Ng(4) + j =i,j(i) g( ui uj 2) (a) Ng(4) + N(N 2)g 1 N(N 2) j =i,j(i) ui uj 2 = Ng(4) + N(N 2)g 1 N(N 2) j=1 ui uj 2 = Ng(4) + N(N 2)g 1 N(N 2) j=1 (2 2u i uj) = Ng(4) + N(N 2)g 1 N(N 2) (b) Ng(4) + N(N 2)g 1 N(N 2) = Ng(4) + N(N 2)g(2), where (a) follows by Jensen s inequality; and (b) follows from the fact that PN i=1 ui 2 0 and the function g(t) is monotonically decreasing. The equality conditions for (a) and (b) only hold when the columns of U form a cross-polytope. We can conclude that the columns of U form a cross polytope. Proposition 1. The mini-batch loss and full-batch loss are not identical, nor is one a simple scaling of the other by a constant factor. In other words, when SB = h N B i , for all B 2, there exists no constant c such that Lcon mini(U, V ; SB) = c Lcon(U, V ) for all U, V . Proof. Consider e U, e V defined such that ui = vi = ei i [N], where ei is i-th unit vector in RN. First note that u i vi = 1 i [N] and u i vj = 0 i = j. Then, L( e U, e V ) = log(e1/τ + N 1) 1/τ, (16) 1 N B ( N B) X i=1 L( e UBi, e VBi) = log(e1/τ + B 1) 1/τ. (17) Published in Transactions on Machine Learning Research (07/2024) We now consider the second part of the statement. For contradiction, assume that there exists some c R such that Lcon mini(U, V ; SB) = c Lcon(U, V ) for all U, V . Let b U, b V be defined such that ˆui = ˆvi = e1 i [N], where e1 = (1, 0, , 0). Note that ˆu i ˆvj = 1 i, j [N]. Then, L( b U, b V ) = log(N), (18) 1 N B ( N B) X i=1 L( b UBi, b VBi) = log(B). (19) From Equation (16) and 17, we have that c = log(e1/τ +B 1) 1/τ log(e1/τ +N 1) 1/τ . Whereas from Equation (18) and (19), we have that c = log(B) log(N) which is a contradiction. Therefore, there exists no c R satisfying the given condition. Theorem 4 (Optimization with all possible N B mini-batches). Suppose B 2. The set of minimizers of the N B mini-batch problem in Equation (4) is the same as that of the full-batch problem in Equation (1) for two cases: (i) N d + 1, and (ii) N = 2d and the pairs (U, V ) are restricted to those satisfying the conditions stated in Def. 2. In such cases, the solutions (U, V ) for the N B mini-batch optimization problem satisfies the following: Case (i) {ui}N i=1 forms a simplex ETF and ui = vi for all i [N]; Case (ii): {ui}N i=1 forms a simplex cross-polytope. Proof. Case (i): Suppose N d + 1. For simplicity, first consider just one of the two terms in the two-sided loss. Therefore, the optimization problem becomes min U,V 1 N B ( N B) X i=1 L(UBi, VBi) s.t. ui = 1, vi = 1 i [N]. Similar to the proof of Lemma 1, we have that i=1 L(UBi, VBi) = 1 euj (vk vj)/τ 1 + (B 1) exp k Bi,k =j u j (vk vj) 1 + (B 1) exp k Bi u j vk Bu j vj 1 + (B 1) exp P( N B) i=1 P k Bi u j vk P( N B) i=1 P j Bi Bu j vj τ N B B (B 1) where (a) and (b) follows by applying Jensen s inequality to et and log(1 + aebt) for a, b > 0, respectively. Note that for equalities to hold in Jensen s inequalities, we need constants cj, c such that u j vk = cj k = j, (20) u vi N 1 N(u i vi) N 1 = c i [N]. (21) Published in Transactions on Machine Learning Research (07/2024) Now, we carefully consider the two terms in the numerator: A1 := ( N B) X k Bi u j vk, A2 := ( N B) X j Bi Bu j vj. To simplify A1, first note that for any fixed l, m [N] such that l = m, there are N 2 B 2 batches that contain l and m. And for l = m, there are N 1 B 1 batches that contain that pair. Since these terms all occur in A1, we have that A1 = N 2 B 2 m=1 u l vm + N 1 B 1 m=1 u l vm + N 2 B 2 l=1 u l vl. Similarly, we have that A2 = N 1 B 1 l=1 u l vl. Plugging these back into the above inequality, we have that i=1 L(UBi, VBi) N B 1 + (B 1) exp PN l=1 PN m=1 u l vm N PN l=1 u l vl τN(N 1) 1 + (B 1) exp u v N PN i=1 u i vi τN(N 1) Observe that the term inside the exponential is identical to Equation (9) and therefore, we can reuse the same spectral analysis argument to show that the simplex ETF also minimizes P( N B) i=1 L(UBi, VBi). Once again, since the proof is symmetric the simplex ETF also minimizes P( N B) i=1 L(VBi, UBi). Case (ii): Suppose N = 2d, and U, V are symmetric and antipodal. Next, we consider the following optimization problem min (U,V ) A 1 N B ( N B) X i=1 Lcon(UBi, VBi) s.t. ui = 1, vi = 1 i [N], (22) where A := {(U, V ) : U, V are symmetric and antipodal}. Since U = V (symmetric property) the contrastive loss satisfies Lcon(UBi, VBi) = 2L(UBi, UBi) u j uj/τ log X k Bi eu j uk/τ # k Bi eu j uk/τ . (23) Published in Transactions on Machine Learning Research (07/2024) Therefore, the solution of the optimization problem in Equation (22) is identical to the minimizer of the following optimization problem: U := arg min U k Bi eu j uk/τ . The objective of the optimization problem can be rewritten by reorganizing summations as k Bi eu j uk/τ , (24) where Ij := {i : j Bi} represents the set of batch indices containing j. We then divide the summation term in Equation (24) into two terms: k Bi eu j uk/τ = k Bi eu j uk/τ + i Ac j log X k Bi eu j uk/τ , (25) by partitioning the set Ij for each j [N] into as the following with k(j) being the index for which uk(j) = uj: Aj := {i : j Bi, and k(j) Bi}; Ac j := {i : j Bi, and k(j) / Bi}. We will prove that the columns of U form a cross-polytope by showing that the minimizer of each term of the RHS in Equation (25) also forms a cross-polytope. Let us start with the first term of the RHS in Equation (25). Starting with applying Jensen s inequality to the concave function f(x) := log(e1/τ +e 1/τ + x), we get: k Bi eu j uk/τ = i Aj log e1/τ + e 1/τ + X k Bi\{j,k(j)} eu j uk/τ k Bi\{j,k(j)} log e1/τ + e 1/τ + (B 2)eu j uk/τ k/ {j,k(j)} log e1/τ + e 1/τ + (B 2)eu j uk/τ k =j log e1/τ + e 1/τ + (B 2)eu j uk/τ N log e1/τ + (B 1)e 1/τ i k =j log e1/τ + e 1/τ + (B 2)e1/τ e uj uk 2 2τ N log e1/τ + (B 1)e 1/τ i k =j log e1/τ + e 1/τ + (B 2)e1/τ e u j u k 2 2τ N log e1/τ + (B 1)e 1/τ i , where (a) follows by Lemma 3 and the fact that g(t) = log(a+be t 2τ ) for a, b > 0 is convex and monotonically decreasing. {u 1, , u N} denotes a set of vectors which forms a cross-polytope. All equalities hold only when the columns of U form a cross-polytope. Published in Transactions on Machine Learning Research (07/2024) Next consider the second term of the RHS in Equation (25). By following a similar procedure above, we get: i Ac j log X k Bi eu j uk/τ 1 B 1 k Bi\{j} log e1/τ + (B 1)eu j uk/τ k/ {j,k(j)} log e1/τ + (B 1)eu j uk/τ k =j log e1/τ + (B 1)eu j uk/τ N log e1/τ + (B 1)e 1/τ i k =j log e1/τ + (B 1)e1/τ e u j u k 2 2τ N log e1/τ + (B 1)e 1/τ i , where {u 1, , u N} denotes a set of vectors which forms a cross-polytope. Both terms of RHS in Equation (25) have the minimum value when U forms a cross-polytope. Therefore, we can conclude that the columns of U form a cross-polytope. Theorem 5 (Optimization with fewer than N B mini-batches). Suppose B = 2 and N d + 1. Then, the minimizer of Equation (4) for SB h N B i is not the minimizer of the full-batch optimization in Equation (1). Proof. Consider a set of batches SB h N 2 i with the batch size B = 2. Without loss of generality, assume that (1, 2) / S i SB{Bi}. For contradiction, assume that the simplex ETF - (U , V ) is indeed the optimal solution of the loss over these SB batches. Then, by definition, we have that for any (U, V ) = (U , V ), i SB L(U Bi, V Bi) 1 |SB| i SB L(UBi, VBi) i SB L(U Bi, V Bi) X i SB L(UBi, VBi), (26) where (U , V ) is defined such that u i = v i for all i [N] and u i v j = 1/(N 1) for all i = j. Also recall that ui = vi = 1 for all i [N]. Therefore, we also have i SB L(U Bi, V Bi) = X k Bi,k =j exp u j (v k v j )/τ k Bi,k =j exp 1 τ(N 1) 1 j Bi log 1 + exp 1 τ(N 1) 1 where the last equality is due to the fact that |Bi| = 2. Now, let us consider ( e U, e V ) defined such that ui = vi for all i [N], and u i vj = 1/(N 2) for all i = j, (i, j) / {(1, 2), (2, 1)}. Intuitively, this is equivalent to placing u2, . . . , u N on a simplex ETF of N 1 points and setting u1 = u2. This is clearly possible because d > N 1 d > N 2, which is the condition Published in Transactions on Machine Learning Research (07/2024) required to place N 1 points on a simplex ETF in Rd. Therefore, i SB L( e UBi, e VBi) = X k Bi,k =j exp u j ( vk vj)/τ k Bi,k =j exp 1 τ(N 2) 1 j Bi log 1 + exp 1 τ(N 2) 1 where the last equality follows since (1, 2) / S i SB{Bi}. It is easy to see from Equation (27) and (28) that P i SB L( e UBi, e VBi) < P i SB L(U Bi, V Bi) which contradicts Equation (26). Therefore, the optimal solution of minimizing the contrastive loss over any SB h N 2 i batches is not the simplex ETF completing the proof. Proposition 2. Suppose B 2, and let SB h N B i be a set of mini-batch indices. If there exist two data points that never belong together in any mini-batch, i.e., i, j [N] s.t. {i, j} Bk for all k SB, then the optimal solution of Equation (4) is not the minimizer of the full-batch problem in Equation (1). Proof. The proof follows in a fairly similar manner to that of Theorem 5. Consider a set of batches of size B 2, SB [ N B ]. Without loss of generality, assume that {1, 2} Bk for any k SB. For contradiction, assume that the simplex ETF - (U , V ) is the optimal solution of the loss over these SB batches. Then, by definition, we have that for any (U, V ) = (U , V ) Once again, for contradiction assume that the simplex ETF - (U , V ) is indeed the optimal solution of the loss over these SB batches. Then, by definition for any (U, V ) = (U , V ), i SB L(U Bi, V Bi) 1 |SB| i SB L(UBi, VBi) i SB L(U Bi, V Bi) X i SB L(UBi, VBi), (29) where (U , V ) is defined such that u i = v i for all i [N] and u i v j = 1/(N 1) for all i = j. Also recall that ui = vi = 1 for all i [N]. Therefore, we also have i SB L(U Bi, V Bi) = 1 k Bi,k =j exp u j (v k v j )/τ k Bi,k =j exp 1 τ(N 1) 1 j Bi log 1 + (B 1) exp 1 τ(N 1) 1 Now, let us consider ( e U, e V ) defined such that ui = vi for all i [N], u2 = v2 and u i vj = 1/(N 2) for all i = j, (i, j) / {(1, 2), (2, 1)}. Once again, note that this is equivalent to placing u2, . . . , u N on a simplex Published in Transactions on Machine Learning Research (07/2024) ETF of N 1 points and setting u1 = u2. Hence, i SB L( e UBi, e VBi) = 1 k Bi,k =j exp u j ( vk vj)/τ k Bi,k =j exp 1 τ(N 2) 1 j Bi log 1 + (B 1) exp 1 τ(N 2) 1 where for the final equality note that following. The only pair for which u j vk = 1/(N 2) is (j, k) = (1, 2). Since there is no i SB such that {1, 2} Bi, this term never appears in our loss. From Equation (30) and Equation (31), we have that P i SB L( e UBi, e VBi) < P i SB L(U Bi, V Bi) which contradicts Equation (29). Therefore, we conclude that the optimal solution of the contrastive loss over any SB h N 2 i batches is not the simplex ETF. Proposition 3. Suppose B 2, and let SB h N B i be a set of mini-batch inidices satisfying Bi T Bj = , i, j SB and S i SB Bi = [N], i.e., {Bi}i SB forms non-overlapping mini-batches that cover all data samples. Then, the minimizer of the mini-batch loss optimization problem in Equation (4) is different from the minimizer of the full-batch loss optimization problem in Equation (1). Proof. Applying Lemma 1 specifically to a single batch Bi gives us that the optimal solution for just the loss over this batch is the simplex ETF over B points. In the case of non-overlapping batches, the objective function can be separated across batches and therefore the optimal solution for the sum of the losses is equal to the solution of minimizing each term independently. More precisely, we have i=1 Lcon(UBi, VBi) = i=1 min UBi,VBi Lcon(UBi, VBi), where UBi = {uj : j Bi} and VBi = {vj : j Bi}, respectively, and the equality follows from the fact that Bi s are disjoint. C.2 Proofs of Results From Section 5 Lemma 2. Contrastive loss Lcon(U, V ) is a geodesically non-quasi-convex function of U, V on T = {(U, V ) : ui = vi = 1, i [N]}. Proof. Without loss of generality, we focus on the case where τ = 1. The contrastive loss function Lcon is geodesically quasi-convex if for any two points (U, V ) and (U , V ) in the domain and for all t in [0, 1]: Lcon(t(U, V ) + (1 t)(U , V )) max{Lcon(U, V ), Lcon(U , V )}. We provide a counter-example for geodesically quasi-convexity, which is a triplet of points (U 1, V 1), (U 2, V 2), (U 3, V 3) where (U 3, V 3) is on the geodesic between other two points and satisfies Lcon(U 3, V 3) > max{Lcon(U 1, V 1), Lcon(U 2, V 2)}. Let N = 2 and Now, define U 3 = normalize((U 1 + U 2)/2) and V 3 = normalize((V 1 + V 2)/2), which is the midpoint of the geodesic between (U 1, V 1) and (U 2, V 2). By direct calculation, we obtain Lcon(U 3, V 3) 2.798 > 2.773 max(Lcon(U 1, V 1), Lcon(U 2, V 2)), which indicates Lcon is geodesically non-quasi-convex. Published in Transactions on Machine Learning Research (07/2024) Theorem 8 (Theorem 6 restated). Consider N = 4 samples and their embedding vectors {ui}N i=1, {vi}N i=1 with dimension d = 2. Suppose ui s are parametrized by θ(t) = [θ(t) 1 , θ(t) 2 , θ(t) 3 , θ(t) 4 ] as in the setting described in Section 5.1 (see Figure 2(a)). Consider initializing u(0) i = v(0) i and θ(0) i = ϵ > 0 for all i, then updating θ(t) via OSGD and SGD with the batch size B = 2 as described in Section 5.1. Let TOSGD, TSGD be the minimal time required for OSGD, SGD algorithm to have E[θ(T )] (π/4 ρ, π/4)N. Suppose there exist ϵ, T such that for all t satisfying B(t) = {1, 3} or {2, 4}, θ(t)Lcon(UB(t), VB(t)) ϵ, and TOSGD, TSGD < T. Then, TOSGD τ π/4 ρ ϵ + O(η2ϵ + ηϵ3) ηϵ , TSGD 3(e2/τ + 1) e2/τ 1 τ π/4 ρ ϵ + O(η2ϵ + η2 ϵ) ηϵ + O(ηϵ3 + η ϵ) . Proof. We begin with the proof of TOSGD τ π/4 ρ ϵ + O(η2ϵ + ηϵ3) Assume that the parameters are initialized at θ(0) 1 , θ(0) 2 , θ(0) 3 , θ(0) 4 = (ϵ, ϵ, ϵ, ϵ). Then, there are six batches with the batch size B = 2, and we can categorize the batches according to the mini-batch contrastive loss (τ = 1): 1. B = {1, 2} or {3, 4}: Lcon(UB, VB) = 2/τ + 2 log(e1/τ + ecos 2ϵ/τ); 2. B = {1, 3} or {2, 4}: Lcon(UB, VB) = 2/τ + 2 log(e1/τ + e 1/τ); 3. B = {1, 4} or {2, 3}: Lcon(UB, VB) = 2/τ + 2 log(e1/τ + e cos 2ϵ/τ). Without loss of generality, we assume that OSGD algorithm described in Algorithm 5 chooses the mini-batch B = {1, 2} corresponding to the highest loss at time t = 0, and updates the parameter as θ(1) 1 = ϵ η θ1Lcon(UB, VB), θ(1) 2 = ϵ η θ2Lcon(UB, VB). Then, for the next update, OSGD choose u3, u4 which is now closer than updated u1, u2. And u3, u4 would be updated as same as what previously u1, u2 have changed. Thus, θ1 updates only at the even time, and stays at the odd time, i.e. ( θ(t) 1 η θ1Lcon(UB, VB) if t is even, θ(t) 1 if t is odd. Iterating this procedure, we can view OSGD algorithm as one-parameterized algorithm of parameter ϕ(t) = θ(2t) 1 as: ϕ(0) = ϵ, ϕ(t) = ϕ(t 1) + η g ϕ(t 1) , ϕ(Thalf) π where g(ϕ) = 2 sin(2ϕ)/τ(1 + e(1 cos(2ϕ))/τ), and Thalf := TOSGD/2. In the procedure of updates, we may assume that ϕ(t) (0, π 4 ) for all t. To analyze the drift of ϕ(t), we firstly study smoothness of g; τ cos 2ϕ(1 + e(1 cos 2ϕ)/τ) sin2 2ϕ τ e(1 cos 2ϕ)/τ (1 + e(1 cos 2ϕ)/τ)2 . We can observe that max ϕ [0, π 4 ] |g (ϕ)| = 2/τ, hence g(ϕ) has Lipschitz constant 2/τ, i.e. g ϕ(t 1) g ϕ(0) 2 ϕ(t 1) ϕ(0) . Published in Transactions on Machine Learning Research (07/2024) ϕ(t) ϕ(t 1) = η g(ϕ(t 1)) η|g(ϵ)| + 2η τ (ϕ(t 1) ϵ) τ ϕ(t 1) + O(ηϵ3), where the first inequality is from Lipschitz-continuity of g(ϕ), and the second equality is from Taylor expansion of g at ϵ = 0 as; 3τ 2 ϵ3 + . Hence, ϕ(t) (1 + 2η/τ)ϕ(t 1) + O(ηϵ3) indicates that ϕ(Thalf) (1 + 2η/τ)Thalfϵ + T O(ηϵ3) (1 + (2η/τ)Thalf)ϵ + O(η2ϵ + ηϵ3), for some constant T > TOSGD. Moreover π 4 ρ < ϕ(Thalf) implies that 2 π/4 ρ ϵ + O(ηϵ3 + η2ϵ) So, we obtain the lower bound of TOSGD by doubling Thalf. We estimate of TOSGD. Now, we study convergence rate of SGD algorithm. We claim that TSGD 3(e2/τ + 1) e2/τ 1 τ π/4 ρ ϵ + O(η2(ϵ + ϵ)) ηϵ + O(η(ϵ3 + ϵ)) . Without loss of generality, we firstly focus on the drift of θ1. Since batch selection is random, given θ(t) = (θ(t) 1 , θ(t) 2 , θ(t) 3 , θ(t) 4 ): 1. B = {1, 2} with probability 1/6. Then, Lcon(UB, VB) = 2/τ + 2 log(e1/τ + ecos(θ(t) 1 +θ(t) 2 )/τ) implies θ(t+1) 1 = θ(t) 1 + η 2 sin(θ(t) 1 + θ(t) 2 )/τ 1 + e(1 cos(θ(t) 1 +θ(t) 2 ))/τ . 2. B = {1, 3} with probability 1/6. At t = 0, the initial batch selection can be primarily categorized into three distinct sets; closely positioned vectors {u1, u2} or {u3, u4}, vectors that form obtuse angles {u1, u4} or {u2, u3}, and vectors diametrically opposed at 180 , {u1, u3} or {u2, u4}. Given that ϵ is substantially small, the possibility of consistently selecting batches from the same category for subsequent updates is relatively low. As such, it is reasonable to infer that each batch is likely to maintain its position within the initially assigned categories. From this, one can deduce that vector sets such as {u1, u3} or {u2, u4} continue to sustain an angle close to 180 . Given these conditions, it is feasible to postulate that if the selected batch B encompasses either {1, 3} or {2, 4}, the magnitude of the gradient of the loss function Lcon(UB, VB), denoted by Lcon(UB, VB) , would be less than a particular threshold ϵ, i.e. Lcon(UB, VB) < ϵ. Then, θ(t+1) 1 = θ(t) 1 + ηO( ϵ). 3. B = {1, 4} with probability 1/6. Then, Lcon(UB, VB) = 2/τ + 2 log(e1/τ + e cos(θ1+θ4)/τ) implies θ(t+1) 1 = θ(t) 1 η 2 sin(θ(t) 1 + θ(t) 4 )/τ 1 + e(1+cos(θ(t) 1 +θ(t) 4 ))/τ . Published in Transactions on Machine Learning Research (07/2024) Since there is no update on θ1 for the other cases, taking expectation yields E[θ(t+1) 1 θ(t) 1 |θ(t)] = η 6F1(θ(t)) + O(η ϵ), where F1(θ) is defined as: F1(θ) = 2 sin(θ1 + θ2)/τ 1 + e(1 cos(θ1+θ2))/τ 2 sin(θ1 + θ4)/τ 1 + e(1+cos(θ1+θ4))/τ . We study smoothness of F1 by setting F1(θ) = f (θ1 + θ2) f+(θ1 + θ4), where f (t) := 2 sin t/τ 1 + e(1 cos t)/τ , f+(t) := 2 sin t/τ 1 + e(1+cos t)/τ . max t [0,π/2] |f (t)| = 1/τ, max t [0,π/2] |f +(t)| = C/τ, for some constant C (0, 1). Then for θ = (θ1, θ2, θ3, θ4), θ = (θ 1, θ 2, θ 3, θ 4), |F1(θ ) F1(θ)| |f (θ 1 + θ 2) f (θ1 + θ2)| + |f+(θ 1 + θ 4) f+(θ1 + θ4)| (1/τ) |θ 1 + θ 2 θ1 θ2| + (C/τ) |θ 1 + θ 4 θ1 θ4| (2(1 + C)/τ) θ θ . In the same way, we can define the functions F2, F3, F4 all having Lipschitz constant 2(1 + C)/τ. As we define F(θ) = (F1(θ), F2(θ), F3(θ), F4(θ)), it has Lipschitz constant 4(1 + C)/τ satisfying that E[θ θ|θ] = η 6F(θ) + O(η ϵ), where Big O( ) is applied elementwise to the vector, denoting that each element follows O( ) independently. From Lipschitzness of F, for any t 1, E[ θ(t) θ(t 1) |θ(t 1)] η 6 F(θ(t 1)) + O(η ϵ) 6 F(θ(0)) + η 6 F(θ(t 1)) F(θ(0)) + O(η ϵ) 6 F(θ(0)) + 2η(1 + C) 3τ θ(t 1) θ(0) + O(η ϵ). By taking expecations for both sides, E[ θ(t) θ(t 1) ] η 6 F(θ(0)) + 2η(1 + C) 3τ E[ θ(t 1) θ(0) ] + O(η ϵ). Applying the triangle inequality, θ(t) θ(0) θ(t) θ(t 1) + θ(t 1) θ(0) , we further deduce that E[ θ(t) θ(0) ] 1 + 2η(1 + C) E[ θ(t 1) θ(0) ] + η F(θ(0)) 6 + O(η ϵ) . Setting Γ = 3τ 2η(1+C) η F (θ(0)) 6 + O(η ϵ) , we can write E[ θ(t) θ(0) + Γ] 1 + 2η(1 + C) E[ θ(t 1) θ(0) + Γ], Thus, with constant T > TSGD, E[ θ(TSGD) θ(0) + Γ] 1 + 2η(1 + C) 1 + 2η(1 + C) 3τ TSGD Γ + T O(η2Γ). Published in Transactions on Machine Learning Research (07/2024) By Taylor expansion of F1 near ϵ 0: F1(ϵ, ϵ, ϵ, ϵ) = 2(e2/τ 1) τ(e2/τ + 1)ϵ + O(ϵ3), F(θ0) = 4(e2/τ 1) τ(e2/τ + 1)ϵ + O(ϵ3), Γ = e2/τ 1 (1 + C)(e2/τ + 1)ϵ + O(ϵ3 + ϵ) = O(ϵ + ϵ). Since E[ θ(TSGD) θ(0) ] 2( π 3τ TSGD E[ θ(TSGD) θ(0) ] + O(η2(ϵ + ϵ)) 4 ρ ϵ) + O(η2(ϵ + ϵ)). TSGD 3(e2/τ + 1) e2/τ 1 τ π/4 ρ ϵ + O(η2(ϵ + ϵ)) ηϵ + O(η(ϵ3 + ϵ)) . Remark 2. To simply compare the convergence rates of two algorithms, we assumed that there is some constant T such that TSGD, TOSGD < T in Theorem 8. However, without this assumption, we could still obtain lower bounds of both algorithms at τ = 1 as; TOSGD 2 log(1 + 2η) log π 4 ρ + O(ϵ3) log 1 + 2(1+C) π 4 ρ (1 C)ϵ + O(ϵ3 + ϵ) ϵ + O(ϵ3 + ϵ) where C = (e2 1)/2(C + 1)(e2 + 1), C := max x [0, π 2 ][2 sin x/(1 + e1+cos x)] , and their approximations are C 0.265, C 0.436. For small enough η, ϵ, ϵ, we can observe OSGD algorithm converges faster than SGD algorithm, if the inequalities are tight. Direct Application of OSGD and its Convergence We now focus exclusively on the convergence of OSGD. We prove Theorem 7, which establishes the convergence of an application of OSGD to the mini-batch contrastive learning problem, with respect to the loss function e Lcon. Algorithm 2: The direct application of OSGD to our problem 1: Parameters: k: the number of batches to be randomly chosen at each iteration; q: the number of batches of the largest losses to be chosen among k batches at each iteration; T: the number of iterations. 2: Inputs: an initial feature vector (U (0), V (0)), the set of learning rates {ηt}T 1 t=0 . 3: for t = 0 to T 1 do Randomly choose S [ N B ] with |S| = k Choose i1, . . . , iq S having the largest losses, i.e., Lcon(U (t) Bi , V (t) Bi ) i S U,V Lcon(U (t) Bi , V (t) Bi ) (U (t+1), V (t+1)) (U (t), V (t)) ηtg (U (t+1), V (t+1)) normalize(U (t+1), V (t+1)) Published in Transactions on Machine Learning Research (07/2024) For ease of reference, we repeat the following definition: e Lcon(U, V ) := 1 j=1 γj Lcon(UB(j), VB(j)), γj = Pq 1 l=0 j 1 l ( N B) j k l 1 ( N B) k , (32) where B(j) represents the batch with the j-th largest loss among all possible N B batches, and q, k are parameters for the OSGD. Theorem 7 (Convergence results). Consider sampling t from [T] with probability proportional to {ηt}T t=0, that is, P(t = t) = ηt/(PT i=0 ηi). Then ρ > ρ0 = p 8/Bτ 2 + 4e2/τ/Bτ 2, E e Lcon(U (t ), V (t )) 2 (ρ + ρ0)2 e Lcon(U (0), V (0)) e Lcon + 8ρ PT t=0 η2 t PT t=0 ηt , where e Lcon denotes the minimized value of e Lcon. Proof. Define ( b U (t ), b V (t )) = argmin U ,V n e Lcon(U , V ) + ρ 2 (U , V ) (U (t ), V (t )) 2o . We begin by refer- ring to Lemma 2.2. in Davis & Drusvyatskiy (2019), which provides the following equations: (U (t ), V (t )) ( b U (t ), b V (t )) = 1 ρ e Lcon ρ (U (t ), V (t )) , e Lcon( b U (t ), b V (t )) e Lcon ρ (U (t ), V (t )) . Furthermore, we have that e Lcon is ρ0-Lipschitz in ((Bd(0, 1))N)2 by Theorem 11. This gives e Lcon(U (t ), V (t )) e Lcon( b U (t ), b V (t )) ρ0 (U (t ), V (t )) ( b U (t ), b V (t )) e Lcon(U (t ), V (t )) e Lcon( b U (t ), b V (t )) + e Lcon(U (t ), V (t )) e Lcon( b U (t ), b V (t )) e Lcon( b U (t ), b V (t )) + ρ0 (U (t ), V (t )) ( b U (t ), b V (t )) ρ e Lcon ρ (U (t ), V (t )) . As a consequence of Thm 9, E e Lcon(U (t ), V (t )) 2 (ρ + ρ0)2 ρ2 E e Lcon ρ (U (t ), V (t )) 2 e Lcon ρ (U (0), V (0)) e Lcon ρ + 8ρ PT t=0 η2 t PT t=0 ηt e Lcon(U (0), V (0)) e Lcon + 8ρ PT t=0 η2 t PT t=0 ηt . Note that e Lcon ρ is the minimized value of e Lcon ρ , and the last inequality is due to e Lcon ρ (U (0), V (0)) e Lcon ρ e Lcon(U (0), V (0)) e Lcon , because e Lcon ρ (U (0), V (0)) = min U ,V n e Lcon(U , V ) + ρ 2 (U , V ) (U (0), V (0)) 2o e Lcon(U (0), V (0)) Published in Transactions on Machine Learning Research (07/2024) by putting (U , V ) = (U (0), V (0)), and e Lcon = min U ,V n e Lcon(U , V ) o n e Lcon(U , V ) + ρ 2 (U , V ) (U, V ) 2o = e Lcon ρ (U, V ) for any U, V , implying that e Lcon e Lcon ρ . We provide details, including proof of theorems and lemmas in the sequel. Theorem 9. Consider sampling t from [T] with probability P(t = t) = ηt/(PT i=0 ηi). Then ρ > ρ0 = 2 p 2/Bτ 2 + 4e2/τ/Bτ 2, we have E e Lcon ρ (U (t ), V (t )) 2 ρ ρ ρ0 e Lcon ρ (U (0), V (0)) e Lcon ρ + 8ρτ 2 PT t=0 η2 t PT t=0 ηt , where e Lcon ρ (U, V ) := min U ,V n e Lcon(U , V ) + ρ 2 (U , V ) (U, V ) 2o , and e Lcon ρ denotes the minimized value of e Lcon ρ . Proof. e Lcon is ρ0-Lipschitz in ((Bd(0, 1))N)2 by Theorem 11. Hence, it is ρ0-weakly convex by Lemma 5. Furthermore, the gradient norm of a mini-batch loss, or U,V Lcon(UBi, VBi) is bounded by L = 4/τ. Finally, (Kawaguchi & Lu, 2020, Theorem 1) states that the expected value of gradients of the OSGD algorithm is U,V e Lcon(U (t), V (t)) at each iteration t. Therefore, we can apply (Davis & Drusvyatskiy, 2019, Theorem 3.1) to the OSGD algorithm to obtain the desired result. Roughly speaking, Theorem 7 shows that (U (t ), V (t )) are close to a stationary point of e Lcon ρ . We refer readers to Davis & Drusvyatskiy (2019) which illustrates the role of the norm of the gradient of the Moreau envelope, e Lcon ρ (U (t ), V (t )) , being small in the context of stochastic optimization. We leave the results of some auxiliary theorems and lemmas to Subsection C.3. C.3 Auxiliaries for the Proof of Theorem 7 For a square matrix A, we denote its trace by tr(A). If matrices A and C are of the same shape, we define the canonical inner product A, C by i,j Aij Cij = tr(A C). Following a pythonic notation, we write Ai,: and A:,j for the i-th row and j-th column of a matrix A, respectively. The Cauchy Schwarz inequality for matrices is given by where a norm is a Frobenius norm in matrix i.e. A = P i,j A2 ij 1/2 . Lemma 4. Let A Rm n, C Rn k. Then, AC A C . Proof. By a basic calculation, we have AC 2 = tr(C A AC) = tr(CC A A) = CC , A A CC A A . Published in Transactions on Machine Learning Research (07/2024) Meanwhile, for any positive semidefinite matrix D, let D = UΛU be a spectral decomposition of D. Then, we have tr(D2) = tr(UΛ2U ) = tr(Λ2U U) = tr(Λ2) (tr(Λ))2 = (tr(D))2, where λi(D) denotes the i-th eigenvalue of a matrix D. Invoking this fact, we have CC 2 = tr((CC )2) (tr(CC ))2 = C 4, or equivalently, CC C 2. Similarly, we have A A = A 2. Therefore, we obtain AC 2 CC A A A 2 C 2, which means AC A C . If L: Rm n R is a function of a matrix X Rm n, we write a gradient of L with respect to X as a matrix-valued function defined by ( XL)ij = L ij = L Xij . Then, the chain rule gives d dt L(X) = d X for a scalar variable t. If L(U, V ) is a function of two matrices U, V Rm n, we define U,V L as a horizontal stack of two gradient matrices, i.e., U,V L = ( UL, V L). Now, we briefly review some necessary facts about Lipschitz functions. Lemma 5 (Rendering of weak convexity by a Lipschitz gradient). Let f : Rd R be a ρ-smooth function, i.e., f is a ρ-Lipschitz function. Then, f is ρ-weakly convex. Proof. For the sake of simplicity, assume f is twice differentiable. We claim that 2f ρId, where Id is the d d identity matrix and A B means A B is a positive semidefinite matrix. It is clear that this claim renders f + ρ 2 2 to be convex. Let us assume, contrary to our claim, that there exists x0 Rd with 2f(x0) ρId. Therefore, 2f(x0) has an eigenvalue λ < ρ. Denote corresponding eigenvector by u, so we have 2f(x0)u = λu, and consider g(ϵ) = f(x0 + ϵu); the (elementwise) Taylor expansion of g at ϵ = 0 gives f(x0 + ϵu) = f(x0) + ϵ 2f(x0)u + o(ϵ), which gives f(x0 + ϵu) f(x0) ϵ = 2f(x0)u + o(ϵ) Taking ϵ 0, we obtain f(x0 + ϵu) f(x0) /ϵ |λ| > ρ, which is contradictory to ρ-Lipschitzness of f. For X RB B, let us define j=1 exp(Xij/τ) + j=1 exp(Xji/τ) Using this function, we can write the loss corresponding to a mini-batch B of size B by LM(U BVB) = Lcon(UB, VB). We now claim the following: Published in Transactions on Machine Learning Research (07/2024) Lemma 6. Consider X RB B, where |Xij| 1 for all 1 i, j B. Then, XLM(X) is bounded by p 8/Bτ 2 and 2e2/τ B2τ 2 -Lipschitz. Proof. With basic calculus rules, we obtain τB XLM(X) = 2IB + PX + QX, (33) where IB is the B B identity matrix and (PX)ij = exp(Xij/τ)/ k=1 exp(Xik/τ), (QX)ij = exp(Xij/τ)/ k=1 exp(Xkj/τ). j Pij = 1 for all i, it is easy to see that (IB P)i,: 2 2. This gives IB PX 2 2B, and similarly IB QX 2 2B. Therefore, we have Bτ XLM(X) IB PX + IB QX or equivalently XLM(X) p 8/Bτ 2. (35) We now show that XLM is 2e2/τ B2τ 2 -Lipschitz. Define p: RB RB by (p(x))i = exp(xi/τ) PB k=1 exp(xk/τ) . Then, we have xp(x) = 1 τ [diag(p(x)) p(x)p(x) ]. For x [ 1, 1]B, we have p(x)i e2/τ B 1+e2/τ < e2/τ B for any i. Thus, xp(x) diag(p(x)) e2/τ which means p(x) is e2/τ Bτ -Lipschitz, i.e., p(x) p(y) e2/τ Bτ x y for any x, y [ 1, 1]B. Using this fact, we can bound PX PY for X, Y [ 1, 1]B B as follows: i=1 p(Xi,:) p(Yi,:) 2 e2/τ i=1 Xi,: Yi,: 2 = e2/τ Similarly, we have QX QY e2/τ Bτ X Y . Summing up, Bτ XLM(X) B XLM(Y ) PX PY + QX QY 2e2/τ which renders XLM(X) XLM(Y ) 2e2/τ B2τ 2 X Y . Recall that Lcon(UB, VB) = LM(U BVB) for UB, VB Rd B (They correspond to embeddings corresponding to a mini-batch B). Using this relation, we can calculate the gradient of Lcon with respect to UB. Denote Published in Transactions on Machine Learning Research (07/2024) Eij Rd B a one-hot matrix, which is a matrix of zero entries except for (i, j) indices being 1, and write G = XLM(U BVB). Then, (UB)ij Lcon(UB, VB) = (U BVB) UBij , XLM(U BVB) = E ij VB, G = tr V B Eij G = tr Eij(GV B ) = (GV B )ji = (VBG )ij. This elementwise relation means UB Lcon(UB, VB) = VBG = VB( XLM(U BVB)) , (36) and similarly, VB Lcon(UB, VB) = UB XLM(U BVB). (37) We introduce a simple lemma for bounding the difference between two multiplication of matrices. Lemma 7. For A1, A2 Rm n and B1, B2 Rn k, we have A1B1 A2B2 A1 A2 B1 + A2 B1 B2 . Proof. This follows from a direct calculation and Lemma 4 A1B1 A2B2 = A1B1 A1B2 + A1B2 A2B2 A1(B1 B2) + (A1 A2)B2 A1 A2 B1 + A2 B1 B2 . Theorem 10. For any U, V (Bd(0, 1))N and any batch B of size B, we have U,V Lcon(UB, VB) 4 Proof. Suppose UB, VB (Bd(0, 1))B, we have UB,VBLcon(UB, VB) = (VB( XLM(U BVB)) , UB XLM(U BVB)) from Equation (36) and (37). By following the fact that UB , VB B and XLM(X) p 8/Bτ 2 (see Lemma 6), we get VB( XLM(U BVB)) VB XLM(U BVB) UB XLM(U BVB) UB XLM(U BVB) UB,VBLcon(UB, VB) = q VB( XLM(U BVB)) 2 + UB XLM(U BVB) 2 4 Published in Transactions on Machine Learning Research (07/2024) Since Lcon(UB, VB) is independent of U[N]\B and V[N]\B, we have U,V Lcon(UB, VB) = UB,VBLcon(UB, VB) 4 Theorem 11. e Lcon(U, V ) is ρ0-Lipschitz for U, V (Bd(0, 1))N, or to clarify, e Lcon(U 1, V 1) e Lcon(U 2, V 2) ρ0 (U 1, V 1) (U 2, V 2) for any U 1, V 1, U 2, V 2 (Bd(0, 1))N, where ρ0 = p 8/Bτ 2 + 4e2/τ/Bτ 2. Proof. Denoting U i B, V i B as parts of U i, V i that correspond to a mini-batch B, we first show UB,VBLcon(U 1 B, V 1 B ) UB,VBLcon(U 2 B, V 2 B ) ρ0 (U 1 B, V 1 B ) (U 2 B, V 2 B ) holds. For any UB, VB (Bd(0, 1))B, we have UB,VBLcon(UB, VB) = (VB( XLM(U BVB)) , UB XLM(U BVB)). from Equation 36 and Equation 37. Recall Lemma 6; for any U i B, V i B (Bd(0, 1))B (i = 1, 2), we have XLM((U i B) V i B) p XLM((U 1 B) V 1 B ) XLM((U 2 B) V 2 B ) 2e2/τ B2τ 2 (U 1 B) V 1 B (U 2 B) V 2 B . We invoke Lemma 7 and obtain U 1 B XLM((U 1 B) V 1 B ) U 2 B XLM((U 2 B) V 2 B ) U 1 B U 2 B XLM((U 1 B) V 1 B ) + U 2 B XLM((U 1 B) V 1 B ) XLM((U 2 B) V 2 B ) 8/Bτ 2 U 1 B U 2 B + 2e2/τ B3/2τ 2 (U 1 B) V 1 B (U 2 B) V 2 B 8/Bτ 2 U 1 B U 2 B + 2e2/τ B3/2τ 2 ( U 1 B U 2 B V 1 B + U 2 B V 1 B V 2 B ) 8/Bτ 2 + 2e2/τ/Bτ 2) U 1 B U 2 B + (2e2/τ/Bτ 2) V 1 B V 2 B , and similarly V 1 B X(LM((U 1 B) V 1 B )) V 2 B X(LM((U 2 B) V 2 B )) (2e2/τ/Bτ 2) U 1 B U 2 B + ( p 8/Bτ 2 + 2e2/τ/Bτ 2) V 1 B V 2 B . Using the fact that (ax + by)2 + (bx + ay)2 = (a2 + b2)(x2 + y2) + 4abxy (a + b)2(x2 + y2) holds for any a, b 0 and x, y R, we obtain Lcon(U 1 B, V 1 B ) Lcon(U 2 B, V 2 B ) 2 = V 1 B X(LM((U 1 B) V 1 B )) V 2 B X(LM((U 2 B) V 2 B )) 2 + U 1 B XLM((U 1 B) V 1 B ) U 2 B XLM((U 2 B) V 2 B ) 2 8/Bτ 2 + 4e2/τ/Bτ 2)2( U 1 B U 2 B 2 + V 1 B V 2 B 2) 8/Bτ 2 + 4e2/τ/Bτ 2)2 (U 1 B, V 1 B ) (U 2 B, V 2 B ) 2. Published in Transactions on Machine Learning Research (07/2024) Restating this with ρ0 = p 8/Bτ 2 + 4e2/τ/Bτ 2, we have Lcon(U 1 B, V 1 B ) Lcon(U 2 B, V 2 B ) ρ0 (U 1 B, V 1 B ) (U 2 B, V 2 B ) . (38) Recall the definition of e Lcon: e Lcon(U, V ) = 1 j γj Lcon(UB(j), VB(j)), where γj = Pq 1 l=0 ( j 1 l )((N B) j ((N B) k ) and P j γj = q. For any U, V (Sd)N, we can find a neighborhood of (U, V ) so that value rank of Lcon(UBi, VBi) over i {1, . . . , N B } does not change, since Lcon is ρ0-Lipschitz. More precisely speaking, we can find a rank that can be accepted by all points in the neighborhood. Therefore, we have U,V e Lcon(U, V ) = 1 j γj U,V Lcon(UB(j), VB(j)), and since UB(j) VB(j) U V , U,V e Lcon(U, V ) is locally ρ0-Lipschitz. Since Lcon is smooth, such property is equivalent to ρ0IN 2 U,V e Lcon(U, V ) ρ0IN, where IN is the N N identity matrix. Therefore, e Lcon is ρ0-Lipschitz on ((Bd(0, 1))N)2. D Algorithm Details D.1 Stochastic Gradient Descent (SGD) We consider two SGD algorithms: 1. SGD with replacement (Algorithm 3) with k = 1 for the theoretical analysis in Section 5.1. 2. SGD without replacement (Algorithm 4) for experimental results in Section 6, which is widely employed in practical settings. In the more practical setting where ui = fθ(xi) and vi = gϕ(yi), SGD updates the model parameters θ, ϕ using the gradients 1 k P i SB θ,ϕLcon(UBi, VBi) instead of explicitly updating U and V . Algorithm 3: SGD with replacement Input: the number of positive pairs N, mini-batch size B, the number of mini-batches k, the number of iterations T, the learning rate η, initial embedding matrices: U, V 1 for t = 1 to T do 2 Randomly select k mini-batch indices SB N B (|SB| = k) 3 Compute the gradient: g 1 i SB U,V Lcon(UBi, VBi) 4 Update the weights: (U, V ) (U, V ) η(t) g 5 Normalize column vectors of embedding matrices (U, V ) Algorithm 4: SGD without replacement Input: the number of positive pairs N, mini-batch size B, the number of mini-batches k, the number of epochs E, the learning rate η, initial embedding matrices: U, V 1 for e = 1 to E do 2 Randomly partition the N positive pairs into N/B mini-batches: {Bi}N/B i=1 3 for j = 1 to N/Bk do 4 Select k mini-batch indices SB = {k(j 1) + 1, k(j 1) + 2, . . . , kj} 5 Compute the gradient: g 1 i SB U,V Lcon(UBi, VBi) 6 Update the weights: (U, V ) (U, V ) η g 7 Normalize column vectors of embedding matrices (U, V ) Published in Transactions on Machine Learning Research (07/2024) D.2 Ordered SGD (OSGD) We consider two OSGD algorithms: 1. OSGD (Algorithm 5) with k = N B for the theoretical analysis in Section 5.1 and experimental results on synthetic datasets in Section 6.1. 2. OSGD without replacement (Algorithm 6) for experimental results on real datasets in Section 6.2, which is implemented for practical settings. In the more practical setting where ui = fθ(xi) and vi = gϕ(yi), OSGD updates the model parameters θ, ϕ using the gradients 1 k P i SB θ,ϕLcon(UBi, VBi) instead of explicitly updating U and V . Algorithm 5: OSGD Input: the number of positive pairs N, mini-batch size B, the number of mini-batches k, the number of iterations T, the set of learning rates {η(t)}T t=1, initial embedding matrices: U, V 1 for t = 1 to T do 2 Randomly select k mini-batch indices SB N B (|SB| = k) 3 Choose q mini-batch indices Sq := {i1, i2, . . . , iq} SB having the largest losses i.e., Lcon(UBi, VBi) 4 Compute the gradient: g 1 i Sq U,V Lcon(UBi, VBi) 5 Update the weights: (U, V ) (U, V ) η(t) g 6 Normalize column vectors of embedding matrices (U, V ) Algorithm 6: OSGD without replacement Input: the number of positive pairs N, mini-batch size B, the numbers of mini-batches k and q, the number of epochs E, the set of learning rate η, initial embedding matrices: U, V 1 for e = 1 to E do 2 Randomly partition the N positive pairs into N/B mini-batches: {Bi}N/B i=1 3 for j = 1 to N/Bk do 4 Select k mini-batch indices SB = {k(j 1) + 1, k(j 1) + 2, . . . , kj} 5 Choose q mini-batch indices Sq := {i1, i2, . . . , iq} SB having the largest losses i.e., Lcon(UBi, VBi) 6 Compute the gradient: g 1 i Sq U,V Lcon(UBi, VBi) 7 Update the weights: (U, V ) (U, V ) η g 8 Normalize column vectors of embedding matrices (U, V ) E Additional Experimental Results In this section, we provide additional experimental results. First, we present histograms of mini-batch counts for different loss values from models trained with different batch selection methods. Next, we provide the results for N {4, 16} on the synthetic dataset. E.1 Batch Counts: SC method vs. Random Batch Selection We provide additional results comparing the mini-batch counts of two batch selection algorithms: the proposed SC method and random batch selection. The mini-batch counts are based on the mini-batch contrastive loss Lcon(UB, VB) with τ = 1. We measure mini-batch losses from Res Net-18 models trained on CIFAR-100 using the gradient descent algorithm with different batch selection methods: (i) SGD (Algorithm 4), (ii) OSGD (Algorithm 6), and (iii) the SC method (Algorithm 1). Figure 5 illustrates histograms of mini-batch counts for N/B mini-batches, where N = 50000 and B = 20. The results show that mini-batches generated through the proposed spectral clustering method tend to contain a higher proportion of large loss values when compared to the random batch selection, regardless of the pre-trained models used. Published in Transactions on Machine Learning Research (07/2024) # of batches 0.0 0.5 1.0 0.0 0.5 1.0 epoch = 101 random spectral clustering # of batches 1 2 3 4 5 0.0 0.5 1.0 0.0 0.5 1.0 2 4 6 contrastive loss # of batches 1 2 3 4 0.0 0.5 1.0 0.0 0.5 1.0 Figure 5: Histograms of mini-batch counts for N/B mini-batches, for the contrastive loss measured from Res Net-18 models trained on CIFAR-100 using different batch selection methods: (i) SGD (Top), (ii) OSGD (Middle), (iii) SC method (Bottom), where N=50,000 and B=20. Each column of plots is derived from a distinct training epoch. Here we compare two batch selection methods: (i) randomly shuffling N samples and partition them into N/B mini-batches of size B, (ii) the proposed SC method given in Algorithm 1. The histograms show that mini-batches generated through the proposed spectral clustering method tend to contain a higher proportion of large loss values when compared to random batch selection, regardless of the pre-trained models used. 0 100 200 300 400 500 0.0 OSGD Spectral Clustering Method SGD (a) solutions (b) full-batch (c) N B -all (d) N B -sub 0 100 200 300 400 500 Update steps 0.0 0.5 1.0 1.5 2.0 2.5 3.0 (e) norm differences Figure 6: Heatmap of N N matrix visualizing the resulting values from the same settings with Fig 4 except N = 4. E.2 Synthetic Dataset With the settings from Section 6.1, where each column of embedding matrices U, V is initialized as a multivariate normal vector and then normalized as ui = vi = 1, for all i, we provide the results for N {4, 16} and d = 2N or d = N/2. Figure 6 and 7 show the results for N = 4 and N = 16, respectively. We additionally present the results for theoretically unproven cases, specifically for N = 8 and d {3, 5} (see Figure 8). The results provide empirical evidence that all combinations of mini-batches leads to the optimal solution of full-batch minimization for the theoretically unproven cases. F Computational Complexity We compare the complexities of three batch selection algorithms: (i) SGD (Algorithm 4), (ii) OSGD (Algorithm 6), and (iii) the SC method (Algorithm 1). We first present the time complexity for N/B batch selection for each algorithm using big-O notation: For SGD, batches are selected through just random shuffling resulting in O(N). Published in Transactions on Machine Learning Research (07/2024) 0 100 200 300 400 500 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 OSGD Spectral Clustering Method SGD (a) solutions (b) full-batch (c) N B -all (d) N B -sub 0 100 200 300 400 500 Update steps 3.0 3.5 4.0 4.5 5.0 (e) norm differences Figure 7: Heatmap of N N matrix visualizing the resulting values from the same settings with Fig 4 except N = 16. (a) full-batch (b) N B -all (c) N B -sub Figure 8: Theoretically unproven setting. Heatmap of N N matrix when N = 8 and d < N 1. For OSGD, we randomly select k batches, each of size B, and then measure the loss for each of these batches. After measuring, we sort the results and select the top-q batches based on their losses. The computational complexity of evaluating the loss and sorting to generate q batches is O(k B2+k log k). Consequently, the complexity for processing N/B batches becomes O((k B2 + k log k) N/Bq) = O((k B/q + k log k/Bq)N). For the SC method, the primary computational bottleneck arises from the Hungarian algorithm, which operates at a complexity of O(N 3). This complexity is the predominant factor influencing the overall performance of our algorithm. In order to accelerate batch selection in the proposed SC approach, we randomly partition N positive pairs into clusters, each of size k B. We then subsequently apply the SC method on every k B-sized cluster instead of applying it directly on the entire N pairs. This process generates k mini-batches. In this context, the computational complexity for producing k batches is O(k3B3). Extending this to generate N/B batches, the complexity becomes O(k3B3 N/k B) = O(k2B2N). We further report the wall-clock time required for the selection and update of N/B batches using each algorithm on the CIFAR100 dataset, where N = 50, 000 and B = 32 (see Section 6.2). The SC method requires an additional 11 minutes for batch selection compared to SGD (2 minutes) per epoch. As mentioned in Section 6.2, we suggest parallel batch selection using the SC method while updating models through gradient descent to mitigate this issue in practice. This approach involves using the same batches for multiple epochs as the next batch selection proceeds, based on the assumption that batches with higher loss values do not rapidly change.