# statistical_mechanics_of_minmax_problems__79fe80ac.pdf

Published in Transactions on Machine Learning Research (1/2025)

Statistical Mechanics of Min-Max Problems

Yuma Ichikawa ichikawa-yuma1@g.ecc.u-tokyo.ac.jp, ichikawa.yuma@fujitsu.com Department of Basic Science, University of Tokyo Fujitsu Limited

Koji Hukushima k-hukushima@g.ecc.u-tokyo.ac.jp Department of Basic Science, University of Tokyo

Reviewed on Open Review: https: // openreview. net/ forum? id= q Zq UFe Ttu I

Min-max optimization problems, also known as saddle point problems, have attracted significant attention due to their applications in various fields, such as fair beamforming, generative adversarial networks (GANs), and adversarial learning. However, understanding the properties of these min-max problems has remained a substantial challenge. This study introduces a statistical mechanical formalism for analyzing the equilibrium values of minmax problems in the high-dimensional limit, while appropriately addressing the order of operations for min and max. As a first step, we apply this formalism to bilinear min-max games and simple GANs, deriving the relationship between the amount of training data and generalization error and indicating the optimal ratio of fake to real data for effective learning. This formalism provides a groundwork for a deeper theoretical analysis of the equilibrium properties in various machine learning methods based on min-max problems and encourages the development of new algorithms and architectures.

1 Introduction

Min-max optimization problems, also known as saddle point problems, are well-known classical optimization problems extensively studied in the context of zero-sum games (Wald, 1945; Von Neumann & Morgenstern, 1947). These problems have diverse applications across various fields, such as game theory, machine learning, and signal processing. In game theory, min-max problems arise in zero-sum games where one player s gain corresponds to another s loss. Several methods have been proposed to find the min-max value or equilibrium points in these games (Dem yanov & Pevnyi, 1972; Maistroskii, 1977; Bruck, 1977; Lions, 1978; Nemhauser & Wolsey, 1988; Freund & Schapire, 1999). In machine learning, min-max games are relevant for training generative adversarial networks (GANs) (Goodfellow et al., 2020; Arjovsky et al., 2017), Additionally, in adversarial learning, these problems are employed to train models that are robust to adversarial attacks by optimizing a worst-case perturbed loss function (Szegedy et al., 2013; Goodfellow et al., 2014b; Papernot et al., 2016; Madry et al., 2017),

Despite the widespread application of min-max optimization problems, several challenges still need to be addressed, including understanding the usefulness of these min-max formulations, evaluating the convergence properties of the algorithms, and conducting sensitivity analyses of min-max values. A promising approach to addressing these issues is to analyze the typical-case behavior of min-max problems by examining the min-max value averaged over random instances drawn from distributions that capture realistic settings, referred to as randomized instance ensembles. Statistical-mechanical approaches, which have demonstrated their effectiveness in analyzing the typical-case behavior of randomized instance ensembles of optimization and constraint-satisfaction problems (Mézard & Parisi, 1986; Fontanari, 1995), provide a powerful formalism for such analyses. Extending this formalism to analyze the typical-case behavior of min-max values thus presents a potential direction for further research, although this has not yet been fully explored.

Published in Transactions on Machine Learning Research (1/2025)

This study applies the statistical mechanical formalism to min-max problems, modeling them as a virtual two-temperature system. This formalism enables a sensitivity analysis of the typical-case min-max values in the high-dimensional limit. Notably, this formalism properly addresses the order of min-max operations, critical in non-convex scenarios where interchanging the order of min and max can lead to incorrect results (Razaviyayn et al., 2020). Using this formalism, we analyze typical-case min-max values of bilinear min-max games and simple GANs. In particular, we derive the relationship between the amount of training data and generalization error and indicate the optimal ratio of fake data to real data for effective learning.

Our main contributions are as follows:

We introduce a statistical-mechanical formalism developed for sensitivity analysis of equilibrium values in high-dimensional min-max problems.

Applying this approach, we conduct a detailed sensitivity analysis on a bilinear min-max game to verify the theoretical validity of our approach.

Building on this formalism, we analyze the generalization performance of GANs and determine the optimal ratio between fake and real data for practical training.

2 Related Work

The replica method, which is employed in this study, is a non-rigorous but powerful heuristic approach in statistical physics (Edwards & Anderson, 1975; Mézard et al., 1987; Mezard & Montanari, 2009). It has been proven to be a valuable method for high-dimensional machine-learning problems. Previous studies have investigated the relationship between dataset size and generalization error in supervised learning, including single-layer (Gardner & Derrida, 1988; Opper & Haussler, 1991; Barbier et al., 2019; Aubin et al., 2020) and multi-layer (Aubin et al., 2018) neural networks, as well as kernel methods(Dietrich et al., 1999; Bordelon et al., 2020; Gerace et al., 2020). In unsupervised learning, the replica method has also been applied to dimensionality reduction techniques such as the principal component analysis (Biehl & Mietzner, 1993; Hoyle & Rattray, 2004; 2007), and to generative models such as energy-based models (Decelle et al., 2018; Ichikawa & Hukushima, 2022) and denoising autoencoders (Cui & Zdeborová, 2023). However, the dataset-size dependence of GANs has not been previously analyzed, which this study aims to address.

Related to our work, a statistical mechanical formalism for addressing min-max problems has been proposed (Varga, 1998). However, the treatment of the inverse temperature limit differs from our approach, and it has limitations in accurately handling the order of the min and max operations. In the context of adversarial learning, which involves a non-convex and concave min-max problem, Tanner et al. (2024) analyzes a tractable setting where the internal maximization can be solved. By reducing such cases to standard optimization problems, they apply the replica method and approximate message passing to explore the core phenomenology observed in the adversarial robustness. Even in cases where the internal maximization cannot be explicitly solved, the formalism discussed here provides a basis for further analysis and potential extensions to more complex scenarios.

Notation Here, we summarize the notations used in this study. We use the shorthand expression [N] = {1, 2, . . . , N}, where N N. Id Rd d denotes a d d identity matrix, and 1d denotes the vector (1, . . . , 1) Rd and 0d denotes the vector (0, . . . , 0) Rd. For a matrix A = (Aij) Rd k and a vector a = (ai) Rd, we use the shorthand expressions d A = Qd i=1 Qk j=1 d Aij and da = Qd i=1 dai, respectively. The notation Odx,dy(1) describes the asymptotic order of a function with respect to the parameters dx and dy. Specifically, a function f(dx, dy) is said to be Odx,dy(1) if it remains bounded as dx and dy grow large (or tend toward some specified limit), independently of dx and dy. The standard Gaussian measure is defined as Dz = dze z 2/2/(2π)n/2. The notation extrxf(x) represents the evaluation of a function f(x) at its extremum with respect to the variable x. Specifically, this shorthand implies locating and evaluating f(x) at points where its gradient xf(x) = 0.

Published in Transactions on Machine Learning Research (1/2025)

3 Statistical Physics Formalism for Min-Max Optimization Problems

This section introduces a statistical-mechanical formalism that models min-max problems as a virtual twotemperature system from a statistical mechanics perspective. Min-max problems are formally expressed as Ψ(A) = min x X max y Y V (x, y; A), s.t. x X Rdx, y Y Rdy, (1)

where V ( , ) : Rdx Rdy R is a bivariate function; x Rdx and y Rdy are the optimization variables; X and Y are the feasible sets; A is a parameter characterizing the problem, e.g., graph G. We introduce the following Boltzmann distribution to analyze min-max problems for a given bivariate function V (x, y; A) in Eq. (1), with virtual inverse temperatures βmin R and βmax R:

pβmin,βmax(x; A) = 1 Z(βmin, βmax, A)e βmin 1 βmax ln R

Y dyeβmax V (x,y;A) ,

where Z(βmin, βmax, A) is the normalization constant, also known as the partition function. Hereafter, we refer to it as the partition function. In this context, a two-temperature system is particularly important because it allows us to distinguish between the opposing optimization objectives inherent in min-max problems, similar to the different thermal behaviors in statistical mechanics.

By taking the limit βmax + followed by βmin + , the distribution lim βmin + lim βmax + pβmin,βmax(x; A)

concentrates on a uniform distribution over the min-max values, assuming that well-defined min-max values exist for V ( , ) and the min-max value is bounded over feasible sets X and Y. Note that the order of these limits is crucial because the min and max operations cannot be interchanged in non-convex and non-concave min-max problems (Razaviyayn et al., 2020), i.e., minx X maxy Y V (x, y; A) = maxy Y minx X V (x, y; A). While a similar formulation has been used in the previous work (Varga, 1998), they simultaneously take the limits of both βmin and βmax with a fixed ratio βmin/βmax = Oβmin,βmax(1), which does not fully capture the distinct effects of min and max operations in non-convex settings. Such an approach generally does not yield accurate results when the function V (x, y; A) is non-convex with respect to x and y.

Statistical-mechanical approaches have demonstrated their effectiveness in analyzing the typical-case behavior specifically, the properties of the optimal value averaged over the instances that follow a distribution p(A) for optimization and constraint-satisfaction problems (Mézard & Parisi, 1986; Fontanari, 1995). These analyses have succeeded in providing insights into different aspects of combinatorial optimization, unlike worst-case analysis. This work also focuses on evaluating the typical cases of min-max problems characterized by a random parameter A. Our main objective is to calculate the logarithm of Z(βmin, βmax, A) averaged over the random variables A in the limit βmax followed by βmin :

Ω= lim βmin lim βmax f(βmin, βmax),

where f(βmin, βmax) = 1 βmindx EA [log Z(βmin, βmax, A)] ,

which is referred to as the free energy density. This free energy is a generating function with variables x and y. Appendix C shows how to calculate the function of the optimal value x through this free energy.

Setting the ratio of the inverse temperatures as p = βmin/βmax, this can be rewritten as

f(βmin, βmax) = 1 βmindx EA

X dxe βmin 1 βmax ln R dyeβmax V (x,y;A) ,

= 1 βmindx EA

Y dyeβmax V (x,y;A) p . (2)

Although calculating the expectation value of the logarithm is generally difficult, we begin by using the identity

EA[log f(A)] = lim γ +0 1 γ log EA[(f(A))γ] (3)

Published in Transactions on Machine Learning Research (1/2025)

to expand the logarithmic form as follows:

Ω= lim βmin lim βmax lim γ +0 1 βmindxγ log Zγ(βmin, βmax), (4)

where Zγ(βmin, βmax) = EA

Y dyeβmax V (x,y;A) p γ . (5)

At this point, note that these transformations are purely algebraic identities without assuming the parameters γ or p to be integers.

Following the idea of the replica method (Edwards & Anderson, 1975; Parisi, 1979; 1983; Zdeborová & Krzakala, 2016; Gabrié, 2020), we then proceed under the assumption that γ and p are natural numbers. Specifically, rather than addressing Eq (4) directly real values γ and p, one calculates the average of the γ-th and p-th powers for γ, p N, then performs an analytic continuation to γ, p R for this expression, and finally takes the limits γ +0, βmin + and βmax + . Based on this replica trick , the calculation simplifies to the replicated partition function Zγ(βmin, βmax) as an approximation of Zγ(βmin, βmax):

Zγ(βmin, βmax) Zγ(βmin, βmax) = EA

X a dxa p Y

Yal dyale βmax P

a,l V (xa,yal;A) #

up to the first order of γ to take the γ +0 limit on the righthand side of Eq. (4). This computation is a standard procedure in the statistical physics of interaction systems including random variables, and is generally accepted as exact, although rigorous proof has not yet been provided. Specifically, the mathematical rigor of the method remains limited due to the unproven uniqueness of the analytic continuation, an issue noted for the moment problem (Tanaka, 2007). As noted in Section 2, the replica method has provided various results in high dimensional statistics and machine learning as well.

Additionally, before taking the limits, βmin and βmax , the concept of finite inverse temperatures βmin and βmax corresponds to scenarios where neither the minimum nor the maximum is fully achieved, a common situation in the min-max algorithms. This approach provides valuable insights into cases where neither extreme is fully realized or both are only partially optimized. Exploring novel algorithms based on this finite-temperature generalization of min-max problems represents an intriguing direction for future work. Furthermore, in game theory, this formalism can be interpreted as a framework for modeling games under relaxed assumptions of complete rationality, where players x and y are assumed to behave with bounded rationality rather than adhering strictly to classical models of fully rational behavior (Von Neumann & Morgenstern, 1947).

In the following sections, we apply this formalism to a fundamental and significant bilinear min-max game, demonstrating that the analytic continuation of p in the replica method is a rigorous operation. We then analyze the minimal model of GANs as a more practical example.

4 Bilinear Min-max Games

This min-max formalism introduces two replica parameters: γ, associated with the randomness of A, and p, related to the dual structure of min-max problems. The analytic continuation with respect to the replica parameter γ is widely recognized as effective and is frequently employed in the statistical mechanics of optimization. However, the analytic continuation of the replica parameter p has not yet been explored. While establishing its mathematical validity presents challenges, this study eliminates the influence of the replica parameter γ associated with the randomness of A and rigorously demonstrates that the analytic continuation with respect to p holds for fundamental bilinear min-max games. Specifically, we show that the free energy density derived using the replica trick in Eq. (6), as explained in Section 3, is equivalent to the exact expression in Eq. (5) derived without analytic continuation of γ and p for bilinear min-max games (Tseng, 1995; Daskalakis et al., 2017).

Bilinear games are regarded as a fundamental example for studying new min-max optimization algorithms and techniques (Daskalakis et al., 2017; Gidel et al., 2019; 2018; Liang & Stokes, 2019). Mathematically,

Published in Transactions on Machine Learning Research (1/2025)

bilinear zero-sum games can be formulated as the following min-max problem:

min x {0,1}dx max y {0,1}dy V (x, y; W ),

where V ( , ) is given by

V (x, y; W ) = 1 2dx x Wxxx + 1 2dy y Wyyy + 1 p

dxdy x Wxyy + x bx + y by,

where W = (Wxx, Wyy, Wxy, bx, by). For simplicity, we assume Wxx = wxx1dx dx Rdx dx, Wxy = wxx1dx dy Rdx dy, Wyy = wyy1dy dy Rdy dy, bx = bx1dx Rdx, and by = by1dy Rdy. The following results can be readily extended to the matrices Wxx, Wyy, and Wxy with a limited number of eigenvalues of Odx,dy(1). For a detailed discussion, refer to Appendix B.

In this setting, the analytically continued free energy density f(βmin, βmax; W ) calculated using replicated partition function Zγ(βmin, βmax) in Eq. (6) coincides with the exact free energy density f(βmin, βmax; W ) from the partition function Zγ(βmin, βmax) in Eq. (5).

Theorem 4.1 For any βmin, βmax R and wxx, wxy, wyy, bx, by R, the following equality holds:

f(βmin, βmax; W ) = f(βmin, βmax; W ),

f(βmin, βmax; W ) = extr mx,my

2 (mx)2 + κwyy

2 (my)2 + wxyκ1/2mxmy

+ bxmx + κbymy 1 βmin H(mx) + κ βmax H(my)

where κ = dy/dx, H(x) = x log(x) (1 x) log(1 x) denotes binary cross entropy, and extr denotes the extremum operation.

This theorem establishes the validity of the analytic continuation for the replica parameter p using Eq. (6) for bilinear min-max games. The detailed proof of this theorem is provided in Appendix A.

5 Generative Adversarial Networks

Generative adversarial networks (GANs) (Goodfellow et al., 2020) aim to model high-dimensional probability distributions based on training datasets. Despite significant progress in practical applications (Arjovsky et al., 2017; Lucic et al., 2018; Ledig et al., 2017; Isola et al., 2017; Reed et al., 2016), several issues are yet to be resolved, including how the amount of training data influences generalization performance and how sensitive GANs are to specific hyperparameters. This section analyzes the relationship between the amount of training data and generalization error. Additionally, we conduct a sensitivity analysis on the ratio of fake data generated by the generator to the amount of training data, which is critical for the training of GANs. Our analysis employs a minimal setup that captures the intrinsic structure and learning dynamics of GANs (Wang et al., 2019). We consider the high-dimensional limit, where the number of real and fake samples, n and n, respectively, and the dimension d are large while remaining comparable. Specifically, we analyze the regime in which n, n, d while maintaining a comparable ratio, i.e., α = n/d = Θ(d0) and α = n/d = Θ(d0), commonly referred to as sample complexity.

5.1 Settings

Generative model for the dataset We consider that a training dataset D = {xµ}n µ=1, where each xµ Rd is drawn from the following distribution:

d w cµ + ηnµ, (7)

Published in Transactions on Machine Learning Research (1/2025)

where w Rd is a deterministic feature vector, cµ R is random scalar drawn from a standard normal distribution p(c) = N(c; 0, 1), nµ is a background noise vector whose components are i.i.d. from the standard normal distribution N(n; 0d, Id), and η R is a scalar parameter to control the strength of the noise. We also assume that w 2 = 1. This generative model, known as the spiked covariance model (Wishart, 1928; Potters & Bouchaud, 2020), has been studied in statistics to analyze the performance of unsupervised learning methods such as PCA (Ipsen & Hansen, 2019; Biehl & Mietzner, 1993; Hoyle & Rattray, 2004), sparse PCA (Lesieur et al., 2015), deterministic autoencoders (Refinetti & Goldt, 2022), and variational autoencoder (Ichikawa & Hukushima, 2024; 2023).

GAN model Following Wang et al. (2019), we assume that the generator has the same linear structure as the dataset generative model described in Eq. (7):

g(z; w) = 1

where w Rd is a learnable parameter, z R is a latent variable drawn from a standard normal distribution p(z) = N(z; 0, 1), n is a noise vector whose components are i.i.d. from the standard normal distribution N( n; 0d, Id), and η R is a scalar parameter to control the strength of the noise.

We also define the linear discriminator as

ψ(x; v) = f 1

d v x , (9)

where x is an input vector, which can be either the real data xµ from Eq. (7) or the fake one g(z µ; w) from Eq. (8). The vector v Rd is a learnable parameter, and f : R R can be any function.

Training algorithm The GAN is trained by solving the following min-max optimization problem:

min w Rd max v Rd V (w, v; D), (10)

V (w, v; D, D) =

µ=1 ϕ (ψ(xµ; v))

µ=1 ϕ ψ(g(z µ; w); v) λ

2 v 2 + λ 2 w 2, (11)

and D = {zµ} n µ=1 are the latent values of the fake data. The last two terms are regularization terms, where λ and λ control the regularization strength. This value function defined in Eq. (11) is a general form that includes various types of GANs. Specifically, when ϕ = ϕ and ϕ L 1, it represents a Wasserstein GANs (WGANs) (Arjovsky et al., 2017) and, when ϕ(x) = log σ(x) and ϕ(x) = log(1 σ(x)) with σ being the sigmoid function, it corresponds to the Vanilla GANs, which minimize the JS-divergence (Goodfellow et al., 2014a). As we assumes a linear discriminator, V (w, v; D, D) can be expressed as a function of linear combinations v xµ/

d and v g(z µ;w)/

d as follows:

V (w, v; D, D) =

d v g(z µ; w) λ

2 v 2 + λ 2 w 2, (12)

where, for clarity in the subsequent analysis, we redefined ϕ and ϕ as functions of the linear combinations v xµ/

d and v g(z µ;w)/

Generalization error In the ideal case where the generator perfectly learns the underlying true probability distribution, we have w = w. Therefore, we define the generalization error εg as

εg( w, w ) = 1

d ED w w 2 , (13)

where w denotes the min-max optimal value in Eq. (10). The generalization error, εg, quantifies the accuracy of signal recovery from the training data.

Published in Transactions on Machine Learning Research (1/2025)

5.2 Replica Calculation

We apply the replica formalism sketched in Section 3 to derive a set of deterministic equations characterizing the typical behavior of GANs.

In this problem setting, the replicated partition function Zγ in Eq. (6) can be expressed as

Zγ(βmin, βmax) =

Rd dval Ec,neβmax P

al ϕ 1 d (val) w c+ η

d (val) n n

al ϕ 1 d (val) waz+ η

d (val) n n e

al( λ wa 2 λ val 2).

To take the average over n and n, we notice that since n and n follow a multivariate normal distribution N( n; 0d, Id), the quantities u = ((val) n/

d)a,l and u = ((val) n/

d)a,l also follow a Gaussian multivariate distribution as p(u) = p( u) = N(0γp, Q),

where Q = (Qab ls ) Rγp γp, Qab ls = 1

d(val) vbs.

To conduct further computations, we introduce auxiliary variables through the following identities:

abls d Z δ(d Qab ls (val) vbs)d Qab ls = Y

al d Z δ(dma l (val) w )dma l = Y

al d Z δ(dba l (val) wa)dba l .

The replicated partition function can then be expressed as

Zγ(βmin, βmax) = Z d Qdmdbeβmind(S(Q,m,b)+T (Q,m,b)),

where the entropic term S(Q, m, b) and energetic term T (Q, m, b) are defined as follows:

S(Q, m, b) = 1 dβmin ln Z Y

al dwadval Y

abls d Z δ(d Qab ls (val) vbs)

al d Z δ(dma l (val) w )d Z δ(dba l (val) wa)e

al( λ wa 2 λ val 2),

T (Q, m, b) = α βmin ln Z Dc Z dup(u)eβmax P

al ϕ 1 d (val) w c+ η

+ α βmin ln Z Dz Z d up( u)e βmax P

al ϕ 1 d (val) waz+ η

d (val) n .

Using the Fourier representation of the delta function, S(Q, m, b) is further expressed as

S(Q, m, b) = 1 dβmin log Z d Qd md bed( 1

2 tr QQ m m b b)

al dwadvale 1

abls Qab ls valvbs+w P

al ma l val+P

al ba l waval+ βmax

al( λ(wa)2 λ(val)2) !d

Replica symmetric ansatz Here, we assume the following symmetric structure:

a, b [γ], l, s [p], Qab ls = q + βmin δab + χ βmax δlsδab, (15)

a, b [γ], l, s [p], Qab ls = βmaxˆqδlsδab β2 max βmin ˆ δab β2 max ˆχ, (16)

a [γ], l [p], ma l = m, ma l = βmax ˆm, (17)

a [γ], l [p], ba l = b, ˆba l = βmaxˆb. (18)

Published in Transactions on Machine Learning Research (1/2025)

This replica symmetric (RS) structure restricts the integration of the replicated weight parameters {wa}, {val} across the entire R(d γp) R(d γp) to a subspace that satisfies the constraints in Eq. (15) (18). This structure, along with scaling by the maximum and minimum beta values, is similar to the standard one-step replica symmetry breaking (1RSB) (Mézard et al., 1987; Takahashi & Kabashima, 2022).

We now turn to the entropic term S(Q, m, b). The terms that exclude the integrals with respect to {val} and {wa} can be expressed as

1 2tr QQ m m b b

ˆq q + βmin + χ βmax

ˆχ + ˆ βmin

+ ˆχ + ˆ q + ˆ

+ ˆmm + ˆbb

The term that includes the integrals with respect to {val} and {wa} can be expressed as

al dwa Dζadvale 1

2 βmax(ˆq+λ)P

al(val)2+βmax P

ˆχz+w ˆm+waˆb val λβmin

2 βmax(ˆq+λ)(va)2+βmax q

ˆχz+w ˆm+waˆb va

a dwa Dζae λβmin

a(wa)2 βmin 2(ˆq+λ) P

ˆχz+w ˆm+waˆb 2

Z dwdζe βmin 1

2 ζ2 λ 2 w2 (

ˆ χz+w ˆ m+wˆb)2

This can be derived using the identity, for any a R+ and any x R, e a 2 x2 = R Dze azx. Summarizing these results, the entropic term can be written as

S(Q, m, b, Q, m, b) = γ

ˆq q + βmin + χ βmax

ˆχ + ˆ βmin

+ ˆχ + ˆ q + ˆ

+ ˆmm + ˆbb

Z Dz log Z dwdζe βmin 1

2 ζ2 λ 2 w2 (

ˆ χz+w ˆ m+wˆb)2

By taking the limit as βmax followed by βmin , we obtain

S(Q, m, b, Q, m, b) = γ

q(ˆq + ˆ ) (χ )ˆχ 2m ˆm 2bˆb + λ( ˆm2 + ˆχ) ˆb2 + (ˆq + ˆ + λ) λ

We next turn to the energetic term T (Q, m, b). Under the RS ansatz, u follows

ual = r χ βmax xal +

βmin ya + qξ,

Published in Transactions on Machine Learning Research (1/2025)

where xal, xal, yal, yal, ξ, and ξ follow the standard normal distribution N( ξ; 0, 1). Then, the energetic term T (Q, m, b) can be expand as

T (Q, m, b) = α βmin ln Z Dc Z dup(u)eβmax P

al ϕ 1 d (val) w c+ η

+ α βmin ln Z Dz Z d up( u)e βmax P

al ϕ 1 d (val) waz+ η

The term (a) can be simplified as

(a) = α βmin ln Ec,ξ

al Dya Dxaleβmax P

al ϕ mc+ η χ βmax xal+p

βmin ya+ qξ #

= α βmin ln Ec,ξ

Z Dy Z Dxeβmaxϕ mc+ η χ βmax x+p

βmin y+ qξ p γ ,

= α βmin γEc,ξ

log Z Dy Z Dxeβmaxϕ mc+ η χ βmax x+p

βmin y+ qξ p + Oγ(γ2),

= α βmin γEc,ξ

log Z dye βmin

2 y2 Z dxe βmax

2 x2+βmaxϕ(mc+ η( χx+

y+ qξ)) p + Oγ(γ2).

Taking the limit as βmax followed by βmin , we obtain:

(a) = αγEc,ξ

2x2 + ϕ mc + η χx +

Similarly, the term (b) is also expressed as

(b) = α βminγ Ez, ξ

ln Z d ye βmin

2 y2 Z d xe βmax

2 x2 βmax ϕ(bz+ η( χ x+

y+ q ξ)) p + Oγ(γ2).

Taking the same limits, we find:

(b) = αγEz, ξ

2 x2 ϕ bz + p

Putting the entropic term and energetic term together, the free energy density is given by

f = extr ˆq,ˆχ, ˆm,ˆb q,δ,χ,m,b

qˆq (χ )ˆχ 2(m ˆm + bˆb) + λ( ˆm2 + ˆχ) ˆb2 + (ˆq + λ) λ 2(αΦ(q, , χ, m, b) + α Φ(q, , χ, m, b))

Φ(q, , χ, m, b) = Ec,ξ

2x2 + ϕ mc + η χx +

y + qξ , (20)

Φ(q, , χ, m, b) = Ez, ξ

2 x2 ϕ bz + p

y + q ξ . (21)

Note that the min and max operations are involved in the two-level optimization described in Eqs. (20) and (21).

5.3 Results: Application to Simple GANs

In this subsection, following Wang et al. (2019), we apply the formulation derived above to the simple WGAN to demonstrate its generalization properties and conduct a sensitivity analysis of the ratio r fake to real data.

Published in Transactions on Machine Learning Research (1/2025)

Figure 1: (Left) Generalization error as a function of sample complexity α for different values of the ratio r. (Right) Asymptotic generalization error limα ε(α) as a function of the the ratio r.

Self-consistent Equations We consider the case where the functions ϕ(x) and ϕ(x) are both quadratic, defined as ϕ(x) = ϕ(x) = x2/2. This setting allows for an explicit calculation of the free energy density, which is given by

f = extr q,χ,m,b ˆq,ˆχ, ˆm,ˆb

qˆq χˆχ 2m ˆm 2bˆb + λ( ˆm2 + ˆχ) ˆb2 + (ˆq + λ) λ α(ηq + m2)

ηχ 1 α( ηq + b2)

To find the extremum in Eq. (22), we require that the gradient with respect to each order parameter equals zero. This results in the following set of self-consistent equations:

q = λ2( ˆm2 + ˆχ)

(ˆb2 + λ(ˆq + λ))2 , χ = λ ˆb2 + λ(ˆq + λ) , m = ˆm λ ˆb2 + λ(ˆq + λ) , b = ˆb λ( ˆm2 + ˆχ)

(ˆb2 + λ(ˆq + λ))2 ,

ˆq = αη ηχ 1 + α η ηχ + 1, ˆχ = αη(qη + m2ρ)

(ηχ 1)2 + α η(q η + d2ρ)

( ηχ + 1)2 , ˆm = αm ηχ 1, ˆb = αb ηχ + 1.

Learning Curve For simplicity, we set α = rα and λ = λ = η = η = 1. Our analysis focuses on how the generalization error depends on α while varying the ratio r, as generating fake data from the generator is generally much easier than collecting real data. Fig. 1 (Left) shows the dependence of generalization error on sample complexity α for various values of the ratio r. The results demonstrate a sharp decline in the generalization error as the ratio r increases. However, when r becomes large, the generalization error increases in the region where α is large, eventually leading to a phase where no learning occurs, and the generalization error equals 1. This implies that as α increases, the learning becomes dominated by only fake data.

In contrast, for smaller r, real data consistently dominates the objective function V (w, v; D, D), resulting in a steady decrease in generalization error. However, the reduced influence of the fake data component in the objective function, which drives the learning of the generator, requires a significantly larger amount of real data for effective generator training.

Asymptotic Generalization Error We next analyze the asymptotic behavior of the generalization error when the sample complexity α becomes sufficiently large. The asymptotic behavior of the generalization error as a function of α is given by

(r 1)rα1/2 + Oα(α 1) r 1,

1 + Oα(α 1) r > 1.

Published in Transactions on Machine Learning Research (1/2025)

The results for α are shown in Fig. 1(Right). The optimal ratio is r = 1/2, indicating that using fake data approximately equal to half of the real data is effective when the dataset approaches infinity. At r = 1, a phase transition occurs, suggesting that the model changes from a phase of effective learning phase to one where fake data becomes dominant. Beyond this point, for r 1, the model fails to learn any meaningful signal w , and the generalization error is 1.

Furthermore, when r = 1/2, the generalization error scales as εg α 1, which represents the optimal asymptotic behavior for a model-matched scenario. These results demonstrate the critical role of the ratio r in determining learning performance. Therefore, tuning the ratio r according to the available real data is crucial for achieving optimal performance. In practice, it is known that in training GANs, the stability of learning can deteriorate depending on the ratio r of fake to real data. This theoretical analysis provides insights into the importance of the ratio r and is expected to contribute to improving learning algorithms.

6 Conclusion

This study introduces a statistical mechanical formalism to analyze high-dimensional min-max optimization problems, focusing on the critical order of min and max operations in non-convex scenarios. Our goal was to perform a sensitivity analysis of equilibrium values, providing new insights into their properties and generalization performance.

We applied this approach to a simple min-max game, evaluated the generalization performance of GANs, and derived the optimal ratio of fake to real data for effective learning. This successful application not only validates the approach but also opens the way for extending this formalism to more complex min-max problems and broader applications, suggesting a promising direction for significant advancements in machine learning and optimization.

Acknowledgments

We thank T. Takahashi, K. Okajima, Y. Nagano, and K. Nakaishi for useful discussions and suggestions. This work was supported by JST Grant Number JPMJPF2221 and JPSJ Grant-in-Aid for Scientific Research Number 23H01095. Additionally, YI was supported by the WINGS-FMSP program at the University of Tokyo.

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214 223. PMLR, 2017.

Benjamin Aubin, Antoine Maillard, Florent Krzakala, Nicolas Macris, and Lenka Zdeborová. The committee machine: Computational to statistical gaps in learning a two-layers neural network. Advances in Neural Information Processing Systems, 31, 2018.

Benjamin Aubin, Florent Krzakala, Yue Lu, and Lenka Zdeborová. Generalization error in high-dimensional perceptrons: Approaching bayes error with convex optimization. Advances in Neural Information Processing Systems, 33:12199 12210, 2020.

Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451 5460, 2019.

Michael Biehl and Andreas Mietzner. Statistical mechanics of unsupervised learning. Europhysics Letters, 24 (5):421, 1993.

Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pp. 1024 1034. PMLR, 2020.

Published in Transactions on Machine Learning Research (1/2025)

Ronald E Bruck. On the weak convergence of an ergodic iteration for the solution of variational inequalities for monotone operators in hilbert space. J. Math. Anal. Appl, 61(1):159 164, 1977.

Hugo Cui and Lenka Zdeborová. High-dimensional asymptotics of denoising autoencoders. ar Xiv preprint ar Xiv:2305.11041, 2023.

Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. ar Xiv preprint ar Xiv:1711.00141, 2017.

Aurélien Decelle, Giancarlo Fissore, and Cyril Furtlehner. Thermodynamics of restricted boltzmann machines and related learning dynamics. Journal of Statistical Physics, 172:1576 1608, 2018.

Vladimir Fedorovich Dem yanov and Aleksandr Borisovich Pevnyi. Numerical methods for finding saddle points. USSR Computational Mathematics and Mathematical Physics, 12(5):11 52, 1972.

Rainer Dietrich, Manfred Opper, and Haim Sompolinsky. Statistical mechanics of support vector networks. Physical Review Letters, 82(14):2975, 1999.

Samuel Frederick Edwards and Phil W Anderson. Theory of spin glasses. Journal of Physics F: Metal Physics, 5(5):965, 1975.

José Fernando Fontanari. A statistical analysis of the knapsack problem. Journal of Physics A: Mathematical and General, 28(17):4751, 1995.

Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999.

Marylou Gabrié. Mean-field inference methods for neural networks. Journal of Physics A: Mathematical and Theoretical, 53(22):223002, 2020.

Elizabeth Gardner and Bernard Derrida. Optimal storage properties of neural network models. Journal of Physics A: Mathematical and general, 21(1):271, 1988.

Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pp. 3452 3462. PMLR, 2020.

Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial networks. ar Xiv preprint ar Xiv:1802.10551, 2018.

Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol, Gabriel Huang, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1802 1811. PMLR, 2019.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014a.

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014b.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11): 139 144, 2020.

David C Hoyle and Magnus Rattray. Principal-component-analysis eigenvalue spectra from data with symmetry-breaking structure. Physical Review E, 69(2):026124, 2004.

David C Hoyle and Magnus Rattray. Statistical mechanics of learning multiple orthogonal signals: asymptotic theory and fluctuation effects. Physical Review E, 75(1):016101, 2007.

Published in Transactions on Machine Learning Research (1/2025)

Yuma Ichikawa and Koji Hukushima. Statistical-mechanical study of deep boltzmann machine given weight parameters after training by singular value decomposition. Journal of the Physical Society of Japan, 91 (11):114001, 2022.

Yuma Ichikawa and Koji Hukushima. Dataset size dependence of rate-distortion curve and threshold of posterior collapse in linear vae. ar Xiv preprint ar Xiv:2309.07663, 2023.

Yuma Ichikawa and Koji Hukushima. Learning dynamics in linear VAE: Posterior collapse threshold, superfluous latent space pitfalls, and speedup with KL annealing. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li (eds.), Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pp. 1936 1944. PMLR, 02 04 May 2024.

Niels Ipsen and Lars Kai Hansen. Phase transition in pca with missing data: Reduced signal-to-noise ratio, not sample size! In International Conference on Machine Learning, pp. 2951 2960. PMLR, 2019.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125 1134, 2017.

Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681 4690, 2017.

Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. Phase transitions in sparse pca. In 2015 IEEE International Symposium on Information Theory (ISIT), pp. 1635 1639. IEEE, 2015.

Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 907 915. PMLR, 2019.

Pierre-Louis Lions. Une méthode itérative de résolution d une inéquation variationnelle. Israel Journal of Mathematics, 31:204 208, 1978.

Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. Advances in Neural Information Processing Systems, 31, 2018.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

D Maistroskii. Gradient methods for finding saddle points. Matekon, 13(3):22, 1977.

Marc Mezard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.

Marc Mézard and Giorgio Parisi. A replica analysis of the travelling salesman problem. Journal de physique, 47(8):1285 1296, 1986.

Marc Mézard, Giorgio Parisi, and Miguel Angel Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing Company, 1987.

George Nemhauser and Laurence Wolsey. Wiley-interscience series in discrete mathematics and optimization. Integer and Combinatorial Optimization, pp. 765 766, 1988.

Manfred Opper and David Haussler. Generalization performance of bayes optimal classification algorithm for learning a perceptron. Physical Review Letters, 66(20):2677, 1991.

Nicolas Papernot, Patrick Mc Daniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (Euro S&P), pp. 372 387. IEEE, 2016.

Published in Transactions on Machine Learning Research (1/2025)

Giorgio Parisi. Toward a mean field theory for spin glasses. Physics Letters A, 73(3):203 205, 1979.

Giorgio Parisi. Order parameter for spin-glasses. Physical Review Letters, 50(24):1946, 1983.

Marc Potters and Jean-Philippe Bouchaud. A First Course in Random Matrix Theory: For Physicists, Engineers and Data Scientists. Cambridge University Press, 2020.

Meisam Razaviyayn, Tianjian Huang, Songtao Lu, Maher Nouiehed, Maziar Sanjabi, and Mingyi Hong. Nonconvex min-max optimization: Applications, challenges, and recent theoretical advances. IEEE Signal Processing Magazine, 37(5):55 66, 2020.

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning, pp. 1060 1069. PMLR, 2016.

Maria Refinetti and Sebastian Goldt. The dynamics of representation learning in shallow, non-linear autoencoders. In International Conference on Machine Learning, pp. 18499 18519. PMLR, 2022.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Takashi Takahashi and Yoshiyuki Kabashima. Macroscopic analysis of vector approximate message passing in a model-mismatched setting. IEEE Transactions on Information Theory, 68(8):5579 5600, 2022.

Toshiyuki Tanaka. Moment problem in replica method. Interdisciplinary information sciences, 13(1):17 23, 2007.

Kasimir Tanner, Matteo Vilucchio, Bruno Loureiro, and Florent Krzakala. A high dimensional model for adversarial training: Geometry and trade-offs. ar Xiv preprint ar Xiv:2402.05674, 2024.

Paul Tseng. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, 60(1-2):237 252, 1995.

Peter Varga. Minimax games, spin glasses, and the polynomial-time hierarchy of complexity classes. Physical Review E, 57(6):6487, 1998.

John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior, 2nd rev. 1947.

Abraham Wald. Statistical decision functions which minimize the maximum risk. Annals of Mathematics, 46 (2):265 280, 1945.

Chuang Wang, Hong Hu, and Yue Lu. A solvable high-dimensional model of gan. Advances in Neural Information Processing Systems, 32, 2019.

John Wishart. The generalised product moment distribution in samples from a normal multivariate population. Biometrika, pp. 32 52, 1928.

Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: Thresholds and algorithms. Advances in Physics, 65(5):453 552, 2016.

Published in Transactions on Machine Learning Research (1/2025)

A Derivation of Theorem 4.1 proof

In this section, we provide the derivation proof of Theorem 4.1. The derivation begins with the calculation of the free energy density without the analytic continuation of p = βmin/βmax to p N. The free energy density in Eq. (2) is connected to the effective Hamiltonian, Heff(x; W ), which is defined through the relationship:

f(βmin, βmax; W ) = 1 βmindx EW log X

x exp ( βmin Heff(x; W )) .

The effective Hamiltonian is given by

Heff(x; W )

= 1 βmax log X

y eβmax V (x,y;W ),

= 1 βmax log X

y e βmax wxxdx

dx 2 + dywyy

e βmax bxdx x 1dx

dx +bydy y 1dy

= 1 βmax log e βmaxdx wxx

dx 2 +bx x 1dx

y e βmaxdy wyy

dy 2 +wxy q

dx dy x 1dx

dy +by y 1dy

1 βmax log Z d ˆmydmye βmaxdy wyy

2 (my)2+wxy q

dx dy x 1dx

dx my+bymy 1 βmax my ˆmy+ 1 βmax Softplus( ˆmy) +o(dy) ,

where Softplus(x) = log(1 + ex). To evaluate the integral with respect to ˆmy and my, we take the limit as dy and apply the saddle point approximation. The effective Hamiltonian can be expressed as follows:

Heff(x; W ) = dx

+ κ extr my, ˆmy

2 (my)2 + wxyκ 1/2 x 1dx

my + bymy 1 βmax my ˆmy + 1 βmax Softplus( ˆmy) !

Published in Transactions on Machine Learning Research (1/2025)

where κ = dy/dx. By summing over x, the free energy density can be calculated as follows:

f(βmin, βmax; W )

= 1 βmindx log X

x e βmindx wxx

dx 2 +bx x 1dx

e βmindx κ extr my, ˆ my h wyy

2 (my)2+wxyκ 1/2 x 1dx

dx my+bymy 1 βmax my ˆmy+ 1 βmax Softplus( ˆmy) i

= 1 βmindx log dx

Z dmxd ˆmxe βmindx wxx

2 (mx)2+bxmx+ 1 βmin mx ˆmx 1 βmin Softplus( ˆmx)

e βmindx κ extr my, ˆ my[ wyy

2 (my)2+wxyκ 1/2mxmy+bymy 1 βmax my ˆmy+ 1 βmax Softplus( ˆmy)]

= extr mx, ˆmx,my, ˆmy

2 (mx)2 + bxmx + 1 βmin mx ˆmx 1 βmin Softplus( ˆmx)

2 (my)2 + wxyκ1/2mxmy + κbymy κ βmax my ˆmy + κ βmax Softplus( ˆmy)

The final equality is obtained by applying the saddle point method to evaluate the integral. From the saddle point equations, the following expressions are

mx = σ( ˆmx), my = σ( ˆmy)

Further transformation of the equation yields the following expression:

f(βmin, βmax; W ) = extr mx,my

2 (mx)2 + κwyy

2 (my)2 + wxyκ1/2mxmy + bxmx + κbymy

1 βmin H(mx) + κ βmax H(my)

where H(x) = x log x (1 x) log(1 x) represents the binary cross-entropy.

Next, we proceed to evaluate the free energy density under analytic continuation in the replica method, which is expressed as

ˆf(βmin, βmax; W ) = 1 βmindx log X

y1,...,yp e

dx 2 + κwyy

e βmaxdx wxyκ1/2 x 1dx

dy +bxp x 1dx

We introduce the order parameter through the Fourier transform representation of the delta function:

ˆf(βmin, βmax; W )

= 1 βmindx log dxdp y (2π)p+1

Z dmxd ˆmx Y

l dmy l d ˆmy l eβmaxdx( wxxp

2 (mx)2+ κwyy

l(my l )2+wxyκ1/2mx P

eβmaxdx(bxpmx+κby P

l my l 1 βmax ˆmxmx κ βmax P

l ˆmy l my l ) X

y1,...,yp e P

= 1 βmindx log dxdp y (2π)p+1

Z dmxd ˆmx Y

l dmy l d ˆmy l eβmaxdx( wxxp

2 (mx)2+ κwyy

l(my l )2+wxyκ1/2mx P

eβmaxdx(bxpmx+κby P

l my l 1 βmax ˆmxmx κ βmax P

l ˆmy l my l + 1 βmax Softplus( ˆmx)+ κ βmax P

l Softplus( ˆmy l )).

Published in Transactions on Machine Learning Research (1/2025)

Under the assumption of replica symmetry, where l [p], ˆmy l = my, my l = my, we can further reformulate the expression as follows:

ˆf(βmin, βmax; W )

= 1 βmindx log Z dmxd ˆmxdmyd ˆmye βmindx( wxx

2 (mx)2+ κwyy

2 (my)2+wxyκ1/2mxmy)

e βmindx(bxmx+κbymy 1 βmaxp ˆmxmx κ βmax ˆmymy+ 1 βmaxp Softplus( ˆmx)+ κ βmax Softplus( ˆmy)+o(dx)+o(dy)),

= extr mx,my, ˆmx, ˆmy

2 (mx)2 + κwyy

2 (my)2 + wxyκ1/2mxmy + bxmx + κbymy

1 βmaxp ˆmxmx κ βmax ˆmymy + 1 βmaxp Softplus( ˆmx) + κ βmax Softplus( ˆmy)

The final equality is derived by handling the integral using the saddle point method. Consequently, the following saddle point equation is obtained:

mx = σ( ˆmx), my = σ( ˆmy).

Substituting these results, the following expression for the free energy density is derived as

ˆf(βmin, βmax; W ) = extr mx,my

2 (mx)2 + κwyy

2 (my)2 + wxyκ1/2mxmy + bxmx + κbymy

1 βmin H(mx) + κ βmax H(my)

This result coincides with the exact free energy density f(βmin, βmax; W ), derived without the need for analytic continuation.

B Generalization of Theorem 4.1 to Finite Eigenmodes

In this section, we generalize Theorem 4.1 to cases where Wxx, Wxy, and Wyy are decomposed into a finite number of eigenmodes as follows:

l=1 αlala l , Wyy =

m=1 βmbmb m, Wxy =

n=1 γncnc n , αl, βm, γn, L, M, N = Odx,dy(1).

In this formulation, V (x, y; W ) can be expanded as follows:

V (x, y; W ) = dx

This expansion constitutes a direct extension of the calculations in Appendix A, with the terms x 1dx/dx and y 1dy/dy augmented by overlaps with the eigenvectors, {x al/dx}l, {y bm/dy}M m=1, and {x cn/dx, y cn/dy}N n=1. By performing analogous calculations, we find that the free energy density, without assuming analytic continuation, aligns with the analytically continued free energy density. Furthermore, this free energy density is characterized by the saddle point condition involving L + M + N = Odx,dy(1) variables, as in Eq. (23). If we assume L, M, N = Odx,dy(dx), then as dx , the limit becomes trivial; Without appropriate scaling, the free energy density diverges.

Published in Transactions on Machine Learning Research (1/2025)

C Evaluation of Functions of the Optimal Value of Min-Max Problems

In this section, we present a method for evaluating the expected value EA[G( x(A))] over a set of problem instances A, where x(A) = argminx X maxy Y V (x, y; A) represents the min-max optimal solution. This analysis provides insights into how Eq. (2) functions as a generating function. We also demonstrate that this approach can be directly applied to evaluate the generalization error in Section 5, particularly Eq. (13).

The key idea is to expand EA[G( x(A))] as follows:

EA [G( x(A))] = EA

lim βmin + lim βmax +

Z pβmin,βmax(x; A)G(x)dx

lim βmin + lim βmax + ω f(βmin, βmax; ωG(x)) ω=0

where we extend the free energy from Eq. (2) by introducing a parameter ω as follows:

f(βmin, βmax; ωG(x)) = 1 βmindx EA log Z dx exp βmin

βmax log Z dy exp (βmax V (x, y; A)) + ωG(x) .

We employ this technique to derive the generalization error in Section 5. Specifically, the generalization error for GANs can be expressed as:

εg( w, w ) = 1

d ED w(D) w 2 .

To compute this, we augment the free energy calculation by adding the term ω( w 2 2w w ). This adjustment is incorporated into the calculation by modifying the term λ(wa)2 in the exponent of the d-th power expression in Eq. (14) to (ω + λ)(wa)2 2ω(w wa). Since these terms are quadratic, the Gaussian integration remains straightforward. The remaining calculation follows the same procedure in the main text and is thus omitted for brevity. Eventually, we obtain the following form:

lim d + εg = 1 2A(ˆq, ˆχ, ˆm,ˆb) + A2(ˆq, ˆχ, ˆm,ˆb),

where A(ˆq, ˆχ, ˆm,ˆb) is determined by Eq. (22), explicitly given as

A(ˆq, ˆχ, ˆm,ˆb) =

ˆb2 + λ(ˆq + λ)