# nonequilibrium_dynamics_of_hybrid_continuousdiscrete_groundstate_sampling__05adc4ab.pdf Published as a conference paper at ICLR 2025 NON-EQUILIBRIUM DYNAMICS OF HYBRID CONTINUOUS-DISCRETE GROUND-STATE SAMPLING Timothée Guillaume Leleu NTT Research, Stanford University timothee.leleu@ntt-research.com tleleu@stanford.edu Sam Reifenstein NTT Research samuel.reifenstein@ntt-research.com We propose a general framework for a hybrid continuous-discrete algorithm that integrates continuous-time deterministic dynamics with Metropolis-Hastings (MH) steps to combine search dynamics that either preserve or break detailed balance. Our purpose is to study the non-equilibrium dynamics that leads to the ground state of rugged energy landscapes in this general setting. Our results show that MHdriven dynamics reach easy ground states more quickly, indicating a stronger bias toward these solutions in algorithms using reversible transition probabilities. To validate this, we construct a set of Ising problem instances with a controllable bias in the energy landscape that makes certain degenerate solutions more accessible than others. The constructed hybrid algorithm demonstrates significant improvements in convergence and ground-state sampling accuracy, achieving a 100x speedup on GPU compared to simulated annealing, making it well-suited for large-scale applications. 1 INTRODUCTION The fact that non-equilibrium relaxation dynamics reach different degenerate ground states at different rates, contrary to what is expected at equilibrium, is a fundamental topic in optimization and statistical physics (Bernaschi et al., 2020; Lucas, 2014; Mohseni et al., 2022). In machine learning, recent works suggest that sampling from more easily accessible degenerate ground states favors finding models that generalize well during the training of overparameterized neural networks (Soudry et al., 2018; Baity-Jesi et al., 2018; Feng & Tu, 2021; Baldassi et al., 2022; 2023). Here, we study this issue from the general perspective of non-convex optimization, focusing on the Ising Hamiltonian. In particular, we study the task of sampling from degenerate ground states of a non-convex energy landscape. Ground-state sampling involves finding not just any ground state, as it is often defined in combinatorial optimization, but multiple degenerate ground states. In this context, non-equilibrium dynamics refers to the processes through which systems evolve over time toward steady states, with varying rates of convergence to distinct ground states that deviate from equilibrium expectations. The Metropolis-Hastings (MH) algorithm is one of the most widely used techniques for sampling from a reversible Markov Chain Monte Carlo (MCMC) (Zhang et al., 2020; Chib & Greenberg, 1995; Chen et al., 2014). This approach proposes new states and accepts them with a probability designed to ensure that the stationary distribution of the chain is the Boltzmann distribution. Despite its simplicity and versatility, MH can struggle with getting trapped in local energy minima, especially when dealing with complex energy landscapes. Examples of these energy landscapes often arise in systems exhibiting broken ergodicity (Bernaschi et al., 2020), such as spin glasses. In these cases, other MCMC algorithms that maintain detailed balance, like simulated annealing (Kirkpatrick et al., 1983) and more advanced methods (Hukushima & Nemoto, 1996; Surjanovic et al., 2022; Hukushima & Iba, 2003; Houdayer, 2001; Landau et al., 2004), are often favored for exploring rugged energy landscapes and finding optimal solutions. Interestingly, several algorithms do not satisfy detailed balance yet remain competitive (Leleu et al., 2019; 2021; Goto et al., 2021; Buesing et al., 2011), particularly those based on chaotic search (Leleu et al., 2019; 2021). Although these approaches do not sample in general from the exact Boltzmann distribution at equilibrium, it can be shown in special cases that non-reversible Markov chains mix faster than reversible ones (Kapfer & Krauth, Published as a conference paper at ICLR 2025 2017). However, their ability to reach degenerate ground states of a rugged energy landscape using non-equilibrium dynamics is still not well understood. In this work, we introduce a generalized algorithm that combines MCMC techniques with chaotic search dynamics via the chaotic amplitude control (CAC) algorithm. Tuning the hyperparameters of this hybrid approach allow for a smooth transition between the two strategies. Additionally, we create new planted instances with degenerate ground states, inspired by the Wishart planted models. A tunable parameter controls the bias towards different degenerate ground states, enabling us to investigate the non-equilibrium behavior of the hybrid MH/chaotic algorithm in locating solutions that vary in difficulty. We demonstrate that the algorithm s performance can be optimized based on the bias parameter, making it competitive with state-of-the-art methods. The code for our algorithm can be found on this online repository: Git Hub Repository. 2 RELATED WORK 2.1 DISCRETE STATE MCMC OPTIMIZERS MCMC methods for optimizing energy functions over discrete spaces such as simulated annealing (Kirkpatrick et al., 1983) and parallel tempering (Hukushima & Nemoto, 1996) are standard approaches in the field of optimization. Building upon these, recent schemes such as population annealing (Hukushima & Iba, 2003), variational parallel tempering (Surjanovic et al., 2022) and specialized techniques, including cluster updates (Houdayer, 2001), Wang-Landau sampling (Landau et al., 2004), and Replica Exchange Wang-Landau Sampling (Vogel et al., 2013), have been developed to enhance convergence and sampling efficiency. However, these methods often struggle in higherdimensional discrete spaces due to the complexity of the energy landscape and the tendency to become trapped in local energy minima caused by broken ergodicity. This is a behavior observed in many systems with frustrated interactions. (Bernaschi et al., 2020). 2.2 HYBRID CONTINUOUS-DISCRETE SAMPLERS For continuous space sampling, a plethora of methods have been developed based on the idea of combining continuous-time dynamics (typically via ODEs or SDEs) with discrete MCMC corrections. Some examples of this are Metropolis Adjusted Langevin Dynamics (MALA) (Roberts & Rosenthal, 1998), Hamiltonian Monte Carlo (HMC) (Neal et al., 2011; Xifara et al., 2014), as well as derivative methods such as the No-U-Turn Sampler (NUTS) (Hoffman et al., 2014), Generalized HMC (Chen et al., 2014), HMC with the Gaussian integral trick (Zhang et al., 2012), amortized Metropolis adjustments (Zhang et al., 2020), and sampling using collective variables (Schönle et al., 2024). These methods primarily target continuous spaces, and thus are not directly applicable to discretestate sampling. However, research has been conducted to extend continuous sampling techniques to discrete spaces, including methods like discontinuous HMC (Nishimura et al., 2020), Gibbs with Gradient (GWG) (Grathwohl et al., 2021), and the Hamming Ball Sampler (Titsias & Yau, 2017). Nonetheless, these approaches are not suitable for ground-state sampling. 2.3 HYBRID CONTINUOUS-DISCRETE OPTIMIZERS Recently, there has been renewed interest in using continuous-space ODEs to find ground states in discrete spaces. This effort has been driven primarily by the idea of leveraging a diverse array of physical hardware, including CMOS, memristors, spintronics, superconducting circuits, and photonics (see review (Mohseni et al., 2022)), all designed to solve optimization problems more efficiently than current high-performance digital computers. In particular, dynamical systems such as simulated bifurcation machines (Goto et al., 2021), analog iterative machines (AIM) (Kalinin et al., 2023), and chaotic amplitude control (CAC) (Leleu et al., 2019; 2021; Reifenstein et al., 2023) offer an alternative approach. Rather than relying on gradient descent of an energy function in systems with symmetric interactions, such as in the seminal work by Hopfield (Hopfield & Tank, 1985), it has been shown that Hamiltonian dynamics (Goto et al., 2021; Neal et al., 2011) or chaotic dynamics driven by asymmetric interactions (Leleu et al., 2019; 2021) can outperform standard heuristics like simulated annealing (Kirkpatrick et al., 1983), parallel tempering (Hukushima & Iba, 2003), and breakout local search (Benlic & Hao, 2013) on specific Published as a conference paper at ICLR 2025 benchmark datasets (Leleu et al., 2021). This has challenged the prevailing belief that MCMC heuristics maintain an advantage in discrete optimization over systems whose dynamics are relaxed into the continuous domain. Hybrid approaches that combine continuous-space ODEs with discrete MCMC probabilistic transitions remain underexplored in the context of ground-state search, as opposed to their more common application in Boltzmann sampling. 3.1 OVERVIEW In this section, we introduce a novel MCMC method that combines chaotic search with the MH algorithm (Hastings, 1970) (see Fig. 1 a). The algorithm is built on three main concepts: (1) relaxation from discrete to continuous space (or continuous embedding), (2) addition of auxiliary variables to escape from local minima (chaotic amplitude control (Leleu et al., 2019; 2021) and momentum (Kalinin et al., 2023)), (3) probabilistic jumps in the discrete space based on the Metropolis-Hastings criterion. In the next section, we construct a novel set of Ising problem instances inspired from the Wishart planted ensemble (Hamze et al., 2020) in order to benchmark this algorithm s ability to sample from optimal solutions (or ground-state energies). The constructed instances exhibit degenerate ground states that are sampled with unequal probabilities by local search algorithms (see Fig. 1 b). Ground-state energy Energy landscape Discrete configuration space More likely sampled Less likely sampled Continuous space Metropolis-Hastings jump Chaotic search Figure 1: a) Proposed hybrid approach generates deterministic chaotic trajectories that rapidly explores the energy landscape and probabilistic jumps accepted via the Metropolis-Hastings criterion. (b) Constructed planted instances with degenerate ground states with unequal sampling probabilities by heuristics. Some ground states are easier, and others harder, for local search heuristics to discover. 3.2 DETERMINISTIC CONTINUOUS TIME MODEL We define a continuous time dynamical system inspired by the CAC algorithm (Leleu et al., 2019; 2021) and generalize its formulation to draw a correspondence between its parameters and the inverse temperature β of the Boltzmann distribution P(σ) eβV (σ). The goal is to design a dynamical system capable of sampling from the ground states of V , specifically from the zero-temperature distribution P(σ) defined on the discrete space σ { 1, 1}N, where N is the number of variables or spins. First, discrete variables σ are relaxed to the continuous space x RN and we define the potential (or energy function) V (x) on the real space. The following dynamical system is used to direct the search towards lower energy states: dt = αu e x V, (1) where the symbol denote the Hadamard product; x V , the gradient of V with respect to the vector x; α, a positive parameter; and e, a vector of positive auxiliary variables. The variables x are often referred to as "soft spins." The variables u represent the internal states of these soft spins, while the auxiliary variable e accounts for variations in their amplitude (Leleu et al., 2019). The term proportional to γ represents the momentum which has been utilized for non-convex Published as a conference paper at ICLR 2025 optimization (Kalinin et al., 2023) and sampling (Neal et al., 2011; Xifara et al., 2014). Moreover, we set: xi = ϕ(βeffui), i {1, , N}, (2) and ϕ(x) = 2 1+e x 1 is a sigmoidal function normalized to the domain x [ 1, 1]. βeff is an effective temperature (see Appendix section S3.1 for a discussion about its influence). The auxiliary variables ei undergo dynamic changes over time, with the goal of rectifying amplitude heterogeneity, given as follows (Leleu et al., 2019): dt = ξ(x x a) e, (3) with a and ξ being parameters: a represents target amplitude for the variables x and ξ represents the speed of the error correction dynamics. Because this algorithm adds momentum dynamics to the original chaotic amplitude control equations (Leleu et al., 2019; 2021), we call it chaotic amplitude control with momentum (CACm). In the Appendix section S1, the role of the auxiliary variables e is analyzed and we show that their modulation can be interpreted as a dynamic pre-conditioning of the Hessian of V at the proximity of the discrete state. Because the auxiliary variables e remain positive, they do not change the inertia (number of positive, negative, and zero eigenvalues) of the Hessian of V but can change its directions and eigenvalues. Thus, CAC performs some dynamic deformation of the potential function V which affects its ability to sample local minima determined by the shape of V when compared to other algorithms, as it will be shown in the following. Note that the momentum and pre-conditioning play distinct roles in the dynamics. The momentum aggregates past gradients, while pre-conditioning modulates the effective curvature of V . 3.3 PROBABILISTIC DISCRETE TIME MODEL AND MCMC In order to utilize this dynamical system algorithmically, we discretize eqs. 1-2 using a straightforward Euler approximation. Unlike Hamiltonian Monte Carlo, which requires careful selection of the integration method, an accurate integration scheme is not essential in this context since the goal is to sample from a discrete distribution rather than precisely from a continuous one. The utilization of a simple integration method and large integration time step enables the discrete-time approximation of the system to function as computationally efficient solver for combinatorial optimization. The discrete-time model is defined as follows (after a suitable change of variables): u(m + 1) = u(m) αu(m) e x V (x(m)) + γ(u(m) u(m 1)), (4) e(m + 1) = e(m) ξ(x(m) x(m) a) e(m). (5) To maintain the effect of the deterministic dynamics when used for optimization, we formulate an MCMC as follows. The eqs. 4 and 5 are iterated for n time steps, subsequent to which we sample a discrete state σ { 1, 1}N as detailed below: P(σi = 1) = xi + 1 2 , σi = 1 otherwise. (6) It is worth noting that all spins are updated synchronously and independently. While the deterministic dynamics over n time-steps can be perceived as local updates, the generation of σ can leap to remote states. Note moreover that when we set ξ = 0, ei(0) = 0, α = 1, m = 1, and update one spin i at a time, we revert to the equation of the Boltzmann machine. 3.4 METROPOLIS ADJUSTED CHAOTIC SAMPLING When ξ = 0, detailed balance is violated, causing the stationary distribution of the chain to differ from the Boltzmann distribution. To design an MCMC method that converges (at least in principle) to the Boltzmann distribution, we introduce a Metropolis-Hastings acceptance criterion for the discrete state samples σ(kt), where k belongs to {1, 2, ..., K}. The inverse of the number of Metropolis-Hastings steps is denoted f MH = 1 K and the total number of steps is noted T with T = Kn. Published as a conference paper at ICLR 2025 Typically, the acceptance criterion of a new configuration τ, when the current state is σ is given as follows: A(τ/σ) = min(1, P(τ)Q(σ/τ) P(σ)Q(τ/σ)). (7) Here, Q(τ/σ) represents the probability of transitioning from σ to τ. The Monte Carlo step for our method is comprised of two substeps as follows: (1) A deterministic trajectory following eqs. 4 and 5 for n steps starting from u(kn) = 0, e(kn) = 0, and , x(kn) = σ(kn), (2) A sample τ generated with the probability p with p = y+1 2 with y defined as yi = ϕ( βeffui((k+1)n) αei((k+1)n) ), i.e., P(τi = 1) = pi, τi = 1 otherwise. Note that y depends exclusively on σ, since it is generated deterministically from it. Because of the synchronous updating of all spins, we can write the following definition for Q(τ/σ): log Q(τ/σ) = 2 log (pi) + 1 τi 2 log (1 pi) . (8) To calculate the transition probability in the reverse Monte Carlo step from τ to σ, we need to apply first the deterministic sub-step (1) from τ to obtain a candidate end point x1, from which we can calculate the probability to generate the random sample σ with probability p = x+1 2 . The acceptance criterion can then be rewritten as follows: log A(τ/σ) = min(0, β(V (σ) V (τ) 2 log (pi) + 1 τi 2 log(1 pi) 2 log( pi) + 1 σi 2 log (1 pi) ). (9) This acceptance criterion can be computed efficiently since it only requires the logarithm of terms already computed within the deterministic equations. The discrete state τ is accepted with probability A(τ/σ). The Markov chain using the Metropolis-Hastings step respects detailed balance condition and converges to our target Boltzmann distribution P(x) = e βV (x). In summary, the pseudo code of our approach is given in section S2 of the Appendix. Note that y is solely determined by σ. The deterministic path to y originates from σ with e = 1. Consequently, the transition probability from σ to τ depends only on σ (mapped to y by the deterministic flow), preserving the Markov property. The updates of all spins are done in parallel independently, resulting in a product of probabilities. This can be reduced to a sum of log probabilities for more stable numerical simulations as shown in eq. 9. We call the algorithm defined by eqs. 4-9 Metropolis Hastings Chaotic Amplitude Control with momentum (MHCACm). 3.5 SUMMARY OF ALGORITHMS A key property of new the algorithm proposed in this work is that can be interpreted as a generalization of many existing methods which exist as limiting cases of MHCACm. Some examples are: Simulated annealing (Kirkpatrick et al., 1983), Hopfield neural networks (Hopfield & Tank, 1985), analog iterative machines (Kalinin et al., 2023), chaotic amplitude control (Leleu et al., 2019), and chaotic amplitude control with momentum. These limiting cases are summarized in Table 1. The generalized algorithms tend to have more parameters which need to be accurately tuned in order to adequately utilize the dynamics provided by each aspect of the algorithm. When properly tuned to a given class of instances, the performance of the generalized algorithm performs at least as well as any of its constituent parts. Importantly, automatic parameter tuning methods have recently been developed which allow the many parameters to be quickly and accurately tuned (Reifenstein et al., 2024). 1 x is defined similarly to y using the end state of the deterministic path starting from τ given as xi = ϕ( βeffui((k+2)n) αei((k+2)n) ) Published as a conference paper at ICLR 2025 Table 1: Limit cases of the proposed algorithm MHCACm. SA: simulated annealing. HNN: Hopfield neural network. AIM: analog iterative machine. CAC: chaotic amplitude control. CACm: chaotic amplitude control with momentum. MHCACm: Metropolis Hastings Chaotic Amplitude Control with momentum. model f MH γ ξ model f MH γ ξ T = 0 = 0 CAC = 1 = 0 > 0 HNN = 1 = 0 = 0 CACm = 1 > 0 > 0 AIM = 1 > 0 = 0 MHCACm 1 T < f MH < 1 > 0 > 0 4 BENCHMARK 4.1 DEGENERATE PLANTED WISHART INSTANCES We construct a set of planted Ising problem instances that exhibit ground states with different tunable properties. The construction of these instances is inspired by the Wishart planted ensemble (WPE) (Hamze et al., 2020). WPE instances are particularly useful for benchmarking due to their tunable hardness parameter αWPE. For certain values of αWPE, the recovery of the planted solution becomes easier or harder. We introduce the degenerate Wishart planted ensemble (d WPE) to allow for planting of multiple degenerate ground states. Additionally, we add a tunable "bias", denoted b, to the energy landscape which allows us to make some ground states easier/harder to find than others for heuristic algorithms as depicted in Fig. 1 (b). Thus, we refer to these different ground states as "easy" and "hard" solutions. Details of the construction method are provided in the Appendix section S5. 4.2 COMPUTATION TIME TO FIND EASY AND HARD SOLUTIONS In this section, we evaluate the novel algorithm for finding low energy states of non-convex combinatorial optimization problems. A common metric for evaluating the performance of Ising solvers is the "time to solution" (TTS) which measures the number of steps needed to have 99% probability of finding any ground state (the smaller, the better the algorithm s performance). In case there are D degenerate ground states (i.e., solutions of same energy), there is equal probability of sampling them from the Boltzmann distribution. However, algorithms generally reach a ground state before fully equilibrating to the Boltzmann distribution and, consequently, each degenerate solution can have different probability of being sampled at finite sampling time. We define TTS, TTSeasy, and TTShard as the time to reach any, the easier, and harder to find degenerate solution (of same Ising energy) with 99% probability as follows: TTS = T log(1 0.99) log (1 P d Pd), TTSeasy = T log(1 0.99) log (1 maxd Pd), TTShard = T log(1 0.99) log (1 mind Pd), (10) where Pd is the probability to sample the ground state d. The algorithms used for comparison are all tuned to nearly optimal parameters by minimizing the TTS. We use a state-of-the-art autotuning method called dynamic anisotropic smoothing (Reifenstein et al., 2024) (DAS) which is particularly well fitted to optimizing the parameters of heuristics for combinatorial optimization (Reifenstein et al., 2024). The use of a automatic parameter tuning also guarantees a fair comparison between algorithms (see Appendix S4). Moreover, all algorithms are compared using the number of steps" to solution, where the algorithmic complexity of a single step is dominated by the matrix-vector multiplication of size N N 2. 4.3 GLOBAL OPTIMIZATION TO ANY OPTIMAL SOLUTION For the task of finding any ground state, Fig. 2 (a) shows that MHCACm performs almost as well as CACm and better than AIM, CAC, and SA3. In the case of sampling any ground state of biased 2For SA, a step represents a full sweep of N spin updates which also has complexity O(N 2). The same for PT, but divided by the number of replica. 3Results for SA and PT are collected using dwave-neal (dwavesystems, 2024) and py SA (Mandrà & Munoz Bauza, 2024) Published as a conference paper at ICLR 2025 (b = 12 ) d WPE problem instances (see section 4.1), MHCACm requires significantly fewer steps than other algorithm to find an optimal solution (see Fig. 2 (b)). 60 70 80 90 100 110 120 130 140 N TTS (any ground-state) a (unbiased Wishart) MHCACm CACm AIM CAC PT SA (in sweeps) 60 70 80 90 100 110 120 130 140 N TTS (any ground-state) b (biased Wishart, b=12) MHCACm CACm AIM CAC PT SA (in sweeps) Figure 2: The time to solution of (unbiased) Wishart planted instances and biased degenerate planted instance with b = 12 are shown in (a) and (b), respectively. f MH = 0.1. All other parameters tuned with DAS (Reifenstein et al., 2024) (see Appendix S4). Colored regions represent the 95% confidence interval calculated by generating 15 instances per data point. 4.4 SAMPLING IN THE CASE OF BIASED DEGENERATE OPTIMAL SOLUTIONS Next, we examine why MHCACm requires fewer steps to find any solution when there is a larger bias in the d WPE instances. Fig. 3 (a) and (b) show that the TTSeasy for easy ground states decreases as a function of the bias b of the generated Wishart instances. Conversely, the time to sampled the hard solution TTShard increases. This is unlike the case of CACm without the MH step, for which the TTSeasy and TTShard remain approximately equal for all bias b > 0. Thus, MHCACm is better at finding the easier solutions, which explains the smaller overall TTS (to any ground state) observed in Fig. 2. Table 2 indicates that MHCACm is the fastest algorithm to sample from easier ground states, whereas CACm is the better algorithm to reach the harder to find ones. This difference in complexity between finding the easy and hard solutions is also present in other MCMC algorithms such as SA and PT as can be seen in Tab. 2. The energy landscape contains a higher density of low-energy states near the easy solution. An algorithm that attempts to sample from a Boltzmann distribution will thus have a large bias towards this part of the solution space at finite temperature, because of the many low energy states that should be sampled with high probability. Although in the infinite time limit we would expect equal probability of finding all ground states, in practice the extremely slow relaxation time of these algorithms makes it very improbable for the "hard" ground state to be found. However, our hybrid algorithm is able to find both ground states with nonzero probability. We believe a reason for this is that the purely analog methods such as CAC, AIM and CACm do not sample from a Boltzmann distribution and thus can partially avoid these large basins of attraction. We see this in Tab. 2 in which the analog dynamics are able to find both ground states with reasonable success as well. We show the results of the Boltzmann sampling algorithm Gibbs with gradient (Grathwohl et al., 2021) (GWG), although its purpose is not combinatorial optimization. In appendix section S6, comparison to another recently proposed sampling algorithm (Sun et al., 2023) is shown. We also investigate the effect of changing the complexity parameter αWPE of biased Wishart planted instances (see Fig. 4) with a bias of b = 12. We observe that the relative reduction of the TTS due to the introduction of the MH step in biased instances is more pronounced for instances of higher complexity (i.e., smaller parameter αWPE as shown by the comparison of Fig. 4 (a) and (c)). 4.5 RESULTS ON GSET The GSET (Ye, 2024) is a set of Max-Cut problem instance which are commonly used to benchmark Ising and Max-Cut solvers (Goto et al., 2021; Leleu et al., 2019; Benlic & Hao, 2013). To test the ability of MHCACm to find optimal solutions on larger optimization problems, we briefly present some results on these instances in comparison with a state of the art algorithm known as d SBM (Goto Published as a conference paper at ICLR 2025 1000 2000 3000 4000 T 1000 2000 3000 4000 T 103 104 105 106 107 0 2 4 6 8 10 12 bias b TTSmin * (easy ground-state) CACm MHCACm AIM 0 2 4 6 8 10 12 bias b TTSmax * (hard ground-state) CACm MHCACm AIM Figure 3: Time to reach the easy solution TTSeasy (a) and hard solution TTShard (b) vs. the bias b of degenerate Wishart planted instances. (c) maxd Pd vs. runtime T for degenerate Wishart planted instances. The same optimal parameters as in Fig. 2 are used. (d) The same for TTSeasy. (a,b,c,d) N = 100. (c,d) b = 10. Table 2: Time to solution (in steps, i.e., matrix vector multiplication) using degenerate Wishart planted instances with bias b = 12. We show both TTS, TTSeasy and TTShard (as defined in the text) obtained for a variety of algorithms and two problem sizes. Since SA and GWG were not able to sample the hard ground state, TTShard is not defined for it and TTShard = TTS. Bold numbers denote the best algorithm. 10 3 TTS 10 3 TTSeasy 10 3 TTShard model/N 100 140 100 140 100 140 SA 27.56 1917.67 27.56 1917.67 N/A N/A PT 2437.92 N/A 287.71 N/A 4605.06 N/A GWG 6369.48 15779.94 6369.48 15779.94 N/A N/A AIM 22.31 906.96 37.06 958.13 390.72 16884.93 CAC 238.81 5833.26 586.26 46960.90 427.34 6634.85 CACm 9.41 22.91 18.54 37.84 25.08 76.88 MHCACm 1.40 2.44 1.47 2.44 318.16 248.63 et al., 2021). Table 3 shows that MHCACm reaches best known solutions faster than d SBM (Goto et al., 2021) on the first few instances of the GSET problem set (Ye, 2024). Table 3: TTS (in steps or MVM) for solving GSET (Ye, 2024) instances of MHCACm and d SBM (Goto et al., 2021). instance MHCACm d SBM N (id) p0 T TTS p0 T TTS 800 (1) 0.612000 3000 14592 0.026777 2000 339332 800 (2) 0.021000 3000 650949 0.010660 6000 2578124 800 (3) 0.533500 3000 18118 0.033920 3000 400343 800 (4) 0.140500 3000 91249 0.025144 2000 361673 800 (5) 0.044000 3000 307029 0.022099 3000 618221 800 (6) 0.070500 3000 188972 0.023856 1000 190728 800 (7) 0.300000 3000 38734 0.022552 1000 201889 800 (8) 0.166500 3000 75858 0.019060 1500 358947 4.6 PARALLELIZATION ON GPU The algorithm is particularly well-suited for deployment on highly parallel hardware, such as GPUs, due to its capacity for parallel spin updates. In particular, the computational bottleneck of MHCACm is the matrix vector multiplication (MVM), a basic linear algebra operation that can be done very efficiently on modern GPUs using frameworks such as Py Torch. In table 4 we compare the wall-clock TTS of our Py Torch implementation of MHCACm finding ground states of a degenerate WPE instance of size N = 100. We also show the wall-clock time of a standard implementation of SA (Tiosue, Published as a conference paper at ICLR 2025 0.70 0.80 0.90 Wishart complexity WPE CACm MHCACm 0.70 0.80 0.90 Wishart complexity WPE 10 3TTSeasy CACm MHCACm 0.70 0.80 0.90 Wishart complexity WPE 10 3TTShard CACm MHCACm Figure 4: TTS, TTSeasy and TTShard (as defined in the text) vs. complexity of Wishart planted instances αWPE (see definition in Appendix section S5). N = 100, b = 12. 2024) running an CPU. MHCACm on GPU exhibits a 100x speed up in real time against simulated annealing on CPU for sampling easy optimal solutions. Furthermore, MHCACm is compatible with an implementation on special-purpose analog hardware designed for linear algebra acceleration, such as a memresistor crossbar (Hu et al., 2014) or on-chip photonic mesh arrays (Bogaerts et al., 2020). Table 4: Wall clock times of algorithms on different computing platforms. TTS is time to find any ground state of a degenerate WPE instance with αWPE = 0.8, N = 100 and bias b = 12. The GPU used is a Nvidia V100 and the CPU wall time is calculated on a 2024 Mac Book Pro (Apple M3 Chip). TTS (s) (N = 100) unbiased WPE biased WPE (b = 12) SA (CPU) 0.0756 0.0567 AIM (CPU) 0.32 0.0287 CACm (CPU) 0.013 0.016 MHCACm (CPU) 0.0163 4.08 10 4 MHCACm (GPU) 0.00207 5.18 10 5 5 COMPARISON TO OTHER APPROACHES Methods such as HMC (Neal et al., 2011; Xifara et al., 2014) introduce auxiliary degrees of freedom (momentum) to the gradient dynamics, providing the system additional directions to move away from local minima of the error function. However, these methods do not force the dynamics to be close to the discrete state, making them ill-suited for discrete state sampling. CAC suggests a different mechanism for the auxiliary variables. It adjusts the amplitude of the continuous variables, ensuring proximity to the discrete (binary) state (Leleu et al., 2019; 2021). Conventionally, this can be interpreted as a dual-primal Lagrangian problem in which we want to minimize the error function (in the continuous embedding) while also satisfying a constraint (being close to the discrete state). Our method differs from the primal dual (Vadlamani et al., 2020) approach. With CAC, the error function is achieved by multiplying the gradient with auxiliary variables, akin to dynamically pre-conditioning the Hessian of the cost function. Unlike HMC however, CAC does not conserve (even approximately) energy. In HMC (Neal et al., 2011; Xifara et al., 2014), the MH step compensates for the discretization errors brought about by the leapfrog integration. However, the reason for adding an MH step to CAC is different. State-of-the-art algorithms that exploit relaxation to an continuous state for discrete combinatorial optimization such as memcomputing (Sheldon et al., 2019) chaotic amplitude control (Leleu et al., 2019), and simulated bifurcation machine (Goto et al., 2021) do not accurately sample from the Boltzmann distribution. Our proposed methodology bridges this gap, integrating continuous relaxation algorithms with fair sampling of a discrete distribution via the construction of a hybrid MCMC. 6 CONCLUSION The MHCACm algorithm effectively combines chaotic dynamics with the Metropolis-Hastings method to enhance ground-state sampling efficiency in discrete spaces. The introduction of the MH step increases the probability of finding more accessible ground states during non-equilibrium dynamics when a bias exists in the energy landscape. However, the MH step does not seem beneficial when such a bias among degenerate ground states is absent. This approach significantly improves the Published as a conference paper at ICLR 2025 speed and accuracy of finding optimal solutions for combinatorial optimization tasks, demonstrating superior performance and suitability for large-scale parallel processing on GPUs. The theoretical understanding of the sampling capabilities of MHCACm are out of the scope of this paper. However, numerical evidence seems to indicate that the algorithm not only achieves fair sampling from the Boltzmann distribution after sufficient relaxation time (see Appendix Fig. S2) but can exhibit faster relaxation times on NP-hard Ising problem instances with rugged energy landscape (see Fig. 3 and Tab. 1) when compared to traditional methods such as SA. Lastly, the connection with learning in over-parameterized neural networks is to be investigated. Finding the most reachable and common solution to a large scale optimization problem is particularly useful for training neural networks in machine learning, as it typically corresponds to a trained network with better generalization properties (Soudry et al., 2018; Baity-Jesi et al., 2018; Feng & Tu, 2021; Baldassi et al., 2022; 2023). Published as a conference paper at ICLR 2025 Marco Baity-Jesi, Levent Sagun, Mario Geiger, Stefano Spigler, Gérard Ben Arous, Chiara Cammarota, Yann Le Cun, Matthieu Wyart, and Giulio Biroli. Comparing dynamics: Deep neural networks versus glassy systems. In International Conference on Machine Learning, pp. 314 323. PMLR, 2018. Carlo Baldassi, Clarissa Lauditi, Enrico M Malatesta, Rosalba Pacelli, Gabriele Perugini, and Riccardo Zecchina. Learning through atypical phase transitions in overparameterized neural networks. Physical Review E, 106(1):014116, 2022. Carlo Baldassi, Enrico M Malatesta, Gabriele Perugini, and Riccardo Zecchina. Typical and atypical solutions in nonconvex neural networks with discrete and continuous weights. Physical Review E, 108(2):024310, 2023. Una Benlic and Jin-Kao Hao. Breakout local search for the max-cutproblem. Engineering Applications of Artificial Intelligence, 26(3):1162 1173, 2013. ISSN 0952-1976. doi: https://doi.org/ 10.1016/j.engappai.2012.09.001. URL https://www.sciencedirect.com/science/ article/pii/S0952197612002175. Massimo Bernaschi, Alain Billoire, Andrea Maiorano, Giorgio Parisi, and Federico Ricci-Tersenghi. Strong ergodicity breaking in aging of mean-field spin glasses. Proceedings of the National Academy of Sciences, 117(30):17522 17527, 2020. Wim Bogaerts, Daniel Pérez, José Capmany, David AB Miller, Joyce Poon, Dirk Englund, Francesco Morichetti, and Andrea Melloni. Programmable photonic circuits. Nature, 586(7828):207 216, 2020. Lars Buesing, Johannes Bill, Bernhard Nessler, and Wolfgang Maass. Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLo S computational biology, 7(11):e1002211, 2011. Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pp. 1683 1691. PMLR, 2014. Siddhartha Chib and Edward Greenberg. Understanding the metropolis-hastings algorithm. The american statistician, 49(4):327 335, 1995. dwavesystems. dwave-neal. https://github.com/dwavesystems/dwave-neal, 2024. Accessed: 2024-0815. Yu Feng and Yuhai Tu. The inverse variance flatness relation in stochastic gradient descent is critical for finding flat minima. Proceedings of the National Academy of Sciences, 118(9):e2015617118, 2021. Hayato Goto, Kotaro Endo, Masaru Suzuki, Yoshisato Sakai, Taro Kanao, Yohei Hamakawa, Ryo Hidaka, Masaya Yamasaki, and Kosuke Tatsumura. High-performance combinatorial optimization based on classical mechanics. Science Advances, 7(6):eabe7953, 2021. Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris Maddison. Oops i took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pp. 3831 3841. PMLR, 2021. Firas Hamze, Jack Raymond, Christopher A Pattison, Katja Biswas, and Helmut G Katzgraber. Wishart planted ensemble: A tunably rugged pairwise ising model with a first-order phase transition. Physical Review E, 101(5):052102, 2020. W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. 1970. Matthew D Hoffman, Andrew Gelman, et al. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593 1623, 2014. John J Hopfield and David W Tank. neural computation of decisions in optimization problems. Biological cybernetics, 52(3):141 152, 1985. Published as a conference paper at ICLR 2025 Jérôme Houdayer. A cluster monte carlo algorithm for 2-dimensional spin glasses. The European Physical Journal B-Condensed Matter and Complex Systems, 22:479 484, 2001. Miao Hu, Hai Li, Yiran Chen, Qing Wu, Garrett S Rose, and Richard W Linderman. Memristor crossbar-based neuromorphic computing system: A case study. IEEE transactions on neural networks and learning systems, 25(10):1864 1878, 2014. Koji Hukushima and Yukito Iba. Population annealing and its application to a spin glass. In AIP Conference Proceedings, volume 690, pp. 200 206. American Institute of Physics, 2003. Koji Hukushima and Koji Nemoto. Exchange monte carlo method and application to spin glass simulations. Journal of the Physical Society of Japan, 65(6):1604 1608, 1996. Kirill P Kalinin, George Mourgias-Alexandris, Hitesh Ballani, Natalia G Berloff, James H Clegg, Daniel Cletheroe, Christos Gkantsidis, Istvan Haller, Vassily Lyutsarev, Francesca Parmigiani, et al. Analog iterative machine (aim): using light to solve quadratic optimization problems with mixed variables. ar Xiv preprint ar Xiv:2304.12594, 2023. Sebastian C Kapfer and Werner Krauth. Irreversible local markov chains with rapid convergence towards equilibrium. Physical review letters, 119(24):240603, 2017. Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671 680, 1983. DP Landau, Shan-Ho Tsai, and M Exler. A new approach to monte carlo simulations in statistical physics: Wang-landau sampling. American Journal of Physics, 72(10):1294 1302, 2004. Timothée Leleu, Yoshihisa Yamamoto, Peter L Mc Mahon, and Kazuyuki Aihara. Destabilization of local minima in analog spin systems by correction of amplitude heterogeneity. Physical review letters, 122(4):040607, 2019. Timothée Leleu, Farad Khoyratee, Timothée Levi, Ryan Hamerly, Takashi Kohno, and Kazuyuki Aihara. Scaling advantage of chaotic amplitude control for high-performance combinatorial optimization. Communications Physics, 4(1):266, 2021. Andrew Lucas. Ising formulations of many np problems. Frontiers in physics, 2:74887, 2014. Salvatore Mandrà and Munoz-Bauza. Pysa: Fast simulated annealing in native python. https://github.com/nasa/Py SA, 2024. Accessed: 2024-08-15. Naeimeh Mohseni, Peter L Mc Mahon, and Tim Byrnes. Ising machines as hardware solvers of combinatorial optimization problems. Nature Reviews Physics, 4(6):363 379, 2022. Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011. Akihiko Nishimura, David B Dunson, and Jianfeng Lu. Discontinuous hamiltonian monte carlo for discrete parameters and discontinuous likelihoods. Biometrika, 107(2):365 380, 2020. Dilina Perera, Inimfon Akpabio, Firas Hamze, Salvatore Mandra, Nathan Rose, Maliheh Aramon, and Helmut G Katzgraber. Chook a comprehensive suite for generating binary optimization problems with planted solutions. ar Xiv preprint ar Xiv:2005.14344, 2020. Sam Reifenstein, Timothee Leleu, Timothy Mc Kenna, Marc Jankowski, Myoung-Gyun Suh, Edwin Ng, Farad Khoyratee, Zoltan Toroczkai, and Yoshihisa Yamamoto. Coherent sat solvers: a tutorial. Advances in Optics and Photonics, 15(2):385 441, 2023. Sam Reifenstein, Timothee Leleu, and Yoshihisa Yamamoto. Dynamic anisotropic smoothing for noisy derivative-free optimization. ar Xiv preprint ar Xiv:2405.01731, 2024. Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1): 255 268, 1998. Published as a conference paper at ICLR 2025 Christoph Schönle, Marylou Gabrié, Tony Lelièvre, and Gabriel Stoltz. Sampling metastable systems using collective variables and jarzynski-crooks paths. ar Xiv preprint ar Xiv:2405.18160, 2024. Forrest Sheldon, Fabio L Traversa, and Massimiliano Di Ventra. Taming a nonconvex landscape with dynamical long-range order: Memcomputing ising benchmarks. Physical Review E, 100(5): 053311, 2019. Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70):1 57, 2018. Haoran Sun, Katayoon Goshvadi, Azade Nova, Dale Schuurmans, and Hanjun Dai. Revisiting sampling for combinatorial optimization. In International Conference on Machine Learning, pp. 32859 32874. PMLR, 2023. Nikola Surjanovic, Saifuddin Syed, Alexandre Bouchard-Côté, and Trevor Campbell. Parallel tempering with a variational reference. ar Xiv preprint ar Xiv:2206.00080, 2022. Joshua Tiosue. Qubovert: A python library for representing and solving qubo problems. https://github.com/jtiosue/QUBOvert, 2024. Accessed: 2024-05-22. Michalis K Titsias and Christopher Yau. The hamming ball sampler. Journal of the American Statistical Association, 112(520):1598 1611, 2017. Sri Krishna Vadlamani, Tianyao Patrick Xiao, and Eli Yablonovitch. Physics successfully implements lagrange multiplier optimization. Proceedings of the National Academy of Sciences, 117(43): 26639 26650, 2020. Thomas Vogel, Ying Wai Li, Thomas Wüst, and David P Landau. Generic, hierarchical framework for massively parallel wang-landau sampling. Physical review letters, 110(21):210603, 2013. Tatiana Xifara, Chris Sherlock, Samuel Livingstone, Simon Byrne, and Mark Girolami. Langevin diffusions and the metropolis-adjusted langevin algorithm. Statistics & Probability Letters, 91: 14 19, 2014. Yinyu Ye. Gset: A collection of graphs, 2024. URL https://web.stanford.edu/~yyye/ yyye/Gset/. Accessed: 2024-05-22. Ruqi Zhang, A Feder Cooper, and Christopher De Sa. Amagold: Amortized metropolis adjustment for efficient stochastic gradient mcmc. In International Conference on Artificial Intelligence and Statistics, pp. 2142 2152. PMLR, 2020. Yichuan Zhang, Zoubin Ghahramani, Amos J Storkey, and Charles Sutton. Continuous relaxations for discrete hamiltonian monte carlo. Advances in Neural Information Processing Systems, 25, 2012. Published as a conference paper at ICLR 2025 - APPENDIX - NON-EQUILIBRIUM DYNAMICS OF HYBRID CONTINUOUS-DISCRETE GROUND-STATE SAMPLING S1 ANALYSIS OF THE DETERMINISTIC PATH In the following section, we analyze the deterministic dynamical system defined as chaotic amplitude control with momentum (CACm) in the main manuscript. For simplicity, we do not take into account the effect of momentum and set γ = 0. S1.1 DYNAMIC PRE-CONDITIONING OF THE HESSIAN OF V The equation (1) of the main manuscript is a straightforward gradient descent of the potential V . However, the gradient is modulated by the auxiliary variables. In this section, we give more context about the motivation for using the auxiliary variables e and explain that their modulation can be interpreted as a dynamic pre-conditioning of the Hessian of V at the proximity of the discrete state σ. First, we show in Appendix section S1.2 that the variables u in the system of eqs. (1) and (2) can be eliminated by a change of variables. Then, the Jacobian matrix Jxx in the space x defined as Jxx = { xi xj }ij with xi = dx dt can be expressed as follows at proximity of the fixed points of the system (see Appendix section S1.3): Jxx = D[ αf( aσ)] D[e g( aσ)] D[eh( aσ)]H( aσ) (S1) where D[x] is a diagonal matrix with elements given by the components of the vector x and H( aσ) is the Hessian of V defined as the Jacobian of x V and calculated at the point aσ. The functions f and g and defined in Appendix section S1.2. Equation S1 shows that the auxiliary variables e modulates the Hessian H on the slow time scale such that H(t, τ) = D[e(τ)h(xi)]H with h(x) = 1 dψ dx > 0 for x ] 1, 1[ and ψ(x) = ϕ 1(x) βeff (see lemma 1 in Appendix section S1.3). Given that the fixed points of eqs. (1) and (2) are given as ei = α a 1 σi V xi ( aσi), we can interpret the role of e as a normalization of each row of the Hessian by the inverse of the gradient in the corresponding direction (see theorem 1 in Appendix section S1.3). Because the auxiliary variables e remain positive, they do not change the inertia (number of positive, negative, and zero eigenvalues) of the Hessian H (see lemma 2 in Appendix section S1.4) but can change its directions and eigenvalues. Moreover, we show in the theorem 2 of in Appendix section S1.5 that this dynamical system can exhibit period cycles and chaos when the target amplitude a is chosen to be sufficiently small. In summary, the role of the auxiliary variables is to act as a pre-conditioning of the Hessian and introduce entropy in the search dynamics due to its chaotic dynamics. Similar dynamics have been shown to be competitive with state-of-the-art heuristics to solve Ising problems (Leleu et al., 2019; 2021). S1.2 CHANGE OF VARIABLE Consider xi = ϕ(2βui) = ui = ψβ(xi) with ψβ(xi) = ϕ 1(xi) In the case ϕ(x) = 2 1+e x 1, we have ϕ 1(x) = ln( 2 x+1 1), ψ β(x) = 1 β(x+1)(x 1) with ψ β(x) = ψβ(x) x . Moreover, ψ β(x) is positive on ] 1, 1[ and 1 ψ β(x) 0 for |x| 0. Because dui dt = ψ β(xi) dxi dt with ψ β = 1 β ϕ 1(xi) xi , the equations of motion can be rewritten by eliminating ui ( i {1, , N}) as follows: dt = αψβ(xi) ψ β(xi) ei ψ β(xi) x V, (S2) dt = ξ(x2 i a)ei, (S3) Published as a conference paper at ICLR 2025 S1.3 JACOBIAN MATRIX AT THE FIXED POINTS The fixed points dxi dt = 0 and dei dt = 0 of the dynamical system described by eqs. S2 and S3 are given as follows: xi = σi a, (S4) ei = αψβ(σi a) V xi (σ a) . (S5) Moreover, the Jacobian matrix J at the fixed points is defined as: J = Jxx Jxe Jex Jee Jxx = D[ αf( aσ)] D[e g( aσ)] D[e h( aσ)]H( aσ), (S7) V xi ( aσi) ψ β( aσi) }i], (S8) Jex = 2ξ a D[σ e], (S9) Jee = 0, (S10) with H(x) = J[ x V ] is the Hessian of V in x without its diagonal elements. The notation D[x] denote the diagonal matrix with diagonal elements given by the components of the vector x. The functions f, g, and h of eq. S7 are defined as follows: f(xi) = ψβ ψ β(xi) xi , (S11) g(xi) = 1 ψ β(xi) V xi xi , (S12) h(xi) = 1 ψ β(xi) > 0 for xi ] 1, 1[. (S13) In the case of ϕ(x) = 2 1+e x 1, we have: f(xi) = 2 + xiln( 1 + xi 1 + xi ), (S14) g(xi) = βxi V xi β(xi + 1)(xi 1) 2V x2 i , (S15) h(xi) = β(xi + 1)(xi 1). (S16) Note that f is an even function with f(x) = f( x). In the case ξ 0, the dynamics of ei variables becomes much slower than that of xi. We note τ the slower time scale of ei with τ = tξ. The lemma 1 is obtained using eq. S7. Lemma 1: When ξ 0, the Hessian H of V is modulated on a slow time scale τ with H = D[e g( aσ)]H as proximity of the fixed points x = aσ. We call H the effective Hessian matrix. Published as a conference paper at ICLR 2025 S1.4 EFFECTIVE HESSIAN Stability of the fixed points is determined by the eigenvalues of the Jacobian matrix J. The blocks Jex and Jxe are such that (see also (Leleu et al., 2019)): Jex Jxe = b I, (S17) with I the identity matrix and b = 2ξα a ψβ( a) ψ β( a) and Jee = 0. Consequently, the determinant P(λ) = |J λI| characterizing the eigenvalues of J has the following property: P(λ) = Jxx λI Jxe Jex λI = |Jxx λI|λ Jxe Jex = |Jxx λI|λ b I. (S18) If we note µi the eigenvalues of Jxx, the characteristic polynomial can be rewritten: P(λ) = (µ λ)λ b I = λ2 µλ b. (S19) The eigenvalues λj can come in pairs λ+ j and λ j depending on the sign of j = µ2 j + 4b and are expressed as follows by solving P(λ) = 0 ( j {1, , N}): λ j = (µj p 2 if j > 0, (S20) λ j = (µj i p 2 if j < 0. (S21) We define the pre-conditioned Hessian matrix H as H( aσ) = D[ σi V xi (σi a)]H( aσi) such that : D[eih( aσi)]H( aσ) = D[ α 1 V xi (σi a) ψβ(σi a) ψ β(σi a)]H( aσ), (S22) ψ β( a)D[ σi V xi (σi a)]H( aσ), (S23) ψ β( a) H( aσ), (S24) The last step uses the fact that ψβ(x) ψ β(x) is an odd function. Note that the factors σi V xi (σi a), i, multiplying the Hessian H are positive at local minima satisfying V xi (σi a)σi > 0, i. In this case, the eigenvalues of D[ σi V xi (σi a)]H( aσi) are the same as D[ σi V xi (σi a)]1/2H( aσi)D[ σi V xi (σi a)]1/2. By Sylvester s law of matrix inertia, this implies that the eigenvalues of H and H have the same signs. In matrix theory, the term inertia" refers to the number of positive, negative, and zero eigenvalues of a real symmetric matrix. Lemma 2: The signs of the eigenvalues of H and H are the same when σi V xi (σi a) > 0, i. Theorem 1: The effect of error variables at proximity of fixed points is to replace the Hessian H of V by an effective Hessian H with H( aσ) = D[ σi V xi (σi a)]H( aσi). The inertia of the Hessian and effective Hessian are the same for local minima of V verifying the condition σi V xi (σi a) > 0, i. Published as a conference paper at ICLR 2025 S1.5 EIGENVALUES OF H AND STABILITY OF THE FIXED POINTS The eigenvalues of the Jacobian Jxx, noted µj, can be expressed using the eigen spectrum of the pre-conditioned Hessian H, noted γj, as follows using eq. S7: µj = αf( aσi) eig( aσi) + αψβ( a) ψ β( a)γj. (S25) The fixed points become stable when the real parts of eigenvalues are all negative, which is verified when the eigenvalue with maximal real part becomes equal to 0 given by the following condition using eqs. S20 and S21: maxj(Re[µj q µ2 j + 4b]) = 0 if µ2 j 4b > 0, (S26) maxj(Re[µj i q µ2 j 4b]) = 0 if µ2 j 4b < 0. (S27) Noting γ the eigenvalue with maximal part Re[γj], eqs. S26 and S26 imply that the fixed point σi a becomes stable under the condition (see eq. S21): µ = 0, (S28) i.e., αf( aσi) eig( aσi) γ = 0. (S29) Assuming 2V x2 i = 0, we have (see eqs. S3 and S12): eig(xi) = ψ β(xi) (ψ β(xi))2 V xi V xi , (S30) = α ψ β(xi) (ψ β(xi))2 ψβ(xi). (S31) Moreover, we have (see eq. S11): f(xi) = ψβ ψ β(xi) xi (S32) = 2 ψβψ β (ψ β(xi))2 . (S33) Combining eqs. S31 and S33 , eq. S29 can be written as follows: γ( aσ) = 2 ψ β( a) ψβ( a). (S34) Theorem 2: For small values of the target amplitude a such that γ( aσ) > 2 ψ β( a) ψβ( a) where γ( aσ) is the eigenvalue of H( aσ) with H( aσ) = D[ σi V xi (σi a)]H( aσi), the system described by eqs. S2 and S3 does not have any stable fixed points assuming 2V Published as a conference paper at ICLR 2025 In general settings where the system is bounded, Theorem 2 implies that the dynamical system of eqs. S2 and S3 settles to either a periodic (quasi periodic) or chaotic attractor for sufficiently large runtime T. In the case of the Ising Hamiltonian V (x) = 1 ij ωijxjxi, the calculations are simplified because V xi (σi a) = a V xi (σi) and V (σ a) = a V (σ) and eq. S34 can be written: γ(σ) = 2 a ψ β( a) ψβ( a), (S35) where γ is independent of a. Published as a conference paper at ICLR 2025 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 =F( a) ,a=0.8 ,a=0.6 0 20 40 60 80 100 t 0 20 40 60 80 100 t 0 20 40 60 80 100 t 0 20 40 60 80 100 t Figure S1: (a) Bifurcation diagram of the system of eqs. S2 and S3 in the case of the V (x) = 1 ij ωijσiσj with N = 8 and ωij is randomly chosen to be equal to 1 or 1 for j > i. ωij = ωji and ωii = 0. The red curve corresponds to the set { a, F( a)} with a > 0 which denote the condition under which the real part of the eigenvalues µ of the Jacobian matrix Jxx is zero. The + and markers correspond to the maximum real part γ of the eigenvalues of D[ σi V xi (σi a)]H( aσi) for the x-axis and the value of a = 0.6 and a = 0.8, respectively. (b) and (c) Time series of soft spins xi vs. time in the case a = 0.8 and a = 0.6, respectively. (d) and (e) The same as (b) and (c) for the error variables ei. In Fig. S1, the stability of the dynamics described in eqs. S2 and S3 when V is Ising Hamiltonian with N = 8 is shown by the bifurcation diagram in the space {F( a), a}. The + and markers correspond to the position real part of the eigenvalue of γ of D[ σi V xi (σi a)]H( aσi) for the x-axis and the value of a = 0.6 and a = 0.8, respectively. For a = 0.8, the point { γ, a} is above the set defined by { a, F( a)} with a > 0 (shown in red). In this regime, the system settles to a stable fixed point as shown in Fig. S1 (b) and (d). For a = 0.6, the point { γ, a} is below the red curve. In this regime, the system does not possess any fixed points and the limit behavior is a periodic cycle. Published as a conference paper at ICLR 2025 S2 PSEUDO-CODE OF MHCACM The Pseudo-code of Metropolis-Hastings adjusted chaotic amplitude control with momentum (MHCACm) is shown in this section. Algorithm 1 Metropolis-Hastings adjusted chaotic amplitude control with momentum 1: function DETERMINISTIC PATH(σ) Iterate deterministic n steps of CACm 2: u 0 3: u 0 4: u 0 5: e 1 6: y σ Initialize 7: for t 0 to n do 8: u u 9: u u 10: u u αu + e Vx(y) + γ(u u ) Euler steps 11: e e ξ(y y a) e 12: e e Normalize e 13: y 2 1+e 2βeffu 1 14: end for 15: y 2 1+e 2βeffu/(αe) 1 16: return y 17: end function 18: function PROBABILISTIC JUMP(y) Generate random jump 19: τ = 1 with probability y 1 2 , τ = 1 otherwise 20: return τ 21: end function 22: function METROPOLIS-HASTINGS STEP(τ, σ, y, x) Metropolis-Hastings step 23: A(τ/σ) min(1, P (τ)Q(σ/τ) P (σ)Q(τ/σ)) = F(τ, σ, y, x) Function F defined in eq. (9) 24: B = TRUE with probability A(τ/σ), B = FALSE otherwise 25: if B then 26: y x Accept the new configuration τ 27: σ τ 28: end if 29: return τ, y 30: end function 31: procedure SAMPLE (V ) 32: y U(0, 1) Random initialization 33: σ PROBABILISTIC JUMP (y) Initial configuration 34: for k 0 to K do Generate K samples 35: τ PROBABILISTIC JUMP (y) New candidate configuration τ 36: x DETERMINISTIC PATH (τ) 37: σ, y METROPOLIS-HASTINGS STEP (τ, σ, y, x) 38: end for 39: end procedure Published as a conference paper at ICLR 2025 S3 BOLTZMANN SAMPLING S3.1 VERIFICATION OF BOLTZMANN SAMPLING AT EQUILIBRIUM To test our Markov chain approach, we sample from the Boltzmann distribution of the Ising Hamiltonian V (σ) = 1 ij ωijσiσj. We use the Wishart planted ensemble (WPE) (Hamze et al., 2020) of size N = 18 spins to generate the Ising coupling matrix. Using the WPE to generate instances of the Ising Hamiltonian allows us to tune the complexity of the energy landscape and also know the ground state a-priori (Hamze et al., 2020). Additionally, these instances are known to exhibit properties of NP-Hard problems making the ground state difficult to find by heuristic algorithms (Hamze et al., 2020). Because we use a small problem size (N = 18) the exact Boltzmann distribution can be determined by brute force. The sampled distribution obtained by MHCACm is compared to the exact distribution at inverse temperature β = 0.07 by calculating the symmetric KL divergence (see Appendix section S3.2). Note that CACm (without the Metropolis-Hastings step) does not sample from the target Boltzmann distribution (see Fig. S2 a). In contrast, MHCACm exhibits exactly the same distribution of energy as the Boltzmann distribution at thermal equilibrium. This is accomplished while retaining the global exploration properties of CACm, making it a potent hybrid heuristic for identifying lower energy states of non-convex energy functions. In order to evaluate the sampling characteristics of the constructed Markov chain, we aim at drawing samples from the Boltzmann distribution of the Ising Hamiltonian V (σ) = 1 ij ωijσiσj for a Wishart planted instance (Hamze et al., 2020) with a size of N = 18 spins at an inverse temperature of β. 0.2 0.4 0.6 = eff/ KL divergence CACm, =0.09 MHCACm, =0.09 CACm, =0.11 MHCACm, =0.11 4 2 0 2 H 104 CACm MHCACm (proposed) Exact Gibbs distribution 0.02 0.03 0.04 0.05 effective inv. temperature * eff inv. temperature CACm MHCACm Figure S2: (a) Exact Boltzmann distribution (green dotted line), sampled distribution using CACm (without MH step, see red line), and MHCACm (without MH step, see blue line). Gaussian filtering of the energy distribution is applied for the sake of readability. β = 0.07. (b) KL divergence vs. ratio of inverse temperatures κ = βeff β . (c) Effective inverse temperature β eff temperature at which the KL divergence is minimized vs. β. The states used for the sampled distribution are boolean configurations accepted by the MH criterion. In Figs. S2 (b) is shown that the KL divergence is minimized for a ratio of inverse temperatures κ = βeff β . Note that βeff is the effective temperature defined in eq. 2 of the main manuscript. The effective inverse temperature βeff at which the KL divergence is minimized temperature in noted β eff. Figure S2 (c) shows that β eff is a linear function of β within the range of parameters considered. Moreover, the addition of the MH step to MHCACm increases the similarity between β and β compared to that of CACm. S3.2 SIMULATIONS FOR SMALL N In Fig. S2, the distance to the Boltzmann distribution P(σ) = eβV (σ) σ eβV (σ)) is calculated using the symmetric KL divergence. The symmetric KL divergence DSKL(P Q) between the distributions P and Q is defined as follows: Published as a conference paper at ICLR 2025 DSKL(P Q) = 1 2(DKL(P Q) + DKL(Q P)), (S36) DKL(P Q) = X i P(i) log P(i) Q(i). (S37) In the case of Fig. S2, the convergence of the MCMC is measured by the KL divergence between the exact Boltzmann distribution of energy levels obtained by brute force calculations and the estimated distribution of energy levels obtained by simulation of the Metropolis-Hastings adjusted chaotic discrete space sampling. The former computation is possible because the problems size N = 18 is sufficiently small for the brute force search method. The instance used for Fig. 2 is a randomly generated instance from the Wishart planted set with hardness parameter αWPE = 0.8 (Perera et al., 2020). S3.3 SIMULATIONS FOR LARGE N In order to test the convergence of the MCMC in the case of larger N, we can utilize a more heuristic metric. Conventional Gibbs sampling algorithm is run for a sufficiently large number of steps for the chain to converge. In the case of N = 60 and αWPE = 0.8, numerical simulations indicate that the convergence occurs within about 1000 time step of the Gibbs sampling algorithm. To observe the convergence, we limit the characterization of the Boltzmann distribution to its first two moments: the mean energy < H > with < H >= P σ H(σ)P(σ) and standard deviation of the energy p < (H < H >)2 >. Figure S3 shows the convergence of the mean and standard deviation of the energy under Gibbs sampling. 0 200 400 600 800 1000 Monte Carlo steps 0 200 400 600 800 1000 Monte Carlo steps <(H )2 > Figure S3: Convergence of the mean and standard deviation of energy distribution of a N = 60 Wishart planted instance with αWPE = 0.8 under Gibbs sampling. Approximate convergence is reached after 1000 Monte Carlo steps. β = 2.5. The distance D between the mean and variance of the distribution obtained by the proposed Metropolis Hastings adjusted chaotic discrete space sampling and the one obtained by iterating the Gibbs sampling algorithm is then calculated as follows: D = | < Hchaos > < HGibbs > | < (Hchaos < Hchaos >)2 > p < (HGibbs < HGibbs >)2 >|, (S38) Published as a conference paper at ICLR 2025 where the subscript denote the cases of Gibbs and MH chaotic sampling, respectively. In Fig. S4 is shown distance D vs. the number of steps T in the algorithm. One step corresponds to one tentative spin flip and one Euler step in the Gibbs sampling and chaotic sampling case, respectively. The proposed sampling method exhibits a faster decrease of D indicating a quicker relaxation to the steady-state distribution. The use of auxiliary variables speed up the convergence compared to the case without. 100 101 102 103 Distance D (mean and variance) Proposed >0 Proposed =0 Gibbs sampling Figure S4: Distance D between the mean and variance of the distribution obtained by the proposed Metropolis-Hastings adjusted chaotic discrete space sampling and the one obtained by iterating the Gibbs sampling algorithm (see eq. S36) vs. the number of steps T in the algorithm. One step corresponds to one tentative spin flip and one Euler step in the Gibbs sampling and chaotic sampling case, respectively. For N larger than N = 60, measuring the convergence of the distribution to its steady state becomes non-trivial because of strong broken ergodicity (Bernaschi et al., 2020). Numerical simulation of the convergence to the steady state distribution in this case falls out of the scope of this paper. Note that the convergence to the steady-state distribution can be further accelerated by considering annealing of temperature or the use of the replica exchange Monte Carlo method in both cases of Gibbs sampling and proposed chaotic sampling. Published as a conference paper at ICLR 2025 S4 AUTOTUNING USING DYNAMIC ANISOTROPIC SMOOTHING The parameters of the solvers are described in Table S1. We have run the autotuning algorithm called dynamic anisotropic smoothing (DAS) (Reifenstein et al., 2024) for each problem setting (problem size N, bias b). The MH sampling rate is set to f MH = 0.1. Table S1: Parameters of the algorithms tuned using DAS Algorithm number of parameters parameter list SA 3 βini, βfin, T AIM 4 β, α, γ, T CAC 5 β, α, ξ, a, T CACm 6 β, α, ξ, a, γ, T HMCACm 7 β, κ, α, ξ, a, γ, T An example of the the evolution of parameters during the DAS autotuning is shown in Fig. S5. Figure S5: Evolution of parameters during the DAS autotuning for MHCACm. N = 100, b = 12, T = 1000. Here lamb denotes α and kappa is βeff Published as a conference paper at ICLR 2025 S5 CONTRUCTION OF DEGENERATE WISHART PLANTED ENSEMBLE The Wishart planted ensemble instances are generated by first choosing a planted ground state g { 1, 1}N. However, in our case we consider a set of planted ground states which are represented as columns in an N by D matrix G { 1, 1}N D. Let ˆG be a matrix with the same columns space as G whose columns are orthonormal. Then we generate an N by M matrix ˆW of i.i.d. Gaussian random variables. This matrix is then transformed so that the columns are orthogonal to all of the ground states to form the matrix given as follows: W = (I ˆG ˆG ) ˆW. The Ising coupling matrix is constructed as ˆJ = WW with the diagonal elements being set to zero. We can see that every column of G is a ground state since x ˆJx 0 and G ˆJG = 0. Note that in the Wishart planted ensemble, the parameter αWPE = M N where N is the number rows and M is the number of columns, quantifies the aspect ratio of the Wishart matrix used to define the interaction strengths of the Ising model. This parameter plays a central role in determining the ruggedness of the energy landscape. To add bias to the energy landscape, we want to somehow create the instance such that configurations close to one ground state tend to have a larger energy than the other ground state even though both ground states have the same energy. To do this, we can consider W = (I ˆG ˆG )A ˆW where A is a symmetric matrix and ˆW is the random matrix defined above. In this work we consider A to be defined as A = I + b N 11 where b is the "bias" away from the ferromagnetic solution. The point of this is that the residual energy (energy away from the ground state) will be amplified for states that are close to the ferromagnetic solution. Then, when choosing the planted ground state we can choose one which is close to the ferromagnetic solution and one that is far creating bias in the search towards one ground state over another. For the numerical results in this work we consider one ground state with a hamming distance of 3 from the ferromagnetic solution and one ground state chosen completely randomly from { 1, 1}N. A Gauge transform is applied to obfuscate the ground states at the end of this process. S6 COMPARISON TO RECENTLY PROPOSED SAMPLING ALGORITHM In this section, we compare the proposed algorithm to the sampling method described in (Sun et al., 2023). We have implemented the algorithm of (Sun et al., 2023), that we denote by the acronym i SCO, and compare it against MHCACm in table S2. Success probability is higher using MHCACm. p0 (N=60) i SCO MHCACm b=0 (no bias) 0.0003 0.0008 b=12 (bias) 0.51 0.15 Table S2: Probability p0 to find any ground-state from unbiased (b = 0) and biased (b = 12) Wishart planted instances using i SCO and MHCACm after T = 500 time steps. N = 60. S7 COMPUTING ENVIRONMENT All simulations are done in CPU using a Intel Core i9-11950H, 8 cores, 2.6 GHz and GPU using a NVIDIA RTX A300; except for Table 3 of the main manuscript for which the GPU used is a Nvidia V100 and the CPU wall time is calculated on a 2024 Mac Book Pro (Apple M3 Chip). The code is written in python and pytorch.