# optimization_algorithm_design_via_electric_circuits__164de8cb.pdf Optimization Algorithm Design via Electric Circuits Stephen P. Boyd Stanford University Tetiana Parshakova Stanford University Ernest K. Ryu UCLA Jaewook J. Suh Rice University We present a novel methodology for convex optimization algorithm design using ideas from electric RLC circuits. Given an optimization problem, the first stage of the methodology is to design an appropriate electric circuit whose continuoustime dynamics converge to the solution of the optimization problem at hand. Then, the second stage is an automated, computer-assisted discretization of the continuous-time dynamics, yielding a provably convergent discrete-time algorithm. Our methodology recovers many classical (distributed) optimization algorithms and enables users to quickly design and explore a wide range of new algorithms with convergence guarantees. 1 Introduction In the classical literature of optimization theory, optimization algorithms are designed with the goal of establishing fast worst-case convergence guarantees. However, these methods, designed with the pessimistic framework of worst-case analysis, often exhibit slow practical performance. In the modern machine learning literature, optimizers are designed with the goal of obtaining fast empirical performance on a set of practical problems of interest. However, these methods, designed without consideration of the feasibility of a convergence analysis, tend to be much more difficult to analyze theoretically, and such methods sometimes even fail to converge under nice idealized assumptions such as convexity [92, 131]. In this work, we present a novel methodology for convex optimization algorithm design using ideas from electric RLC (resistor-inductor-capacitor) circuits and the performance estimation problem [54, 150]. (To clarify, our proposal does not involve building a physical circuit.) Specifically, our methodology provides a quick and systematic recipe for designing new, provably convergent optimization algorithms, including distributed optimization algorithms. The ease of the methodology enables users to quickly explore a wide range of algorithms with convergence guarantees. Optimization problem formulation. We consider the standard-form optimization problem minimize f(x) subject to x R(E ), (1) where x Rm is the optimization variable, f : Rm R { } is closed, convex, and proper, and E Rn m. Assume we have n nets N1, . . . , Nn forming a partition of {1, . . . , m}. More specifically, we let E Rn m be a selection matrix defined as Eij = +1 if j Ni 0 otherwise. (2) Our goal is to find a primal-dual solution satisfying the KKT conditions [126, Theorem 28.3] y f(x), x R(E ), y N(E). (3) As we show through examples, this standard-form problem (1) conveniently models many optimization problem setups of practical interest. Lead authors (author list ordered alphabetically) 38th Conference on Neural Information Processing Systems (Neur IPS 2024). In the analysis and design of optimization algorithms, a standard approach is to consider a continuoustime model of a given algorithm, corresponding to the limit of small stepsizes [124, 79, 6, 156, 143, 93, 86, 137]. Our work is based on the key observation that such continuous-time models can be interpreted as RLC circuits connected to the subdifferential operator f, which we interpret as a nonlinear resistor. We expand on this observation and propose a general methodology for designing optimization algorithms by designing RLC circuits that relax to the nets defined by E. Example. Problem (1) represents a general form of distributed optimization, where the constraints enforce consensus among the primal variables. An example is the so-called consensus problem [30, 7] minimize x1,...,x N Rm/N f1(x1) + + f N(x N) subject to x1 = = x N, where x = (x1, . . . , x N) is the decision variable, the objective function f(x) = f1(x1) + + f N(x N) is block-separable, and E = (I, . . . , I) Rm m/N. Refer to sections E and F, for an overview of classical splitting methods and decentralized methods for solving (1). 1.1 Preliminaries and Notation We generally follow the standard definitions and notations of convex optimization [31, 118, 23, 119, 129]. Consider the extended-valued function f : Rn R { }. We say f is closed if its epigraph is closed set in Rn+1 and proper if its value is finite somewhere. We say f is CCP if it is closed, convex, and proper. For R > 0, we say f is R-smooth if f is finite and differentiable everywhere and f(x) f(y) R x y for all x, y Rn. For µ > 0, we say f is µ-strongly convex if f(x) (µ/2) x 2 is convex. Let f (y) = supx Rn{ y, x f(x)} denote the Fenchel conjugate of f. For R > 0 and a CCP f, define the R-Moreau envelope of f as Rf(x) = infz Rn f(z) + 1 2R z x 2 . One can show [23, Proposition 13.24] that the R-Moreau envelope is given by Rf = f + R 2 2 . If f is 1/R-smooth, we can define [23, Theorem 18.15] the R-pre-Moreau envelope of f as which is defined such that R( f) = f. Due to the limited space, we defer the review of prior works to A of the appendix. 1.2 Contributions Our work presents two technical novelties, one in continuous time and the other in discrete time. The first is the observation that many standard optimization algorithms can be interpreted as discretizations of electric RLC circuits connected to the subdifferential operator f. The second is the use of the performance estimation problem to obtain an automated recipe for discretizing convergent continuoustime dynamics into convergent discrete-time algorithms, and we provide code implementing our automatic discretization methodology. By combining these two insights, we provide a quick and systematic methodology for designing new, provably convergent optimization algorithms, including distributed optimization algorithms. We provide an open-source package that implements automatic discretization of our circuits: https://github.com/cvxgrp/optimization_via_circuits 2 Continuous-time optimization with circuits 2.1 Interconnects We now describe two types of electric circuits that we call static and dynamic interconnects. Both interconnects have m terminals, and we will later connect them to the m inputs of f. Static interconnect. The static interconnect is a set of (ideal) wires connecting m terminals and forming n nets. See Figure 1 for an example. Let x Rm be a vector of terminal potentials and y Rm be a vector of currents leaving the terminals. Using matrix E Rn m as defined in (2), we can express Kirchhoff s voltage law (KVL) as x R(E ) and Kirchhoff s current law (KCL) as y N(E). Static interconnect Figure 1: Example of a static interconnect, m = 5, N1 = {1, 3}, N2 = {2, 4}, N3 = {5}. In other words, the static interconnect enforces the V-I relationship (x, y) R(E ) N(E). (4) Dynamic interconnect. The dynamic interconnect is an RLC circuit with m terminals and 1 ground node. We assume all inductances and capacitances have values in (0, ) while the resistances have values in [0, ). (A 0-ohm resistor is an ideal wire. We do not permit ideal wire loop.) Each RLC component has two (scalar-valued) terminals: the + and terminals. Denote the number of nodes in the RLC circuit by τ. Connect nodes 1, 2, . . . , m to terminals 1, 2, . . . , m, and let the last node, node τ, be the ground node. (This implies τ m + 1.) Denote the number of RLC components by σ. We describe the topology with a reduced node incidence matrix (with the bottom row corresponding to the ground node removed) A R(τ 1) σ defined as ( +1 if node i connects to + terminal of component j 1 if node i connects to terminal of component j 0 otherwise. See Figure 2 for an example. Dynamic interconnect R1 R2 R3 L1 L2 C1 C2 1 +1 0 0 1 0 0 0 2 0 +1 0 0 0 0 0 3 0 0 0 0 1 0 0 4 0 0 +1 0 0 0 0 5 0 0 0 0 0 1 +1 6 1 0 0 +1 +1 0 0 7 0 1 1 0 0 +1 0 Figure 2: Example of a dynamic interconnect with τ = 8 nodes, σ = 7 RLC components, m = 5 terminals, and 1 ground node. Reduced node incidence matrix A is provided. (R2 and R3 are 0-ohm resistors.) This dynamic interconnect is admissible with respect to the static interconnect of Figure 1. The ground node is designated to have 0 potential, and the potential of any node is the potential relative to ground. The voltage across a component is the difference of potentials between the + and terminals. The current through a component is defined as the current flowing from the + terminal to the terminal. Let x Rm be the potentials at the m terminals, which are connected to nodes 1, . . . , m, and y Rm be the currents leaving the terminals. Denote the node potential vector with the ground node excluded (since the potential at ground is 0) by x e So, e Rτ 1 m denotes the potentials at the non-terminal nodes. Denote the vector of voltages by v Rσ and the vector of currents by i Rσ. Then, the currents and voltages of the dynamic interconnect satisfy the following V-I relations (i) Ai = y 0 (KCL) (ii) v = A x e (iii) v R = DRi R (Resistor) (iv) v L = DL d dti L (Inductor) (v) i C = DC d dtv C (Capacitor) where DR, DL, and DC are diagonal matrices respectively with resistances, inductances, and capacitances values in the diagonals. Admissibility. When an RLC circuit reaches equilibrium, voltages across inductors and currents through capacitors are 0. We say a dynamic interconnect is admissible if it relaxes to the static interconnect at equilibrium. Mathematically, this condition is expressed as n (x, y) Ai = y 0 , v = A x e , v R = DRi R, v L = 0, i C = 0 o = R(E ) N(E). As an example, the dynamic interconnect of Figure 2 is admissible with respect to the static interconnect of Figure 1. 2.2 Composing interconnects with f We view the subdifferential operator f as an m-terminal electric device that is also grounded. Let x Rm be the potentials at the m terminals (excluding ground) and y Rm be the currents flowing into the m terminals. The f operator enforces the V-I relation We connect the m terminals of f to the m terminals of the static and dynamic interconnects. Immediately, connecting the static interconnect with f enforces the V-I relations (4) and y f(x), which combine to be the optimality condition (3). Therefore, the potentials at the m terminals as a vector in Rm is an optimal x Rm solving (1). To clarify, connecting the static interconnect with f leads to a static circuit in the sense that the potential x and current y do not depend on time. y 1 x 1 x 2 x 3 x 4 x 5 Static interconnect Figure 3: The static interconnect of Figure 1 connected with f. The potentials at the m terminals is an optimal x Rm solving (1). Next, we compose (connect) the dynamic interconnect with f. Due to capacitors and inductors, this circuit is dynamic in the sense that the voltages v(t) and x(t) and currents i(t) and y(t) depend on time, although we often omit explicitly writing the t-dependence for notational convenience. Then, the V-I relations of the dynamic interconnect combined with y f(x) leads to the V-I relation n (v, i) y f(x), Ai = y 0 , v = A x e v R = DRi R, v L = DL d dti L, i C = DC d dtv C, t (0, ) o , where v(t) = (v R(t), v L(t), v C(t)) Rσ, i(t) = (i R(t), i L(t), i C(t)) Rσ, e(t) Rτ m 1, x(t) Rm, and y(t) Rm for t [0, ). Dynamic interconnect Figure 4: The dynamic interconnect of Figure 2 connected with f. The potentials at the m terminals satisfy x(t) x for an optimal x Rm solving (1) under the conditions of Theorem 2.2. Under appropriate conditions, the dynamics (5) is mathematically well-posed in the sense that there exist unique Lipschitz-continuous curves v(t), i(t), x(t), and y(t) satisfying the V-I relation (5) as formalized in the following Theorem 2.1. The proof, which utilizes the machinery of monotone operator theory [21, 23, 129], is provided in B of the appendix. Theorem 2.1. Assume f is µ-strongly convex and M-smooth. Suppose (v0, i0, x0, y0) satisfy , v0 = A x0 e , v0 R = DRi0 R, y0 = f(x0). Then there is a unique Lipschitz continuous curve (v, i, x, y): [0, ) Rσ Rσ Rm Rm satisfying the conditions in (5) and the initial condition (v(0), i(0), x(0), y(0)) = (v0, i0, x0, y0). Equillibrium yields a primal-dual solution. With the dynamic interconnect composed with f, we generically expect the circuit state (v(t), i(t), x(t), y(t)) to converge (relax) to an equilibrium state. The admissibility condition ensures that at such an equilibrium, (x, y) will be a primal-dual solution. We formally state this fact as Theorem C.2 of the Appendix. 2.3 Energy dissipation Let (v , i , x , y ) be an equilibrium of an admissible dynamic interconnect composed with f. Since the voltages across resistors and inductors and the currents through capacitors are zero under equilibrium, we have v = (v R, v L, v C) = (0, 0, v C), i = (i R, i L, i C) = (0, i L, 0). (We formally show this in Theorem C.2 of the appendix.) Define the energy of the circuit at time t as 2 v C(t) v C 2 DC + 1 2 i L(t) i L 2 DL, (6) which is a dissipative (non-increasing) quantity: d dt E = v C v C, i C i C + i L i L, v L v L = i R 2 DR x x , y y 0. (7) Here, we use i C = 0 and v L = 0 and the fact that the power dissipated by the resistors and f must come from the energy stored in the capacitors and inductors. This dissipativity property leads to the following continuous-time convergence. Theorem 2.2. Assume f : Rm R is strongly convex and smooth. Assume the dynamic interconnect is admissible, and let (x , y ) be a primal-dual solution pair. Let (v(t), i(t), x(t), y(t)) be a curve satisfying (5). Then, lim t (x(t), y(t)) = (x , y ). Theorem 2.2 largely follows as a corollary of Theorem 2.1. The formal proof is provided in D of the appendix. In 4, we present a systematic framework for finding discretized versions of Theorem 2.2 the corresponding discretized algorithms. 3 Circuits for classical algorithms In this section, we present circuits recovering the classical Nesterov acceleration, decentralized ADMM, and PG-EXTRA. For additional examples and detailed derivations, refer to E and F of the appendix, where we provide circuits and analyses of classical algorithms such as gradient descent [35], proximal point method [125], proximal gradient method [42], primal decomposition [72, 140], dual decomposition [59, 96, 63, 142], Douglas Rachford splitting [123, 53, 101], Davis Yin splitting [46], decentralized gradient descent [116, 171], and diffusion [33, 34]. Figure 5: Multi-wire notation. Multi-wire notation. We start by quickly introducing the multiwire notation depicted in Figure 5. When optimizing f : Rm R and using the m-terminal device f, we will often use dynamic interconnects that have the same RLC circuit across each net, i.e., the dynamic interconnect consists of m identical copies of the same RLC circuit for the m coordinates of x Rm. In this case, we use the diagonal-line notation depicted in Figure 5. Moreau envelope. We use the following simple identity throughout this work: f composed with a resistor is equivalent to Rf(x). f x R x m Rf x m To clarify, the equivalence means the two circuits impose the same V-I relation on the m pins of x. To see this, note f( x) = 1 R(x x) x = prox Rf(x) and use the identity for the gradient of the Moreau envelope to conclude R(x prox Rf(x)) = 1 See E.1 of the appendix for further details. 3.1 Nesterov acceleration Let f : Rm R be a 1/R-smooth convex function. Then, the circuit corresponding to the classical Nesterov acceleration is given below. The use of a negative resistor R may seem unconventional, but the fact that this circuit is stable is easier to see if we consider the equivalent circuit with the pre-Moreau envelope f, i.e., f is the convex function such that R f = f. To clarify, negative resistors satisfy the same V-I relations of the standard resistors but with a negative slope. Negative resistors have also been considered in [153]. The V-I relations of this circuit lead to the ODE d2 L d dtx + 1 dt f(x) + R LC f(x) = 0. If we set R = p L/C, which can be interpreted as an instance of critical damping [164, 174, 40], L = 1 8µ µ, and C = 2 µ, we recover the Nesterov ODE [162] dt2 x + 2 µ d dtx + f(x) = 0. We also quickly point out that other choices of parameters lead to the high-resolution ODE introduced in [136]. See E.3 of the appendix for further details. 3.2 Decentralized ADMM Let f1, . . . , f N : Rm R { } be CCP functions. Consider a decentralized optimization setup with graph G. We provide the full description of the decentralized setup and notations in F of the appendix. Define Γj to be the neighbors of j in graph G. For simplicity, we only illustrate the circuit related to nodes j and l, where j and l are directly connected through an edge in the graph G. The circuit corresponding to decentralized ADMM [74, 71, 70, 157, 139] is given below. R R R R L L L In the following, the left column presents the dynamics of the continuous-time circuit and the right column presents the discretization with stepsize L/R, recovering the standard decentralized ADMM: aj = 1 |Γj| l Γj (Ri Ljl + ejl) xj = prox(R/|Γj|)fj (aj) d dti Ljl = 1 ak+1 j = 1 |Γj| l Γj (Rik Lj,l + ek jl) xk+1 j = prox(R/|Γj|)fj ak+1 j ek+1 jl = 1 2(xk+1 j + xk+1 l ) ik+1 L jl = ik Ljl + 1 R(ek+1 jl xk+1 j ) for every node j = 1, . . . , N and every edge (j, l) in graph G. 3.3 PG-EXTRA Let f1, . . . , f N : Rm R { } be CCP functions and h1, . . . , h N : Rm R be convex M-smooth functions. Consider a decentralized optimization setup with graph G. The circuit corresponding to PG-EXTRA [138] is given below. m hj fj xj R ej R xj Define the mixing matrix W RN N with l Γj R Rjl if j = l R Rjl if j = l, l Γj 0 otherwise. In the following, the left column presents the V-I relations for the continuous-time circuit and the right column presents the discretization with stepsize 1 2, recovering the standard PG-EXTRA: xj = prox Rfj l=1 Wjlxl R hj(xj) wj d dtwj = xj xk+1 j = prox Rfj l=1 Wjlxk l R hj(xk j ) wk j wk+1 j = wk j + 1 l=1 Wjlxk l ) for every node j = 1, . . . , N and every edge (j, l) in graph G. 4 Automatic discretization We discretize the continuous-time dynamics given by the circuit with an admissible dynamic interconnect using a two-stage Runge Kutta method with parameters α, β and stepsize h > 0. The explicit form of the discretization is stated in G of the appendix. Let {(vk, ik, xk, yk)} k=1 be the iterates generated by the discretized algorithm. Then the energy stored in the circuit at time t = kh is 2 vk C v C 2 DC + 1 2 ik L i L 2 DL. To guarantee convergence of the discretized algorithm, we search for discretization parameters that ensure the E1, E2, . . . sequence is dissipative in the following sense. Specifically, we say the algorithm or the discretization is sufficiently dissipative if there is an η > 0 such that Ek+1 Ek + η xk x , yk y 0, (8) holds for all k = 1, 2, . . .. This requirement is analogous to the sufficient decrease conditions in optimization [31, 121]. The following Lemma 4.1, which proof we provide in G of the appendix, states that sufficient dissipativity ensures convergence under suitable conditions. Lemma 4.1. Assume f : Rm R { } is a strictly convex function and the dynamic interconnect is admissible. If the two-stage Runge Kutta discretization, as explicitly stated in G of the appendix, generates a discrete-time sequence {(vk, ik, xk, yk)} k=1 satisfying the sufficient dissipativity condition (8), then xk converges to a primal solution. We find such a discretization with the following automated methodology. Given a discretization characterized by (α, β, h), the dissipativity condition (8) for a given η > 0 is implied if the optimal value of the following optimization problem is non-positive: maximize E2 E1 + η x1 x , y1 y subject to Es = 1 2 vs C v C 2 DC + 1 2 is L i L 2 DL, s {1, 2} (v1, i1, x1, y1) is feasible initial point (v2, i2, x2, y2) is generated by discrete optimization method from initial point f F, where f, v1, i1, x1, y1, v , i , x , y are the decision variables and F is a family of functions (e.g., L-smooth convex) that the algorithm is to be applied to. Here, we are using the fact that (8) is homogeneous with respect to k (i.e., (8) essentially has no k-dependence), and therefore it is sufficient to verify the condition for k = 1 but for all feasible initial points (v1, i1, x1, y1). It turns out that (9) can be solved exactly as a semidefinite program (SDP) for many commonly considered function classes F. This technique was initially proposed as the performance estimation problem (PEP) [54, 150], a computer-aided methodology for constructing convergence proofs of first-order optimization methods. See, e.g., PEPit [76] package that implements PEP in Python. Further, (9) can be posed as a nonconvex quadratically constrained quadratic problem (QCQP) with only a few tens of variables and such problems can be solved exactly with spatial branch-and-bound algorithms [2, 102, 80, 98, 45]. In conclusion, we can solve a non-convex QCQP to find a provably convergent discretization of the continuous-time circuit with an admissible dynamic interconnect. We use the Ipopt [155, 9] solver. Further details are provided in G of the appendix. Example. Consider the following example circuit for the minimization of a convex function f. Let R1 = R2 = R3 = 1, and C1 = C2 = 10. With our automatic discretization methodology, we find the sufficiently dissipative parameters η = 6.66, h = 6.66, α = 0, β = 1. The resulting provably convergent algorithm is xk = prox(1/2)f(zk), yk = 2(zk xk) wk+1 = wk 0.33(yk + 3wk) zk+1 = zk 0.16(5yk + 3wk). is provably convergent2 under the condition that f is strictly convex, see H for details. 5 Experiments Figure 8: Underlying graph G. In this section, we use our methodology to obtain a new algorithm and experiment with it on a specific problem instance. Consider a decentralized optimization problem with a communication graph G with N = 6 nodes and 7 edges, as shown in Figure 8. Specifically, we consider the optimization problem minimize x R100 P i {4,5} x bi 2 + x bi 2 2 + P i/ {4,5} x bi 2, where each agent i {1, . . . , 6} holds the vector bi R100. To leverage the strong convexity of f4 and f5, we propose a modification to the DADMM circuit described in F.3. Given that a circuit with a capacitor and inductor corresponds to a momentum method (see 3.1), and momentum is known to accelerate convergence for strongly convex functions [124], we add a capacitor to e45 to DADMM circuit as shown in the left column of Figure 9. We then discretize the circuit and refer the the resulting algorithm DADMM+C. We apply DADMM+C to the decentralized optimization problem and observe a speedup as shown in the right columns of Figure 9. The relative error for DADMM+C decreases to 10 10 in 66 iterations, for DADMM in 87 iterations and for P-EXTRA in 294 iterations. For further details, see I.1 of the appendix. 2Our pipeline has a final verification stage that numerically checks whether point returned by the Ipopt solver is indeed feasible for the small QCQP. Strictly speaking, our theoretical convergence guarantee relies on the correctness of this numerical verification of feasibility. 0 20 40 60 80 100 Iteration count k |f(xk) f |/f P-EXTRA DADMM Circuit DADMM+C Figure 9: (Left) Circuit of DADMM+C. Compared to the DADMM circuit of F.3, the DADMM+C circuit has an additional capacitor. (Right) Relative error f(xk) f /f vs. iteration count. Further, we define a general version of the DADMM+C method for any connected graph and establish a general convergence proof in Lemma I.1 of in I.1.1 of the appendix. This convergence analysis demonstrates how to use our methodology to discover a new family of methods with a classical convergence proof. Finally, we provide another set of similar experiments in I.2 of the appendix. 6 Conclusion In this work, we present a novel approach to optimization algorithm design using ideas from electric RLC circuits. The continuous-time RLC circuit models combined with the automatic discretization method provide a foundation for designing algorithms that inherently possess convergence guarantees. Further, we provide code implementing the automatic discretization. Our framework opens the door to future research by applying this methodology to a broader range of optimization problems and extending the problem to other setups, such as the stochastic optimization setup. Acknowledgments and Disclosure of Funding This work was supported by the Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02), the National Research Foundation of Korea (NRF) grant funded by the Korean government (No.RS-2024-00421203, RS-2024-00406127), and the Oliger Memorial Fellowship. We thank Hangjun Cho for the helpful discussions on the continuous-time analysis. We also thank anonymous reviewers for the highly constructive feedback. [1] B. Abbas and H. Attouch. Dynamical systems and forward backward algorithms associated with the sum of a convex subdifferential and a monotone cocoercive operator. Optimization. A Journal of Mathematical Programming and Operations Research, 64(10):2223 2252, 2015. [2] T. Achterberg and E. Towle. Non-Convex Quadratic Optimization: Gurobi 9.0. 2020. https: //www.gurobi.com/resource/non-convex-quadratic-optimization/. [3] S. Adly and H. Attouch. Finite convergence of proximal-gradient inertial algorithms combining dry friction with Hessian-driven damping. SIAM Journal on Optimization, 30(3):2134 2162, 2020. [4] A. Agarwal, C. Fiscko, S. Kar, L. Pileggi, and B. Sinopoli. An equivalent circuit workflow for unconstrained optimization. ar Xiv preprint ar Xiv:2305.14061, 2023. [5] A. Agarwal and L. Pileggi. An equivalent circuit approach to distributed optimization. ar Xiv preprint ar Xiv:2305.14607, 2023. [6] F. Alvarez. On the minimizing property of a second order dissipative system in hilbert spaces. SIAM Journal on Control and Optimization, 38(4):1102 1119, 2000. [7] F. Alvarez and H. Attouch. An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set-Valued Analysis, 9(1):3 11, 2001. [8] F. Alvarez, H. Attouch, J. Bolte, and P. Redont. A second-order gradient-like dissipative dynamical system with Hessian-driven damping : Application to optimization and mechanics. Journal de Mathématiques Pures et Appliquées, 81(8):747 779, 2002. [9] J. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl. Cas ADi: a software framework for nonlinear optimization and optimal control. Mathematical Programming Computation, 11:1 36, 2019. [10] V. Apidopoulos, J.-F. Aujol, and C. Dossal. The differential inclusion modeling FISTA algorithm and optimality of convergence rate in the case b 3. SIAM Journal on Optimization, 28(1):551 574, 2018. [11] H. Attouch and F. Alvarez. The heavy ball with friction dynamical system for convex constrained minimization problems. Belgian-French-German Conference on Optimization, 1998. [12] H. Attouch, Z. Chbani, J. Fadili, and H. Riahi. First-order optimization algorithms via inertial systems with Hessian driven damping. Mathematical Programming, 2020. [13] H. Attouch, Z. Chbani, J. Peypouquet, and P. Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168(1):123 175, 2018. [14] H. Attouch, Z. Chbani, and H. Riahi. Fast proximal methods via time scaling of damped inertial dynamics. SIAM Journal on Optimization, 29(3):2227 2256, 2019. [15] H. Attouch, Z. Chbani, and H. Riahi. Rate of convergence of the Nesterov accelerated gradient method in the subcritical case α 3. ESAIM: Control, Optimisation and Calculus of Variations, 25:2, 2019. [16] H. Attouch and M.-O. Czarnecki. Asymptotic control and stabilization of nonlinear oscillators with non-isolated equilibria. Journal of Differential Equations, 179(1):278 310, 2002. [17] H. Attouch, X. Goudou, and P. Redont. The heavy ball with friction method, I. The continuous dynamical system: Global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics, 02(01):1 34, 2000. [18] H. Attouch and S. C. László. Newton-like inertial dynamics and proximal algorithms governed by maximally monotone operators. SIAM Journal on Optimization, 30(4):3252 3283, 2020. [19] H. Attouch and J. Peypouquet. Convergence of inertial dynamics and proximal algorithms governed by maximally monotone operators. Mathematical Programming, 174(1):391 432, 2019. [20] H. Attouch, J. Peypouquet, and P. Redont. A dynamical approach to an inertial forwardbackward algorithm for convex minimization. SIAM Journal on Optimization, 24(1):232 256, 2014. [21] J.-P. Aubin and A. Cellina. Differential Inclusions: Set-Valued Maps and Viability Theory, volume 264. Springer Science & Business Media, 2012. [22] M. Barré, A. B. Taylor, and F. Bach. Principled analyses and design of first-order methods with inexact proximal operators. Mathematical Programming, 201(1):185 230, 2023. [23] H. Bauschke and P. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer International Publishing, second edition, 2017. [24] M. Betancourt, M. I. Jordan, and A. C. Wilson. On symplectic optimization. ar Xiv:1802.03653, 2018. [25] R. I. Bo t and E. R. Csetnek. Convergence rates for forward backward dynamical systems associated with strongly monotone inclusions. Journal of Mathematical Analysis and Applications, 457(2):1135 1152, 2018. [26] R. I. Bo t, E. R. Csetnek, and D.-K. Nguyen. Fast Optimistic Gradient Descent Ascent (OGDA) method in continuous and discrete time. Foundations of Computational Mathematics, 2023. [27] R. I. Bo t and D. A. Hulett. Second order splitting dynamics with vanishing damping for additively structured monotone inclusions. Journal of Dynamics and Differential Equations, 2022. [28] S. Boyd. Distributed optimization: Analysis and synthesis via circuits. Lecture Note EE364b, Stanford University, 2010. [29] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011. [30] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1):1 122, 2011. [31] S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [32] R. E. Bruck. Asymptotic convergence of nonlinear contraction semigroups in Hilbert space. Journal of Functional Analysis, 18(1):15 26, 1975. [33] F. S. Cattivelli, C. G. Lopes, and A. H. Sayed. A diffusion rls scheme for distributed estimation over adaptive networks. In Signal Processing Advances in Wireless Communications, 2007. [34] F. S. Cattivelli and A. H. Sayed. Diffusion LMS strategies for distributed estimation. IEEE Transactions on Signal Processing, 58(3):1035 1048, 2010. [35] A.-L. Cauchy. Méthode générale pour la résolution des systémes d équations simultanées. Comptes Rendus Hebdomadaires des Séances de l Académie des Sciences, 25:536 538, 1847. [36] T. Chaffey, S. Banert, P. Giselsson, and R. Pates. Circuit analysis using monotone+skew splitting. European Journal of Control, 74:100854, 2023. [37] T. Chaffey and A. Padoan. Circuit model reduction with scaled relative graphs. Conference on Decision and Control (CDC), pages 6530 6535, 2022. [38] T. Chaffey and R. Sepulchre. Monotone RLC circuits. European Control Conference, 2021. [39] T. Chaffey and R. Sepulchre. Monotone one-port circuits. IEEE Transactions on Automatic Control, 69(2):783 796, 2024. [40] S. Chen, B. Shi, and Y.-X. Yuan. On underdamped Nesterov s acceleration. ar Xiv:2304.14642, 2023. [41] L. Chua and G.-N. Lin. Nonlinear programming without computation. IEEE Transactions on Circuits and Systems, 31(2):182 188, 1984. [42] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4):1168 1200, 2005. [43] E. R. Csetnek, Y. Malitsky, and M. K. Tam. Shadow Douglas Rachford splitting for monotone inclusions. Applied Mathematics & Optimization, 80(3):665 678, 2019. [44] S. Cyrus, B. Hu, B. Van Scoy, and L. Lessard. A robust accelerated optimization algorithm for strongly convex functions. American Control Conference (ACC), pages 1376 1381, 2018. [45] S. Das Gupta, B. P. G. Van Parys, and E. K. Ryu. Branch-and-bound performance estimation programming: A unified methodology for constructing optimal optimization methods. Mathematical Programming, 2023. [46] D. Davis and W. Yin. A three-operator splitting scheme and its optimization applications. Set-Valued and Variational Analysis, 25(4):829 858, 2017. [47] G. B. De Luca and E. Silverstein. Born-infeld (BI) for AI: Energy-conserving descent (ECD) for optimization. International Conference on Machine Learning, 162, 2022. [48] J. B. Dennis. Mathematical Programming and Electrical Networks. Ph D thesis, Massachusetts Institute of Technology, 1959. [49] C. A. Desoer and J. Katzenelson. Nonlinear RLC networks. Bell System Technical Journal, 44(1):161 198, 1965. [50] C. A. Desoer and E. S. Kuh. Basic Circuit Theory. Electronic Engineering. Mc Graw-Hill, 1969. [51] C. A. Desoer and F. F. Wu. Nonlinear monotone networks. SIAM Journal on Applied Mathematics, 26(2):315 333, 1974. [52] J. Diakonikolas and M. I. Jordan. Generalized momentum-based methods: A Hamiltonian perspective. SIAM Journal on Optimization, 31(1):915 944, 2021. [53] J. Douglas and H. H. Rachford. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American Mathematical Society, 82(2):421 439, 1956. [54] Y. Drori and M. Teboulle. Performance of first-order methods for smooth convex minimization: A novel approach. Mathematical Programming, 145(1):451 482, 2014. [55] R. J. Duffin. Nonlinear networks. I. Bulletin of the American Mathematical Society, 52(10):833 838, 1946. [56] J. Eckstein and D. P. Bertsekas. On the Douglas Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1):293 318, 1992. [57] L. Euler. Institutiones Calculi Differentialis. Petropolis, 1755. [58] M. Even, R. Berthier, F. Bach, N. Flammarion, H. Hendrikx, P. Gaillard, L. Massoulié, and A. Taylor. Continuized accelerations of deterministic and stochastic gradient descents, and of gossip algorithms. Neural Information Processing Systems, 2021. [59] H. Everett. Generalized Lagrange multiplier method for solving problems of optimum allocation of resources. Operations Research, 11(3):399 417, 1963. [60] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems. SIAM Journal on Optimization, 28(3):2654 2689, 2018. [61] K. Feng. On difference schemes and symplectic geometry. Proceedings of the 5th International Symposium on Differential Geometry and Differential Equations, pages 42 58, 1984. [62] S. Fiori. Quasi-geodesic neural learning algorithms over the orthogonal group: A tutorial. Journal of Machine Learning Research, 6(26):743 781, 2005. [63] M. L. Fisher. The Lagrangian relaxation method for solving integer programming problems. Management Science, 50(12_supplement):1861 1871, 2004. [64] G. França, M. I. Jordan, and R. Vidal. On dissipative symplectic integration with applications to gradient-based optimization. Journal of Statistical Mechanics: Theory and Experiment, 2021(4):043402, 2021. [65] G. França, D. Robinson, and R. Vidal. ADMM and accelerated ADMM as continuous dynamical systems. International Conference on Machine Learning, 2018. [66] G. França, D. P. Robinson, and R. Vidal. Gradient flows and proximal splitting methods: A unified view on accelerated and stochastic optimization. Physical Review E: Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, 103(5):053304, 2021. [67] G. França, D. P. Robinson, and R. Vidal. A nonsmooth dynamical systems perspective on accelerated extensions of ADMM. IEEE Transactions on Automatic Control, 68(5):2966 2978, 2023. [68] G. França, J. Sulam, D. Robinson, and R. Vidal. Conformal symplectic and relativistic optimization. Neural Information Processing Systems, 2020. [69] G. França, J. Sulam, D. P. Robinson, and R. Vidal. Conformal symplectic and relativistic optimization. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124008, 2020. [70] D. Gabay. Chapter IX applications of the method of multipliers to variational inequalities. In M. Fortin and R. Glowinski, editors, Augmented Lagrangian Methods: Applications to the Numerical Solution of Boundary-Value Problems, volume 15 of Studies in Mathematics and Its Applications, pages 299 331. Elsevier, 1983. [71] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers and Mathematics with Applications, 2(1):17 40, 1976. [72] A. M. Geoffrion. Primal resource-directive approaches for optimizing nonlinear decomposable systems. Operations Research, 18(3):375 403, 1970. [73] B. Gharesifard and J. Cortés. Distributed continuous-time convex optimization on weightbalanced digraphs. IEEE Transactions on Automatic Control, 59(3):781 786, 2014. [74] R. Glowinski and A. Marroco. Sur l approximation, par éléments finis d ordre un, et la résolution, par pénalisation-dualité d une classe de problèmes de Dirichlet non linéaires. Revue Française d Automatique, Informatique, Recherche Opérationnelle. Analyse Numérique, 9(2):41 76, 1975. [75] E. Gorbunov, N. Loizou, and G. Gidel. Extragradient method: O(1/K) last-iterate convergence for monotone variational inequalities and connections with cocoercivity. International Conference on Artificial Intelligence and Statistics, 2022. [76] B. Goujaud, C. Moucer, F. Glineur, J. M. Hendrickx, A. B. Taylor, and A. Dieuleveut. PEPit: Computer-assisted worst-case analyses of first-order optimization methods in Python. Mathematical Programming Computation, 16(3):337 367, 2024. [77] E. Hairer, C. Lubich, and W. Gerhard. Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations. Springer, 2 edition, 2006. [78] S. Hassan-Moghaddam and M. R. Jovanovi c. Proximal gradient flow and Douglas Rachford splitting dynamics: Global exponential stability via integral quadratic constraints. Automatica, 123:109311, 2021. [79] U. Helmke and J. Moore. Optimization and dynamical systems. Proceedings of the IEEE, 84(6):907 , 1996. [80] R. Horst and H. Tuy. Global Optimization: Deterministic Approaches. Springer Science & Business Media, 2013. [81] B. Hu and L. Lessard. Dissipativity theory for Nesterov s accelerated method. International Conference on Machine Learning, 2017. [82] T. H. Hughes. Passivity and electric circuits: A behavioral approach. IFAC-Papers On Line, 50(1):15500 15505, 2017. [83] A. Iserles. A First Course in the Numerical Analysis of Differential Equations. Cambridge University Press, 2009. [84] U. Jang, S. D. Gupta, and E. K. Ryu. Computer-assisted design of accelerated composite optimization methods: Opt ISTA. ar Xiv:2305.15704, 2023. [85] H. K. Khalil. Nonlinear Systems. Pearson Education. Prentice Hall, 2002. [86] S. S. Kia, J. Cortés, and S. Martínez. Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication. Automatica, 55:254 264, 2015. [87] D. Kim. Accelerated proximal point method for maximally monotone operators. Mathematical Programming, 190(1 2):57 87, 2021. [88] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, 159(1-2):81 107, 2016. [89] D. Kim and J. A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192 219, 2021. [90] J. Kim and I. Yang. Convergence analysis of ODE models for accelerated first-order methods via positive semidefinite kernels. Neural Information Processing Systems, 2023. [91] J. Kim and I. Yang. Unifying Nesterov s accelerated gradient methods for convex and strongly convex objective functions. International Conference on Machine Learning, 2023. [92] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015. [93] W. Krichene, A. Bayen, and P. L. Bartlett. Accelerated mirror descent in continuous and discrete time. Neural Information Processing Systems, 2015. [94] W. Kutta. Beitrag Zur Näherungsweisen Integration Totaler Differentialgleichungen. Teubner, 1901. [95] S. Lee and D. Kim. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. Neural Information Processing Systems, 2021. [96] C. Lemaréchal. Lagrangian relaxation. In M. Jünger and D. Naddef, editors, Computational Combinatorial Optimization: Optimal or Provably Near-Optimal Solutions, Lecture Notes in Computer Science, pages 112 156. Springer, Berlin, Heidelberg, 2001. [97] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57 95, 2016. [98] L. Liberti. Introduction to global optimization. Ecole Polytechnique, 2008. [99] F. Lieder. On the convergence rate of the Halpern-iteration. Optimization Letters, 15(2):405 418, 2021. [100] P. Lin, W. Ren, and J. A. Farrell. Distributed continuous-time optimization: Nonuniform gradient gains, finite-time convergence, and convex constraint set. IEEE Transactions on Automatic Control, 62(5):2239 2253, 2017. [101] P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964 979, 1979. [102] M. Locatelli and F. Schoen. Global Optimization: Theory, Algorithms, and Applications. SIAM, 2013. [103] H. Lu. An O(sr)-resolution ODE framework for understanding discrete-time algorithms and applications to the linear convergence of minimax problems. Mathematical Programming, 194(1):1061 1112, 2022. [104] J. Lu and C. Y. Tang. Zero-gradient-sum algorithms for distributed convex optimization: The continuous-time case. IEEE Transactions on Automatic Control, 57(9):2348 2354, 2012. [105] C. J. Maddison, D. Paulin, Y. W. Teh, B. O Donoghue, and A. Doucet. Hamiltonian descent methods. ar Xiv:1809.05042, 2018. [106] J. C. Maxwell. A Treatise on Electricity and Magnetism, volume 1. Oxford: Clarendon Press, 1873. [107] R. Mc Lachlan and M. Perlmutter. Conformal Hamiltonian systems. Journal of Geometry and Physics, 39(4):276 300, 2001. [108] A. Megretski and A. Rantzer. System analysis via integral quadratic constraints. IEEE Transactions on Automatic Control, 42(6):819 830, 1997. [109] W. Millar. CXVI. Some general theorems for non-linear systems possessing resistance. Philosophical Magazine and Journal of Science, 42(333):1150 1160, 1951. [110] G. Minty. Solving steady-state nonlinear networks of monotone elements. IRE Transactions on Circuit Theory, 8(2):99 104, 1961. [111] G. J. Minty. Monotone networks. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 257(1289):194 212, 1960. [112] G. J. Minty. On the maximal domain of a monotone function. The Michigan Mathematical Journal, 8(2):135 137, 1961. [113] C. Moucer, A. Taylor, and F. Bach. A systematic approach to Lyapunov analyses of continuoustime models in convex optimization. SIAM Journal on Optimization, 33(3):1558 1586, 2023. [114] M. Muehlebach and M. Jordan. A dynamical systems perspective on Nesterov acceleration. International Conference on Machine Learning, 2019-06-09/2019-06-15. [115] M. Muehlebach and M. I. Jordan. Optimization with momentum: Dynamical, control-theoretic, and symplectic perspectives. Journal of Machine Learning Research, 22(73):1 50, 2021. [116] A. Nedic and A. Ozdaglar. Distributed Subgradient Methods for Multi-Agent Optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009. [117] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Doklady Akademii Nauk SSSR, 269(3):543 547, 1983. [118] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004. [119] Y. Nesterov. Lectures on Convex Optimization. Springer, 2 edition, 2018. [120] R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. Jordan. A general analysis of the convergence of ADMM. International Conference on Machine Learning, 2015. [121] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999. [122] J. Park and E. K. Ryu. Exact optimal accelerated complexity for fixed-point iterations. International Conference on Machine Learning, 2022. [123] D. W. Peaceman and J. Rachford, H. H. The numerical solution of parabolic and elliptic differential equations. Journal of the Society for Industrial and Applied Mathematics, 3(1):28 41, 1955. [124] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964. [125] R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5):877 898, 1976. [126] T. Rockafellar. Convex analysis, volume 11. Princeton University Press, 1997. [127] C. Runge. Ueber die numerische Auflösung von Differentialgleichungen. Mathematische Annalen, 46(2):167 178, 1895. [128] R. D. Ruth. A canonical integration technique. IEEE Transactions on Nuclear Science, 30(4):2669 2671, 1983. [129] E. Ryu and W. Yin. Large-Scale Convex Optimization via Monotone Operators. Cambridge University Press, 2022. [130] E. K. Ryu, A. B. Taylor, C. Bergeling, and P. Giselsson. Operator splitting performance estimation: Tight contraction factors and optimal parameter selection. SIAM Journal on Optimization, 30(3):2251 2271, 2020. [131] S. K. Sashank J. Reddi, Satyen Kale. On the convergence of Adam and beyond. International Conference on Learning Representations, 2018. [132] K. Sawant, D. Nguyen, A. Liu, J. Poon, and S. Dhople. A hybrid-computing solution to nonlinear optimization problems. IEEE Transactions on Circuits and Systems I: Regular Papers, 71(12):6555 6568, 2024. [133] J. Schropp and I. Singer. A dynamical systems approach to constrained minimization. Numerical Functional Analysis and Optimization, 21(3-4):537 551, 2000. [134] D. Scieur, V. Roulet, F. Bach, and A. d Aspremont. Integration methods and optimization algorithms. Neural Information Processing Systems, 2017. [135] S. Seshu and M. B. Reed. Linear Graphs and Electrical Networks. Addison-Wesley Series in Behavioral Science: Quantitative Methods. Addison-Wesley Publishing Company, 1961. [136] B. Shi, S. Du, W. Su, and M. Jordan. Acceleration via symplectic discretization of highresolution differential equations. Neural Information Processing Systems, 2019. [137] B. Shi, S. S. Du, M. I. Jordan, and W. J. Su. Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, 2021. [138] W. Shi, Q. Ling, G. Wu, and W. Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013 6023, 2015. [139] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014. [140] G. J. Silverman. Primal decomposition of mathematical programs by resource allocation: I. Basic theory and a direction-finding procedure. Operations Research, 20(1):58 74, 1972. [141] C. S. Simon Michalowsky and C. Ebenbauer. Robust and structure exploiting optimisation algorithms: An integral quadratic constraint approach. International Journal of Control, 94(11):2956 2979, 2021. [142] D. Sontag, A. Globerson, and T. Jaakkola. Introduction to dual decomposition for inference. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning, pages 219 254. The Massachusetts Institute of Technology Press, 2011. [143] W. Su, S. Boyd, and E. J. Candès. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. Neural Information Processing Systems, 2014. [144] W. Su, S. Boyd, and E. J. Candès. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1 43, 2016. [145] J. J. Suh, J. Park, and E. K. Ryu. Continuous-time analysis of anchor acceleration. Neural Information Processing Systems, 2023. [146] J. J. Suh, G. Roh, and E. K. Ryu. Continuous-time analysis of AGM via conservation laws in dilated coordinate systems. International Conference on Machine Learning, 2022. [147] A. Sundararajan, B. Van Scoy, and L. Lessard. Analysis and design of first-order distributed optimization algorithms over time-varying graphs. IEEE Transactions on Control of Network Systems, 7(4):1597 1608, 2020. [148] A. Taylor and F. Bach. Stochastic first-order methods: Non-asymptotic and computer-aided analyses via potential functions. Conference on Learning Theory, 2019-06-25/2019-06-28. [149] A. Taylor and Y. Drori. An optimal gradient method for smooth strongly convex minimization. Mathematical Programming, 199(1-2):557 594, 2023. [150] A. B. Taylor, J. M. Hendrickx, and F. Glineur. Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming, 161(1):307 345, 2017. [151] K. Ushiyama, S. Sato, and T. Matsuo. A unified discretization framework for differential equation approach with Lyapunov arguments for convex optimization. Neural Information Processing Systems, 2023. [152] B. Van Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49 54, 2018. [153] S. Vichik and F. Borrelli. Solving linear and quadratic programs with an analog circuit. Computers & Chemical Engineering, 70:160 171, 2014. [154] R. D. Vogelaere. Methods of integration which preserve the contact transformation property of the hamilton equations. 1956. [155] A. Wächter and L. T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106:25 57, 2006. [156] J. Wang and N. Elia. Control approach to distributed optimization. Annual Allerton Conference on Communication, Control, and Computing, 2010. [157] E. Wei and A. Ozdaglar. On the O(1/k) convergence of asynchronous distributed alternating direction method of multipliers. Global Conference on Signal and Information Processing, 2013. [158] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351 E7358, 2016. [159] J. C. Willems. The Generation of Lyapunov Functions for Input-Output Stable Systems. SIAM Journal on Control, 9(1):105 134, 1971. [160] J. C. Willems. Dissipative dynamical systems part I: General theory. Archive for Rational Mechanics and Analysis, 45(5):321 351, 1972. [161] A. C. Wilson, L. Mackey, and A. Wibisono. Accelerating rescaled gradient descent: Fast optimization of smooth functions. Neural Information Processing Systems, 2019. [162] A. C. Wilson, B. Recht, and M. I. Jordan. A Lyapunov analysis of accelerated methods in optimization. Journal of Machine Learning Research, 22(113):1 34, 2021. [163] G. Wilson. Quadratic programming analogs. IEEE Transactions on Circuits and Systems, 33(9):907 911, 1986. [164] L. Yang, R. Arora, V. braverman, and T. Zhao. The physical systems behind optimization algorithms. Neural Information Processing Systems, 31, 2018. [165] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson. A survey of distributed optimization. Annual Reviews in Control, 47:278 305, 2019. [166] T. Yoon, J. Kim, J. J. Suh, and E. K. Ryu. Optimal acceleration for minimax and fixed-point problems is not unique. International Conference on Machine Learning, 2024. [167] T. Yoon and E. K. Ryu. Accelerated algorithms for smooth convex-concave minimax problems with O(1/k2) rate on squared gradient norm. International Conference on Machine Learning, 2021. [168] Y. Yu and B. Açıkme se. RC circuits based distributed conditional gradient method. ar Xiv:2003.06949, 2020. [169] Y. Yu and B. Açıkme se. RLC circuits-based distributed mirror descent method. IEEE Control Systems Letters, 4(3):548 553, 2020. [170] H. Yuan, Y. Zhou, C. J. Li, and Q. Sun. Differential inclusions for modeling nonsmooth ADMM variants: A continuous limit theory. International Conference on Machine Learning, 2019. [171] K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835 1854, 2016. [172] J. Zhang, A. Mokhtari, S. Sra, and A. Jadbabaie. Direct Runge Kutta discretization achieves acceleration. Neural Information Processing Systems, 2018. [173] J. Zhang, S. Sra, and A. Jadbabaie. Acceleration in first order quasi-strongly convex optimization by ODE discretization. Conference on Decision and Control, 2019. [174] P. Zhang, A. Orvieto, and H. Daneshmand. Rethinking the variational interpretation of accelerated optimization methods. Neural Information Processing Systems, 2021. A Prior works Distributed optimization as RLC circuits. This work started as a lecture for the Stanford University EE 364b class given in 2010 [28]. The lecture proposed the idea of relating distributed optimization algorithms to the dynamics of RLC circuits. Different from the prior studies [48, 163, 41, 153, 132], that consider solving specific optimization problems through implementing physical circuits, our focus is on using insights from circuit theory to design new algorithms, without any consideration of implementing physical circuits. The follow-up works [168, 169, 4, 5], have built upon this setup [28]. Optimization algorithms from continuous-time dynamics. Relating continuous-time dynamics described by ordinary differential equation (ODE) with optimization algorithm is a technique with a long history [32, 79, 6, 133, 62]. The continuous-time dynamics related to Polyak s heavy ball method [124] were studied by [11, 17, 8, 16]. The ODE model for Nesterov acceleration [117] was introduced by [143, 144], analyses for generalized cases were followed by [10, 13, 15], and the ODE model for Nesterov acceleration for strongly convex function (NAG-SC) was introduced in [162]. Together with [93], the studies by [143, 144] initiated continuous-time analyses of accelerated first-order methods and inspired much follow-up works such as [158, 19, 115, 52, 58, 162, 18, 27, 146, 91, 145, 26]. As a further refined continuous-time model preserving more information from the discretization, the high-resolution ODE for NAG-SC was introduced in [137], and was further developed by [103]. In addition to accelerated methods, various topics and methods in optimization have been studied in a continuous-time framework. Continuous-time dynamics related to splitting methods were studied by [20, 1, 25, 43, 66, 78]. [65] studied continuous-time dynamics of ADMM [74, 71, 70, 56, 29], and provided an accelerated ADMM by discretizing the ODE model combined with [143]. The analyses were furthermore generalized to differential inclusions by [170, 67]. There are numerous works of continuous-time analyses for distributed optimization, [156, 104, 73, 86, 100] to name a few, and we refer the readers to the survey paper [165] for a comprehensive overview. Computer-assisted analysis of optimization algorithms. There has been lines of work automating the analysis of optimization methods using semidefinite programs (SDP). One line of work is performance estimation problems (PEP) introduced by [54], which provides a systematic way to obtain worst-case performance guarantees of a given fixed-step first-order method. The range and technique of utilizing PEP have been further developed by [150, 148, 130, 113, 90], and many efficient algorithms with tight analyses utilizing PEP are discovered [88, 99, 87, 89, 167, 95, 122, 75, 149, 84, 22, 166]. Another line of work is an approach adapting integral quadratic constraints (IQC) [108]. IQCs are a powerful analysis method in control theory for analyzing interconnected dynamical systems with nonlinear feedback. This approach was first adapted for analyzing first-order optimization algorithms by [97] and followed by [60]. Analyses based on IQC have lead to tight bounds for well-known algorithms [120, 81]. IQC has also been utilized to develop new fast algorithms with tight convergence rates [152, 44, 147, 141]. Recently, an extension of PEP to leveraging quadratic constrained quadratic programs (QCQP) was introduced by [45]. Treating the step-sizes as optimization variables, this work furthermore provides systematic computer-assisted methodology to optimize the step-sizes. Our work adapts this approach to finding appropriate discretizations. To the best of our knowledge, our proposal is the first instance of using computer-assisted methodologies to find discretizations of continuous-time dynamics. Physics-bases approaches to designing optimization algorithms. Optimization methods obtained by discretizing conformal Hamiltonian dynamics [107] were considered by [105]. Studying structure-preserving discretizations for conformal (dissipative) Hamiltonian systems, [68, 69] analyzed symplectic structure of Nesterov and heavy ball, and introduced Relativistic Gradient Descent (RGD) by adopting ideas from special relativity. Based on relativistic Born-Infeld (BI) dynamics, [47] considered a class of frictionless, energy-conserving system and introduced Bouncing BI (BBI) algorithm as a discretization. Our work is based on nonlinear resistive electric circuits, the study of which dates back to [55]. The stationary condition for nonlinear networks were considered by [109], generalizing theorems of Maxwell [106] for linear networks. The study of nonlinear resistive networks influenced the refinement of the concept of maximal monotonicity [111, 112, 110], which is now a fundamental concept in convex optimization. Well-posedness of the solutions for nonlinear networks was studied by [49, 51], but only for one-descent nonlinear resistors. Recently, the study of nonlinear electrical circuits was revisited by [38, 37, 36, 39] using contemporary methods of convex optimization. However, their main focus was on circuits, not on designing new optimization algorithms. To the best of our knowledge, our work is the first to introduce a generalized framework for designing optimization methods based on electric circuits. Discretization. Continuous-time analyses of optimization algorithms must eventually contend with the issue of discretizing the dynamics into a discrete-time algorithm. Discretization of differential equations is a subject of numerical analysis, and it has a long history, even dating back to Euler [57]. Standard discretization schemes such as Euler, Runge Kutta [127, 94] and symplectic integrators [154, 128, 61], have a rich body of research analyzing their convergence [77, 83] for example. However, these theories in numerical analysis primarily focus on the convergence of the discretized sequence to the trajectory of the solution flow in differential equations throughout a finite time-interval, which differs from the focus of optimization. Therefore, directly applying standard discretization schemes from numerical analysis does not ensure convergence to the optimality criteria of interest in optimization, such as function value or optimal point convergence. In optimization, the study of discretization can broadly be divided into two categories. One involves applying standard discretization schemes or their variants, and the other provides special rules tailored to the specific dynamics of interest. As previously discussed, the former cases can only guarantee the convergence involving certain errors [24, 64], or introduce specific and limited cases they can cover [134, 172, 136, 114, 173, 146, 151]. The latter type of works do provide discretization rules with analytic proofs for certain families of ODEs [7, 144, 158, 14, 161, 3, 12, 52, 26], but cannot be applied to general cases. Of course, both approaches have brought significant advances in obtaining new methods from continuous-time dynamics, however, it is still true that previous approaches cannot immediately applied the new ODEs that emerge from our framework. To the best of our knowledge, our work is the first to propose to automate the process of finding a discretized method from ODE using computer-assisted tools. B Proof of Theorem 2.1 To prove Theorem 2.1, it is sufficient to consider the cases without 0-ohm resistors and furthermore all resistor, inductance, capacitance values are 1. We first state the theorem for such cases, which implies Theorem 2.1. Theorem B.1. Let f : Rm Rm be a µ-strongly convex and M-smooth function and B : RJ RK be a matrix. Suppose (v0, i0, x0, y0) satisfy i0 R(B ), v0 R = i0 R, y0 = f(x0). (10) Then there is a uniquely determined Lipschitz continuous curve (v, i, x, y): [0, ) R2K satisfies i y R(B ), y = f(x), v R = i R, v L = d dti L, i C = d dtv C, (11) for all t (0, ) and the initial condition (v(0), i(0), x(0), y(0)) = (v0, i0, x0, y0). Lemma B.2. Theorem B.1 implies Theorem 2.1. Proof. (i) KCL, KVL and V-I relations for equivalent dynamics without 0-ohm resistors. We first consider the equivalent dynamic interconnect without 0-ohm resistors. As 0-ohm resistors are ideal wires, from basic circuit theory we know the nodes connected by 0-ohm resistors can be considered as a single node. We find the expression for KCL, KVL and V-I relations for the equivalent dynamic interconnect composed with f. The equivalent expression for KCL and KVL can be considered as consequence of Tellegen s theorem in [50, 10.2.3], however, we write the detail here to make it self-contained. Observe, KCL and KVL can be equivalently written as A Im 0 We furthermore restrict the values to satisfy Ohm s law for 0-ohm resistors, i.e., the potential values of two nodes connected to a 0-ohm resistor is identical. Let s first focus on KCL, the left equation. Suppose node j and l are connected with 0-ohm resistor named as Rjl. Suppose the k th column of A corresponds to Rjl. Eliminating Rjl corresponds to eliminating the k th column of A and eliminating i Rjl from i. However, if we just directly eliminate them, as i Rjl may not be zero, the equation will no longer be satisfied. We need to keep the information that currents flowing into node j (except for i Rjl) flows to node l. As we do not permit ideal wire loop, without loss of generality we may assume node j is not the ground node. To preserve the information, when node l is not the ground node, we add the j th row of A Im 0 the l th row. Then k th component of the l th row becomes 0, thus the equation corresponding to the l th row will still be satisfied after eliminating the k th column and Rjl. When node l is the ground node, skip the row addition. Now eliminate the j th row. Note that column is eliminated only from A. We now move on to KVL. Eliminating a column of A and a component in i corresponds to eliminating a row of A and a component in v. This conserves the validity of the equation. Next, the row operation corresponds to column operation for A Im 0 . Recall we ve restricted the potential values of the nodes connected with 0-ohm resistor to be same, values in x e corresponding to column j and l coincide. Thus when node l is not the ground node, adding j th column to the l th column and eliminating j th component in x e , will not change the values on the left hand side. When node l is the ground node, the same argument holds by skipping the column addition. Repeat this process until there is no 0-ohm resistors. Name the reduced matrix as B and reduced current as i. Then KCL reduces to B i y = 0 and KVL reduces to v x , or equivalently v x Now name the reduced diagonal matrices DR, DL and DC as the reduced matrices that without the entries corresponding to eliminated components. Note DR has no zero diagonal entries. Then KCL, KVL and V-I relations for the equivalent dynamic interconnect composed with f become as follows i y R( B ), y = f(x), v R = DR i R, v L = DL d dt i L, i C = DC d dt v C. As an equivalent dynamics, it is enough to prove the curve that satisfies (12) and the initial condition ( v(0), i(0), x(0), y(0)) = ( v0, i0, x0, y0) with condition i0 R( B ), v0 R = DR i0 R, y0 = f(x0) (13) is unique and Lispchitz continuous. (ii) Sufficient to consider only the cases with DR, DL and DC are identity matrices. For a dynamic interconnect composed with f, consider the equivalent dynamics without 0-ohm resistors. Let B be the matrix in (12) for the dynamics, and let K be the number of columns of B. Suppose ( v0, i0, x0, y0) satisfy (13). Define the diagonal matrix and define B = BP. Define i0 and v0 to satisfy i0 (v0, i0, x0, y0) satisfies (10) since = ( BP) P 1 i0 = B z0 z0, v0 = P B z = Bz0, v0 R = DR i0 R q DR v0 R = q D 1 R i0 R v0 R = i0 R. Then by Theorem B.1, there is a Lipschitz continuous curve (v, i, x, y): [0, ) R2K that satisfies (11) for all t (0, ) and the initial condition (v(0), i(0), x(0), y(0)) = (v0, i0, x0, y0). Define i and v to satisfy i y . Then i and v are Lipschitz continuous as well, as they are composition of linear operators and Lipschitz continuous curves. Furthermore, we can check (12) and the initial condition ( v(0), i(0), x(0), y(0)) = ( v0, i0, x0, y0) is satisfied, with the similar argument above. Reversing the arguments, the uniqueness can be obtained since P is invertible and thus (v, i) 7 ( v, i) is bijective. This concludes the proof. By Lemma B.2, our goal has reduced to Theorem B.1. We will establish the well-posedness for v C and i L first, then extended them to whole curve. The well-posedness of v C, i L can be obtained by reducing the dynamics to a differential inclusion with a maximal monotone operator. We first restate the theorem in [21] and its immediate implication as a remark, which we use in the proof. Theorem B.3. [21, Thm 3.2.1] Let 𝕄: Rn Rn be a maximal monotone operator, consider the differential inclusion X(t) 𝕄(X(t)), (14) with initial condition X(0) = X0 dom 𝕄. Then there is a unique solution X : [0, ) Rn that is absolutely continuous and satisfies (14) for almost all t. Moreover, if we denote T = {t [0, ) | X is differentiable at t}, then followings are true. (i) Let X( ), Y ( ) are the solutions issued from X0, Y0 dom 𝕄respectively. Then X(t) Y (t) X0 Y0 for all t 0. (ii) For all t 0, X+(t) := limh 0+ X(t+h) X(t) h is well-defined and continuous from the right. Note, X(t) = X+(t) for all t T . (iii) t 7 X+(t) is nonincreasing. (iv) X+(t) = m(𝕄(X(t))) holds for all t 0. Here m(K) is the element of K Rn with minimal norm, that is, m(K) = ΠK(0) = argmin k K k . Therefore X(t) = m(𝕄(X(t))) holds for all t T , and so (14) is satisfied almost everywhere. Remark. From (iii) we have X+(t) X+(0) = m(𝕄(X0)) for all t 0, thus for t1, t2 0 we have X(t1) X(t2) = t2 X+(s)ds Z t1 t2 m(𝕄(X0)) ds = |t1 t2| m(𝕄(X0)) . Therefore the theorem implies that X is Lipschitz-continuous, in particular with parameter m(𝕄(X0)) . Thus our first goal is to prove the condition (11) can be equivalently written as d dt for some maximal monotone operator 𝔸: R|C|+|L| R|C|+|L|. We first establish an efficient reformulation of KCL and KVL. Lemma B.4. There is a skew-symmetric matrix ˆH : Rσ+m Rσ+m and a corresponding diagonal matrix J : Rσ+m Rσ+m with entries 0 of 1 that satisfies i y R(B ) ˆu = ˆH ˆw, where ˆu and ˆw are defined as ˆw = [J Iσ+m J] , ˆu = [Iσ+m J J] Moreover, let Q: Rσ+m Rσ+m be a permutation matrix, define w = Q ˆw, u = Qˆu. Then there is a skew-symmetric matrix H that satisfies i y R(B ) u = Hw. Remark. The diagonal matrix J determines whether to select voltage or current for each component, to construct ˆw. To clarify, w, u Rσ+m are the vectors that {wl, ul} becomes a current and voltage pair of a component for l = 1, 2, . . . , σ + m. Such partitions of current, voltages values w, u and skew-symmetric matrix H were also considered in [82] with different notation. However, we introduce our method of constructing them here, as we will consider H with a special property in Corollary B.4.1 that plays a key role in the proof. Proof. Define N and B be matrices consisted with basis of N(B) and Row(B) respectively. Then KCL and KVL can be shortly rewritten as We now show there is a diagonal matrix J : Rσ+m Rσ+m with entries 0 or 1, that makes the below square matrix invertible N 0 0 B J Iσ+m J R2(σ+m) 2(σ+m). Name N0 = N 0 and B0 = 0 B . We will attach the standard basis vectors or 0 below, and increase the index with attached number of rows. We proceed induction on the index, until the index becomes σ + m. Suppose, for 0 k σ + m 1, Nk and Bk satisfy the form N 0 j1e1 ... jkek 0 B (1 j1)e1 ... (1 jk)ek where jl {0, 1} and el Rσ+m is a standard basis (row) vector for 1 l k. We claim either ek+1 / Row(Nk) or ek+1 / Row( Bk) is true. Proof by contradiction. Suppose not. That is, suppose ek+1 Row(Nk) and ek+1 Row( Bk). Then there are n = (n1, . . . , nσ+m) Row(N), r = (r1, . . . , rσ+m) Row( B) and coefficients al, bl such that l=1 aljlel = r + l=1 bl(1 jl)el. Taking inner product with ep, 1 p σ + m, we have ap if 1 p k, jp = 1 0 if 1 p k, jp = 0 1 if p = k + 1 0 if k + 1 < p σ + m, 0 if 1 p k, jp = 1 bp if 1 p k, jp = 0 1 if p = k + 1 0 if k + 1 < p σ + m. p=1 nprp = nk+1rk+1 = 1. By the way, since n Row(N) = N(B), r Row( B) = R(N ) we have n r and so n, r = 0. A contradiction, we conclude either ek+1 / Row(Nk) or ek+1 / Row( Bk) is true. From the proved claim, we can extend N0, B0 to Nσ+m, Bσ+m with keeping the form of (15) by repeating the process below. Recall, the desired form of the matrix was N 0 0 B J Iσ+m J with diagonal matrix J R(σ+m) (σ+m) with entries 0 or 1. By the construction, we see matrix Nσ+m Bσ+m satisfies the desired form. Moreover, we know the nonzero rows of Nσ+m and Bσ+m are linearly independent respectively, by their construction. By the form of G we see if l-th row of Nσ+m is nonzero then l-th row of Bσ+m is zero and vice-versa, we conclude the rows of G are linearly independent. Therefore, G is invertible. N 0 0 B J Iσ+m J We know above equation holds for arbitrarily chosen (v, x), (i, y) that satisfies KVL and KCL respectively. Observe dom(G) = R(B ) N(B) and from dimension theorem we know dim(R(B ) N(B)) = dim(R(B )) + dim(N(B)) = σ + m. As G is invertible, we have dim(R(G)) = dim(dom(G)) = σ + m. Therefore the values of the components of ˆw can be arbitrary values in R. Rearranging the rows of G 1, from (16) we obtain H R2(σ+m) 2(σ+m) that satisfies ˆw ˆu = Hw 0 Hw w Hu 0 Hu w where the block matrices are in R(σ+m) (σ+m). Now, naming ˆH = Hu w we get ˆu = ˆH ˆw. Now to show H is skew-symmetric, recall from (v, x) R(B ) and (i, y) N(B) we have (v, x), (i, y) = 0. Thus for all ˆw Rσ+m, we have D ˆw, ˆH ˆw E = ˆw, ˆu = (v, x), (i, y) = 0. Therefore ˆH is skew-symmetric. Finally, let Q: Rσ+m Rσ+m be a permutation matrix. Define H = Q ˆHQ . Since H = Q ˆH Q = Q( ˆH)Q = H, H is skew-symmetric. And from Q Q = Iσ+m we have ˆu = ˆH ˆw u = Qˆu = Q ˆH ˆw = Q ˆHQ Q ˆw = Hw, we conclude the proof. Corollary B.4.1. Recall ˆw is composed with voltage or current values of each component. Integrate the values of resistors and denote as r, integrate v C, i L as p and integrate i C, v L as p . Then we may rearrange the elements of ˆw with certain permutation matrix Q, that w = Q ˆw can be decomposed as following order , where wp = wv C wi L , wp = wi C wv L , wr = wvr wir wv R wx wi R wy Consider rewriting u = Hw in the decomposed way as "up up ur Hp p Hp p Hp r Hp p Hp p Hp r Hr p Hr p Hr r Then there is a diagonal matrix J satisfies the properties considered in Lemma B.4, that corresponding H satisfies Hr p = 0, Hp p = 0, Hp r = 0. Proof. Name the indices as Cl, Lk {1, . . . , σ + m} for l {1, . . . , |C|}, k {1, . . . , |L|} that satisfy = v Cl, e Lk First we put e Cl s and e Lk s in J as many as possible. That is, determine the values of j Cl s and j Lk s to satisfy {e Cl | j Cl = 1} is linearly independent to Row(N). {e Cl | j Cl = 1} e Cl is linearly dependent to Row(N) for any l {j Cl = 1}. {e Ls | j Lk = 0} is linearly independent to Row( B). {e Ls | j Lk = 0} e Ls is linearly dependent to Row( B) for any s {j Ls = 0}. Next, fill the remaining j s as we ve done in Lemma B.4. Since the proof can be applied using the same argument to other cases, we will focus on a specific case. Focusing on the last row of (17), we can furthermore decompose and write as following ur = Hr p Hr p Hr r " wp wp wr = 0 Hir i L Hir i C 0 0 Hir ir Hvr v C 0 0 Hvr v L Hvr vr 0 wv C wi L wi C wv L wvr wir Note, since above equations origin from KCL and KVL (which are linear equations only with current values or voltage values), Hβ α = 0 if α is current and β is voltage, and vice-versa. Refer [135, Theorem 6.3]. Observe Hr p = Hir i C 0 0 Hvr v L , here we show Hir i C = 0. Focusing on arbitrary k th row of Hir i C, from above equality we get uirk = Hir i L Hir i C Hir ir "wi L wi C wir 0 = Hir i L Hir i C Hir ir "wi L wi C wir where the subscript k means the k th row of the block matrix. As this is a linear equation of current values, it origins from KCL, thus there is a vector r Row( B) corresponding to this equation, i.e. = Hir i L Hir i C Hir ir "wi L wi C wir On the other hand, as wi L, wi C, wir are consisted with the components of i, y that corresponds to jl = 0, there are coefficient vectors a R|L|, b R|C|, c R|R|+m that satisfies Hir i L Hir i C Hir ir "wi L wi C wir s {j Ls=0} ase Ls + X l {j Cl=0} ble Cl + X q {jrq =0} cqerq erk Note bl s correspond to components of Hir i C k. Organizing, we have s {j Ls=0} ase Ls + X l {j Cl=0} ble Cl + X q {jrq =0} cqerq erk. Observe that from right hand side, we can see r is orthogonal to e Cl | j Cl = 1 . By the way, as e Cl | j Cl = 1 {e Cl} is linearly dependent to Row(N) for all l {j Cl = 0}, we see X l {j Cl=0} ble Cl span e Cl | j Cl = 1 Row(N) , so there is some coefficient vector d R|C| and n Row(N) that satisfies X l {j Cl=0} ble Cl = X l {j Cl =1} dl e Cl + n. However, as r Row( B) and Row( B) Row(N), we have r, n = 0. Moreover, as r is orthogonal to e Cl | j Cl = 1 , we conclude l {j Cl =1} dl e Cl + n l {j Cl=0} ble Cl l {j Cl=0} ble Cl l {j Cl=0} b2 l . Therefore, as bl s corresponds to components of Hir i C k, we conclude Hir i C k = 0. As k was arbitrary, we get Hir i C = 0. Similarly we can show Hvr v L = 0, and thus Hr p = 0. Repeating the same argument, we can show Hp p = 0. Finally, as H is skew-symmetric, we have Hp r = (Hr p ) = 0. We now move on to V-I relations of resistors. To express V-I relations in terms of w and u, we adopt partial inverse. Definition. [23, Definition 20.42] Let 𝕄: Rd Rd be a set-valued operator and let K be a closed linear subspace of Rd. Denote ΠK : Rd Rd the projection onto K as ΠK(z) = argmin k K z k . The partial inverse of 𝕄with respect to K is the operator 𝕄K : Rd Rd defined by gra 𝕄K = {(ΠKx + ΠK y, ΠKy + ΠK x) | (x, y) gra 𝕄} , u 𝕄K(w) x, y such that y 𝕄x and (w, u) = (ΠKx + ΠK y, ΠKy + ΠK x). We then prove important properties of the function related to V-I relations for resistors. Lemma B.5. Suppose f is µ-strongly convex and M-smooth function. Let Qr, Hr r , Jr : R|R|+m R|R|+m be a permutation matrix, a skew-symmetric matrix, a diagonal matrix with entries 1 or 0 respectively and let K = R(Jr). Define F : R|R|+m R as F(v R, x) = 1 2 v R 2 + f(x). Then the following holds. (i) dom(Qr( F)KQ r Hr r ) 1 = R|R|+m. (ii) (Qr( F)KQ r Hr r ) 1 is Lipschitz continuous monotone operator. Proof. Take (wl r, ul r) R|R|+m for l {1, 2}, such that ul r (Qr( F)KQ r)wl r. As Qr is permutation matrix, we know (Qr) 1 = (Qr) , and thus Q rul r ( F)K(Q rwl r). Then there are vl R xl R|R|+m such that = F vl R xl , Q rwl r, Q rul r = ΠK + ΠK il R yl + ΠK vl R xl (18) By [23, Proposition 20.44, (iii)], we have Q r(w1 r w2 r), Q r(u1 r u2 r) = v1 R x1 Moreover, we can check 2 + ΠK v1 R x1 2 + ΠK i1 R y1 2 = w1 r w2 r 2 + u1 r u2 r 2 . Define µmin = {µ, 1} and Mmin = {M, 1}. Then we can check F is µmin-strongly convex and Mmin-smooth, we see (i1 R, y1) (i2 R, y2), (v1 R, x1) (v2 R, x2) µmin (v1 R, x1) (v2 R, x2) 2 , (i1 R, y1) (i2 R, y2), (v1 R, x1) (v2 R, x2) Mmin (i1 R, y1) (i2 R, y2) 2 . w1 r w2 r, u1 r u2 r = Q r(w1 r w2 r), Q r(u1 r u2 r) = v1 R x1 µmin + Mmin = µmin + Mmin w1 r w2 r 2 + u1 r u2 r 2 µmin + Mmin w1 r w2 r 2 , we see Qr( F)KQ r is µmin+Mmin 2 -strongly monotone. Note we can check ( F)K is also strongly monotone, by considering the special case Qr = I|r|+m. Lastly, since Hr r is skew-symmetric, we know Hr r z, z = 0 for all z R|R|+m. Therefore for arbitrary ( wl r, ul r) R|R|+m with l {1, 2} such that ul r (Qr( F)KQ r Hr r ) wl r, since ul r = ul r + Hr r wl r (Qr( F)KQ r) wl r, w1 r w2 r, u1 r u2 r = w1 r w2 r, u1 r u2 r Hr r ( w1 r w2 r) = w1 r w2 r, u1 r u2 r µmin + Mmin w1 r w2 r 2 thus (Qr( F)KQ r Hr r ) is also µmin+Mmin 2 -strongly monotone. Now since F is maximal monotone, ( F)K is maximal monotone by [23, Proposition 20.44, (v)]. Since ( F)K is strongly monotone, we have dom( F)K = R|R|+m by [23, Proposition 22.11], and thus Qr( F)KQ r is maximal monotone by [129, Theorem 12]. Moreover, since both Qr( F)KQ r and Hr r have full domain, (Qr( F)KQ r Hr r ) is maximal monotone by [129, Theorem 10]. Organizing, (Qr( F)KQ r Hr r ) is maximal monotone and strongly monotone. Therefore by [23, Proposition 22.11], we conclude (ii). Finally, observe (Qr( F)KQ r Hr r ) 1 = Qr( F)KQ r Hr r µmin + Mmin 2 I|R|+m | {z } =:𝕄 +µmin + Mmin = 𝕁 2 µmin+Mmin 𝕄 2 µmin + Mmin I|R|+m. Since 𝕄is monotone, 2 µmin+Mmin 𝕄is also monotone, by [23, Corollary 23.9] we know 𝕁 2 µmin+Mmin 𝕄 is 1-Lipschitz continuous. Therefore (Qr( F)KQ r Hr r ) 1 is 2 µmin+Mmin -Lipschitz continuous. Finally it is monotone as it is an inverse of a monotone operator, we conclude (ii). Finally, we are ready to prove Theorem 2.1. Proof of Theorem 2.1. By Lemma B.2 it is suffices to show Theorem B.1. (i) Well-posedness and Lipschitz continuity of v C, i L. Existence of the whole curve (v, x, i, y). Define an operator 𝔸: R|L|+|C| R|L|+|C| as v = (v R, v L, v C), i = (i R, i L, i C), (x, y) (19) such that i y R(B ), y = f(x), v R = i R We prove 𝔸is maximal monotone by providing an explicit expression of 𝔸and apply Theorem B.3. From Corollary B.4.1, we know there is a diagonal matrix J : Rσ+m Rσ+m, a permutation matrix Q: Rσ+m Rσ+m and a corresponding skew-symmetric matrix H : Rσ+m Rσ+m that satisfies i y Hp p Hp p Hp r Hp p 0 0 Hr p 0 Hr r where w and u are defined as in Lemma B.4. Define F : R|R|+m R as F(v R, x) = 1 2 v R 2 + f(x). Then it is straight forward that y = f(x), v R = i R i R y By the construction of w, u, there is a diagonal matrix Jr : R|R|+m R|R|+m with entries 1 or 0 and a permutation matrix Qr : R|R|+m R|R|+m that satisfies Q 1 r wr Q 1 r ur = Q rwr Q rur = Jr I|R|+m I|R|+m Jr v R x i R y Define Kr = R(Jr). Note, we can check ΠKr(z) = Jrz and ΠK r (z) = (I|R|+m Jr)z for all z R|R|+m. Recalling (18) in Lemma B.5 we see i R y ur = (Qr( F)Kr Q r)wr. From (Qr( F)Kr Q r)wr = ur = Hr pwp + Hr r wr, we get expression for wr in terms of wp wr = (Qr( F)Kr Q r Hr r ) 1 Hr pwp. (21) Now focusing on the expression for up up , since Hr p = (Hp r ) as H is skew-symmetric, eliminat- ing wr by applying (21) we see up up = Hp p Hp p Hp r Hp p 0 0 = Hp p Hp p Hp p 0 + Hr p 0 (Qr( F)Kr Q r Hr r ) 1 Hr p 0 Since H is skew-symmetric, its principal minors Hp p Hp p Hp p 0 and Hr r are skew-symmetric and so maximal monotone. Furthermore, by Lemma B.5 we have (Qr( F)Kr Q r Hr r ) 1 is maximal monotone and dom(( F)1, 1 Hr r ) 1 = R|R|+m. Invoking [129, Theorem 11, 12], we conclude 𝔹is maximal monotone. Organizing, we see i C v L R(B ), y = f(x), v R = i R , where w u J Iσ+m J Iσ+m J J Therefore there is a diagonal matrix Jp,p : R|L|+|C| R|L|+|C| with entries 1 or 0 and a permutation matrix Qp,p : R|L|+|C| R|L|+|C| that satisfies i C v L where Q p,p 0 0 Q p,p wp wp up up = Jp,p I|L|+|C| Jp,p I|L|+|C| Jp,p Jp,p v C i L i C v L Therefore, for Kp,p = R(Jp,p ) we have 𝔸Kp,p = Q p,p 𝔹Qp,p . Since 𝔹is a maximal monotone operator with dom 𝔹= R|L|+|C|, we have Q p,p 𝔹Qp,p is maximal monotone. Finally from [23, Proposition 20.44, (v)], we conclude 𝔸is maximal monotone. By applying Theorem B.3 and its remark, we know there is a unique Lipschitz continuous curve (v C, i L): [0, ) R|L| R|C| that satisfies v C(t) i L(t) = m𝔸 v C(t) i L(t) 𝔸 v C(t) i L(t) for almost all t [0, ), where m𝔸is the minimum-norm selection of 𝔸. Let i C(t) v L(t) m𝔸 v C(t) i L(t) . Moreover, the definition of 𝔸implies the existence of accompanying curves v R, i R, x and y that satisfy KCL, KVL and V-I relations. This concludes the existence of the curve. (ii) The whole flow (v, x, i, y) is well-posed and Lipschitz continuous. Finally, we show other curves besides (v L, i C) are defined uniquely and Lipschitz continuous. To do so, we prove there is a Lipschitz continuous function G : R|C| R|L| Rσ Rσ Rm Rm that satisfies G(v L, i C) = (v, i, x, y). We prove the claim by finding the explicit expression of the component functions of G. We first show one key equation wp = (Hp p ) up . (23) From (20) we have Hp p wp = up . And as H is skew-symmetric, we have Hp p = (Hp p ) . The core information we additionally use here, is the V-I relations d dtwp = up . As differentiation is a linear operation, we get (23) by following Hp p up = Hp pup = Hp p dt Hp pwp = d dtup = wp . Recall, from (22) we have up = Hp p + Hp r (Qr( F)Kr Q r Hr r ) 1 Hr p wp + Hp p wp . Moving the last term of right hand side to left hand side, multiplying both sides by (Hp p ) and using (23), we have 𝕀+ (Hp p ) Hp p wp = (Hp p ) Hp p + Hp r (Qr( F)Kr Q r Hr r ) 1 Hr p wp. Since 𝕀+ (Hp p ) Hp p 0 its inverse exists, we conclude wp = 𝕀+ (Hp p ) Hp p 1 (Hp p ) Hp p + Hp r (Qr( F)Kr Q r Hr r ) 1 Hr p Organizing, we have " wp wp wr 0 0 (Qr( F)Kr Q r Hr r ) 1 Hr p From Lemma B.5 we know (Qr( F)Kr Q r Hr r ) 1 is Lipschitz continuous, and clearly linear operators are Lipschitz continuous, so ℂis Lipschitz continuous as it is composition and sum of Lipschitz continuous functions. Therefore wp 7 w is Lipschitz continuous. Finally since u = Hw and H is indeed Lipschitz continuous as a linear operator, mapping wp 7 u is also Lipschitz continuous. As (w, u) is rearrangement of (v, i, x, y) and wv C, wi L are component functions of v C and i L, we get the desired result. For (v, i, x, y) that satisfies (11) with proper initial value, we know (v C, i L) is uniquely defined by previous observation and (v, i, x, y) = G(v C, i L) should hold, we conclude (v, i, x, y) is uniquely determined since G is single valued. Furthermore, as v C(t) and i L(t) are Lipschitz continuous with respect to t, we have (v(t), i(t), x(t), y(t)) = G(v C(t), i L(t)) is also Lipschitz continuous as it is composition of Lipschitz continuous functions. This concludes the proof. C Equilibrium condition Define the set of voltages and currents in the equilibrium of the interconnect Dx,y = (v, i) | Ai = y 0 , v = A x e , v R = DRi R, v L = 0, i C = 0 . Lemma C.1. Assume the dynamic interconnect is admissible. Then for all (v, i) Dx,y we have v R = i R = 0. Proof. First, note that for all (v, i) in the dynamic interconnect v, i = e, Ai = x e = x, y . (24) Now suppose (v, i) Dx,y, then we have v, i = v R, i R = i R 2 DR = x, y . From the admissibility assumption we have x R(E ) and y N(E). Thus i R 2 DR = x, y = 0, which concludes the proof. Theorem C.2. Assume the dynamic interconnect is admissible. If (x , y ) is a primal-dual solution pair (with zero duality gap) for the optimization problem, then there exist v C R|C| and i L R|L| such that ((0, 0, v C), (0, i L, 0)) Dx ,y . Conversely, if y f(x) and ((v R, v L, v C), (i R, i L, i C)) Dx,y, then v R = i R = 0 and (x, y) is a primal-dual solution pair (with zero-duality) for the optimization problem. Proof. First, observe the admissibility assumption can be rewritten as {(x, y) | (v, i) such that (v, i) Dx,y} = R(E ) N(E). Now suppose (x , y ) X Y . Then by Karush-Kuhn-Tucker (KKT) optimality conditions, we have (x , y ) R(E ) N(E). Thus there exists (v , i ) such that (v , i ) = ((v R, 0, v C), (i R, i L, 0)) Dx ,y . Furthermore from Lemma C.1 we have v R = i R = 0. Therefore v C, i L are the vectors that satisfiy the desired statement. Conversely, suppose (v, i) = ((v R, v L, v C), (i R, i L, i C)) Dx,y. From Lemma C.1, we have v R = i R = 0. Moreover, since there exists (v, i) such that (v, i) Dx,y, by admissibility assumption we have (x, y) R(E ) N(E). Finally, given the assumption that y f(x), by using KKT optimality conditions, we conclude (x, y) X Y . D Energy dissipation analysis In this section, we provide the proof of Theorem 2.2. The energy function (6) used in the proof is related to the Lyapunov function considered in [159]. However, the dissipativity theory presented in [159, 160] does not directly apply to our setup. In our setup, we allow cases where v C, i L oscillate, for example, a circuit with a disconnected L C loop. The proof is obtained by combining Barbalat s lemma [85, Lemma 8.2] with Theorem 2.1. Proof of Theorem 2.2. Let (x , y ) be a primal-dual solution pair. Then by Theorem C.2, there is (v , i ) Dx ,y that satisfies (v , i ) = ((0, 0, v C), (0, i L, 0)). (25) In particular, i C = 0 and v L = 0. Define the total energy at time t as 2 v C v C 2 DC + 1 2 i L i L 2 DL. Then the power at time t is given by d dt E(t) = v C v C, DC v C + i L i L, DL i L = v C v C, i C i C + i L i L, v L v L = v R 0 v R, i R 0 i R x x , y y (26) = i R 2 DR x x , y y | {z } 0 where we used (24) and the monotonicity of f. Therefore E( ) = limt E(t) exists. Now, integrating from 0 to we have 0 x(t) x , y(t) y dt E(0) E( ) < . From Theorem 2.1 we know the integrand is Lipschitz continuous, by Barbalat s lemma we get lim t x(t) x , y(t) y = 0. Since f is µ-strongly convex and M-smooth, f and ( f) 1 are strictly monotone, we conclude lim t x(t) = x , lim t y(t) = y , which is our desired result. E Centralized classical algorithms E.1 Resistors and Moreau envelope For R > 0, define the Moreau envelope of f : Rm R of parameter R as Rf(x) = inf z R m Then Rf is 1/R-smooth with gradient given by R(x prox Rf(x)). (27) In this section we show that composing linear resistors with f is equivalent to taking a Moreau envelope of f. See two circuits below. By KCL and Ohm s law for the first circuit, we have 1 R(x x) = i f( x), which is equivalent to x = prox Rf(x). Using identity for the gradient of the Moreau envelope, we get R(x prox Rf(x)) = 1 R(x x) = i. Therefore, the V-I relation on m pins of x in both circuits is identical. As a consequence, consider f to be 1/R-smooth. Let f be pre-Moreau envelope of f, i.e., R f = f. Note that f is a convex function. Then from the series connection of the resistors ( R in series with R is the same as 0-ohm resistor), we get the equivalence of the two circuits below. Note that for this circuit x = x R f( x). E.2 Gradient flow Let f : Rm R be a convex function. Consider the circuit below. Let x be the potentials at m pins of f, and y be the current entering those pins. Applying KCL and the V-I relations of the capacitor we get DC d dtv C = i C = y f(x). Since e is connected to ground, we have v C = x e = 0. The resulting differential inclusion is d dtx D 1 C f(x). For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as Ek = 1 2 xk x 2 DC. E.3 Nesterov acceleration Let f : Rm R be a 1/R-smooth convex function. Consider the circuit below. Observe, by Ohm s law and y = f(x) we have x+ = x Ry = x R f(x). From KCL, KVL, and V-I relations we have d dti L = D 1 L (v C x+) d dtv C = D 1 C f(x). Applying KCL and Ohm s law at x, it follows f(x) = i L + 1 which implies that x = v C + Ri L. Differentiating the above equality twice and plugging in the V-I relations, we get d2 dt2 x = R(DLDC) 1 f(x) RD 1 L d dtx (D 1 C R2D 1 L ) d Reorganizing, we conclude d2 dt2 x + RD 1 L d dtx + (D 1 C R2D 1 L ) d dt f(x) + R(DLDC) 1 f(x) = 0. (28) Under the proper selection of parameters for µ-strongly convex and L-smooth function f, (28) corresponds to the high-resolution ODE for NAG-SC introduced in [136] d2 dt2 x + 2 µ d dt f(x) + (1 + µs) f(x) = 0. As an immediate consequence, if we set R = 1 4µ, Li = 1 8µ µ, Ci = 2 µ, we recover the lowresolution ODE of NAG-SC d2 dt2 x + 2 µ d dtx + f(x) = 0. For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as Ek = 1 2 vk C x 2 DC + 1 2 ik L y 2 DL. E.4 Proximal point method Consider the circuit below. Then from the discussion in E.1, the above circuit is equivalent to the circuit below. According to E.2, the ODE for the above circuit is d dtx = D 1 C Rf(x). (29) Since from (27) we have prox Rf(x) = x R Rf(x), this circuit gives a continuous model for the proximal point method. Applying Euler discretization to (29) with a stepsize of Ci R for each ith coordinate, we recover proximal point method xk+1 = xk R Rf(xk) = xk (xk prox Rf(xk)) = prox Rf(xk). For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as Ek = 1 2 xk x 2 DC. E.5 Proximal gradient method Let f : Rm R be 1/R-smooth convex function, and g: Rm R be convex function. Consider the circuit below. Observe, by the Ohm s law e = x R f(x). Using KCL at x and KVL at e, we get i C = f(x) Rg(e) v C = x R f(x). Applying V-I relation for the capacitor and eliminating e gives dt f(x) = d C f(x) + Rg(x R f(x)) . Organizing d dtx = 1 C Rg(I R f) + f (x) + R d We can show that R d dt f(x) < M for some M > 0, thus d dtx = 1 CR R Rg(I R f) + R f (x) + O (MCR) . (30) Applying Euler discretization with stepsize CR we have xk+1 xk CR R Rg(I R f) + R f (xk) + O (MCR) . Multiplying CR on both sides and reorganizing gives xk+1 = xk R Rg(I R f) + R f (xk) + O(MCR) = prox Rg I (I R f)(xk) + (I R f) (xk) + O(MCR) = prox Rg(I R f)(xk) + O(MCR). If we set C R, we recover the proximal gradient method. For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as Ek = C 2 ek e 2 2, where e = x R f(x ). E.6 Primal decomposition Let f1, . . . , f N : Rm R be convex functions. Consider the circuit below. Let x1, . . . , x N Rm be vectors of potentials at pins of f1, . . . , f N respectively. From KVL, we have e = x1 = = x N = v C. Using KCL at e and the V-I relation of nonlinear resistors we get PN j=1 yj + i C = 0, where yj fj(xj). Using the V-I relation for capacitor we have Discretizing above V-I relations, we recover primal decomposition yk j fj(xk j ) ek+1 = ek h For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as Ek = C 2 ek x 2 2. E.7 Dual decomposition Let f1, . . . , f N : Rm R be convex functions. Consider the circuit below. Using KCL at xj and V-I relation for nonlinear resistors we get xj f j (yj) = f j (i Lj). Using KCL at e yields PN j=1 yj = 0. Using KVL and V-I relation for inductors we get e xj = v Lj = L d Summing over j = 1, . . . , N gives d dti Lj = L d j=1 yj = 0, leading to e = (1/N) PN j=1 xj. Discretizing above V-I relations, we recover dual decomposition xk j f j (ik Lj) ik+1 Lj = ik Lj h L(ek xk j ). For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as 2 ik Lj y j 2 2. E.8 Proximal decomposition Let f1, . . . , f N : Rm R be convex functions. Consider the circuit below. Let x1, . . . , x N Rm be vectors of potentials at pins of f1, . . . , f N respectively. Define e Rm to be a vector of potentials on the bottom of the circuit. Observe, by Ohm s law and KCL we have yj = i Lj + 1 R(e xj) fj(xj). It implies that xi = prox Rfi(e + Ri Li). The V-I relations for inductors are given by d dti L = v L/L = (E e x)/L, where E = (I, . . . , I) RNm m. Further, note that by KCL Ey = PN j=1 yj = 0, therefore E d dt Ey = 0. Using the above V-I relations for g = Ne Ex we get the following ODE dt RE(y i L) = RE d L E(E e x) = R We initialize circuit with Ei L(0) = 0 and EE = NI gives 0 = Ey(0) = E(i L(0) + (E e(0) x(0))/R) = 1 Thus the solution to an ODE (31) is g = 0 and we conclude e = 1 The V-I relations for the circuit are xj = prox Rfj(e + Ri Lj), j = 1, . . . , N d dti L = (E e x)/L. Discretizing above V-I relations we recover proximal decomposition xk+1 j = prox Rfj(ek + Rik Lj), j = 1, . . . , N ek+1 = 1 N Exk ik+1 L = ik L + h L(E ek+1 xk+1). For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as 2 ik Lj y j 2 2 + γ ek x 2 2, where γ is a parameter that is being optimized, see G. E.9 Douglas Rachford splitting Let f, g: Rm R be convex functions. Consider the circuit below. Using KCL at x1 and Ohm s law we get 1 R(x2 x1) + i L g(x1), which implies x1 = prox Rg(x2 + Ri L). Similarly, using KCL at x2 and Ohm s law we get 1 R(x1 x2) i L f(x2), which implies x2 = prox Rf(x1 Ri L). From KVL and V-I relation for inductors we have d dti L = 1 Discretizing above V-I relations with R = L = 1 and stepsize h = 1, we recover Douglas Rachford splitting xk+1 1 = prox Rg(xk 2 + Rik L) xk+1 2 = prox Rf(xk+1 1 Rik L) ik+1 L = ik L + h L(xk+1 2 xk+1 1 ). For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as Ek = L 2 ik L y 1 2 2 + γ xk 2 x 2 2, where γ is a parameter that is being optimized, see G, and y 1 g(x ). E.10 Davis-Yin splitting Let f, g, h: Rm R be convex functions, with h also being 1/S-smooth. Consider the circuit below. Using KCL at e and Ohm s law we have x1 = x2 Si S + Si S = x2. Applying KCL at x2, we get S h(x2). (32) Using KCL at x3 and (32) it follows R + i L = x1 x3 S h(x2) g(x3). Using KCL at x1, we get x3 x1 Organizing, and applying V-I relation for inductor x3 = prox Rg S e R h(x1) x1 = prox Rf d dti L = 1 L(x1 x3). Now we eliminate the term i L. Differentiating (32), applying L d dti L = x1 x3 we get 1 L(x1 x3) = d In other words, d dte = S L(x3 x1) + d Using alternating update and Euler discretization of e and x1, we have xk+1 3 = prox Rg S ek R h(xk 1) xk+1 1 = prox Rf S (ek xk 1) ek+1 = ek + Sh L (xk+1 3 xk+1 1 ) + xk+1 1 xk 1 Sh d dt h(xk 1). Set R = S = h = α and L = α2, then the above can be rewritten as xk+1 3 = proxαg 2xk 1 ek α h(xk 1) xk+1 1 = proxαf ek + xk+1 3 xk 1 ek+1 = ek + xk+1 3 xk 1 α2 d dt h(xk 1). dt h(xk) is bounded, then α2 d dt h(xk) = O(α2). For small α we may ignore this term and recover DYS. For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as Ek = L 2 ik L y 1 2 2 + γ ek 1 e 2 2, where γ is a parameter that is being optimized, see G, and e = x R(y 1 + y 3), y 1 f(x ), y 3 g(x ). F Decentralized classical algorithms In a decentralized optimization setup, we are given a graph G = (V, A) which defines the communication pattern between agents. This means that each agent is constrained to communicate only to its direct neighbors for the edges of the graph. We define Γj as the neighbors of j in graph G. For simplicity, in each example we only illustrate the circuit between components indexed by j and l, where j and l are connected through an edge in the graph G. F.1 Decentralized gradient descent Let f1, . . . , f N : Rm R be differentiable convex functions. Decentralized gradient descent (DGD) is derived as a gradient descent of the appropriate penalty formulation of a decentralized problem. Similarly, to construct a DGD circuit we apply the gradient flow circuit of E.2 to appropriate nonlinear resistors and arrive at the following circuit. The right side of the circuit contains the graph with resistors Rjl connecting vectors of potentials xj Rm and xl Rm for every neighbors j and l in the given graph G. Using the KCL at xj we get 0 = fj(xj) + X Rjl + i Cj. Applying the V-I relation for the capacitors we get l Γj (xj xl)/Rjl Euler discretization recovers the DGD h CRjl xk l h C fj(xk j ), with gradient stepsize h/C and the mixing matrix l Γj h CRjl if j = l h CRjl if j = l, l Γj 0 otherwise. For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as 2 xk j x j 2 2. F.2 Diffusion Let f1, . . . , f N : Rm R be 1/R-smooth convex functions. Decentralized gradient descent is derived as a forward-backward splitting fixed point iteration of the appropriate penalty formulation of a decentralized problem. Similarly, to construct a diffusion circuit we apply the proximal gradient circuit of E.5 to appropriate nonlinear resistors and arrive at the following circuit. The right side of the circuit is the graph with linear resistors Rjl connecting vectors of potentials ej Rm and el Rm for every neighbors j and l in the given graph G. By Ohm s law we have v Cj = ej = xj R fj(xj). Using the KCL at ej we get Rjl + i Cj = 0. Applying the V-I relation for the capacitors we get d dtej = d dtxj R d (xj R fj(xj)) (xl R fl(xl)) We can show that R d dt fj(xj) < M for some M > 0, thus (xj R fj(xj)) (xl R fl(xl)) Rjl + O(MC) Applying Euler discretization with stepsize CR gives (xk j R fj(xk j )) R Rjl (xk l R fl(xk l )) + O(MCR), with gradient stepsize R and the mixing matrix l Γj R Rjl if j = l R Rjl if j = l, l Γj 0 otherwise. If we set C R, we recover the diffusion method. For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as 2 ek j e j 2 2, where e j = x j R fj(x j) for all j = 1, . . . , N. F.3 Decentralized ADMM Let f1, . . . , f N : Rm R be convex functions. Then the decentralized ADMM circuit is given below. R R R R L L L Note that this circuit is similar to the one in proximal decomposition in E.8 with the difference that instead of a single net e we now have a net ejl for each edge (j, l) in graph G. Denote currents on inductors to be i Ljl Rm and i Llj Rm. Using KCL at xj we get X i Ljl + ejl xj fj(xj). (33) We initialize the circuit such that i Ljl(0)+i Llj(0) = 0 for each edge (j, l) in graph G. Now consider KCL at ejl i Ljl + i Llj = (ejl xj) Using V-I relation for inductor we also have d dti Ljl + d dti Llj = 1 L(2ejl xj xl). Combining the two equalities above we get an ODE d dt i Ljl + i Llj = R L i Ljl + i Llj . Using initial conditions the solution of an ODE is i Ljl + i Llj = 0. From (34) we conclude that ejl = 1 2(xj + xl). Using (33), we get the V-I relations for the circuit xj = prox(R/|Γj|)fj l Γj (Ri Ljl + ejl) ejl = 1 2(xj + xl) d dti Ljl = 1 L(ejl xj), for every j = 1, . . . , N and every edge (j, l) in graph G. Discretizing the V-I relations with stepsize L/R, we recover decentralized ADMM, xk+1 j = prox(R/|Γj|)fj l Γj (Rik Ljl + ek jl) ek+1 jl = 1 2(xk+1 j + xk+1 l ) i L k+1 jl = i L k jl + 1 R(ek+1 jl xk+1 j ). For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as 2 ik Ljl i Ljl 2 2 + L 2 ik Llj i Llj 2 2 + γ ek jl x 2 2 where γ is a parameter that is being optimized, see G, and i Ljl is the current through inductor at equilibrium. F.4 PG-EXTRA Let f1, . . . , f N : Rm R be convex functions, and h1, . . . , h N : Rm R be convex M-smooth functions. Then the PG-EXTRA circuit is given below. Denote current on inductor going from xj to m hj fj xj R ej R xj xl to be i Ljl Rm. Recall E.1 and apply Ohm s law to get R = ej prox Rfj(ej) R = Rfj(ej) = xj ej This yields xj = ej R Rfj(ej) = prox Rfj(ej). Using KCL at xj we get R = hj(xj) + X i Ljl + xj xl Define the mixing matrix 1 P l Γj R Rjl if j = l R Rjl if j = l, l Γj 0 otherwise. Rearranging the terms in (35) we get ej = xj R hj(xj) X Ri Ljl + xj xl l=1 Wjl(xj xl) R hj(xj) X l Γj Ri Ljl l=1 Wjlxl R hj(xj) X l Γj Ri Ljl. Using the V-I relation for inductor we also have d dti Ljl = 1 Ljl (xj xl). Set Ljl = Rjl for every edge (j, l) in graph G. Define wj = P l Γj Ri Ljl, then R Ljl (xj xl) Combining the above, we get the V-I relations for the circuit xj = prox Rfj l=1 Wjlxl R hj(xj) wj d dtwj = xj for every j = 1, . . . , N and every edge (j, l) in graph G. Discretizing the above V-I relations with stepsize 1/2, and following the decentralized notation of [129, 11.3], we recover PG-EXTRA, xk+1 = prox Rf Wxk R h(xk) wk wk+1 = wk + 1 We can simplify the circuit by eliminating potentials ej as shown below. For the automatic discretization of the continuous-time dynamics of this circuit, we define the energy at time k as 2 ik+1 Ljl i Ljl 2 2 + j=1 γ xk j x 2 2, where γ is a parameter that is being optimized, see G, and i Ljl is the current through inductor at equilibrium. G Automatic discretization In this paper, we discretize admissible dynamic interconnects corresponding to the following decentralizes setup with graph consensus minimize x1,...,x N R m/N f1(x1) + + f N(x N) subject to xj = xl, j = 1, . . . , N, l Γj, (38) where Γj contains the neighbors of agent j in the communication graph, see F. We assume that the communication graph is connected, ensuring that all agents can communicate with each other [129, 11.2]. The static interconnect for this problem corresponds to the consensus problem minimize x1,...,x N R m/N f1(x1) + + f N(x N) subject to x1 = = x N, which is a special case of (1) where E = (I, . . . , I) Rm m/N. Therefore, we have n = m/N nets each of size N with xj Rn for all j = 1, . . . , N. This setup generalizes the setup of the classical methods discussed in E and F. For automatic discretization, we focus on dynamic interconnects that have the same RLC circuit across each net, i.e., the dynamic interconnects represented with the multi-wire notation. Runge Kutta method. The capacitor and inductor ODEs are of the form d dtx(t) = F(x(t)). We discretize ODEs using the two-stage Runge Kutta method, with coefficients α, β, and stepsize h: xk+1/2 = xk + αh F(xk) xk+1 = xk + βh F(xk) + (1 β)h F(xk+1/2). We clarify that simpler one-stage discretization schemes can also be used. We chose two-stage Runge Kutta to demonstrate that multi-stage discretization schemes are compatible with our automatic discretization methodology. Energy descent. Let a discrete-time optimization algorithm generate a sequence {(vk, ik, xk, yk)} k=1 with vk, ik Rσ (voltages across and currents through the branches of interconnect) and xk, yk Rm (potentials at terminals and currents leaving terminals). Let the subscripts R, L, and C denote the components related to resistors, inductors, and capacitors, respectively. Then the energy stored in the circuit is given by Ek = 1 2 vk C v C 2 DC + 1 2 ik L i L 2 DL. Lemma G.1. Assume f : Rm R { } is a strictly convex function and the dynamic interconnect is admissible. Let a discrete-time optimization algorithm generate a sequence {(vk, ik, xk, yk)} k=1. If there exists η > 0 such that for all k = 1, 2, . . . the energy descent Dk = Ek+1 + η xk x , yk y Ek 0 (39) holds, then xk converges to a primal solution. Proof. Suppose there exists η > 0 for which (39) holds. Then we have 0 EK+1 EK η x K x , y K y k=0 η xk x , yk y . By monotonicity of subdifferential operator f, we have xk x , yk y 0, yk f(xk). Thus rearranging the terms we get k=0 η xk x , yk y E0. Sending K to infinity, by the summability argument it follows that xk x , yk y 0. (40) Define the Lagrangian function L(x, z, y) = f(x) y T (x E z). Since x R(E ), there exists some z such that x = E z . Then for fixed z and y , function L(x, z , y ) is strictly convex. Its subgradient is given by (y y ) x L(x, z , y ) for y f(x), therefore, 0 L(x , z , y ). Together with strict convexity this implies that L(x, z , y ) achieves a unique global minimum at x with L(x , z , y ) = f(x ). By the subgradient inequality, we have xk x , yk y L(xk, z , y ) f(x ) 0. Then the condition (40) implies L(xk, z , y ) f(x ). Therefore, xk x which concludes the proof. By the descent lemma G.1, the discretization is dissipative if there exist value η > 0 such that Dk = Ek+1 + η xk x , yk y Ek 0 for all k = 1, 2, . . . Since the descent Dk is defined using a one-step transition, without loss of generality, it suffices to consider k = 1. Solver dissipative term. To provide more flexibility with the Ipopt [155, 9] solver, we also incorporate dissipation from the linear resistors as in the continuous-time energy dissipation (7), i.e., we try to establish Dk = Ek+1 + η xk x , yk y + ρR ik R 2 2 Ek (41) with η > 0 and ρ 0. Also see D. If there exist values η > 0 and ρ 0 such that Dk 0 holds, then the discretization is sufficiently dissipative and Lemma G.1 applies. G.1 Dissipative discretization In this section, we fix α, β, h, η, and ρ and describe a convex optimization problem that checks whether the discretization is dissipative. We focus on problem (38). Worst-case optimization problem. To verify if the dissipativity condition Dk 0 (41) is satisfied for a given discretization, we can alternatively solve a worst-case problem. Specifically, this entails determining if the optimal value of the following optimization problem is non-positive: maximize E2 E1 + η x1 x , y1 y + ρR i1 R 2 2 subject to Es = 1 2 vs C v C 2 DC + 1 2 is L i L 2 DL, s {1, 2} (v1, i1, x1, y1) is feasible initial point (v2, i2, x2, y2) is generated by discrete optimization method from initial point f F, where f, vk, ik, xk, yk, v , i , x , y are the decision variables and F is a family of functions (e.g., L-smooth convex) that the algorithm is to be applied to. Reformulated worst-case optimization problem. Recall that we assume that the RLC circuit across each net is the same. Thus we can define an operator mat(z) that reshapes vector z Rσ into a matrix of size n σ/n, where each row contains information (voltage or current) of the electric components that belong to the same net. Define index sets IK = {1, 1.5, 2, }, IN = {1, . . . , N}, IK IN = {(k, l) | l IN, k IK}, and matrices H = h mat(v1) mat(i1) yk l (k,l) IK IN i Rn (2σ/n+|IK|N), G = HT H S2σ/n+|IK|N + , (k,l) IK IN R|IK|N, where yk l fl(xk) and f k l = fl(xk) for all l IN. Note that we have |F| = |IK|N, |G| = (2σ/n + |IK|N)2, and for |IK| = 4 this simplifies to |F| = 4N, |G| = (2σ/n + 4N)2. Recall the circuit ODEs are discretized with the two-stage Runge Kutta method, leading to the variables mat(vk), mat(ik), xk, yk that are linear combinations of columns in H. The coefficients of these linear combinations are polynomials in α, β, and h. In other words, there exist matrices vk, ik, xk, yk l such that mat(vk) = Hvk, mat(ik) = Hik, xk = Hxk, yk l = Hyk l for all k IK, l IN. Similarly, we can find f k l such that f k l = Ff k l . For fixed parameters α, β, h, η, and ρ, the problem (42) can be reformulated as maximize f1,...,f N,H E2 E1 + η x1 x , y1 y + ρR i1 R 2 2 subject to Es = 1 2 vs C v C 2 DC + 1 2 is L i L 2 DL, s {1, 2} mat(vk) = Hvk, k IK mat(ik) = Hik, k IK xk = Hxk, k IK yk l = Hyk l , k IK, l IN fl Fµl,Ml(Rn), l IN. By the interpolation lemma ([149], Theorem 2), fl Fµl,Ml(Rn) if and only if 0 f j l f i l + gj l , xi xj + 1 2Ml gi l gj l 2 2 + µl 2(1 µl/Ml) xi xj 1/Ml(gi l gj l ) 2 2, i, j IK. Therefore, we can replace infinite dimensional decision variable fl Fµl,Ml(Rn) with |IK|(|IK| 1) inequalities. Grammian formulation. Now using Grammian formulation, the problem of finding the worst-case energy difference over a given family of functions reduces to solving an SDP, similar to [150]. This SDP can be presented compactly as maximize G,F [F T vec(G)T ]Dp subject to [F T vec(G)T ]Slijp 0, l = 1, . . . , N, i, j IK G 0, (43) where p is a vector with dummy variables that encode the monomials of α, β, h, η, ρ, and D R(|F |+|G|) |p| and S R(|IK|N) (|F |+|G|) |p| are some matrices with constant coefficients. Dualization. Define variables for the energy descent as VD = [F T vec(G)T ]Dp, and for interpolating inequality indexed by lij as (VS)lij = [F T vec(G)T ]Slijp. Let Z and λlij for all i, j IK, l = 1, . . . , N be the dual variables for problem (43). Vertically stack λlij and (VS)lij to form vectors λ and VS respectively. The Lagrangian that generates primal problem (43) is L(G, f, Z, λ) = VD λT VS + Tr(GZ), and the dual problem is given by minimize Z,λ 0 subject to DF p λT SF p = 0 DGp λT SGp + Z = 0 Z 0 λ 0. Algebraic proof. Using a weak duality we have p d . Let Z and λ be optimal dual variables with d = 0, then for all G S(|C|+|L|)/n+|IK|N + and F R|IK|N it follows that L(G, F, Z , λ ) = F T DF p (λ )T SF p G DGp (λ )T SGp + Z Therefore, having Z and λ gives us an algebraic proof for the worst-case one step energy difference l,i,j λ lij |{z} 0 (VS)lij | {z } 0 Tr(GZ ) | {z } 0 where G 0 because G is a Gram matrix and (VS)lij 0 for all fl Fµl,Ll. G.2 Optimizing over discretizations In this section we also optimize over the parameters α, β, h, η, and ρ. G.2.1 QCQP formulation We can formulate the dual problem (44) as QCQP following [45], minimize p,λ,P 0 subject to DF p λT SF p = 0 DGp λT SGp + PP T = 0 p T Qep + a T e p = 0, e = 1, . . . , |p| P is lower triangular diag(P) 0 λ 0, where relations for dummy variables are specified using quadratic or bilinear constraints p T Qep + a T e p = 0. G.2.2 Lifted nonconvex SDP Alternatively, we can formulate the dual problem (44) as lifted nonconvex semidefinite problem with respect to a variable w = (p, λ) R|p|+|λ|. Specifically, we have minimize w,W,Z 0 subject to DF (i, :)w Tr SF (:, i, :)W = 0, i = 1, . . . , |F| DG(i, :)w Tr SG(:, i, :)W + Z(i, i) = 0, i = 1, . . . , |G| Tr Qe W + a T e w = 0 W = ww T Z 0 λ 0, where SF = 0 1 2ST F 1 2SF 0 , DF = [ DF 0 ], SG = 0 1 2ST G 1 2SG 0 , DG = [ DG 0 ], Qe = Qe 0 0 0 and ae = (ae, 0). In the above the transpose for the third order tensors SF and SG is obtained by transposing the first and third dimensions. Note that with the exception of the rank-1 constraint W = ww T , the constraints define convex sets. G.2.3 SDP relaxation To find globally optimal solutions to the nonconvex optimization problem, methods like spacial branch-and-bound require good initial bounds on the variables. Following [45], an SDP relaxation of (45) is given by minimize w,W,Z 0 subject to DF (i, :)w Tr SF (:, i, :)W = 0, i = 1, . . . , |F| DG(i, :)w Tr SG(:, i, :)W + Z(i, i) = 0, i = 1, . . . , |G| Tr Qe W + a T e w = 0 W w w T 1 Problem (46) is now a convex optimization problem, since the rank-1 constraint W = ww T has been relaxed to W ww T . This constraint in turn can be represented equivalently using the Schur complement. H Package ciropt In this section, we present a simple problem instance to demonstrate the step-by-step process of obtaining a discretized algorithm with our methodology. Optimization problem. Consider a problem minimize f(x), where f is a convex function. Determine the static interconnect. Static interconnect is determined from the optimality conditions. The optimality condition for this problem is to find an x such that 0 f(x). The corresponding static interconnect for this condition provided below. Admissible dynamic interconnect. An admissible dynamic interconnect with RLC components relaxes to the static interconnect in equilibrium. The following provides an example of such a dynamic interconnect. The V-I relations for the circuit (left column) and convergent discretized method found by our method (right) are displayed below. x = prox(R/2)f (z) d dte2 = 1 2CR(Ry + 3e2) d dtz = 1 4CR(5Ry + 3e2) xk = prox(R/2)f zk ek+1 2 = ek 2 h 2CR(Ryk + 3ek 2) zk+1 = zk h 4CR(5Ryk + 3ek 2). Automatic discretization. Now we find a discretization parameters for this dynamic interconnect that guarantee algorithm convergence using ciropt package. Step 1. Define a problem. import ciropt as co problem = co.Circuit Opt() Step 2. Define function class, in this example f is convex and nondifferentiable, i.e., µ = 0 and M = . f = co.def_function(problem, mu=0, M=np.inf) Step 3. Define the optimal points. x_star, y_star, f_star = f.stationary_point( return_gradient_and_function_value=True) Step 4. Define values for the RLC components and discretization parameters, here for simplicity we take α = 0 and β = 1. R, C = 1, 10 h, eta = problem.h, problem.eta Step 5. Define the one step transition in the discretized V-I relations. z_1 = problem.set_initial_point() e2_1 = problem.set_initial_point() x_1 = co.proximal_step(z_1, f, R/2)[0] y_1 = (2 / R) * (z_1 - x_1) e1_1 = (e2_1 - R * y_1) / 2 v_C1_1 = e2_1 / 2 - z_1 v_C2_1 = e2_1 e2_2 = e2_1 - h / (2 * R * C) * (R * y_1 + 3 * e2_1) z_2 = z_1 - h / (4 * R * C) * (5 * R * y_1 + 3 * e2_1) x_2 = co.proximal_step(z_2, f, R/2)[0] y_2 = (2 / R) * (z_2 - x_2) v_C1_2 = e2_2 / 2 - z_2 v_C2_2 = e2_2 Step 6. Define the dissipative term E2 E1 + η x1 x , y1 y . Solve the final problem. E_1 = (C/2) * (v_C1_1 + x_star)**2 + (C/2) * (v_C2_1)**2 E_2 = (C/2) * (v_C1_2 + x_star)**2 + (C/2) * (v_C2_2)**2 Delta_1 = eta * (x_1 - x_star) * (y_1 - y_star) problem.set_performance_metric(E_2 - (E_1 - Delta_1)) params = problem.solve()[:1] This gives the disretization parameters b = 6.66, h = 6.66. The resulting provably convergent algorithm is xk = prox(1/2)f(zk) yk = 2(zk xk) wk+1 = wk 0.33(yk + 3wk) zk+1 = zk 0.16(5yk + 3wk). New algorithm. Solve your problem using new algorithm. Consider Huber penalty function ϕ : R R ϕ(x) = x2 |x| 1 2x 1 |x| > 1. We consider the primal problem minimize f(x) = P i ϕ(xi ci) subject to Ax = b, where A Rm n, b Rm and c Rn, and solve the dual problem maximize g(y) = f ( A y) b y. We apply our algorithm to solve the dual problem. Note that the proximal operator proxαg( y) is equivalent to x = argmin x f(x) + (α/2) Ax b 2 2 + y (Ax b) , y = y + α(Ax b). Since f is CCP and 2-smooth (as a Huber loss), f is 1/2-strongly convex. We take m = 30, n = 100 and sample entries of A, c and b from i.i.d. Gaussian distribution. Finally we rescale the entries of A by λmin(AA ) to have g that is 1/2-strongly convex. The following Figure 10 presents the results of the algorithm applied to a random problem instance. 0 5 10 15 20 25 30 k |f(x) f |/|f | Figure 10: Relative error across iterations when applying the new algorithm. I Numerical experiments I.1 Decentralized ADMM+C Consider a decentralized optimization problem minimize x R m PN i=1 fi(x), where f1, . . . , f N are CCP. Suppose furthermore we know some of the functions are strongly convex, that is, suppose there is a subset S {1, 2, . . . , N} such that fj are strongly convex for j S. We wish to find an efficient algorithm that fully exploits the additional information for fj s. To solve the problem in a decentralized manner, define a primal variable xi Rm for each agent function fi. To leverage the strong convexity of fj for each j S, we could consider implementing a specialized update rule for xj that is more effective for strongly convex functions. We consider a modification of the DADMM circuit in F.3. Recall from 3.1, a circuit with a capacitor and inductor corresponds to a method with momentum. It is known [124] that momentum accelerates the convergence of methods for strongly convex functions. Therefore, we propose to attach capacitors to the circuit in F.3, on the nets that are directly related to xj s in j S. We anticipate that the method derived by discretization of a new circuit (using our automatic discretization methodology) will outperform the DADMM. Consider a modified decentralized geometric median problem from [138]. Suppose each agent i {1, . . . , N} holds vector bi Rm, and consider the minimization problem minimize x R m P i S x bi 2 + x bi 2 2 + P i/ S x bi 2. (47) The minimization subproblem has an explicit solution, i.e., proxρfi(z) = bi bi z bi z 2 ( bi z 2 ρ)+, ( z i / S 1 1+2ρ(z + 2ρbi) i S, ρ = ( ρ i / S ρ 1+2ρ i S. We set m = 100, N = 6, S = {4, 5}, and sample vectors bi R100 from the uniform distribution over [ 100, 100]100. We use graph G provided in Figure 9. We initialize iterates to x0 i = bi for all i. We use a modified DADMM circuit F.3 for the graph in Figure 9. This modified version includes an extra capacitor connected at e45, to which we refer as DADMM+C. Note when N = 1, the DADMM+C circuit corresponds to the Nesterov acceleration circuit 3.1. Using KCL at e45, we have i L45 + i L54 = (e45 x4) Thus for {j, l} = {4, 5} the update rule of ejl is given by (34), while for e45 we get d dte45 = 1 i L45 + i L54 + 1 R (2e45 x4 x5) . Other V-I relations remain unchanged as in F.3. The resulting algorithm becomes xk+1 j = prox(R/|Γj|)fj l Γj (Rik Ljl + ek jl) ek+1 jl = ek 45 h CR R(ik L45 + ik L54) + 2ek 45 xk+1 4 xk+1 5 {j, l} = {4, 5} 1 2(xk+1 j + xk+1 l ) otherwise i L k+1 jl = i L k jl + h L(ek+1 jl xk+1 j ). We consider the circuit with R = 0.8, L = 2 and C = 15. To discretize the circuit, we take advantage of the fact that the strong convexity of fi is 2 for i S (47). Specifically, we apply our automatic discretization methodology to convex functions, setting µ = 0 for fi with i / S and µ = 2 for fi with i S, and using smoothness M = 100. The sufficiently dissipative parameters we find are η = 3.70, h = 3.52, ρ = 0, α = 0, β = 1, γ = 4.48. We compare DADMM+C with DADMM and P-EXTRA. Based on grid search, we set R = 0.6 for DADMM in F.3, and R = 1 and h1 = = h N = 0 for PG-EXTRA in F.4 to get P-EXTRA. Note that the parameters of the proximal operators for DADMM are scaled by 1/|Γj|, in contrast to P-EXTRA, where |Γj| is generally not equal to 1. We use Metropolis mixing matrix for P-EXTRA, 1 max{|Γi|,|Γj|}+1 if i Γj 1 P j Γj Wij if i = j 0 otherwise. The numerical results are illustrated in Figure 11. The relative error for DADMM+C decreases to 10 10 in 66 iterations, for DADMM in 87 iterations and for P-EXTRA in 294 iterations. 0 20 40 60 80 100 Iteration count k |f(xk) f |/f P-EXTRA DADMM Circuit DADMM+C Figure 11: (Left) Underlying graph G. (Right) Relative error f(xk) f /f vs. k. I.1.1 Convergence proof of decentralized ADMM+C We first review the meaning of the numerical values in the previous section. Suppose (x , y ) be a primal-dual solution pair. Then by Theorem C.2, there is (v , i ) Dx ,y that satisfies (v , i ) = ((0, 0, v C), (0, i L, 0)). The numerical values imply, for the energy function ik Ljl i Ljl j 0 that satisfy following inequality is true ek+1 jl xk+1 j 2 + h xk+1 x , yk+1 y Proof. For notation simplicity, define yk+1 jl = ik Ljl + 1 R ek jl xk+1 j . Note, from the first line of the algorithm we have xk+1 j + R |Γj|yk+1 j = 1 |Γj| P Rik Ljl + ek jl , and therefore R ek jl xk+1 j = X l Γj yk+1 jl . (i) Difference of P 2 ik Ljl i Ljl 2 . ik+1 Ljl i Ljl ik Ljl i Ljl ik Ljl i Ljl + h L(ek+1 jl xk+1 j ) ik Ljl i Ljl D ek+1 jl xk+1 j , ik Ljl i Ljl E + X ek+1 jl xk+1 j 2 ek+1 jl xk+1 j , yk+1 jl 1 R ek jl xk+1 j i Ljl ek+1 jl xk+1 j 2 D ek+1 jl xk+1 j , ek jl xk+1 j E + X D ek+1 jl xk+1 j , yk+1 jl i Ljl E ek+1 jl xk+1 j 2 ek+1 jl xk+1 j 2 1 D ek+1 jl xk+1 j , ek jl ek+1 jl E D xk+1 j x j, yk+1 jl i Ljl E + X D ek+1 jl x j, yk+1 jl i Ljl E . On the other hand, from Theorem C.2 we know i R = 0, by KCL at xj we have y j = P l Γj i Ljl. Therefore D xk+1 j x j, yk+1 jl i Ljl E = D xk+1 j x j, yk+1 jl i Ljl E = xk+1 j x j, yk+1 j y j . Moreover, i Ljl = i Llj, ek+1 jl = ek+1 lj holds by their definition, and x j = x l as x is the solution. Therefore we see X D ek+1 jl x j, i Ljl E = X j 0 that satisfy L τ 2 0, CR or equivalently, we conclude the desired inequality. I.2 PG-EXTRA + Parallel C In this section, we introduce an additional pipeline of designing new optimization algorithm via circuit. The previous automatized discretization pipeline has the advantage of guaranteeing convergence; however, it may provide a conservative step size since it considers all worst-case scenarios. As a result, it may eliminate the possibility of finding efficient step size that works for certain optimization problem in practice. Our circuit-based approach has the advantage of designing a variant of the prior method quickly, that is likely to converge and possibly works better based on physical intuition. Furthermore, the variant method provides greater freedom in selecting parameters to tune. We provide an example of new optimization method obtained with exploiting these advantages, that outperforms PG-EXTRA for the problem considered in the paper introduced PG-EXTRA [138]. We use a modified PG-EXTRA circuit F.4, that includes extra capacitors connected parallel to inductors. m hj fj xj R ej R xj Rjl Cjl Ljl Recalling (6), we know the energy for this circuit is defined as below j 0, we derive the following method xk+1 = prox Rf Wxk R h(xk) wk uk wk+1 = wk + s(I W)xk (48) s (I W)(xk+1 xk). We now consider the decentralized quadratic programming from [138]. Suppose each agent j {1, . . . , N} holds a symmetric positive semidefinite matrix Qj Rm m, vectors aj, pj Rm and scalars bj R. Consider the minimization problem minimize x R m 1 N PN j=1 x Qjx + p j x , subject to a j x bj, j = 1, . . . , N. Set fj(x) = δ{z|a j z bj}(x) and hj(x) = x Qjx + p j x, where δ{z|a j z bj}(x) = 0 if a j x bj otherwise is the indicator function. Then the given optimization problem recasts to minimize x R m 1 N PN j=1 (fj(x) + hj(x)) . Since the minimization subproblem has an explicit solution prox Rfj(z) = ( z if a j z bj z + bj a j z aj 2 2 aj otherwise, this problem can be solved by PG-EXTRA. We follow the same setting of [138]. We set m = 50. Each Qj is generated by taking the product of Qj and its transpose, where Qj Rm m is a matrix with elements that follow an i.i.d. Gaussian distribution. Each pj is generated to follow an i.i.d. Gaussian distribution. Vectors aj and bj are also randomly generated, however, we conducted the experiment for the case that the solution of the constrained problem differs from that of the unconstrained problem. We use Metropolis mixing matrix as in I.1. The numerical results are illustrated in Figure 12. We compare PG-EXTRA and the variant method (48) obtained from the modified circuit with additional parallel capacitors. We use R = 0.05, R = 0.07 for PG-EXTRA and R = 0.07, C = 0.3 and s = 0.8 for (48). The parameters for PGEXTRA were obtained through a grid search. The parameters for (48) are hand-optimized starting from C = 0 and s = 0.5, the parameter selection that makes (48) to coincide with PG-EXTRA when u0 = 0. The relative error for (48) decreases to 10 8 in 147 iterations, while for PG-EXTRA with R = 0.05 in 214 iterations. 0 200 400 600 800 1000 Iteration count k |f(xk) f |/f PG-EXTRA, R = 0.05 PG-EXTRA, R = 0.07 Circuit PG-EXTRA + Parallel C Figure 12: (Left) Underlying graph G. (Right) Relative error f(xk) f /f vs. k. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We provide the methodology to design an appropriate electric circuit whose continuous-time dynamics converge to the solution of the optimization problem in 2, 3, and introduce the automated, computer-assisted discretization methodology in 4, 5. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [NA] Justification: This work proposes a new methodology for optimization algorithm design. The theoretical nature makes the discussion of such limitations not strongly relevant. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Our paper provides the full set of assumptions and complete proofs for all theoretical results, with theorems and lemmas properly numbered and cross-referenced. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide the code and thoroughly describe the details to ensure reproducibility of the experimental results. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide our codes and the instructions for using them in the README.md file. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide the necessary details and code to understand and reproduce the experiments. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: Our work is theoretical and includes non-statistical simple experiments with synthetic data. Therefore, the notion of statistical significance is not applicable to our work. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We use minimal CPU computation for toy experiments. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have reviewed the Neur IPS Code of Ethics and our paper conforms to it in every respect. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: We conduct theoretical research on convex optimization and do not anticipate any negative social impacts from our results. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our theoretical research on convex optimization does not involve releasing data or models, hence no safeguards are needed. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We explicitly mention the solvers and the packages we are referring to in the paper and the code. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: This paper does not release new datasets or assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.