# symmetric_pruning_in_quantum_neural_networks__793c724e.pdf Published as a conference paper at ICLR 2023 Symmetric Pruning in Quantum Neural Networks Xinbiao Wang1,2, Junyu Liu3,4,5,6, Tongliang Liu7, Yong Luo1,8,9, Yuxuan Du2, , Dacheng Tao2 1Institute of Artificial Intelligence, School of Computer Science, Wuhan University, China 2JD Explore Academy 3Pritzker School of Molecular Engineering, The University of Chicago 4Chicago Quantum Exchange 5Kadanoff Center for Theoretical Physics 6q Braid Co. 7Sydney AI Centre, The University of Sydney, 8National Engineering Research Center for Multimedia Software, School of Computer Science, Institute of Artificial Intelligence and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China 9Hubei Luojia Laboratory, Wuhan, China : corresponding author Many fundamental properties of a quantum system are captured by its Hamiltonian and ground state. Despite the significance, ground states preparation (GSP) is classically intractable for most large-scale Hamiltonians. Quantum neural networks (QNNs), which exert the power of modern quantum machines, have emerged as a leading protocol to conquer this issue. As such, the performance enhancement of QNNs becomes the core in GSP. Empirical evidence showed that QNNs with handcraft symmetric ansätze generally experience better trainability than those with asymmetric ansätze, while theoretical explanations remain vague. To fill this knowledge gap, here we propose the effective quantum neural tangent kernel (EQNTK) and connect this concept with over-parameterization theory to quantify the convergence of QNNs towards the global optima. We uncover that the advance of symmetric ansätze attributes to their large EQNTK value with low effective dimension, which requests few parameters and quantum circuit depth to reach the over-parameterization regime permitting a benign loss landscape and fast convergence. Guided by EQNTK, we further devise a symmetric pruning (SP) scheme to automatically tailor a symmetric ansatz from an over-parameterized and asymmetric one to greatly improve the performance of QNNs when the explicit symmetry information of Hamiltonian is unavailable. Extensive numerical simulations are conducted to validate the analytical results of EQNTK and the effectiveness of SP. 1 Introduction The law of quantum mechanics advocates that any quantum system can be described by a Hamiltonian, and many important physical properties are reflected by its ground state. For this reason, the ground state preparation (GSP) of Hamiltonians is the key to understanding and fabricating novel quantum matters. Due to the intrinsic hardness of GSP (Poulin & Wocjan, 2009; Carleo et al., 2019), the required computational resources of classical methods are unaffordable when the size of Hamiltonian becomes large. Quantum computers, whose operations can harness the strength of quantum mechanics, promise to tackle this problem with potential computational merits. In the noisy intermediate-scale quantum (NISQ) era (Preskill, 2018), quantum neural networks (QNNs) (Farhi & Neven, 2018; Cong et al., 2019; Cerezo et al., 2021a) are leading candidates toward this goal. The building blocks of QNNs, analogous to deep neural networks, consist of variational ansätze (also called parameterized quantum circuits) and classical optimizers. In order to enhance the power of QNNs in GSP, great efforts have been made to design advanced ansätze with varied circuit structures (Peruzzo et al., 2014; Wecker et al., 2015; Kandala et al., 2017). Despite the achievements aforementioned, recent progress has shown that QNNs may suffer from severe trainability issues when the circuit depth of ansätze is either shallow or deep. Published as a conference paper at ICLR 2023 Namely, for the deep ansätze, the magnitude of the gradients exponentially decays with the increased system size (Mc Clean et al., 2018; Cerezo et al., 2021b). This phenomenon, dubbed the barren plateau, hints at the difficulty of optimizing deep QNNs, where an exponential runtime is necessitated for convergence. The wisdom to alleviate barren plateaus is exploiting shallow ansätze to accomplish learning tasks (Grant et al., 2019; Skolik et al., 2021; Zhang et al., 2020; Pesah et al., 2021), while the price to pay is incurring another serious trainability issue convergence (Boyd & Vandenberghe, 2004; Du et al., 2021). The trainable parameters may get stuck into sub-optimal local minima or saddle points with high probability because of the unfavorable loss landscape (Anschuetz, 2021; Anschuetz & Kiani, 2022). Orthogonal to these negative results, several studies pointed out that when the depth of ansätze becomes overwhelmingly deep and surpasses a critical point, the overparameterized QNNs embrace a benign landscape and permit fast convergence towards good local minima (Kiani et al., 2020; Wiersema et al., 2020; Larocca et al., 2021b). Nevertheless, the criteria to reach such a critical point is stringent, i.e., the number of parameterized gates or the circuit depth scales exponentially with the problem size, which hurdles the application of over-parameterized QNNs in practice. # parameterized gate Training error Under-parameterized Over-parameterized Slow convergence Local minimum critical point Fast convergence Global minimum Figure 1: The critical point of the over-parameterized regime. When the number of parameters is beyond the critical point (the red circle), the training error exponentially converges to a nearly global minimum. Symmetric ansätze (the blue curve) require few parameters to reach the critical point over the asymmetric ansätze. Empirical evidence sheds new light on exploiting overparameterized QNNs to tackle GSP. QNNs with symmetric ansätze only demand a polynomial number of trainable parameters and the circuit depth with the problem size to reach the over-parameterized region and achieve a fast convergence rate (Herasymenko & O Brien, 2021; Gard et al., 2020; Zheng et al., 2021; 2022; Shaydulin & Wild, 2021; Mernyei et al., 2022; Marvian, 2022; Meyer et al., 2022; Larocca et al., 2022; Sauvage et al., 2022). A common feature of these symmetric ansätze is capitalizing on the symmetric properties underlying the problem Hamiltonian to shrink the solution space and facilitate seeking near-optimal solutions. Unfortunately, current symmetric ansätze are inapplicable to a broad class of Hamiltonians whose symmetry is implicit, since their constructions rely on the explicit information for the symmetry of Hamiltonians. Besides, it is unknown whether the symmetry contributes to lowering the critical point to reach the over-parameterization regime. Here we fill the above knowledge gap from both theoretical and practical aspects. Concretely, we develop a novel notion effective quantum neural tangent kernel (EQNTK) to capture the training dynamic of various ansätze via their effective dimension. In doing so, we expose that compared with the asymmetric ansätze, the symmetric ansätze possess dramatically lower effective dimensions and the required number of parameters and circuit depth to reach the over-parameterization may polynomially scale with the problem size (see Fig. 1 for an intuition). By leveraging EQNTK, we next prove that when the condition of over-parameterization is satisfied, the trainable parameters of QNNs with symmetric ansätze can exponentially converge to the global optima with the increased iterations. Taken together, our analysis recognizes that overparameterized QNNs with symmetric ansätze is a possible solution toward large-scale GSP tasks. Envisioned by EQNTK and pruning techniques in deep neural networks (Han et al., 2015; Blalock et al., 2020; Frankle et al., 2020; Wang et al., 2022), we further devise a symmetric pruning scheme (SP) to automatically tailor a symmetric ansatz from an over-parameterized and asymmetric one with the enhanced trainability and applicability. Conceptually, SP continuously eliminates the redundant quantum gates from the given asymmetric ansatz and correlates parameters to assign different types of symmetries on the slimmed ansatz. In this way, SP generates a modest-depth symmetric ansatz with a fast convergence guarantee and thus improves the hardware efficiency. Extensive simulations on many-body physics and combinatorial problems validate the theory of EQNTK and the Published as a conference paper at ICLR 2023 efficacy of SP. These results deepen our understating about how to merge symmetry with over-parameterization theory and indicate the signification of designing symmetric ansätze. Contributions. We summarize our main contributions as follows: 1. We propose the notion of EQNTK to quantify the training dynamics of QNNs with symmetric ansätze, which reconciles the QNTK theory with the symmetry of the problem Hamiltonian (see Sec. 2.2). As shown in Fig. 1, since the training dynamic between symmetric and asymmetric ansätze is evidently disparate, our results provide a deep understanding towards QNNs with symmetric ansätze, especially for unraveling how the structure information effects the convergence rate. 2. Our key technical contribution is achieving a tighter convergence bound of QNNs with various symmetric ansätze (see Theorem 2 and Lemma 1). Particularly, our bound yields γ = O(poly(LK, deff 1)), where LK is the number of parameters and deff is the effective dimension. The comparison with prior results is summarized in Table 1. Our results not only greatly reduce the threshold in reaching overparameterization but promise an improved convergence rate. These two conclusions are indispensable in applying over-parameterized QNNs to solve practical problems. Larocca et al. (2021b) Anschuetz (2021) You et al. (2022) Liu et al. (2022b) Our results C Ω(poly(deff)) Ω(exp(n)) O(poly(deff)) O(exp(n)) O(poly(deff)) T % % O(log(deff) log( 1 ε )) O( 4n log( 1 ε ) LK ) O( d2 eff log( 1 Table 1: A comparison of the convergence rate for over-parameterized QNNs. The label C and T refers to the critical point and ϵ-convergence rate respectively. The label % denotes that the paper did not study certain regimes. Note that the achieved results in Ref. (Larocca et al., 2021b) do not exhibit how the problem Hamiltonian effects deff. 3. Our last contribution is devising SP, an automatic scheme to identify the implicit symmetries of the problem Hamiltonian and utilize them to design a symmetric ansatz (see Section 3). An attractive feature of SP is its flexibility, where any heuristic that has the ability to capture certain symmetries of the problem Hamiltonian can be seamlessly embedded into SP to further boost its performance. 2 Effective QNTK allows an improved convergence of QNNs Here we establish foundations about why symmetric ansätze have the ability to enhance the trainability of QNNs in ground state preparation (GSP) tasks. To do so, we propose a novel concept effective quantum neural tangent kernel (EQNTK), to reconcile the QNTK theory with the symmetry of the problem Hamiltonian. Attributed to EQNTK, we uncover that the advance of symmetric ansätze originates from their ability to dramatically decrease the overparameterization threshold. For elucidating, we first interpret the necessary backgrounds in Sec. 2.1 and then present our main theoretical results in Sec. 2.2. 2.1 Problem setup Ground state preparation. Given an n-qubit Hamiltonian H C2n 2n, GSP aims to find the eigenvector |ψ C2n (i.e., the ground state) of H corresponding to its minimum eigenvalue. For any n-qubit state |ψ , the variational principle ensures ψ | H |ψ ψ| H |ψ and the equality is satisfied iff |ψ = |ψ . Since the dimension of |ψ scales exponentially with n, GSP is classically intractable for a large n. Quantum neural networks. A QNN can be described by a triplet (|ψ0 , U(θ), H). When it is applied to solve GSP, an ansatz U(θ) (i.e., a parameterized unitary) prepares a variational state |ψ(θ) = U(θ) |ψ0 with a fixed input state |ψ0 . The parameters θ are optimized by minimizing the loss function 2 ψ0| U(θ) HU(θ) |ψ0 E0 2 , (1) Published as a conference paper at ICLR 2023 where E0 = ψ | H |ψ refers to the ground state energy of H. The optimization follows an iterative manner, i.e., the classical optimizer continuously leverages the output of the quantum circuits to update θ and the update rule is θ(t+1) = θ(t) η L(θ(t))/ θ, where η refers to the learning rate. See Appendix A for details. Remark. We adopt E0 to facilitate the convergence analysis and our results cover general loss functions where E0 is replaced by C R with C E0. See Appendix B for details. Constructions of (symmetric) ansätze. The power of QNNs depends on the employed ansatz U(θ). A general form of U(θ) covering many typical ansätze such as Hamiltonian variational ansatz (HVA) and hardware efficient ansatz (HEA) (Bharti et al., 2022; Qian et al., 2021) yields ℓ=1 Uℓ(θℓ), Uℓ(θℓ) = k=1 e i Gkθℓk, (2) where L refers to the layer number, θ = (θ1, , θL) Θ RLK is trainable parameters living in the parameter space Θ, θℓ= (θℓ1, , θℓK) is trainable parameters at the ℓ-th layer, and A = {G1, , GK} is a set of Hermitian traceless operators called an ansatz design. Given Θ and A, a set of ansätze forms a subgroup of SU(2n) with UA = L=0{U(θ) : θ Θ}. The difference of ansätze originates from the varied Θ and A. Given a Hermitian matrix Σ, the ansatz U(θ) is said to be symmetric with respect to Σ if each element in UA is commutable with Σ. Mathematically, denote Σ = Pp j=1 Psj k=1 λjvjk where λj is the eigenvalue with λi = λj for i = j, vjk is the corresponding eigenvector, and Pp j=1 sj = 2n. The explicit form of Σ leads to a direct sum decomposition H = p j=1Vj of the quantum state space, where Vj is the invariant subspace spanned by the eigenvectors {vj1, , vjsj}. Convergence of QNNs. A crucial metric to assess the performance of different QNNs is the ϵ-convergence rate towards the global minimum L(θ ) with θ = minθ Θ L(θ). Definition 1 (ϵ-convergence). A QNN instance (|ψ0 , U(θ), H) achieves an ϵ-convergence if the trained parameters after T iterations θ(T ) satisfy L(θ(T )) ϵ with ϵ R. This quantity measures the distance between the estimated and the optimal loss values, which can be derived via the quantum neural tangent kernel (QNTK) Q(t) = ε t εt, (3) where εt = ψ0| U(θ(t)) HU(θ(t)) |ψ0 E0 denotes the residual training error and εt is the gradients of εt with respect to θ. The following theorem describes the ϵ-convergence of the over-parameterized QNN with an arbitrary ansatz. Theorem 1 (Liu et al. (2022b)). Following notations in Eqns. (1)-(3), when U(θ) matches the Haar distribution up to the fourth moment, the number of parameters satisfies LK 1, and the learning rate η 1, the training dynamics of a QNN instance (|ψ0 , U(θ), H) yields εt (1 η Q)tε0 e γtε0. (4) where γ = η Q is the indicator of the decay rate and Q = O(LK Tr(H2)/4n) refers to the expectation of Q on Haar average. It indicates that the critical point to reach the over-parameterization region is |θ| O(4n/(η Tr(H2))). In this setting, the exponent in Eqn. (4) meets γ O(1) and promises an exponential convergence. Besides, Eqn. (4) hints that the convergence rate of QNNs is continuously enhanced by increasing the value of QNTK, which can be achieved by growing the number of parameters or decreasing the system size. 2.2 Effective Quantum Neural Tangent Kernel The exponential scaling behavior with the number of qubits n in Theorem 1 causes the realization of the over-parameterized QNNs to be impractical for large systems. Moreover, the corresponding convergence rate is independent of the refined structure information of ansätze. This contradicts with the empirical evidence such that symmetric ansätze outperform asymmetric ansätze with fast convergence in training QNNs. It thus highly demands carrying out new theories to stress these issues. Published as a conference paper at ICLR 2023 Optimization path Figure 2: Training dynamics of QNNs with symmetric ansätze. The left and right panels illustrate the dynamic of variational states corresponding to the asymmetric and symmetric ansätze, respectively. The shadow region means the solution space, which is the whole Hilbert space for asymmetric ansatz, and the restricted invariant subspace for symmetric ansatz. Here we propose a novel concept effective QNTK (EQNTK) to resolve the above dilemma and exhibit how symmetry improves the trainability of QNNs. As shown in Fig. 2, given a QNN (|ψ0 , U(θ), H) whose input state |ψ0 and the ground state |ψ live in the same subspace V C2n and V is invariant with the employed symmetric ansatz U(θ), the training dynamics of |ψ(θ) = U(θ) |ψ0 can be exactly captured by V , a much smaller space than the whole state space. Suppose that the state space H under the symmetric ansatz design A can be decomposed into H = p j=1Vj and there exists j [p] such that Vj = V includes the input state |ψ0 , the ground state |ψ , and all possible variational states {|ψ(θ) |θ Θ}. Then the dynamics of |ψ(θ) can be derived by the dimension of this subspace deff = |V |, dubbed the effective dimension of the QNN. Definition 2 (Effective dimension). Consider a QNN instance (|ψ0 , U(θ), H) with symmetric ansatz design A. Suppose V is the invariant subspace covering |ψ0 , {|ψ(θ) |θ Θ}, and |ψ . Then the effective dimension of this QNN is deff = |V |. The projection on this subspace is defined as Π = PP , where P Cd deff is an arbitrary set of orthonormal basis. As a result, for symmetric ansätze, Q(t) and Q in Theorem 1 should be controlled by deff instead of 2n. This integration of the effective dimension transforms QNTK in Eqn. (3) to EQNTK, which reduces the threshold to reach over-parameterization and accelerates the convergence (see Fig. 1). The following theorem establishes the convergence theory of QNNs with symmetric ansätze under EQNTK, whose proof is deferred to Appendix C. Theorem 2. Consider the QNN instance (|ψ0 , U(θ), H) with the effective dimension deff. Following notations in Eqns. (1)-(3), when the distribution of U(θ) constrained to the invariant subspace with projection Π = PP matches the Haar distribution up to the fourth moment, the number of parameters satisfies LK 1, the learning rate η 1, and denoting H = PHP , the training dynamics of a QNN instance (|ψ0 , U(θ), H) yields εt (1 η QS)tε0 e γtε0. (5) where QS = O(LK Tr((H )2)/d2 eff) refers to the expectation of EQNTK QS on Haar average. The above results indicate that when the number of trainable parameters scales with LK O(d2 eff/(η Tr((H )2))), the adopted symmetric ansatz reaches the over-parameterization regime. Compared with QNTK, the reduction of parameters in the order of (2n/deff)2 not only ensures the practical utility of over-parameterized QNNs, but also explains the empirical observations that symmetric ansätze require fewer parameters to reach the critical point than that of the asymmetric ansätze. More importantly, unlike prior results arguing that the trainability can always be improved by over-parameterization, our bound suggests that involving more parameters beyond the critical point may degrade the convergence since the underlying symmetry may be broken and the effective dimension deff could be large. Remark. (i) The derived EQNTK can be used to diagnose the barren plateaus of QNNs. Particularly, the quantity QS/(LK) amounts to the variance of the gradient whose average is zero under the 4-design assumption. In other words, when the number of parameters LK is fixed, a large EQNTK value is preferred to avoid barren plateaus. (ii) The effective dimension can be quantified by other metrics beyond the dimension of the invariant subspace. An alternative is the dynamical Lie algebra (DLA) (Larocca et al., 2021b), which measures the controllability of the quantum system. The following lemma shows the convergence of QNNs with symmetric ansätze under this measure, whose proof is given in Appendix D. Proposition 1 (Informal). Following notations in Theorem 2, denote the dynamical Lie algebra of the QNN instance (|ψ0 , U(θ), H) as g, and assume that the number of parameters LK 1 and the learning rate η 1. If there exists an invariant subspace Vg with dimension Published as a conference paper at ICLR 2023 dg under the DLA g including the input state |ψ0 and the ground state |ψ , the DLA-based EQNTK QD corresponding to the ansatz U(θ) leads to the training dynamics εt (1 η QD)tε0 e γtε0. (6) where QD = O(LK Tr((H )2)/d2 g) refers to the expectation of QD on Haar average. 3 Symmetrical pruning with EQNTK Algorithm 1: Symmetric pruning (SP) Input : Problem Hamiltonian e H = (Pq j=1 αj Hj) I m, the ansatz design A and the parameter space Θ in Eqn. (2). Step 1. Initialize an over-parameterized and asymmetric ansatz via A and Θ; Step 2. Symmetry identification: 2-1. Remove the gates on wires corresponding to the redundant part of e H in A, i.e., I m. 2-2. Remove the gates such that the pruned ansatz design Apr = {H1, , Hq}. 2-3. Assign the spatial symmetry of Apr by correlating some parameters and obtain Θpr Θ. Output: Pruned ansatz design Apr and parameter space Θpr. Beyond analyzing the convergence rate, another ad-hoc topic in GSP is designing advanced ansätze to improve the trainability of QNNs. Although overparameterization and contemporary symmetric ansätze partially address this problem, both of them have evident caveats. The former may request exponential parameters to satisfy the condition of overparameterization, while the latter requires explicit information for the symmetries of the problem Hamiltonian. To compensate for these deficiencies, here we devise symmetrical pruning (SP), an automatic scheme to design symmetric ansätze with the enhanced trainability of QNNs. Conceptually, SP distills a symmetric overparameterized ansatz from an asymmetric over-parameterized ansatz. Supported by the EQNTK theory, the extracted ansatz is resource-friendly in implementation since it holds a small effective dimension and only needs a few trainable parameters to compass the over-parameterization. The Pseudo code of SP is summarized in Alg. 1 and its schematic illustration is shown in Fig. 3. Suppose the problem Hamiltonian is e H = H I m, where H = Pq j=1 αj Hj, αj is the real coefficient and Hj is the tensor product of Pauli matrices on n qubits, SP builds the symmetric ansatz of e H with two primary steps, i.e., initialization and symmetry identification. The initialization step is choosing an initial over-parameterized QNN by setting down the ansatz design A and the parameter space Θ. Note that A should contain all Pauli terms in H and Θ should ensure an ϵ-convergence of QNNs, e.g., a possible choice is adopting a sufficient deep hardware efficient ansatz. Next, the symmetry identification step iteratively discovers the system symmetry, structure symmetry, and spatial symmetry, which is completed by three sub-steps. Step 2-1 symmetrically prunes the qubit wires. That is, all qubit gates interact with the redundant part of e H, i.e., the identity term I m, are removed. Step 2-2 symmetrically prunes the structure. This step drops the parameterized single-qubit gates and the two-qubit gates so that the pruned ansatz design Apr can be block diagonalized under the projection on the eigenspace of H = Pq j=1 αj Hj. A possible solution is setting Apr = {H1, , Hq} and the pruned ansatz Upr(θ) takes the form of Eqn. (2) with Uℓ(θℓ) = Πq k=1e i Hkθℓk. Step 2-3 correlates symmetric parameters to demystify the spatial symmetry of H, which is accomplished by a heuristic related to identifying the graph automorphism group (Stoichev, 2019). See Appendix E for more details. Remark. (i) We emphasize that although both SP and the pruning techniques used in deep neural networks orient to remove redundant parameters and (quantum) neurons, they are fundamentally different. This is because classical pruning methods generally leverage the magnitude of weights or the gradient information to recognize such redundancy, which is impermissible in QNNs (refer to Appendix F for elaborations). (ii) SP is a flexible framework. Besides three symmetric properties in Alg. 1, SP can effectively integrate with other symmetry identification methods in the second step. Published as a conference paper at ICLR 2023 : Gates to be removed : Correlated parameters SP1 SP2 SP3 Figure 3: Schematic of symmetric pruning. The proposed symmetric pruning (SP) distills the symmetric ansatz from a given asymmetric ansatz, completed by removing the redundant gates (highlighted by the red dashed boxes) and correlating the parameters in the gate respecting the spatial symmetry (highlighted by the solid boxes with the same color). 4 Experiments We carry out numerical simulations to explore the theoretical properties of EQNTK and validate the effectiveness of the SP scheme in GSP. Two typical problem Hamiltonians in many-body physics and combinatorial optimization are considered. The omitted details are postponed to Appendix G. Problem Hamiltonian. Let us first recap the two problem Hamiltonians. 1) Transverse-field Ising model. Transverse-field Ising model (TFIM) has been employed to explore many interesting quantum systems. An n-qubit Hamiltonian of 1D TFIM with an open boundary condition is defined as HTFIM = Pn 1 j=1 σz j σz j+1 h Pn j=1 σx j , where σµ j denotes the µ-Pauli matrix (with µ = x, z) acting on the j-th qubit, and h is the strength of the transverse field. For simplicity, we set h = 1 in the following simulations. 2) Maximum cut. Maximum cut (Max Cut) problem aims to partition the set of nodes V in a graph G = (V, E) into two parts such that the number of edges spanning multiple parts is maximized. The Max Cut problem can be recast to GSP. Namely, the objective of an n-node graph is encoded by an n-qubit Hamiltonian HMC = 1 (u,v) E(I σz uσz v) and the optimal solution corresponds to the ground state of HMC as formulated in Eqn. (1). Here we focus on the Erdos-Renyi graphs, which are generated by randomly connecting any pair nodes among n nodes with probability p = 0.6. To verify the effectiveness of SP, the above problem Hamiltonians are modified as an (n+m)- qubit Hamiltonian H = HM Im(HM = HTFIM, HMC) with n = 6 and m = 2. Initialization of QNNs. The hardware efficient ansatz (HEA) with the form of Eqn. (2) is used as the initial ansatz, which is over-parameterized and asymmetric. The layer number is set as L {4, 6, 8 , 28} and L {4, 6, 10, , 36} for TFIM and Max Cut, respectively. For each problem Hamiltonian, the input state is set as |ψ0 = |0 . The parameters θ are uniformly sampled from the uniform distribution [ π, π]. The variational ansatz is trained by the Adam optimizer where the learning rate is 0.001 and the rest hyper-parameters follow the default settings. The training of QNNs stops when the loss value is less than 10 8 or when the change in the loss function is less than 10 8 three times in a row. The maximum number of iterations is set as T = 10000. The ϵ value in Definition 1 is set as 10 5 for both TFIM and Max Cut. Each setting is repeated with 5 times to collect the statistical results. Evaluation metrics. We utilize three metrics to assess the convergence rate of QNNs, i.e., (1) the loss value L(θ(T )) at the convergence stage; (2) the number of iteration steps T(ϵ) T required to achieve the ϵ-convergence; (3) the minimum number of parameterized gates required to achieve ϵ-convergence, which can also be interpreted as the threshold to achieve the over-parameterization regime. Additionally, we record the norm of the gradient at the initialization to verify the effectiveness of EQNTK in predicting the convergence. 4.1 Simulation results EQNTK at initialization. Fig. 4(b) and Fig. 5(b) show that each symmetric pruning step in Alg. 1 (i.e., the sub-steps 2-1, 2-2, and 2-3 refer to SP1 , SP2 , SP3 , respectively) constantly improves the squared norm of gradients (i.e., the EQNTK value QS) for both Published as a conference paper at ICLR 2023 103 (a) (b) (c) ℒ(𝜃 ) 𝑇(𝜖) Critical point Critical point Critical point Figure 4: Results for TFIM model under symmetric pruning. The panels (a)-(c) plot the EQNTK value QS at initialization, the loss value after convergence L(θ(T )), and the iteration number to achieve the ϵ-convergence T(ε) versus the number of parameters LK, respectively. The labels SP0 - SP3 refer to the initial ansatz, the pruned ansatz after system symmetric pruning, structure symmetric pruning, and spatial symmetric pruning, respectively. TFIM and Max Cut. This implicates that SP is capable of accelerating convergence of QNNs and alleviating barren plateaus based on Theorem 2. Notably, the EQNTK QS of the pruned ansatz (labeled by SP3 ) is improved by 20 (or 14) times compared to the initial over-parameterized ansatz (labeled by SP0 ) in the problem of TFIM (or Max Cut). Critical point of QNNs. Fig. 4(c) and Fig. 5(c) illustrate that when the number of parameters surpasses a threshold, QNNs experience a computational phase transition where the loss value L(θ(T )) at the convergence stage sharply drops by an order of the magnitude. Moreover, the minimum number of parameters required to reach the over-parameterization regime, highlighted by the critical point , is dramatically reduced by SP. Specifically, the number of parameters of naive QNNs at the critical point in TFIM (or Max Cut), i.e., labeled by SP0 , scales exponentially with the system size. By contrast, it is gradually reduced from 800 (or 1000) to 300 (or 300) after SP1, then to 120 (or 150) after SP2, and finally to 50 (or 100) after SP3, which scales polynomially with the system size and is resource-friendly for modern quantum devices (see Appendix G for hardware efficiency analysis). Convergence of QNNs. Fig. 4(d) and Fig. 5(d) reflect that SP dramatically improves the convergence of QNNs. In the common over-parameterization regime of SP1 SP3 (i.e., LK 300), the total iterations required to achieve ϵ-convergence can be reduced by up to 6000 steps for TFIM and 5000 steps for Max Cut. For the same ansatz design, increasing the number of parameters linearly improves the convergence, which echoes with Theorem 2. EQNTK and trainability of QNNs. Fig. 4 and Fig. 5 indicate the relation between the EQNTK value and the trainability of QNNs. That is, the convergence rate and the number of parameters around the critical point decrease with the increased EQNTK value. In Max Cut, when the number of parameters LK 450 reaches the over-parameterization regime in the cases of SP1 SP3 , the corresponding EQNTK value yields Q1 : Q2 : Q3 1 : 2 : 4 and the iteration steps T(ϵ) follows T1 : T2 : T3 4 : 2 : 1 (Tj and Qj refer to T(ϵ) and QS in SPj with j [3]). A similar phenomenon is observed in the task of TFIM. These results accord with Theorem 2 such that EQNTK can guide the trainability of QNNs. 5 Related work Prior literature related to our work can be cast into two categories: the trainability theories of QNNs and the design of symmetric ansätze. Trainability of QNNs. Mc Clean et al. (2018) first discovered the barren plateau of QNNs. Since then, a line of research is uncovering intrinsic reasons leading to this phenomenon. Current progress has revealed that these reasons include high entanglement of QNNs (Marrero et al., 2021), the used global measurements (Cerezo et al., 2021b), and the presence of noise (Wang et al., 2021). To mitigate barren plateaus, two popular ways are adopting Published as a conference paper at ICLR 2023 103 (a) (b) (c) ℒ(𝜃 ) 𝑇(𝜖) Critical point Critical point Critical point Figure 5: Results for Max Cut under SP. The notations are identical to those in Fig. 4. shallow QNNs with local measurements (Uvarov & Biamonte, 2021; Pesah et al., 2021; Du et al., 2022a; Zhang et al., 2020) and correlating parameters (Volkoff & Coles, 2021). Another crucial line of research is investigating the convergence rate of QNNs. Several empirical studies have observed that over-parameterized QNNs promise faster convergence, and the trained parameters are near-optimal (Kiani et al., 2020; Wiersema et al., 2020; Zhang & Cui, 2020). Afterward, initial attempts have been made to theoretically explain the superiority of over-parameterized QNNs. Specifically, Larocca et al. (2021b) and Anschuetz (2021) separately leveraged the tools of dynamical Lie algebra and random matrix theory to quantify the critical point of over-parameterized QNNs; You et al. (2022) extended the result of Xu et al. (2018) to the quantum regime and proved the exponential convergence rate of over-parameterized QNNs; Liu et al. (2022c;b;d;a) proposed the quantum neural tangent kernel to exhibit an exponential convergence rate of over-parameterized QNNs. A common caveat of prior literature is that their convergence analysis either requires an exponential circuit depth for reaching over-parameterization or omits the factor of the circuit depth. By contrast, EQNTK greatly reconciles the harsh requirement to reach over-parameterization and allows a tighter convergence rate for the symmetric ansatz. Ansätze with symmetric properties. Previous studies focus on unearthing inherent symmetry behind the problem Hamiltonian to design problem-specific ansätze. The mainstream approaches contain arranging the layout of ansätze (Liu et al., 2019; Seki et al., 2020; Gard et al., 2020; Zheng et al., 2021; 2022), correlating trainable parameters (Shaydulin et al., 2021; Shaydulin & Wild, 2021; Sauvage et al., 2022), and utilizing results from the geometric deep learning (Shaydulin et al., 2021; Shaydulin & Wild, 2021; Sauvage et al., 2022; Meyer et al., 2022), where the symmetry comes from the training data. Compared with asymmetric ansätze, these ansätze enable better trainability in GSP. However, none of the previous proposals can identify the implicit symmetry of the problem Hamiltonian. Moreover, although there is numerical evidence that symmetric ansätze can accelerate convergence, theoretical analysis is still rare. EQNTK readily compensates these issues, which provides an efficient measure to compare the trainability of various ansätze and allows an automatic method (symmetric pruning) to design symmetrical ansatz with fast convergence. 6 Conclusions In this study, we investigate the training performance of QNNs for the GSP problem by developing a novel tool EQTNK, which is capable of capturing the training dynamics of various ansätze via their effective dimension. We prove that a symmetric ansatz design with a small effective dimension enables an improved trainability of QNNs, including alleviating the barren plateaus and reducing the number of parameters and the circuit depth required to reach the over-parameterization regime. Besides, we propose a novel symmetric pruning algorithm to automatically extract the symmetric ansatz from an over-parameterized and asymmetric ansatz. Empirical results confirm the effectiveness of SP. A future research direction is extending the results of EQNTK from GSP to the regime of machine learning and exploring whether over-parameterized QNNs can simultaneously attain good trainability and generalization (Abbas et al., 2021; Du et al., 2022b; Caro et al., 2022). Published as a conference paper at ICLR 2023 7 Acknowledgements Yong Luo was supported by the Special Fund of Hubei Luojia Laboratory under Grant 220100014 and the National Natural Science Foundation of China under Grant 62276195. Tongliang Liu was partially supported by Australian Research Council Projects IC190100031, LP-220100527, DP-220102121, and FT-220100318. Amira Abbas, David Sutter, Christa Zoufal, Aurélien Lucchi, Alessio Figalli, and Stefan Woerner. The power of quantum neural networks. Nature Computational Science, 1(6): 403 409, 2021. Eric R Anschuetz. Critical points in hamiltonian agnostic variational quantum algorithms. ar Xiv preprint ar Xiv:2109.06957, 2021. Eric R Anschuetz and Bobak T Kiani. Beyond barren plateaus: Quantum variational algorithms are swamped with traps. ar Xiv preprint ar Xiv:2205.05786, 2022. Kishor Bharti, Alba Cervera-Lierta, Thi Ha Kyaw, Tobias Haug, Sumner Alperin-Lea, Abhinav Anand, Matthias Degroote, Hermanni Heimonen, Jakob S Kottmann, Tim Menke, et al. Noisy intermediate-scale quantum algorithms. Reviews of Modern Physics, 94(1): 015004, 2022. Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? ar Xiv preprint ar Xiv:2003.03033, 2020. Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Schuld, Naftali Tishby, Leslie Vogt-Maranto, and Lenka Zdeborová. Machine learning and the physical sciences. Reviews of Modern Physics, 91(4):045002, 2019. Matthias C Caro, Hsin-Yuan Huang, Marco Cerezo, Kunal Sharma, Andrew Sornborger, Lukasz Cincio, and Patrick J Coles. Generalization in quantum machine learning from few training data. Nature Communications, 13(1):1 11, 2022. Marco Cerezo and Patrick J Coles. Higher order derivatives of quantum neural networks with barren plateaus. Quantum Science and Technology, 6(3):035006, 2021. Marco Cerezo, Andrew Arrasmith, Ryan Babbush, Simon C Benjamin, Suguru Endo, Keisuke Fujii, Jarrod R Mc Clean, Kosuke Mitarai, Xiao Yuan, Lukasz Cincio, et al. Variational quantum algorithms. Nature Reviews Physics, pp. 1 20, 2021a. doi: https: //doi.org/10.1038/s42254-021-00348-9. Marco Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio, and Patrick J Coles. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nature communications, 12(1):1 12, 2021b. Iris Cong, Soonwon Choi, and Mikhail D Lukin. Quantum convolutional neural networks. Nature Physics, 15(12):1273 1278, 2019. Paul T Darga, Mark H Liffiton, Karem A Sakallah, and Igor L Markov. Exploiting structure in symmetry detection for cnf. In Proceedings of the 41st Annual Design Automation Conference, pp. 530 534, 2004. Yuxuan Du, Min-Hsiu Hsieh, Tongliang Liu, Shan You, and Dacheng Tao. Learnability of quantum neural networks. PRX Quantum, 2(4):040337, 2021. Yuxuan Du, Tao Huang, Shan You, Min-Hsiu Hsieh, and Dacheng Tao. Quantum circuit architecture search for variational quantum algorithms. npj Quantum Information, 8(1): 1 8, 2022a. Published as a conference paper at ICLR 2023 Yuxuan Du, Zhuozhuo Tu, Xiao Yuan, and Dacheng Tao. Efficient measure for the expressivity of variational quantum algorithms. Physical Review Letters, 128(8):080506, 2022b. Edward Farhi and Hartmut Neven. Classification with quantum neural networks on near term processors. ar Xiv preprint ar Xiv:1802.06002, 2018. URL https://arxiv.org/abs/ 1802.06002. Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A quantum approximate optimization algorithm. ar Xiv preprint ar Xiv:1411.4028, 2014. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. ar Xiv preprint ar Xiv:1803.03635, 2018. Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Pruning neural networks at initialization: Why are we missing the mark? ar Xiv preprint ar Xiv:2009.08576, 2020. Motohisa Fukuda, Robert König, and Ion Nechita. Rtni a symbolic integrator for haarrandom tensor networks. Journal of Physics A: Mathematical and Theoretical, 52(42): 425303, 2019. Bryan T Gard, Linghua Zhu, George S Barron, Nicholas J Mayhall, Sophia E Economou, and Edwin Barnes. Efficient symmetry-preserving state preparation circuits for the variational quantum eigensolver algorithm. npj Quantum Information, 6(1):1 9, 2020. Edward Grant, Leonard Wossnig, Mateusz Ostaszewski, and Marcello Benedetti. An initialization strategy for addressing barren plateaus in parametrized quantum circuits. Quantum, 3:214, 2019. Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ar Xiv preprint ar Xiv:1510.00149, 2015. Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992. Y Herasymenko and TE O Brien. A diagrammatic approach to variational quantum ansatz construction. Quantum, 5:596, 2021. Tommi Junttila and Petteri Kaski. Engineering an efficient canonical labeling tool for large and sparse graphs. In 2007 Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 135 149. SIAM, 2007. Abhinav Kandala, Antonio Mezzacapo, Kristan Temme, Maika Takita, Markus Brink, Jerry M Chow, and Jay M Gambetta. Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets. Nature, 549(7671):242 246, 2017. Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pp. 85 103. Springer, 1972. Bobak Toussi Kiani, Seth Lloyd, and Reevu Maity. Learning unitaries by gradient descent. ar Xiv preprint ar Xiv:2001.11897, 2020. Martin Larocca, Piotr Czarnik, Kunal Sharma, Gopikrishnan Muraleedharan, Patrick J Coles, and M Cerezo. Diagnosing barren plateaus with tools from quantum optimal control. ar Xiv preprint ar Xiv:2105.14377, 2021a. Martin Larocca, Nathan Ju, Diego García-Martín, Patrick J Coles, and M Cerezo. Theory of overparametrization in quantum neural networks. ar Xiv preprint ar Xiv:2109.11676, 2021b. Martin Larocca, Frederic Sauvage, Faris M Sbahi, Guillaume Verdon, Patrick J Coles, and Marco Cerezo. Group-invariant quantum machine learning. ar Xiv preprint ar Xiv:2205.02261, 2022. Published as a conference paper at ICLR 2023 Yann Le Cun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989. Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. ar Xiv preprint ar Xiv:1810.02340, 2018. Jin-Guo Liu, Yi-Hong Zhang, Yuan Wan, and Lei Wang. Variational quantum eigensolver with fewer qubits. Physical Review Research, 1(2):023025, 2019. Junyu Liu, Zexi Lin, and Liang Jiang. Laziness, barren plateau, and noise in machine learning. ar Xiv preprint ar Xiv:2206.09313, 2022a. Junyu Liu, Khadijeh Najafi, Kunal Sharma, Francesco Tacchino, Liang Jiang, and Antonio Mezzacapo. An analytic theory for the dynamics of wide quantum neural networks. ar Xiv preprint ar Xiv:2203.16711, 2022b. Junyu Liu, Francesco Tacchino, Jennifer R. Glick, Liang Jiang, and Antonio Mezzacapo. Representation Learning via Quantum Neural Tangent Kernels. PRX Quantum, 3(3): 030323, 2022c. doi: 10.1103/PRXQuantum.3.030323. Junyu Liu, Changchun Zhong, Matthew Otten, Cristian L. Cortes, Chaoyang Ti, Stephen K. Gray, and Xu Han. Quantum Kerr Learning. 5 2022d. José Luis López-Presa, Luis F Chiroque, and Antonio Fernández Anta. Novel techniques to speed up the computation of the automorphism group of a graph. Journal of Applied Mathematics, 2014, 2014. Carlos Ortiz Marrero, Mária Kieferová, and Nathan Wiebe. Entanglement-induced barren plateaus. PRX Quantum, 2(4):040316, 2021. Iman Marvian. Restrictions on realizable unitary operations imposed by symmetry and locality. Nature Physics, 18(3):283 289, 2022. Jarrod R Mc Clean, Sergio Boixo, Vadim N Smelyanskiy, Ryan Babbush, and Hartmut Neven. Barren plateaus in quantum neural network training landscapes. Nature communications, 9(1):1 6, 2018. Brendan D Mc Kay and Adolfo Piperno. Practical graph isomorphism, ii. Journal of symbolic computation, 60:94 112, 2014. Brendan D Mc Kay et al. Practical graph isomorphism. 1981. Péter Mernyei, Konstantinos Meichanetzidis, and Ismail Ilkan Ceylan. Equivariant quantum graph circuits. In International Conference on Machine Learning, pp. 15401 15420. PMLR, 2022. Johannes Jakob Meyer, Marian Mularski, Elies Gil-Fuster, Antonio Anna Mele, Francesco Arzani, Alissa Wilms, and Jens Eisert. Exploiting symmetry in variational quantum machine learning. ar Xiv preprint ar Xiv:2205.06217, 2022. Kosuke Mitarai, Makoto Negoro, Masahiro Kitagawa, and Keisuke Fujii. Quantum circuit learning. Physical Review A, 98(3):032309, 2018. Alberto Peruzzo, Jarrod Mc Clean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J Love, Alán Aspuru-Guzik, and Jeremy L O brien. A variational eigenvalue solver on a photonic quantum processor. Nature communications, 5:4213, 2014. Arthur Pesah, M Cerezo, Samson Wang, Tyler Volkoff, Andrew T Sornborger, and Patrick J Coles. Absence of barren plateaus in quantum convolutional neural networks. Physical Review X, 11(4):041011, 2021. David Poulin and Pawel Wocjan. Preparing ground states of quantum many-body systems on a quantum computer. Physical review letters, 102(13):130503, 2009. Published as a conference paper at ICLR 2023 John Preskill. Quantum computing in the nisq era and beyond. Quantum, 2:79, 2018. doi: https://doi.org/10.22331/q-2018-08-06-79. Yang Qian, Xinbiao Wang, Yuxuan Du, Xingyao Wu, and Dacheng Tao. The dilemma of quantum neural networks. ar Xiv preprint ar Xiv:2106.04975, 2021. Frederic Sauvage, Martin Larocca, Patrick J Coles, and M Cerezo. Building spatial symmetries into parameterized quantum circuits for faster training. ar Xiv preprint ar Xiv:2207.14413, 2022. Maria Schuld, Ryan Sweke, and Johannes Jakob Meyer. Effect of data encoding on the expressive power of variational quantum-machine-learning models. Physical Review A, 103(3):032430, 2021. Kazuhiro Seki, Tomonori Shirakawa, and Seiji Yunoki. Symmetry-adapted variational quantum eigensolver. Physical Review A, 101(5):052340, 2020. Ruslan Shaydulin and Stefan M Wild. Exploiting symmetry reduces the cost of training qaoa. IEEE Transactions on Quantum Engineering, 2:1 9, 2021. Ruslan Shaydulin, Stuart Hadfield, Tad Hogg, and Ilya Safro. Classical symmetries and the quantum approximate optimization algorithm. Quantum Information Processing, 20(11): 1 28, 2021. Andrea Skolik, Jarrod R Mc Clean, Masoud Mohseni, Patrick van der Smagt, and Martin Leib. Layerwise learning for quantum neural networks. Quantum Machine Intelligence, 3(1):1 11, 2021. Stoicho D Stoichev. New exact and heuristic algorithms for graph automorphism group and graph isomorphism. Journal of Experimental Algorithmics (JEA), 24:1 27, 2019. AV Uvarov and Jacob D Biamonte. On barren plateaus and cost function locality in variational quantum algorithms. Journal of Physics A: Mathematical and Theoretical, 54(24): 245301, 2021. Tyler Volkoff and Patrick J Coles. Large gradients via correlation in random parameterized quantum circuits. Quantum Science and Technology, 6(2):025008, 2021. Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. ar Xiv preprint ar Xiv:2002.07376, 2020. Hanrui Wang, Yongshan Ding, Jiaqi Gu, Yujun Lin, David Z Pan, Frederic T Chong, and Song Han. Quantumnas: Noise-adaptive search for robust quantum circuits. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 692 708. IEEE, 2022. Samson Wang, Enrico Fontana, Marco Cerezo, Kunal Sharma, Akira Sone, Lukasz Cincio, and Patrick J Coles. Noise-induced barren plateaus in variational quantum algorithms. Nature communications, 12(1):1 11, 2021. Dave Wecker, Matthew B Hastings, and Matthias Troyer. Progress towards practical quantum variational algorithms. Physical Review A, 92(4):042303, 2015. John W Wheeler. An investigation of the max-cut problem. University of Iowa, 2004. Roeland Wiersema, Cunlu Zhou, Yvette de Sereville, Juan Felipe Carrasquilla, Yong Baek Kim, and Henry Yuen. Exploring entanglement and optimization within the hamiltonian variational ansatz. PRX Quantum, 1(2):020319, 2020. Zhiqiang Xu, Xin Cao, and Xin Gao. Convergence analysis of gradient descent for eigenvector computation. International Joint Conferences on Artificial Intelligence, 2018. Xuchen You, Shouvanik Chakrabarti, and Xiaodi Wu. A convergence theory for overparameterized variational quantum eigensolvers. ar Xiv preprint ar Xiv:2205.12481, 2022. Published as a conference paper at ICLR 2023 Kaining Zhang, Min-Hsiu Hsieh, Liu Liu, and Dacheng Tao. Toward trainability of quantum neural networks. ar Xiv preprint ar Xiv:2011.06258, 2020. Stephen Zhang and Wentao Cui. Overparametrization in qaoa. Written Report, 2020. Han Zheng, Zimu Li, Junyu Liu, Sergii Strelchuk, and Risi Kondor. Speeding up learning quantum states through group equivariant convolutional quantum ans {\" a} tze. ar Xiv preprint ar Xiv:2112.07611, 2021. Han Zheng, Zimu Li, Junyu Liu, Sergii Strelchuk, and Risi Kondor. On the Superexponential Quantum Speedup of Equivariant Quantum Machine Learning Algorithms with SU(d) Symmetry. 7 2022. Zeqiao Zhou, Yuxuan Du, Xinmei Tian, and Dacheng Tao. Qaoa-in-qaoa: solving large-scale maxcut problems on small quantum machines. ar Xiv preprint ar Xiv:2205.11762, 2022. A Optimization of QNNs in GSP In this section, we separately elaborate the elementary notions in quantum computing, the preliminary of Hamiltonian and the ground state preparation (GSP), and the optimization strategy of QNNs in the task of GSP. Basics of quantum computation. The elementary unit of quantum computation is qubit (or quantum bit), which is the quantum mechanical analogue of a classical bit. A qubit is a two-level quantum-mechanical system described by a unit vector in the Hilbert space C2. In Dirac notation, a qubit state is defined as |ϕ = c0 |0 + c1 |1 C2 where |0 = [1, 0] and |1 = [0, 1]T specify two unit bases and the coefficients c0, c1 C yield |c0|2 + |c1|2 = 1. Similarly, the quantum state of n qubits is defined as a unit vector in C2n, i.e., |ψ = P2n j=1 cj |ej , where |ej R2n is the computational basis whose j-th entry is 1 and other entries are 0, and P2n j=1 |cj|2 = 1 with cj C. Besides Dirac notation, the density matrix can be used to describe more general qubit states. For example, the density matrix of the state |ψ is ρ = |ψ ψ| C2n 2n, where ψ| = |ψ refers to the complex conjugate transpose of |ψ . For a set of qubit states {pj, |ψj }m j=1 with pj > 0, Pm j=1 pj = 1, and |ψj C2n for j [m], its density matrix is ρ = Pm j=1 pjρj with ρj = |ψj ψj| and Tr(ρ) = 1. A quantum gate is an unitary operator which can evolve a quantum state ρ to another quantum state ρ . Namely, an n-qubit gate U U(2n) obeys UU = U U = I2n, where U(2n) refers to the unitary group in dimension 2n. Typical single-qubit quantum gates include the Pauli gates, which can be written as Pauli matrices: X = 0 1 1 0 , Y = 0 i i 0 , Z = 1 0 0 1 The more general quantum gates are their corresponding rotation gates RX(θ) = e i θ 2 X, RY (θ) = e i θ 2 Y , and RZ(θ) = e i θ 2 Z with a tunable parameter θ, which can be written in the matrix form as RX(θ) = cos θ , RY (θ) = cos θ , RZ(θ) = e i θ (8) They are equivalent to rotating a tunable angle θ around x, y, and z axes of the Bloch sphere, and recovering the Pauli gates X, Y , and Z when θ = π. Moreover, a multi-qubit gate can be either an individual gate (e.g., CNOT gate) or a tensor product of multiple single-qubit gates. The quantum measurement refers to the procedure of extracting classical information from the quantum state. It is mathematically specified by a Hermitian matrix H called the observable. Applying the observable H to the quantum state |ψ yields a random variable whose expectation value is ψ| H |ψ . Published as a conference paper at ICLR 2023 Hamiltonian and GSP. In quantum computation, a Hamiltonian is a Hermitian matrix that is used to characterize the evolution of a quantum system or as an observable to extract the classical information from the quantum system. Specifically, under the Schrödinger equation, a quantum gate has the mathematical form of U = e it H, where H is a Hermitian matrix, called the Hamiltonian of the quantum system, and t refers to the evolution time of the Hamiltonian. Typical single-qubit Hamiltonians include the Pauli matrices defined in Eqn. (7). As a result, the evolution time t refers to the tunable parameter θ in Eqn. (8). Any single-qubit Hamiltonian can be decomposed as the linear combination of Pauli matrices, i.e., H = a1I + a2X + a3Y + a4Z with aj C. In the same way, a multi-qubit Hamiltonian is denoted by H = P4n j=1 aj Pj, where Pj {I, X, Y, Z} n is the tensor product of Pauli matrices. In quantum chemistry and quantum many-body physics, the Hermitian matrix describing the quantum system to be solved is denoted as the problem Hamiltonian. When taking the problem Hamiltonian as the observable, the quantum state |ψ is said to be the ground state of problem Hamiltonian H if the expectation value ψ | H |ψ takes the minimum eigenvalue of H, which is called the ground energy. GSP refers to preparing the ground state of the problem Hamiltonian. A popular protocol for GSP is to employ a parameterized unitary U(θ) to prepare a variational quantum state |ψ(θ) = U(θ) |ψ0 with a fixed input state |ψ0 and then optimize the parameters θ by minimizing a predefined loss function such as the Eqn. (1). Optimization of QNNs. The optimization of the loss function L(θ) in Eqn. (1) can be completed by gradient-based methods. A plethora of optimizers has been designed to estimate the optimal parameters θ = minθ L(θ). Here we introduce the implementation of the first-order gradient-based optimizer for self-consistency. Refer to Cerezo et al. (2021a) for a comprehensive review. Based on Eqn. (2), the trainable parameters of QNNs are denoted by θ = (θ 1 , , θ L ) with θℓ= (θℓ1, , θℓK)T , where the subscript ℓk refers to the k-th parameter of the ℓ-th layer Uℓfor k [K] and ℓ [L]. Recall the loss function takes the form 2 ψ0| U(θ) HU(θ) |ψ0 E0 2 , and the corresponding update rule at the t-th iteration t [T] is = θ(t) η L(θ(t)) = θ(t) η ψ0| U(θ(t)) HU(θ(t)) |ψ0 E0 ψ0| U(θ(t)) HU(θ(t)) |ψ0 E0 where η refers to the learning rate. The derivative in the last equality can be calculated via the parameter shift rule Mitarai et al. (2018). Mathematically, the derivative with respect to the parameter θℓk for ℓ [L] and k [K] is ψ0| U(θ) HU(θ) |ψ0 E0 = 1 2 sin α ψ0| U(θ+) HU((θ+) |ψ0 E0 ψ0| U((θ ) HU((θ ) |ψ0 E0 , where θ+ = θ + αeℓk, θ = θ αeℓk, eℓk is the unit vector along the θℓk axis and α can be any real number but the multiple of π because of the diverging denominator. B Equivalent training dynamics under the mean square error loss In the main text, to facilitate the convergence analysis, the loss function L(θ) in Eqn. (1) adopts the term E0, as the ground state energy of the problem Hamiltonian. Here we elucidate how to extend our results to a more general loss function in which E0 is replaced Published as a conference paper at ICLR 2023 by any C R with C E0. More specifically, the general mean square error loss function is defined as L(θ, C) = 1 2 ψ0| U(θ) HU(θ) |ψ0 C 2 1 2ε(θ, C)2, (9) where ε(θ, C) = ψ0| U(θ) HU(θ) |ψ0 C refers to the training error associated with C. For clarity, we denote the training error at the t-th iteration as εt(θ(t), C). Given two loss functions L(θ, C) and L(θ, C ) with C, C E0, their convergence behavior or training dynamics is said to be equivalent if the variational quantum state |ψ(θ) converges to a same quantum state with the same convergence rate. More concretely, for the same initial state |ψ(0) , the evolved state |ψ(t) at each iteration t for t [T] is the same. The following lemma indicates the equivalent training dynamic of QNNs under the loss functions L(θ, C) and L(θ, C ) with C, C E0. Lemma 1. Under the framework of the quantum neural tangent kernel in Eqn. (3), given any loss function L(θ, C) in Eqn. (9) with C E0, QNN obeys the same convergence rate with L(θ, E0) and the optimized variational quantum state |ψ(θ) converges to the ground state of the problem Hamiltonian H. Proof of Lemma 1. In the same manner with Eqn. (3), the QNTK of the loss function in Eqn. (9) can be denoted by QC = ε(θ, C) ε(θ, C). According to the explicit form of ε(θ, C) in Eqn. (9), the QNTK QC and QC is the same for any given constant C and C as ε(θ, C) = ε(θ, C ). For this reason, in the following, we use Q to refer the QNTK with any constant C. Recall the results of Theorem 1, i.e., in the case of C = E0, the training error ε(θ, E0) decays as εt(θ(t), E0) e ηQtε0(θ(0), E0). Moreover, due to εt(θ(t), E0) = εt(θ(t), C) + (C E0) for t [T], the training error of QNNs under the loss L(θ, C) is εt(θ(t), C) + (C E0) e ηQt ε0(θ(t), C) + (C E0) . (10) Since the right-hand side tends to be zero with a sufficiently large number t, this suggests εt(θ(t), C)+(C E0) 0. In other words, ε(θ, C) converges to the minimal value of E0 C with the decay rate ηQ. Supported by the variational principle, the optimized variational quantum state U(θ(T )) |ψ0 at the converging stage is exactly the ground state in which the corresponding energy estimates E0. C Proof of Theorem 2 The proof of Theorem 2 employs the following two lemmas whose proofs are given in the subsequent two subsections. Lemma 2 (Adapted from You et al. (2022), Lemma D.1). Following Definition 2, let UA be a matrix subgroup of SU(d) where each element in UA commutes with a Hermitian matrix Σ. The corresponding direct decomposition is denoted by V = Pp j=1 Vj with projection Πj. Let V be the subspace of interest which includes the input state |ψ0 and ground state |ψ . Denote Π = P P as the projection on V . Then for any Hermitian W and any unitary matrix U in the group UA, we have Π UWU Π = Π UΠ Π WΠ Π U Π . (11) Lemma 3. Following notations in Lemma 2, denote ℓ =1 Uℓ (θℓ ) k =1 e iθℓk Gk , U+,ℓk k =k e iθℓk Gk L Y ℓ =ℓ+1 Uℓ (θℓ ), (12) Published as a conference paper at ICLR 2023 the EQNTK takes the form D ψ 0 (U +,ℓk) G k, (U ,ℓk) H U ,ℓk U +,ℓk ψ 0 E2 , (13) where |ψ 0 = P |ψ and A = PAP with A {U+,ℓk, Gk, U ,ℓk, H}. Proof of Theorem 2. Following the gradient descent optimizer in Eqn. (1) with the learning rate η 1, the change of the training error of QNN can be expressed as ε θℓk đθℓk = η X ε θℓk ε = ηQSε. (14) where đε = εt+1 εt and đθℓk = θ(t+1) ℓk θ(t) ℓk , the second equality comes from the update rule with đθℓk = ηε ε/ θℓk, and the third equality uses the definition of QNTK in Eqn. (3). Following the results in Liu et al. (2022c, Theorem 1), when the EQNTK value Q(t) S is a constant, the training error decays with εt (1 ηQ(t) S )tε0 e ηQ(t) S tε0, (15) which guarantees an exponential convergence towards the global optima. To this end, the proof of Theorem 2 amounts to proving that when the number of parameters satisfies |θ| = LK 1, the EQNTK can be regarded as a constant. This can be achieved by deriving an analytical solution of Q(t) S on average as well as the fluctuations around the average for all iterations. Following the above explanations, we next analyze the average of Q(t) S . When no confusion arises, the superscript (t) of Q(t) S and U(θ(t)) are dropped in the subsequent analysis. By leveraging Lemmas 2 and 3, the Haar average of EQNTK yields Z d U+,ℓkd U ,ℓk D ψ 0 (U +,ℓk) G k, (U ,ℓk) H U ,ℓk U +,ℓk ψ 0 E2 , Z d U +,ℓkd U ,ℓk Tr ρ 0(U +,ℓk) M ,ℓk U +,ℓkρ 0(U +,ℓk) M ,ℓk U +,ℓk , Tr2 (M ,ℓk) Tr (ρ 0)2 d2 eff 1 + Tr (M ,ℓk)2 Tr (ρ 0)2 + Tr2 (M ,ℓk) Tr2 (ρ 0) deff d3 eff + Tr (M ,ℓk)2 Tr (ρ 0)2 deff d3 eff Z d U +,ℓk Tr (M ,ℓk)2 d2 eff + deff 2 Tr G k(U ,ℓk) H U ,ℓk 2 2 Tr (G k)2((U ,ℓk) H U ,ℓk)2 d2 eff + deff = 2 d2 eff + deff deff Tr((H )2 Tr2(H )) k=1 (G k)2 ! LK Tr ((H )2) d2 eff . (16) where the second equality employs the assumption such that U ,ℓk and U +,ℓk match the Haar distribution on the group SU(deff) up to the second moment, ρ 0 = |ψ 0 ψ 0| is the projection Published as a conference paper at ICLR 2023 of ρ0 = |ψ0 ψ0| on the subspace V , and M ,ℓk = [G k, (U ,ℓk) H U ,ℓk], the third equality exploits the RTNI package (Fukuda et al., 2019) to calculate the integration with respect to the Haar measure, the fourth equality uses the fact Tr((ρ 0)2) = Tr(ρ 0) = 1 with ρ 0 being the pure state, the fifth equality utilizes Tr([A, B]2) = 2 Tr(ABAB) 2 Tr(A2B2), the last second equality uses the RTNI package again, and the last equality utilizes the fact Tr((G k)2) = Tr(GkΠ GkΠ ) Pp j=1 Tr(GkΠj GkΠj) Tr(G2 k) deff 2n 2n = deff (17) with Tr(G2 k) = Tr(I) = 2n. The fluctuation of EQNTK can be expressed as Q2 S = E(Q2 S) Q2 S, i.e., ℓ1,k1<ℓ2,k2 Z d U +,ℓ1k1d U +,ℓ2k2d U ,ℓ1k1d U ,ℓ2k2 Tr ρ 0(U +,ℓ1k1) M ,ℓ1k1(U +,ℓ1k1) ρ 0U +,ℓ1k1M ,ℓ1k1U +,ℓ1k1 Tr ρ 0(U +,ℓ2k2) M ,ℓ2k2(U +,ℓ2k2) ρ 0U +,ℓ2k2M ,ℓ2k2U +,ℓ2k2 Z d U +,ℓkd U ,ℓk Tr ρ 0(U +,ℓk) M ,ℓk(U +,ℓk) ρ 0U +,ℓk M ,ℓk U +,ℓk Tr ρ 0(U +,ℓk) M ,ℓk(U +,ℓk) ρ 0U +,ℓk M ,ℓk U +,ℓk 8 Tr2((H )2) + 12 Tr((H )4) + 16 Tr((H )2) Tr2(H ) + 48 Tr((H )3) Tr(H ) + 40 Tr2((H )2) + 8 Tr2((H )2) + 12 Tr((H )4) , (18) where ℓ1, k1 < ℓ2, k2 refers to ℓ1K + k1 < ℓ2K + k2, the derivation of the second equality mainly follows the results of Liu et al. (2022b, Appendix C), and the approximation comes from the truncation and the preservation of the leading order term. Taken together, when the number of parameters LK 1 such that QS/ QS 1 LK 1, the EQNTK can be viewed as a constant and Eqn. (15) is satisfied. C.1 Proof of Lemma 2 Proof of Lemma 2. The two conditions in the lemma, i.e., (i) any unitary U in UA commutes with the Hermitian matrix Σ and (ii) Σ leads to the decomposition V = p j=1Vj with projection Πj, imply that U is block-diagonal under the projection {Πj}p j=1. In other words, we have Πj UΠj = 0 for j = j and U UA. This observation gives the following results, i.e., j =1 Πj U Π j,j [p] (Π UΠj) W Πj U Π =Π UΠ WΠ U Π =Π UΠ Π WΠ Π U Π , (19) where the first equality employs the fact of I = Pp j=1 Πj and the last equality uses the property of projections Π2 j = Πj. Published as a conference paper at ICLR 2023 C.2 Proof of Lemma 3 Proof of Lemma 3. The explicit form of the training error ε(θ) = ψ0| U(θ) HU(θ) |ψ0 E0 leads to the explicit form of QNTK, i.e., Q = ( ε(θ)) ε(θ) D ψ0 U +,ℓk h Gk, U ,ℓk HU ,ℓk i U+,ℓk ψ0 E2 . (20) Similarly, for the symmetric ansatz U(θ) with the projection Π = PP , the EQNTK yields k=1 Tr ψ0 ED ψ0 U +,ℓk h Gk, U ,ℓk HU ,ℓk i U+,ℓk 2 k=1 Tr Π ρ0Π U +,ℓk h Gk, U ,ℓk HU ,ℓk i U+,ℓk 2 k=1 Tr Π ρ0Π U +,ℓk h Gk, U ,ℓk OU ,ℓk i U+,ℓkΠ 2 k=1 Tr Π ρ0Π U +,ℓkΠ h Gk, U ,ℓk OU ,ℓk i Π U+,ℓkΠ 2 k=1 Tr Π ρ0Π U +,ℓkΠ h Π GkΠ , Π U ,ℓkΠ HΠ U ,ℓkΠ i Π U+,ℓkΠ 2 k=1 Tr P ρ0PP U +,ℓk P h P Gk P, P U ,ℓk PP HPP U ,ℓk P i P U+,ℓk P 2 k=1 Tr ρ 0(U +,ℓk) G k, (U ,ℓk) H U ,ℓk U +,ℓk 2 D ψ 0 (U +,ℓk) G k, (U ,ℓk) H U ,ℓk U +,ℓk ψ 0 E2 , (21) where the second equality utilizes the assumption such that |ψ0 lies in V , the third, fourth, and fifth equalities employ the property of projection operator (Π )2 = Π and Lemma 2, the final equality follows from the definitions with |ψ 0 = P |ψ0 and A = P AP with A {U ,ℓk, U+,ℓk, Gk, H, ρ0} . D Proof of Proposition 1 Before moving to elaborate on the proof of Proposition 1, we first briefly review the definition of dynamical Lie algebra (DLA). Definition 3 (Definition 3, Larocca et al. (2021a)). Given an ansatz design A, the dynamical Lie algebra (DLA) g is generated by the repeated nested commutators of the operators in A. That is g = span i G1, , i GK Lie (22) where span S Lie denotes the Lie closure, i.e., the set obtained by repeatedly taking the commutator of the elements in S. The proof of Lemma 1 employs the following Lemma. Lemma 4. Consider the QNN instance (|ψ0 , U(θ), H) whose DLA is g. If there exists an invariant subspace Vg including the input state |ψ0 and the ground state |ψ with dimension dg under g, then the effective dimension of this ansatz design A yields deff = dg. Published as a conference paper at ICLR 2023 Proof of Lemma 4. We first demonstrate the equivalence between the group UA = L=0{U(θ) : θ RLK} and the group generated by the elements in g, i.e., UA = {e V , V g}. (23) UA {e V , V g}. To facilitate understanding, we consider a single-layer unitary U(θ) with L = 1 and the ansatz design A = {G1, G2}. From the Baker-Campbell-Hausdorff formula, we have U(θ) = eiθ1G1eiθ2G2 = e J1(θ), (24) J1(θ) = i θ1G1 + θ2G2 + iθ1θ2 2 [G1, G2] θ2 1θ2 12 [G1, [G1, G2]] + . (25) Eqn. (25) implies that by merging eiθ1G1 and eiθ2G2 into a single term, the new evolution is generated by an operator J1(θ) depending on both θ1 and θ2, which contains a nested commutator between G1 and G2. Therefore, we have J1(θ) g and U(θ) {e V , V g}. For the case of multiple layers, i.e., U(θ) = QL ℓ=1 eiθℓ1G1eiθℓ2G2, we have U(θ) = e JL(θ) {e V , V g} by recursively applying the Baker-Campbell-Hausdorff formula to reformulate U(θ) by the JL(θ) g. UA {e V , V g}. Since each element in g is a linear combination of the nested commutators in Eqn. (25), there always exists θ R2L for any V g such that JL(θ) = V and thus e JL(θ) = U(θ) L=0{U(θ) : θ RLK}. Taken together, we obtain Eqn. (23) in the case of K = 2. The results for the ansatz design A with more than two elements can be derived in the same manner. More details can be found in Section IV of Larocca et al. (2021b). The equivalence of UA and {e G : G g} indicates that for any G g and U UA, G and U commutes with the same Hermitian matrix Σ since U can be expressed as e G and hence has the same Eigen-space with G. This implies that the invariant subspace induced by UA is the same with the one induced by g and thus d = dg. Proof of Proposition 1. Following Lemma 4, the DLA-based EQNTK is the same as the EQNTK discussed in Theorem 2 because the corresponding U(θ) induces the same invariant subspace. Hence, the results achieved in Theorem 2 can be applied to the DLA-based EQNTK by replacing the effective dimension deff with dg. E Implementation details of the symmetric pruning algorithm In this section, we elucidate Steps 2-1, 2-2, and 2-3 of the proposed SP in Alg. 1. Recall the considered problem Hamiltonian is expressed as e H = H I m with H = Pq j=1 αj Hj, where αj is the real coefficient and Hj is the tensor product of Pauli matrices on n qubits. A symmetry S of a Hamiltonian e H is a unitary operator leaving e H invariant, i.e., S e HS = e H. (26) All of these symmetries form a symmetry group S where for any two symmetries S1, S2 S, their compositions S1 S2 or S2 S1 and their inverses S 1 1 and S 1 2 are also symmetries in S. In SP, these symmetries are classified into three categories, namely, the system symmetry (Step 2-1), the structure symmetry (Step 2-2), and the spatial symmetry (Step 2-3). Suppose that the initialized asymmetric ansatz is U(θ), SP adopts the following methods to tailor this ansatz to obey the above symmetries. System symmetry. System symmetry considers the symmetry on qubit wires. Specifically, since the problem Hamiltonian e H can be decomposed into a tensor product of Pauli terms, the symmetry condition in Eqn. (26) holds for any unitary of the form Ssys = I n U, where Published as a conference paper at ICLR 2023 U is an arbitrary unitary in SU(2m). All such unitaries are called the system symmetry and form a subgroup of the symmetry group S, i.e., Ssys = {Ssys = I n U : U SU(2m)}. The system symmetry of a unitary V can be recognized if Ssys V S sys = V . With this regard, SP assigns the system symmetry to U(θ) by removing the redundant parameterized gates and the two-qubit gates associated with the last m qubit wires. In doing so, the pruned ansatz has the form UPr(θ) = U1(θ) I m, which yields Ssys(U1(θ) I m)S sys = U1(θ) I m, where U1(θ) is the unitary extracted from U(θ) (the gates applied on the first n qubit wires). Structure symmetry. The structure symmetry Sstr refers to the symmetry for the effective Hamiltonian H, which satisfies Sstr HS str = H. Moreover, an ansatz V (θ) is said to be structure symmetric to the problem Hamiltonian H if there exists a non-trivial symmetry Sstr (i.e., not the identity operation) and θ Θ\{0} such that Sstr V (θ)S str = V (θ). A feasible solution of constructing the structure symmetric ansatz is restricting the corresponding ansatz design that only contains the Pauli terms of H. Given the pruned ansatz UPr returned by Step 2-1, SP (Step 2-2) assigns the structure symmetry on it by removing specific the single-qubit gates and the two-qubit gates so that the pruned ansatz design follows A = {H1, , Hq}. The tailored ansatz returned by Step 22 coincides with HVA, i.e., U1(θ) transforms to the new ansatz whose ℓ-th layer is expressed as Qq j=1 e iθℓj Hj for l [L]. Spatial symmetry. Spatial symmetry is a discrete symmetry considering the permutation invariance for the sites of the problem Hamiltonian, which tightly relates to the problem of graph automorphism that has been widely studied in graph theory. For this reason, here we introduce the spatial symmetry from the graphical perspective and elucidate the implementations of Step 2-3 in Alg. 1. The key in this step is leveraging the algorithms developed in graph theory to automatically identify the spatial symmetry of problem Hamiltonians. From the graphical view, an n-qubit Hamiltonian H refers to a graph G = (V, E) with n vertices, where the j-th node vj V represents the j-th site (qubit) of H and the edge Ei,j E characterizes the interaction strength of the i-th and the j-th sites (qubits). This graph can further be described by an adjacency matrix D. Recall that a spatial symmetry π of a Hamiltonian H is a permutation over the sites leaving H invariant, i.e., πHπ 1 = H (or equivalently [π, H] = 0). In other words, the spatial symmetry π preserves the topology invariance of G such that for any (u, v) E, we have (π(v), π(u)) E, and πDπ 1 = D. In GSP, the action of π on an n-qubit state |ψ π |ψ means permuting the indices of qubits. For instance, a permutation π with π(1) = 3, π(2) = 1, π(3) = 2 acting on the state |ψ1 |ψ2 |ψ3 yields π(|ψ1 |ψ2 |ψ3 ) = |ψ3 |ψ1 |ψ2 . All these permutations form a discrete group of symmetries Sn with the cardinality O(n!). Particularly, the spatial symmetries of the Hamiltonian is the automorphism group of its corresponding graph, defined as Aut(H) = {πa Sn|πa Hπ 1 a = H}, or equivalently Aut(H) = {πa Sn|πa Dπ 1 a = D}. The qubits (or qubit-pairs) in the ansatz corresponding to the nodes (or edges) that can be swapped are called equivalent qubits (or qubit-pairs). More precisely, for any node (or edges) u V (or (u, v) E), if there exists π Aut(H) such that π(u) = x (or π(u, v) = (x, y)), then the qubits (or qubitpairs) corresponding to the node (or edge) u (or (u, v)) and x (or (x, y)) are called equivalent qubits (or qubit-pairs). Given the ansatz returned by Step 2-2, Step 2-3 assigns the spatial symmetry on it by correlating the single-qubit parameterized gates on the equivalent qubits or the two-qubit parameterized gates on the equivalent qubit-pairs. Published as a conference paper at ICLR 2023 : CNOT gate : rotation gate : share one parameter RZ SP1 SP2 SP3 |0 |0 |0 |0 |0 |0 |0 |0 Figure 6: Evolution of ansatz structure during symmetric pruning From left to right shows the initial hardware efficient ansatz and ansatz structure at different stages of symmetric pruning, where SP1 , SP2 , SP3 refer to the sub-steps 2-1, 2-2, and 2-3 in Alg. 1 respectively and L < L. The symbol RX ( RY , RZ ) refers to the single qubit rotation around the x (y, z)-axis and I refers to the identity gate. The rotation gates with the same color of the pruned ansatz are correlated by one individual parameter per layer. The flexibility of SP. The automorphism group for the graphs corresponding to the Hamiltonians with the complicated topological structure is hard to compute manually. In this work, we employ nauty to automatically recognize the automorphism group of graph corresponding to the Hamiltonian nauty (Mc Kay et al., 1981). Besides nauty , there are many heuristic algorithms to compute the automorphism group, including Traces (Mc Kay & Piperno, 2014), saucy (Darga et al., 2004), Bliss (Junttila & Kaski, 2007) and canauto (López-Presa et al., 2014). All of them can be easily integrated into SP. Moreover, these heuristic algorithms are capable of solving most graphs for up to tens of thousands of nodes in less than a second (Mc Kay & Piperno, 2014). F The limitations of applying classical pruning methods to QNNs Although both SP in Alg. 1 and the classical pruning techniques distill a smaller network (or an ansatz) from an over-parameterized one in the view of algorithmic implementation, the latter cannot be directly employed to enhance the power of QNNs. Recall that a common feature of classical pruning methods is scoring each parameter or network element and then removing those accompanied with low scores. Such scores generally correspond to the magnitude of parameters (Frankle & Carbin, 2018), the gradient of parameters (Lee et al., 2018; Wang et al., 2020), and the Hessian matrix (Le Cun et al., 1989; Hassibi & Stork, 1992) at the initialization stage or the phase of terminal. Unfortunately, Cerezo & Coles (2021) proved that the gradient information in QNNs with random deep ansatz exponentially vanishes with the increased number of qubits. In other words, the gradient information fails to provide any useful information to guide pruning. Meanwhile, the output of QNNs can be regarded as a periodic function of parameters (Schuld et al., 2021), which forbids employing the parameters magnitude as the metric to guide the pruning. Therefore, it is inappropriate to straightforwardly apply classical pruning methods to QNNs, where the extracted ansatz may not promise the enhanced trainability. In contrast with classical pruning methods, our proposal does not require any gradient information to construct the symmetric ansätze. Instead, it removes the redundant gates and shrinks the solution space according to the information of the problem Hamiltonian. G More numerical simulation details In this section, we provide the simulation details omitted in the main text. Hardware efficient ansatz. As shown in the left panel of Fig. 6, HEA yields the layerstacking structure following Eqn. (2), where each layer consists of multiple single-qubit Pauli rotation gates and fixed two-qubit CNOT gates. In our numerical simulations, the ℓ-th layer Published as a conference paper at ICLR 2023 Figure 7: Graph representation of problem Hamiltonian. The left panel and the right panel depict the graph representation of the TFIM model and Erdos-Renyi graph with p = 0.6, respectively. of the employed HEA takes the form Uℓ(θ) =U (1) ent j=1 RX(θℓ j,1)RY (θℓ j,2)RZ(θℓ j,3)U (1) ent j=1 RX(θℓ j,4)RY (θℓ j,5)RZ(θℓ j,6)U (2) ent (27) where Rµ(θℓ j,k) = e iθℓ j,kµ with µ {X, Y, Z} denotes the parameterized single-qubit gate, and U (1) ent = Q n 2 j=1 CNOT2j 1,2j and U (2) ent = n 1 2 j=1 CNOT2j,2j+1 refer to the entangled layers with a being the greatest integer no larger than a. Transverse-field Ising model. A central problem in quantum many-body physics is predicting the properties of these quantum systems from the first principles of quantum mechanics. Transverse-field Ising model (TFIM) has been employed to explore many interesting quantum systems. In our numerical simulation, we employ an n-qubit Hamiltonian of 1D TFIM with an open boundary condition, i.e., HTFIM = Pn 1 j=1 σz j σz j+1 Pn j=1 σx j , where σµ j denotes the µ-Pauli matrix (with µ = x, z) acting on the j-th qubit. The effective dimension for HVA under this Hamiltonian is given by deff = n2 (Larocca et al., 2021a). The Hamiltonian is graphically depicted in Fig. 7(a). Max Cut. Although many important problems in statistical physics and operation research (Wheeler, 2004) can be formulated as Max Cut, finding the optimal solution of Max Cut has been proven to be NP-hard (Karp, 1972) and quantum computers are expected to attain better approximated solutions than those of classical computers (Farhi et al., 2014; Zhou et al., 2022).In this work, we consider the Max Cut problem of the Erdos-Renyi graphs whose topology is less structured. An Erdos-Renyi (ER) graph on the vertex set V is a random graph in which each pair of nodes (u, v) connects independently with probability p. Fig. 7(b) shows the instance of the ER graph used in the numerical simulation with setting p = 0.6. Evolution of ansätze. Here we present the evolution of the ansatz structure for the transverse-field Ising model during symmetric pruning in Fig. 6, which serves as an example for better understanding the learning dynamics of our proposal. Specifically, we adopt the Hardware efficient ansatz as the initial over-parameterized ansatz, as shown in the left side of Fig. 6. The gates on the last two wires corresponding to I 2 are first removed through Step 2-1 (referred to SP1 ) in Alg. 1 to ensure the system symmetry. Subsequently, in Step 2-2, SP employs the symmetric information of problem Hamiltonian HTFIM to remove the parameterized single-qubit gates and two-qubit gates on the first six wires such that the pruned ansatz design is Apr = {σx 1, , σx 6, σz 1σz 2, σz 5σz 6}. Finally, in Step 2-3, the spatial symmetry pruning correlates the parameterized gates on the equivalent qubits and qubitpairs through the returned automorphism by the package nauty. In the case of TIFM, the package nauty returns a non-trivial automorphism π(j) = n + 1 j that permutes qubits from each side of the chain. This operation leads to a reduction of the number of free parameters from 11 for the ansatz pruned by SP2 to 6 for the ansatz returned by SP3 . Published as a conference paper at ICLR 2023 TFIM Max Cut (ER graph) Max Cut (3regular graph) Figure 8: The quantum resource required for achieving ϵ-convergence. The left panel, the middle panel, and the right panel depict the number of measurements required to complete one optimization step and the circuit depth required to achieve the ϵ-convergence in the task of TFIM, Max Cut for Erdos-Renyi graph, and Max Cut for 3-regular graph, respectively. Particularly, the label s refers that the total number of shots M is the product of the displayed value in the histogram and the number of shots for updating a single parameter s. The labels SP0 - SP3 refer to the initial ansatz, the pruned ansatz after system symmetric pruning, structure symmetric pruning, and spatial symmetric pruning, respectively. Hardware efficiency analysis. We use two standard metrics to quantify the hardware efficiency, i.e., (1) the number of measurements (shots) M required to complete one optimization step; (2) the circuit depth l(ϵ) required to reach the over-parameterization criteria characterized by ϵ-convergence with ϵ = 10 5. Namely, the first metric concerns the runtime cost in training QNNs, and the second metric evaluates the required quantum resources to construct the quantum circuit. To facilitate the comparison of various ansätze under the first metric, the number of shots for a single parameter update is set to be s, so that the total number of measurements taken to complete one optimization step linearly scales with the number of parameters, i.e., M = LKs. As depicted in Figure 8, in the task of TFIM, compared with the initial over-parameterized asymmetric ansatz (labeled by SP0 ), the required number of measurements M for the pruned ansatz can be dramatically reduced by 14 times; meanwhile, the required circuit depth is reduced by 3 times. In the task of Max Cut, the hardware efficiency improvement is problem-dependent. In particular, the required circuit depth is reduced by about 1.8 times and 2 times for the Erdos-Renyi graph and the 3-regular graph, respectively. For the Erdos-Renyi graph, compared to the ansatz returned by SP1 , the required circuit depth of the ansatz returned by SP2 slightly increases from 100 to 130. This increase originates from the fact that the Erdos-Renyi graph with a large number of edges requires a relatively deep circuit depth to construct the symmetric ansatz after SP1. Although the circuit depth is subtly increased, an evident benefit is a dramatic reduction of the number of measurements M, i.e., compared with the initial ansatz, the required M for the pruned ansatz is reduced by 10 times than that of the initial ansatz. The achieved numerical results confirm that the symmetric ansatz output by our proposal can effectively improve hardware efficiency. As such, it can simultaneously reduce the required quantum resource for reaching the regime of over-parameterization, enable an efficient implementation on NISQ devices, and more importantly, improve the convergence rate so as to reduce the number of access to the quantum devices. Training dynamics analysis of symmetric ansatz. Here we numerically exhibit that EQNTK has the ability to capture the training dynamics of QNNs with symmetrical ansatz. The hyperparameter settings are as follows. In the task of TFIM, the number of qubits of the Hamiltonian is set as n = 6. We employ the QNN with symmetric ansätze processed by SP with the number of layers L = 80 to optimize the loss function defined in Eqn. (1). The learning rate η and the maximum number of iteration T is set as 10 4 and 1000, respectively. Figure 9 plots the theoretically predicted residual training error according to Theorem 2, the practical residual training error ε with 30 independent random initializations, and their average versus the gradient descent optimization steps. The numerical results show that the residual error ε decays exponentially, which echos with the training dynamics derived in Theorem 2. Published as a conference paper at ICLR 2023 Figure 9: Residual training error ε versus the gradient descent steps t. The black dotted curve, black solid curve, and red dotted curve correspond to the training dynamics of ε(t) for 30 different initializations, the theoretical prediction for the average dynamics of ε(t), and the numerical values for the averaged ε(t), respectively.