# training_free_guided_flowmatching_with_optimal_control__d3da5135.pdf Published as a conference paper at ICLR 2025 TRAINING FREE GUIDED FLOW MATCHING WITH OPTIMAL CONTROL Luran Wang1, Chaoran Cheng2, Yizhen Liao3, Yanru Qu2, Ge liu2 1University of Cambridge 2University of Illinois Urbana-Champaign 3Tsinghua University lw703@cam.ac.uk, chaoran7@illinois.edu liaoyz0711@gmail.com, yanruqu2@illinois.edu geliu@illinois.edu Controlled generation with pre-trained Diffusion and Flow Matching models has vast applications. One strategy for guiding ODE-based generative models is through optimizing a target reward R(x1) while staying close to the prior distribution. Along this line, some recent work showed the effectiveness of guiding flow model by differentiating through its ODE sampling process. Despite the superior performance, the theoretical understanding of this line of methods is still preliminary, leaving space for algorithm improvement. Moreover, existing methods predominately focus on Euclidean data manifold, and there is a compelling need for guided flow methods on complex geometries such as SO(3), which prevails in highstake scientific applications like protein design. We present OC-Flow, a general and theoretically grounded training-free framework for guided flow matching using optimal control. Building upon advances in optimal control theory, we develop effective and practical algorithms for solving optimal control in guided ODE-based generation and provide a systematic theoretical analysis of the convergence guarantee in both Euclidean and SO(3). We show that existing backprop-through-ODE methods can be interpreted as special cases of Euclidean OC-Flow. OC-Flow achieved superior performance in extensive experiments on text-guided image manipulation, conditional molecule generation, and all-atom peptide design. 1 INTRODUCTION AND RELATED WORK SDE and ODE-based generative models such as diffusion and continuous normalizing flow (CNF) have exhibited excellent performance on various domains such as images (Ho et al., 2020; Esser et al., 2024), audio (Zhang et al., 2023; Défossez et al., 2022), and discrete data (Lou et al., 2023; Cheng et al., 2024). Particularly, the simplicity of Riemannian Flow Matching on SO(3) manifold (Chen & Lipman, 2023) empowers de novo generation of small molecules (Song et al., 2024; Xu et al., 2023) and proteins (Yim et al., 2023; Bose et al., 2023; Li et al., 2024), leading to enormous advancement in biomedicine. Controlled generation from pre-trained diffusion and flow matching priors has gained growing interest in numerous fields, as it encompasses a wide range of practical tasks including constrained generation (Giannone et al., 2023), solving inverse problems (Liu et al., 2023; Ben-Hamu et al., 2024), and instruction alignment (Black et al., 2023; Esser et al., 2024). There are several lines of work for guiding diffusion and flow models. Classifier-free guidance (CFG) (Ho & Salimans, 2022) trains conditional generative models that take conditions as input. Reward fine-tuning approaches update the generative model parameters to align with certain target objective functions (Black et al., 2023). Both methods require specialized training routines, which are costly and not extendable to new tasks. Training-free guidance on diffusion (Kawar et al., 2022; Chung et al., 2024; Song et al., 2023) alters the scores in the SDE generation process with the gradients from the target function to achieve posterior sampling. These guidances often rely on strong assumptions of the denoising process and require estimating target function gradients w.r.t noised samples which are often intractable. Accurate posterior sampling is only guaranteed for a limited family of objective functions such as linear. Therefore, efforts that deploy such guidance to flow models by bridging the ODE path and SDE path share similar constraints (Pokle et al., 2023; Yim et al., 2024). Published as a conference paper at ICLR 2025 Notably, two recent works showed the effectiveness of guiding pre-trained flow models by differentiating through the ODE sampling process, outperforming popular guidance-based approaches. Particularly, Ben-Hamu et al. (2024) differentiates a loss R(x) through the forward-ODE w.r.t the initial noise x0, which induces implicit regularization by projecting the gradient onto the data manifold under Gaussian path assumption. This strong confinement to the prior might hinder optimization when the target reward function diverges from the prior distribution. Liu et al. (2023) formulates an optimal control problem where a control term ut at each timestep is solved to guide the ODE trajectory. However, the gradient decomposition trick used in Liu et al. (2023) ignores the running cost of control terms and thus could lead to suboptimal behavior. Despite the good performance, there is a lack of systematic theoretical analysis on the convergence behavior and explicit regularization of the differentiate-through-ODE approaches to better guide algorithm design in this space. Furthermore, existing works predominantly focus on the Euclidean manifold due to its simplicity, and there is a compelling need for a theoretically grounded guided flow matching framework on more complex geometries such as SO(3) which is heavily used in scientific applications. To fill the gap between the practical applications of guided generation and theoretical grounds, we propose OC-Flow, a general, practical, and theoretically grounded framework for training-free guided flow matching under optimal control formulation. Our key contributions are as follows: 1. We formulate controlled generation with pre-trained ODE-based priors as an optimal control problem with a control term ut and a running cost that regulates the trajectory proximity to the prior model while optimizing for target loss. Building upon advances in optimal control theory, we develop effective optimization algorithms for both Euclidean and SO(3) space through iterative updates of a co-state flow and control term, with theoretical guarantees under continuous-time formulation. 2. In Euclidean space, we show that running cost bounds the KL divergence between prior and OC-Flow-induced joint distribution. We develop a simple algorithm for OC-Flow through iterative gradient update and provide convergence analysis. We further demonstrate that Dflow and Flow-grad can be interpreted as special cases of Euclidean OC-Flow, providing a unified view of the problem. 3. We present one of the first guided flow-matching algorithm on the SO(3) manifold with theoretical grounds. Our approach extends the Extended Method of Successive Approximations (E-MSA) to SO(3) with a rigorous convergence analysis. Additionally, we propose approximation techniques to enable computationally efficient OC-Flow on SO(3). 4. We provide an efficient and practical implementation of OC-Flow, by introducing the vector Jacobian product and asynchronous update scheme. We show the effectiveness of our method with extensive empirical experiments, including text-guided image manipulation, controllable generation of small molecules on QM9, and energy optimization of flow-based all-atom peptide design. 2 PRELIMINARIES AND PERSPECTIVES ABOUT FLOW MATCHING Euclidean Flow Matching. Flow matching (Lipman et al., 2022; Liu et al., 2022) provides an efficient framework for training a generative model by approximating the time-dependent vector field associated with the flow represented as ψt : [0, 1] Rd Rd. This vector field ut : [0, 1] Rd Rd defines a probability path of the evolution of an initial noise distribution, denoted by p0, towards a target distribution, p1, with the pushforward probability as pt := (ψt) p0. The dynamics of the vector field that governs this flow can be described by the flow ordinary differential equation (ODE) of the form xt = ut(xt), where we follow the convention to use Newton s notation with respect to time t and the state at time t is given by xt := ψt(x0). Lipman et al. (2022) demonstrates that a tractable flow matching objective can be obtained by conditioning on the target data x1. The primary goal of conditional flow matching is to train a model, f p t : [0, 1] Rd Rd, such that it minimizes the difference between its output and the ground truth conditional vector field as LCFM = Et,p(x0,x1) f p t (xt) ut(xt | x1, x0) 2. The trained model f p can be employed as the marginal vector field during the inference phase. In this context, once a noise sample x0 is drawn, the system s evolution can be described by the following differential equation: xt = f p t (xt), x0 p0(x). (1) Rotation Group SO(3). The formulation of flow matching can naturally extend to Riemannian manifolds (Chen & Lipman, 2023). On the Riemannian manifold, a flow is defined as a timedependent diffeomorphisms φt : G G, which describe the continuous evolution of points on the Published as a conference paper at ICLR 2025 Figure 1: Comparison of backpropagation-through-ODE algorithms. For D-Flow (left) and Flow Grad (middle), the black curves represent the state trajectory at the k-th iteration, while the red curves show the updated trajectories at the (k + 1)-th iteration, using gradient updates (blue dashed arrows). DFlow updates the state at t0 only, while Flow Grad propagates updates across all time steps. OC-Flow (right) incorporates the running cost and reward-weighting factor α into the terminal reward. The co-state flow µt (green curves) combines gradient information and system dynamics to iteratively update the control terms {θt}, which in turn updates the states. manifold over time, generated by a vector field V , with V (p) Tp G for each p G where Tp is the tangent space at point p G. The flow evolves according to: d dtxt = V (xt) = Lxt V (e). The CFM objective in Equation 1 can also be adapted for Riemannian flow matching, in which the ground truth vector field can be calculated as the time derivative using the exponential and logarithm maps. Details about rotation group SO(3) can be found in Appendix A. In this work, we focus specifically on the SO(3), the Riemannian manifold of all 3D rotations equipped with the canonical Frobenius inner product. In previous flow-based protein design models, each amino acid is associated with a rotation that defines its orientation (Yim et al., 2023; Bose et al., 2023; Li et al., 2024). Guiding such pre-trained generative models toward the desirable protein properties can potentially have a profound impact on the pharmaceutical industry. 3 OPTIMAL CONTROL FRAMEWORK FOR GUIDED FLOW MATCHING The key challenge in guided generation is balancing optimization and faithfulness to the prior distribution. To address such a need, we propose the following framework. Given a pre-trained flow model, f p t (x), parameterized by a neural network, our goal is to determine the optimal control terms θt that maximize the reward R(x) while maintaining the proximity of the resulting vector field to the original vector field induced by f p t (x). The reward can be customized for diverse tasks such as inverse problem R = H(x) y 2, conditional generation R = (f(x) c)2, and constrained generation R = x y 2. To ensure proximity, we incorporate a penalty on the state trajectory or control terms R T 0 L(θt), also known as the running cost. Optionally, one may also introduce a metric d( , ) to penalize the deviation between the new terminal state xθ 1 and the prior terminal state xp 1. The modified terminal reward function is then defined as: Φ(xθ 1) = R(xθ 1) d(xθ 1, xp 1). and scaling the terminal reward by a constant α, we can formulate the problem as a standard optimal control task: J(θ) := αΦ(xθ T ) + Z T 0 L(θt) dt subject to xθ t = ht(xθ t, θt). (2) A fundamental result in optimal control theory is Pontryagin s Maximum Principle (PMP) (Pontryagin (2018)), which provides the necessary conditions for optimal solutions in control problems. Specifically, at the core of PMP is the introduction of the Hamiltonian function, H. This Hamiltonian is defined in terms of the state of the system, the control, and a new entity called the co-state µ, which resides in the cotangent space of the state manifold: H(t, x, µ, θ) = µ, ht(xθ t , θt) L(θ). (3) The co-state µt, also known as the adjoint variable, encodes the influence of the terminal cost function. Their evolution captures how the sensitivity of the system impacts the cost function, ensuring that the state variables evolve in accordance with the system dynamics. Consequently, in optimal control, the Hamiltonian must be maximized by jointly evolving the states and costates according to a system of coupled differential equations. The details of PMP conditions can be found in Appendix B.1. Published as a conference paper at ICLR 2025 3.1 OC-FLOW ON EUCLIDEAN MANIFOLD We first develop the algorithm and theoretical analysis for OC-flow in Euclidean space. One of the simplest choices for the control term is an additive control (Kobilarov & Marsden (2011)), which directly perturbs the prior trajectory. In fact, with the linear expansion, the additive control could be seen as a general case and is widely used in optimal control. Specifically, with θt representing the control input, the new state dynamics and the corresponding running cost can be defined as: xt = ht(xθ t, θt) = f p t (xt) + θt L(θt) = 1 2 θt 2. (4) The running cost effectively acts as a constraint on the trajectory to encourage proximity to the original prior distribution. To better understand the effect of running costs on the guided distribution, we provide the following proposition to formally prove that it can control deviation from the prior distribution measured by KL-divergence. Proposition 1. For Affine Gaussian Probability Path, the expectation of the running cost upper bounds the KL divergence between the prior joint distribution p1(xp, x1) = p1(xp|x1)pdata(x1) and the joint distribution after guidance p1(xθ, x1) = p1(xθ|x1)pdata(x1), with x1 pdata, xp induced by prior conditional vector field up t (x|x1) and xθ sampled by applying control θt(x1) on up t : Ex1 pdata(x1) 0 θt(x1) 2 dt C KL p1(xθ, x1) p1(xp, x1) . (5) Furthermore, for square-shaped data x with non-zero probability path, the expectation of the running cost, combined with the L1-distance between the prior sample xp 1 and the corresponding guided sample xθ 1, upper bounds the KL divergence between the marginal distributions of the prior model pp 1 and the guided model pθ 1: Exp 1 pp 1(x) A xp 1 xθ 1 + B Z 1 0 θt(xp 1) 2 dt KL(pp 1 pθ 1). (6) Algorithm 1 OC-Flow on Euclidean Space 1: Given: Pre-trained model: f p, initial state: x0 2: Initialize: Control terms θ0, learning rate η, weight decay β 3: for k = 0 to Max Iterations do 4: Solve for the state trajectory: Xθk t+ t = Xθk t + f p(t, Xθk t ) + θk t 5: Update control: θk+1 t = βθk t + η xtΦ(Xk 1 ) 6: end for One effective approach for directly applying PMP to optimal control tasks is the Extended Method of Successive Approximations (E-MSA) (Li et al., 2018). E-MSA builds upon the basic MSA algorithm (Chernousko & Lyubushin (1982)), which iteratively updates the terms in the PMP conditions (Appendix B.2). The primary enhancement of E-MSA over the basic MSA is its ability to extend convergence guarantees beyond a limited class of linear quadratic regulators (Aleksandrov (1968)). A key assumption is the global Lipschitz condition for the functions involved. However, note that this assumption can be relaxed to a local Lipschitz condition if we can demonstrate that xθ t is bounded, which can be safely assumed provided that appropriate regularization techniques are applied. Furthermore several prior work has shown the Lipschitz continuity for the deep learning models. (Gouk et al. (2021),Khromov & Singh (2024)) When the E-MSA algorithm is applied to the guided controlled generation task on Euclidean space, the trajectory of the co-states µt can be calculated in closed form. Specifically, they can be expressed as xtΦ(Xk 1 ) which enables us to derive the following update rule with convergence guarantees: Theorem 2: Assume that the reward function, the prior model, and their derivatives satisfy Lipschitz continuity, bounded by a Lipschitz constant L. Utilizing the E-MSA, for each iteration k, for each constant γ > 2C with C is a function of L, such that under the addictive control and the running cost defined in Equation 4, the optimal update is following: θk+1 t = γ 1 + γ θk t + α 1 + γ xtΦ(Xk 1 ). (7) This update rule for the control term θt guarantees an increase in the objective function defined in Equation 2: J(θk+1) J(θk) 1 2C ϵk γ, ϵk γ 0. (8) Published as a conference paper at ICLR 2025 In practice, solving continuous ODEs requires discretization. The discretized version of the proposed algorithm is outlined in Algorithm 1. In this formulation, the weight decay term is parameterized as β = γ 1+γ , and the learning rate is defined as η = α 1+γ . As demonstrated in Appendix C.3, the discretization error introduced by the Euler method is of the order O( t). This error diminishes as the number of ODE steps increases, ensuring the algorithm converges to the global optimum. 3.2 PRACTICAL IMPLEMENTATION AND ACCELERATION 3.2.1 ADJOINT METHOD AND VECTOR-JACOBIAN PRODUCT A significant portion of the computational time in the OC-Flow algorithm is spent on evaluating the gradient xtΦ(x1). The computational cost of directly back-propagating through ODE with vanilla Autograd requires saving all intermediate computation values, which results in a demanding memory complexity of O(ND2) (Pan et al., 2023; Chen et al., 2018) where N is the ODE steps. We instead employ the adjoint method where the gradient xtΦ(x1) is computed using a vector-Jacobian product by the double-backwards trick, which reduces the memory cost to O(D2). xk/N Φ = x(k+1)/N Φ Φx(k+1)/N xk/N . (9) 3.2.2 ASYNCHRONOUS SETTING FOR FLEXIBLE UPDATE SCHEDULING In practice, discretization techniques are employed to simulate the ODEs governing both the state trajectory xt and the co-states µt and operate in a synchronous setting, where the number of time steps for the state trajectory xt coincides with the number of control terms θt. Here we show that OC-Flow can be extended to an asynchronous framework, providing greater flexibility in scheduling. We subdivide the time interval t into N equally spaced subintervals. Let {xt} denote the state trajectory over the time interval [t, t + t], and {xθ t} represent the trajectory when the control term θt is applied in the first subinterval. The trajectory can be approximated as: xt+ t = xt + t l=1 f p xθ t+ l t N θt xt + t l=1 f p xt+ l t Consequently, the asynchronous setting allows the algorithm to be applied without modification while enabling finer updates to both the control terms and state trajectories by adjusting the frequency N of control term updates relative to the state trajectory simulation (see Appendix C.4 for the proof and justification of the approximations in Equation 10). In our peptide design experiment, the asynchronous setting is applied for efficient computing. Table 1: Comparison of runtime and memory complexity of different methods used in backpropthrough guided-ODE in Euclidean and SO(3) manifold. For complexity, N is the number of ODE steps, n is the number of effective control terms with synchronized and in the range [1, N] and D2 is the complexity of computing 1-step gradient (VJP or Autograd), D depends on data and model size. Number of Running Memory Runtime Convergence Generalization Control Terms Cost Complexity Complexity to Optimal to SO(3) OC-Flow n θ2 t O(D2) O(n D2) Flow Grad n 0 O(D2) O(n D2) D-Flow 1 Implicit O(ND2) O(ND2) 3.3 CONNECTION TO OTHER BACKPROP-THROUGH GUIDED-ODE APPROACHES Several previous works discussed backprop-through-ODE guidance in flow-matching models. Notable examples include D-Flow (Ben-Hamu et al., 2024) and Flow Grad (Liu et al., 2023). An illustration of their algorithms and ours is shown in Figure 1. In this section, we demonstrate that our framework is more general, and both of these methods can be viewed as special cases of our algorithm. Flow Grad formulates the optimization task in a manner similar to our optimal control target in Equation 2. Specifically, it directly applies gradient descent to the control variables: θk+1 t = θk t + α θtΦ(Xk 1 ) = θk t + α N ( xt Xt+ t) 1 xtΦ(Xk 1 ), (11) which can be interpreted as a limiting case of our algorithm in Equation 7, where γ and given with dt 0, xt Xt+ t I. However, as shown in Equation 8, the convergence rate is a complex Published as a conference paper at ICLR 2025 function of γ, so in practice, γ is treated as a tunable parameter. Flow Grad s setting γ may undermine the convergence speed. D-Flow optimizes the reward by applying gradient descent to the initial noise x0: Xk+1 0 = Xk 0 + LBFGS( x0Φ(Xk 1 )). (12) In fact, with the update rule xt+dt = x0 + f(x0) dt + θt dt, the update of θ0 can be seen as an increment to x0. The LBFGS algorithm provides a dynamic learning rate, aligning with our framework, where γ is allowed to vary across iterations. Hence, D-Flow can be viewed as a special case of our asynchronous algorithm when the number of control terms is 1. A more detailed comparison of the algorithms can be found in Table 6 and additional discussion on computation efficiency is in Appendix D. 4 OPTIMAL CONTROL FRAMEWORK FOR GUIDED FLOW MATCHING ON SO(3) Most optimization algorithms are primarily designed for Euclidean spaces and face significant challenges when applied to non-Euclidean settings, such as the SO(3) manifold, which plays a crucial role in drug discovery and peptide design (Huguet et al. (2024)). This section extends the E-MSA algorithm to the SO(3) manifold and presents a rigorous proof of its convergence. 4.1 OC-FLOW FOR SO3 To begin, we define the vector field governing the system dynamics. The state trajectory, influenced by control terms θt so(3), evolves according to the following differential equation: xθ t = xθ t ft(xθ t) + θt . (13) In this work, the left-invariant vector field is utilized, under which the Hamiltonian can be shown to reduce to a linear functional in so(3) (Jurdjevic (1996), Colombo & Dimarogonas (2020)). Given the co-state µt so(3) , the Hamiltonian function is redefined as: H : [0, T] SO(3) so(3) so(3) R, (t, x, µ, θ) 7 µt ft(xθ t) + θt L(θ). (14) A direct approach to apply PMP conditions on the SO(3) manifold involves iteratively updating the cotangent vector µt and the state trajectory xt as shown in Step 4 and Step 5 in Algorithm 2 and then apply an update rule to determine the control term θt for the subsequent iteration with weight decay β and learning rate η the update for θt can be written as: θk+1 t = βθk t + η µθk t , (15) where µθk t is defined by µθk t , v = µθk t (v) with µt so(3) for all v so(3). The existence of µθk t can be derived from the Riesz Representation Theorem (Goodrich (1970)). This formulation leads to the introduction of the OC-FLow algorithm on SO(3),as detailed in Algorithm 2. Algorithm 2 OC-Flow on SO(3) 1: Given: Pre-trained model: f p e , initial state: x0 2: Initialize: Control term θ0 so(3), learning rate η, weight decay β 3: for k = 0 to Max Iterations do 4: Solve for the state trajectory: Xθk t = Xθk t f p(t, Xθk t ) + θk , Xθk 0 = x0 5: Solve for the adjoint variables: µθk t = ad H µ µθk (d Lxθ T ) H x , µθk T = (d Lxθ T ) Φ(xθ T ) 6: Update control: θk+1 t = βθk t + η µθk t 7: end for 4.2 CONVERGENCE OF OC-FLOW ON SO3 To derive the proof of the convergence of our Algorithm 2, we first establish that under the PMP conditions on SO(3), the objective function J(θ), as defined in Equation 2, can be bounded. This is formalized in the following proposition: Published as a conference paper at ICLR 2025 Proposition 3: Assume that the reward function, the prior model, and their derivatives satisfy Lipschitz continuity, bounded by a Lipschitz constant L. Then, there exists a constant C > 0 such that for any θ, ϕ so(3), the following inequality holds: 0 ϕ,θH(t) dt C ϕt θt 2 dt J(ϕ), (16) where Xθ and P θ satisfy the PMP conditions on SO(3) manifold, and Hϕ,θ denotes the change in the Hamiltonian, defined as: Hϕ,θ(t) := H(t, xθ t, µθ t, ϕt) H(t, xθ t, µθ t , θt). (17) Proposition 2 provides a lower bound for the difference in the objective function values under two distinct control terms that satisfy the PMP conditions described by the Hamiltonian equations in Appendix B.1. However, applying this result directly as an optimization algorithm presents several challenges. First, the difference in the Hamiltonian Hϕ,θ(t) is not inherently bounded. Second, the term ||ϕ θ||2 is non-negative, which complicates the minimization process. To address these issues, inspired by the method of E-MSA, we introduce a positive constant γ and define an Extended Hamiltonian: H(t, x, µ, θ, ϕ) := H(t, x, µ, θ) γ 2 θ ϕ 2 = µ, ft(x) + θ 1 2 θ ϕ 2. (18) The introduction of the extended Hamiltonian enables the combination of the original Hamiltonian with the penalty term ϕ θ 2 into a unified expression that can be optimized jointly. A natural approach to achieve this is by updating θ to maximize the Extended-Hamiltonian. The resulting update rule is given by: θk+1 t = arg max θ H(t, xθk, µθk, θ, θk) = γ 1 + γ θk + 1 1 + γ µθ t = βθk + ηµθk t . (19) By performing this maximization step at each iteration, we ensure that the change in the Extended Hamiltonian, H, is non-negative, indicating that the algorithm progresses towards an optimal solution. Furthermore, we can show that when the update process converges, i.e., when H = 0 or equivalently H = 0, the algorithm has reached the optimal control point. These insights can be formalized in the following proposition: Proposition 4: Let Xθ and P θ satisfy the PMP conditions . If the update rule follows Algorithm 2, we define ϵk := R 1 0 θk+1,θk H(t) dt, and ϵk is bounded as: 0 θk+1,θk H(t) dt, lim k ϵk = 0. (20) Furthermore, when ϵk = 0, we have θ = θ := arg maxθ J(θ) With these results, we can now extend the result in E-MSA to the SO(3) manifold and establish a bound for the optimization algorithm based on the derived theoretical properties: Theorem 5: Assume that the reward function, the prior model, and their derivatives satisfy Lipschitz continuity, bounded by a Lipschitz constant L. Let θ0 so(3) be any initial measurable control with J(θ0) < + . Suppose also that infθ so(3) J(θ) > . If the update of θ satisfies equation 19, for sufficiently large γ, the following inequality holds for some constant D > 0: Dϵk J(θk+1) J(θk). (21) Therefore, by invoking Proposition 3, we can conclude that after each update, the target function is non-decreasing and when the update process terminates, the optimal solution has been attained. This establishes the convergence of the OC-Flow algorithm on the SO(3) manifold. 4.3 PRACTICAL IMPLEMENTATION In practice, directly optimizing Algorithm 2 using existing ODE methods is challenging due to the nature of the adjoint variable µt, which is a linear functional in the dual space so(3) . Instead, we can optimize µt as defined in Section 4.1. We can decompose µt into its projections onto a set of orthogonal bases within the so(3) group. Published as a conference paper at ICLR 2025 Figure 2: Visualization of text-guided generated faces with different expressions. A frequently used choice for the basis in so(3) is the canonical basis {E1, E2, E3} (Mc Cann et al. (2023)) satisfying the condition v, Ei = 2 for all v so(3). Thus we can decompose the time derivative of the adjoint variable µt as: µi = µ, Ei and µ = 1 2 P3 i=1 µi Ei. Thus, with the closed forms for the partial derivatives related to Hamiltonian, the practical update for µt in Algorithm 2 can be written as follows. The vector-Jacobian method can be applied to compute the term f p t x (xk t Ej) in Algorithm 2, which significantly reduces complexity from O(D4) to O(D2). Meanwhile, the method of asynchronous can also be applied. See Appendix C.4. µk t,i = µk t , [f p t + θt, Ei] µk t , f p t x xk t Ei , µt, f p t x xk t Ej = Tr µT t f p t x (xk t Ej) µk t t = µk t t i=1 µk t,i Ei, µk T,i = Φ(xk T ), xk T Ei . (22) 5 EXPERIMENTS 5.1 TEXT-GUIDED IMAGE MANIPULATION We first validate our OC-Flow on the traditional text-to-image generation task. Previous work has demonstrated the importance of alignment with the given text prompt using either automatic metrics or human preference as the reward (Black et al., 2023; Esser et al., 2024). In our text-guided image manipulation task, we want to guide the generative model pre-trained on the celebrity face dataset Celeb A-HQ (Karras, 2017) to text guidance {sad, angry, curly hair} showing different facial expressions or traits. Following the same setup in Liu et al. (2023), given an input image xg, the reward for alignment with the text prompt can be effectively evaluated by the CLIP model (Radford et al., 2021) pre-trained in a contrastive way to score the similarity between arbitrary image-text pairs. Following (Liu et al., 2023), we adopt the pre-trained Rectified Flow (RF) (Liu et al., 2022) as the generative prior. Inspired by Proposition 1, for this image task where the square-like assumption is satisfied, an extra terminal constraint xp 1 xθ 1 is added as part of the terminal reward function. Table 2: Comparison of methods on LPIPS, ID, and CLIP metrics. Lower LPIPS and ID indicate better performance, while higher ID and CLIP values are preferred. Method LPIPS ID CLIP CG + RF 0.346 0.643 0.292 CG + LDM 0.383 0.513 0.298 Diffusion CLIP 0.398 0.659 0.285 Style CLIP + e4e 0.359 0.704 0.267 Flow Grad + RF 0.302 0.737 0.299 OC-Flow (Ours) 0.207 0.732 0.302 We choose two state-of-the-art text-guided image manipulation baselines, Style CLIP (Patashnik et al., 2021) and Flow Grad (Liu et al., 2023). We run Style CLIP and Flow Grad with their official implementation and default parameter configurations. For ours, we set time step of 100, step size η = 2.5, weight decay of 0.995, the weight of the extra constraint of 0.4, and the number of optimization steps of 15. For qualitative comparison, we show generated examples of different text-guided expressions in Figure 2. Due to the large gap between reference and guided distributions, Style CLIP fails to manipulate with sad. Lacking in regularization, Flow Grad may change the content too much with curly hair. Our OC-Flow generally produces the best results with a good alignment with the text prompt while preserving the generative prior so as to produce reasonable faces that are not distorted much. 5.2 MOLECULE GENERATION FOR QM9 We further instantiate our OC-Flow for controllable molecule generation on the QM9 dataset (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014), a commonly used molecular dataset containing small molecules with up to 9 heavy atoms from C, O, N, F. Following Hoogeboom et al. (2022); Ben-Hamu Published as a conference paper at ICLR 2025 Figure 3: Visualization of OC-Flow generated molecules with various dipole moments condition. et al. (2024), we target for conditional generation of molecules with specified quantum chemical property values including polarizability α, orbital energies εHOMO, εLUMO and their gap ε, dipole moment µ, and heat capacity cv. Such a conditional generation setting of molecules with desired properties has profound practical applications in drug design and virtual screening. To define the loss function, separate classifiers for each property are first trained to predict the property value for the generated molecule (Hoogeboom et al., 2022), and the loss can be then calculated as the mean absolute error (MAE) between the predicted and the reference property values. The pre-trained unconstrained generative model is taken from Song et al. (2024) (Equi FM), a flow-based generative model that uses an equivariant vector field parameterization for generating the atom coordinates and types via the learned flow dynamics. To demonstrate the zero-shot guidance performance on such a conditional generation task, we compare our approach with other gradient-based methods of D-Flow (Ben-Hamu et al., 2024) and Flow Grad (Liu et al., 2023) on the same pre-trained Equi FM. To be comparable to D-Flow, we follow its setting to use the L-BFGS optimizer with 5 optimization steps with linear search. We generate 1000 molecules for each property and report the MAE in Table 3. The unconditional Equi FM is also included as an upper bound for the guided models. It can be seen that our approach consistently outperforms both of them with lower MAEs, which better balances the reward optimization and the faithfulness to the prior. We provide guided generation samples in Figure 3 with respect to different target dipole moments. A clear trend from hydrocarbons with more symmetric structures to molecules with more high-electronegativity atoms of oxygen and nitrogen can be observed, indicating an increase in the dipole moment. Table 3: MAE for guided generations on QM9 (lower is better). Property α ε εHOMO εLUMO µ cv Unit Bohr³ me V me V me V D cal K mol OC-Flow(Ours) 1.383 367 183 342 0.314 0.819 D-Flow 1.566 355 205 346 0.330 0.893 Flow Grad 2.484 517 273 429 0.542 1.270 Equi FM 8.969 1439 622 1438 1.593 6.873 Classifier 0.095 64 40 35 0.046 0.041 As we have theoretically demonstrated the impact of the regularization strength from the optimal control perspective, we further experiment with a different γ and examine the quality of the conditionally generated molecules by evaluating additional metrics following Song et al. (2024). Specifically, we calculate the atom stability percentage (ASP), molecule stability percentage (MSP), and valid & unique percentage (VUP). Ideally, these metrics should not be greatly lower than the pre-trained model, and a higher strength of regularization should lead to higher scores. Indeed, as demonstrated in Table 4, in which we provide these scores for two different settings of γ = 0.01 and 10, all scores are higher with a higher strength of regularization at the cost of also a higher MAE. In this way, our OC-Flow effectively prevents exploitation from direct gradient descent that may hack the loss function and provides more flexible and fine-grained control over the guided generation. Table 4: MAE and other evaluation metrics for our approach with γ = 0.01 / γ = 10. Property α ε εHOMO εLUMO µ cv MAE 1.383 / 1.557 367 / 365 183 / 188 342 / 339 0.314 / 0.320 0.819 / 0.852 ASP 94.8 / 96.0 95.2 / 96.1 95.2 / 96.1 95.3 / 96.1 95.8 / 96.1 95.2 / 96.1 MSP 64.4 / 69.9 67.9 / 70.5 65.8 / 69.8 68.5 / 70.8 68.0 / 70.1 67.1 / 68.9 VUP 86.2 / 88.6 88.2 / 89.8 86.2 / 87.7 87.6 / 88.7 88.2 / 89.0 89.4 / 88.5 5.3 PEPTIDE DESIGN We evaluate our OC-Flow approach for peptide backbone design using a test set derived from (Li et al., 2024), which includes 162 complexes clustered based on 40% sequence identity using mmseqs2 (Steinegger & Söding, 2017). Our experiments focus on Pep Flow w/Bb, a model designed to exclusively sample peptide backbones while optimizing translations in Euclidean space and Published as a conference paper at ICLR 2025 Figure 4: Visualization of OC-Flow generated peptide and unconditional generated peptide (5djd_C). rotations in SO(3) space. The model employs the Madra X force field (Orlando et al., 2024) for energy optimization, and performance is evaluated using several key metrics. These metrics include Madra X energy, which assesses the total energy of the generated peptide structures, along with Rosetta-based measures of stability and affinity. Currently, our stability and affinity metrics are represented by their respective means: stability quantifies the energy states of peptide-protein complexes, while affinity measures the binding energies. In addition, we employ an existing metric, denoted as IMP, which measures the percentage of generated peptides that exhibit lower energy than the original ground truth. Additionally, we use the root-mean-square deviation (RMSD) to evaluate structural accuracy by aligning the generated peptides to their native structures and calculating the Cα RMSD. To further analyze structural characteristics, we compute the secondary-structure similarity ratio (SSR), which reflects the proportion of shared secondary structures, and the binding site ratio (BSR), which quantifies the overlap between the binding sites of the generated and native peptides on the target protein. Structural diversity is assessed using the average of one minus the pairwise TM-Score (Zhang & Skolnick, 2005) among the generated peptides, representing their dissimilarities. We compare our method to the pre-trained unconditional Pep Flow model (Li et al., 2024), serving as a baseline. We also include ablations where our model guides only translations (Euclidean) or rotations (SO(3)). As shown in Table 5, our OC-Flow method, applied to both Euclidean and SO(3) spaces, consistently outperforms the baseline, even though we only optimize for the Madrax target function. This indicates that our algorithm not only achieves higher target function scores but also captures more natural structural configurations. It generates peptide backbones that are more stable, energetically favorable, and diverse, while improving key metrics such as stability, affinity, IMP, diversity, SSR, and BSR. In comparison, optimizing in Euclidean space alone yields only marginal improvements in IMP, while optimizing rotations alone achieves comparable performance. More experimental details and ablation can be found in Appendix E.3. Table 5: Evaluation of OC-Flow peptide design. Madra X RMSD SSR % BSR % Stability Affinity Diversity imp(%) Ground-truth -0.588 - - - -84.893 -36.063 - - Pep Flow -0.195 1.645 0.794 0.874 -45.660 -26.538 0.310 14.3 OC-Flow(trans) -0.229 1.774 0.797 0.876 -48.380 -27.328 0.323 14.4 OC-Flow(rot) -0.221 1.643 0.794 0.872 -48.636 -27.211 0.310 14.5 OC-Flow(trans+rot) -0.263 2.127 0.797 0.869 -48.853 -27.468 0.338 15.0 6 CONCLUSIONS AND DISCUSSION In this paper, we propose OC-Flow, a general and theoretically grounded framework for trainingfree guided flow matching under optimal control formulation. Our framework provides a unified perspective on existing backprop-through-ODE approaches and lays the foundation for systematic analysis of the optimization dynamics of this setting. Extensive empirical experiments demonstrate the effectiveness of OC-Flow. Future extensions of OC-Flow include generalizing beyond additive control terms and bridging connection with fine-tuning regimes where control terms can be solved as learning updates to the model parameters. Another extension could be scaling up the SO(3) OC-Flow to guide generative tasks for larger molecular systems such as protein motif scaffolding. One potential limitation of backprop-through-ode approaches, despite its superior result, is the higher computation cost compared to posterior sampling approaches. Such tradeoff has been demonstrated in Dflow/Flow Grad as well. Our practical implementations of OC-Flow improve the time and memory complexity (see analysis in Appendix D), where sampling on the image takes 216s compared to 15 minutes in D-Flow. We hope that our findings can guide algorithm design and motivate further model improvement in guided flow matching. Published as a conference paper at ICLR 2025 Vladimir V Aleksandrov. On the accumulation of perturbations in the linear systems with two coordinates. Vestnik MGU, 3:67 76, 1968. Heli Ben-Hamu, Omri Puny, Itai Gat, Brian Karrer, Uriel Singer, and Yaron Lipman. D-flow: Differentiating through flows for controlled generation. ar Xiv preprint ar Xiv:2402.14017, 2024. Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. ar Xiv preprint ar Xiv:2305.13301, 2023. Avishek Joey Bose, Tara Akhound-Sadegh, Kilian Fatras, Guillaume Huguet, Jarrid Rector-Brooks, Cheng-Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael Bronstein, and Alexander Tong. Se (3)-stochastic flow matching for protein backbone generation. ar Xiv preprint ar Xiv:2310.02391, 2023. Ricky TQ Chen and Yaron Lipman. Riemannian flow matching on general geometries. ar Xiv preprint ar Xiv:2302.03660, 2023. Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. Chaoran Cheng, Jiahan Li, Jian Peng, and Ge Liu. Categorical flow matching on statistical manifolds. ar Xiv preprint ar Xiv:2405.16441, 2024. Felix L Chernousko and AA Lyubushin. Method of successive approximations for solution of optimal control problems. Optimal Control Applications and Methods, 3(2):101 114, 1982. Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. Cfg++: Manifoldconstrained classifier free guidance for diffusion models. ar Xiv preprint ar Xiv:2406.08070, 2024. Leonardo Jesus Colombo and Dimos V. Dimarogonas. Symmetry reduction in optimal control of multiagent systems on lie groups. IEEE Transactions on Automatic Control, 65(11):4973 4980, 2020. doi: 10.1109/TAC.2020.3004795. Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. ar Xiv preprint ar Xiv:2210.13438, 2022. Floor Eijkelboom, Grigory Bartosh, Christian Andersson Naesseth, Max Welling, and Jan-Willem van de Meent. Variational flow matching for graph generation. ar Xiv preprint ar Xiv:2406.04843, 2024. Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024. Lawrence C Evans. An introduction to mathematical optimal control theory version 0.2. Lecture notes available at http://math. berkeley. edu/evans/control. course. pdf, 1983. Giorgio Giannone, Akash Srivastava, Ole Winther, and Faez Ahmed. Aligning optimization trajectories with diffusion models for constrained design generation. Advances in Neural Information Processing Systems, 36:51830 51861, 2023. Robert Kent Goodrich. A riesz representation theorem. Proceedings of the American Mathematical Society, 24(3):629 636, 1970. Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael J Cree. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393 416, 2021. Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Published as a conference paper at ICLR 2025 Emiel Hoogeboom, Vıctor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. In International conference on machine learning, pp. 8867 8887. PMLR, 2022. Guillaume Huguet, James Vuckovic, Kilian Fatras, Eric Thibodeau-Laufer, Pablo Lemos, Riashat Islam, Cheng-Hao Liu, Jarrid Rector-Brooks, Tara Akhound-Sadegh, Michael Bronstein, et al. Sequence-augmented se (3)-flow matching for conditional protein backbone generation. ar Xiv preprint ar Xiv:2405.20313, 2024. Velimir Jurdjevic. Optimal problems on Lie groups, pp. 368 406. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 1996. Tero Karras. Progressive growing of gans for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017. Bahjat Kawar, Roy Ganz, and Michael Elad. Enhancing diffusion-based image synthesis with robust classifier guidance. ar Xiv preprint ar Xiv:2208.08664, 2022. Grigory Khromov and Sidak Pal Singh. Some fundamental aspects about lipschitz continuity of neural networks. In The Twelfth International Conference on Learning Representations, 2024. Gwanghyun Kim and Jong Chul Ye. Diffusionclip: Text-guided image manipulation using diffusion models. 2021. Marin B. Kobilarov and Jerrold E. Marsden. Discrete geometric optimal control on lie groups. IEEE Transactions on Robotics, 27(4):641 655, 2011. doi: 10.1109/TRO.2011.2139130. Jiahan Li, Chaoran Cheng, Zuofan Wu, Ruihan Guo, Shitong Luo, Zhizhou Ren, Jian Peng, and Jianzhu Ma. Full-atom peptide design based on multi-modal flow matching. In Proceedings of the International Conference on Machine Learning (ICML), 2024. Qianxiao Li, Long Chen, Cheng Tai, and E Weinan. Maximum principle based algorithms for deep learning. Journal of Machine Learning Research, 18(165):1 29, 2018. Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022. Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022. Xingchao Liu, Lemeng Wu, Shujian Zhang, Chengyue Gong, Wei Ping, and Qiang Liu. Flowgrad: Controlling the output of generative odes with gradients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24335 24344, 2023. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution. ar Xiv preprint ar Xiv:2310.16834, 2023. Brennan S Mc Cann, Annika Anderson, Morad Nazari, David Canales, and Ryne Beeson. On-manifold pose optimization on se (3) for spacecraft coverage maximization. In AAS/AIAA Astrodynamics Specialists Conference, 2023. Gabriele Orlando, Luis Serrano, Joost Schymkowitz, and Frederic Rousseau. Integrating physics in deep learning algorithms: a force field as a pytorch module. Bioinformatics, 40(4):btae160, April 2024. doi: 10.1093/bioinformatics/btae160. Jiachun Pan, Jun Hao Liew, Vincent YF Tan, Jiashi Feng, and Hanshu Yan. Adjointdpm: Adjoint sensitivity method for gradient backpropagation of diffusion probabilistic models. ar Xiv preprint ar Xiv:2307.10711, 2023. Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Textdriven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2085 2094, 2021. Ashwini Pokle, Matthew J Muckley, Ricky TQ Chen, and Brian Karrer. Training-free linear image inversion via flows. ar Xiv preprint ar Xiv:2310.04432, 2023. Published as a conference paper at ICLR 2025 Lev Semenovich Pontryagin. Mathematical theory of optimal processes. Routledge, 2018. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021. Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014. Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of chemical information and modeling, 52(11):2864 2875, 2012. Alessandro Saccon, John Hauser, and A. Pedro Aguiar. Exploration of kinematic optimal control on the lie group so(3)*. IFAC Proceedings Volumes, 43(14):1302 1307, 2010. ISSN 1474-6670. doi: https://doi.org/10.3182/20100901-3-IT-2016.00237. URL https://www.sciencedirect. com/science/article/pii/S1474667015371457. 8th IFAC Symposium on Nonlinear Control Systems. Jun John Sakurai and Jim Napolitano. Modern quantum mechanics. Cambridge University Press, 2020. Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation. In International Conference on Machine Learning, pp. 32483 32498. PMLR, 2023. Yuxuan Song, Jingjing Gong, Minkai Xu, Ziyao Cao, Yanyan Lan, Stefano Ermon, Hao Zhou, and Wei-Ying Ma. Equivariant flow matching with hybrid probability transport for 3d molecule generation. Advances in Neural Information Processing Systems, 36, 2024. Martin Steinegger and Johannes Söding. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026 1028, 2017. Minkai Xu, Alexander S Powers, Ron O Dror, Stefano Ermon, and Jure Leskovec. Geometric latent diffusion models for 3d molecule generation. In International Conference on Machine Learning, pp. 38592 38610. PMLR, 2023. Jason Yim, Andrew Campbell, Andrew YK Foong, Michael Gastegger, José Jiménez-Luna, Sarah Lewis, Victor Garcia Satorras, Bastiaan S Veeling, Regina Barzilay, Tommi Jaakkola, et al. Fast protein backbone generation with se (3) flow matching. ar Xiv preprint ar Xiv:2310.05297, 2023. Jason Yim, Andrew Campbell, Emile Mathieu, Andrew YK Foong, Michael Gastegger, José Jiménez Luna, Sarah Lewis, Victor Garcia Satorras, Bastiaan S Veeling, Frank Noé, et al. Improved motif-scaffolding with se (3) flow matching. ar Xiv preprint ar Xiv:2401.04082, 2024. Chenshuang Zhang, Chaoning Zhang, Sheng Zheng, Mengchun Zhang, Maryam Qamar, Sung-Ho Bae, and In So Kweon. A survey on audio diffusion models: Text to speech synthesis and enhancement in generative ai. ar Xiv preprint ar Xiv:2303.13336, 2023. Yang Zhang and Jeffrey Skolnick. Tm-align: a protein structure alignment algorithm based on the tm-score. Nucleic acids research, 33(7):2302 2309, 2005. Published as a conference paper at ICLR 2025 A BACKGROUND OF RIEMANNIAN MANIFOLD AND SO(3) GROUP A Lie group G is a smooth manifold equipped with group operations, such as multiplication and inversion, which are smooth maps. Specifically, G is considered smooth when it possesses a C differential structure. When G is endowed with a left-invariant Riemannian metric, it becomes a Riemannian manifold, where the inner product of any two tangent vectors v, w Th G at a point h G is preserved under left multiplication. This property is expressed as: Lh(v), Lh(w) = v, w , (23) where Lh : G G is the left multiplication map, and , : TG TG R represents the inner product. Moreover, the tangent space at any point x G is given by Tx G = Lx Te G, where Te G is the tangent space at the identity element e, which is identified with the Lie algebra g. Consequently, the full tangent bundle TG can be written as G g. At each point x G, a tangent space Tx G is attached, representing the space of tangent vectors at that point. The collection of these tangent spaces forms the tangent bundle TG, which itself is a smooth manifold. Additionally, for any point h G, the cotangent space T h G is defined as the dual space of Th G, consisting of linear functionals (co-states) that act on the tangent vectors. Rotation Group SO(3): The special orthogonal group SO(3), describing 3D rotations, is a compact 3-dimensional Lie group. Its Lie algebra so(3) consists of skew-symmetric matrices. The group SO(3) is defined as: SO(3) = {r R3 3 : r r = rr = I, det(r) = 1}. (24) It is a matrix Lie group, and its Lie algebra is given by: so(3) = {r R3 3 : r = r}. (25) Parametrizations of SO(3): The skew-symmetric matrices r so(3) can be uniquely represented by a vector ω R3, such that for any v R3, rv = ω v, where denotes the cross product. This vector is known as the rotation vector, where its magnitude ω represents the angle of rotation, and its direction eω = ω/ ω defines the axis of rotation. The mapping from R3 to the skew-symmetric matrix is referred to as the hat operation, ˆ( ). Another common parametrization of SO(3) is through Euler angles, described using three angles (ϕ, θ, ψ). In the x-convention, the rotation is expressed as a sequence of three rotations: a rotation about the z-axis by ϕ, followed by a rotation about the updated x-axis by θ, and finally, a rotation about the updated z-axis by ψ. One common inner product on so(3) is the induced Frobenius inner product, given by: A, B = Tr(A B), A, B so(3), (26) which equips SO(3) with a Riemannian structure. The manifold SO(3) has constant Gaussian curvature and is diffeomorphic to a solid ball with antipodal points identified. The exponential map exp : so(3) SO(3), originating from the identity element, is defined as matrix exponentiation: k! , A so(3), (27) and can be represented more compactly via Rodrigues rotation formula: exp(A) = I + sin θ θ A + 1 cos θ θ2 A2, A so(3), (28) where θ = A so(3) = 1 2 A F is the rotation angle. Similarly, the logarithm map log : SO(3) so(3), also originating from the identity, is the matrix logarithm: k=1 ( 1)k+1 (R I)k or more compactly as: log(R) = θ sin θA, R SO(3), (30) where A = (R R ) 2 so(3) and θ = A so(3) is the rotation angle. In spherical geometry, the geodesic distance between two rotations is given by d(R1, R2) = log(R 1 R2) F , and interpolation between rotations can be performed using exp(t A). Published as a conference paper at ICLR 2025 B BACKGROUND OF PMP AND E-MSA In this section, we shall introduce the details of the Pontryagin s Maximum Principle on both Euclidean Space and SO(3) manifold as well as the algorithms of MSA and E-MSA. As described in section 3, PMP offers a set of necessary conditions for an optimal control strategy. PMP states that for an optimal trajectory, there exists a co-state trajectory µt such that the Hamiltonian (Equation 3) is maximized (or minimized, depending on the problem) with respect to the control at every time step. Additionally, the state and co-state evolve according to a system of coupled differential equations, where the co-state variables evolve in the cotangent space of the state variables. The governing differential equations are shown below: Pontryagin s Maximum Principle Let θ U be an essentially bounded optimal control, i.e. a solution to (2) with ess supt [0,T ] θ t < (ess sup denotes the essential supremum). Denote by X the corresponding optimally controlled state process. Then, there exists an absolutely continuous co-state process P : [0, T] Rd such that the Hamilton s equations X t = p H(t, X t , P t , θ t ), X 0 = x, (31) P t = x H(t, X t , P t , θ t ), P T = Φ(X T ), (32) are satisfied. Moreover, for each t [0, T], we have the Hamiltonian maximization condition H(t, X t , P t , θ t ) H(t, X t , P t , θ) for all θ Θ. (33) The PMP conditions can be naturally generalised to the Lie Group. Pontryagin s Maximum Principle (PMP) for Lie groups (Saccon et al. (2010)) provides the conditions that govern the flow of the state xt and its associated cotangent flow λ, describing the Hamiltonian equations that the optimal state-adjoint trajectory must satisfy. Specifically, the Hamiltonian equations are given by: X 1 t X t = p H(t, X t , P t , θ t ), X 0 = x, µθ t = ad H µ µθ t (d Lxθ t ) x H , µθ T = (d Lxθ T ) xΦ(xθ T ). (34) The dual map (d Lg) : T gh SO(3) T h SO(3) pulls back a cotangent vector at gh to a cotangent vector at h. The coadjoint representation ad X acts on so(3) and is defined as: ad Xµ, Y = µ, ad XY , (35) where ad XY = [X, Y ] = XY Y X for µ so(3) and X, Y so(3). Additionally, for each t [0, T], the Hamiltonian maximization condition is satisfied: H(t, X t , µ t , θ t ) H(t, X t , µ t , θ) for all θ Θ. (36) B.2 EXTENDED-METHOD OF SUCCESSIVE APPROXIMATIONS (E-MSA) B.2.1 METHOD OF SUCCESSIVE APPROXIMATIONS (MSA) One numerical method for solving the Pontryagin Maximum Principle (PMP) is the Method of Successive Approximations (MSA) Chernousko & Lyubushin (1982), an iterative approach that alternates between propagation and optimization steps based on the PMP conditions. We first present the simplest form of MSA. Consider the general state dynamics: X t = f(t, X t , θ ). (37) Given an initial guess θ0 U for the optimal control, for each iteration k = 0, 1, 2, . . ., we first solve the state dynamics: Xθk t = f(t, Xθk t , θk t ), Xθk 0 = x, (38) Published as a conference paper at ICLR 2025 to obtain Xθk, followed by solving the co-state equation: µθk t = x H(t, Xθk t , µθk t , θk t ), µθk T = Φ(Xθk T ), (39) to determine µθk. Finally, the control is updated using the maximization condition: θk+1 t = arg max θ Θ H(t, Xθk t , µθk t , θ), (40) for t [0, T]. This process is summarized in Algorithm 3. Algorithm 3 Basic MSA 1: Initialize: θ0 U; 2: for k = 0 to #Iterations do 3: Solve Xθk t = f(t, Xθk t , θk t ), Xθk 0 = x; 4: Solve µθk t = x H(t, Xθk t , µθk t , θk t ), µθk T = Φ(Xθk T ); 5: Set θk+1 t = arg maxθ Θ H(t, Xθk t , µθk t , θ) for each t [0, T]; 6: end for B.2.2 E-MSA E-MSA introduces the augmented Hamiltonian H(t, x, µ, θ, v, q) := H(t, x, µ, θ) 1 2ρ v f(t, x, θ) 2 1 2ρ q + x H(t, x, µ, θ) 2. (41) Then, they define the following set of alternative necessary conditions for optimality: Proposition 3 (Extended PMP) Suppose that θ is an essentially bounded solution to the optimal control problem (2). Then, there exists an absolutely continuous co-state process µ such that the tuple (X , µ , θ ) satisfies the necessary conditions X t = µ H(t, X t , µ t , θ t , X t , µ t ), X 0 = x P t = x H(t, X t , µ t , θ t , X t , µ t ), µ T = xΦ(X T ) H(t, X t , µ t , θ t , X t , µ t ) H(t, X t , µ t , 0, X t , µ t ), θ Θ, t [0, T] (42) The key contribution is that the control terms θ can be updated by iteration using the Extended-PMP as shown below. Meanwhile, it is proven that under this update rule, for each iteration, the target function J(θ) is non-decreasing. Algorithm 4 Extended MSA 1: Initialize: θ0 U. Hyper-parameter: ρ; 2: for k = 0 to #Iterations do 3: Solve Xθk t = f(t, Xθk t , θk t ), Xθk 0 = x; 4: Solve P θk t = x H(t, Xθk t , P θk t , θk t ), P θk T = Φ(Xθk T ); 5: Set θk+1 t = arg maxθ Θ H(t, Xθk t , P θk t , θ, Xθk t , P θk t ) for each t [0, T]; 6: end for C PROOFS AND THEOREMS C.1 PROOF FOR PROPOSITION 1 C.1.1 PART 1 Proposition 1. For Affine Gaussian Probability Path, the expectation of the running cost upper bounds the KL divergence between the prior joint distribution p1(xp, x1) = p1(xp|x1)pdata(x1) and Published as a conference paper at ICLR 2025 the joint distribution after guidance p1(xθ, x1) = p1(xθ|x1)pdata(x1), with x1 pdata, xp induced by prior conditional vector field ut(x|x1) and xθ sampled by applying control θt(x1) on ut. Ex1 pdata(x1) 0 θt(x1) 2 dt C KL(p1(xθ, x1) p1(xp, x1)). (43) Proof. For Affine Gaussian Probability Paths, the conditional flow distribution of the prior vector field can be written as: P p t (x|x1) = N(µt(x1), σt(x1)2I), the conditional vector field is written as: ϕt(x) = µt(x1) + σt(x1)x (44) with its dynamics: ϕt(x) = µt(x1) + σt(x1)x, (45) or ut(x|x1) = µt(x1) + σt σt (x µt(x1)). Although control terms θ(x1) are specifically derived for each ODE trajectory and are conditioned on the target point x1, we can define a marginalized control term θt(x) = R θ(x1)p(x1|x) dx1 to assemble aggregated control from all possible controls applied to each target point. Adding that to the marginal vector field can be thought of as effectively adding the target-conditioned controls into the conditional vector field, and then marginalizing: u(x) + θt(x) = Z (ut(x|x1) + θt(x1)) p(x1|x) dx1. (46) Given the definition of the Gaussian path, we can derive that the additive control terms only alter the mean of the distribution by denoting it as θt(x1) by noticing it is not proportional to x: ϕθ t (x) = ut(ϕ(x)|x1) + θt(x1) = θt(x1) + µt(x1) + σt(x1)x. (47) The resulting pushing-forward prob is equivalently: P θ t (x) = Z P θ t (x|x1)q(x1) dx1, P θ t (x|x1) = N(µθ t(x1), σt(x1)2I). (48) Usually, besides maximising the reward, we hope the new distribution is not too much away from the original distribution. One common constraint on the terminal state distribution is the Kullback-Leibler (KL) divergence. Inspired by Variational Flow Matching (Eijkelboom et al. (2024)), we can the KL divergence between the joint distributions of the data point x1 and the prior terminal point xp, denoted as p1(xp, x1) = p1(xp|x1)pdata(x1), and that of the data point x1 and the controlled terminal point xθ, denoted as p1(xθ, x1): KL(P p 1 (x, x1) P u 1 (x, x1)). The KL term can be simplified as: KL(P p 1 (x, x1)||P θ 1 (x, x1)) = ZZ P p 1 (x|x1)q(x1) log P p 1 (x|x1)pdata(x1) P θ 1 (x|x1)pdata(x1) dx ddx1 (49) = Ex1 pdata(x1)[KL(P p 1 (x|x1)||P θ 1 (x|x1))]. Given P p 1 (x|x1) and P θ 1 (x|x1) are two Gaussians with the same variance but different mean, and when we consider the constraint per sample, the constraint can be written as: 1 2 µθ 1(x1) µp 1(x1) 2 σ1(x1)2 , (50) µθ 1(x1) µp 1(x1) 2 = 0 θt 2 dt. (51) Therefore, we have the following inequality: Ex1 pdata(x1) 0 θt(x1) 2 dt 1 4σ1(x1)2 KL(p1(xθ, x1) p1(xp, x1)). (52) Published as a conference paper at ICLR 2025 C.1.2 PART 2 Proposition 1. Part 2 For square-shaped data x with non-zero probability path, the expectation of the running cost, combined with the L1-distance between the prior sample xp 1 and the corresponding guided sample xθ 1, upper bounds the KL divergence between the marginal distributions of the prior model pp 1 and the guided model pθ 1: Exp 1 pp 1(x) A xp 1 xθ 1 + B Z 1 0 θt(xp 1) 2 dt KL(pp 1 pθ 1). (53) Proof. Firstly, we can build a bound for the distance of terminal points of pre-trained model ϕp t (x) and our controlled model ϕθ t(x) given the same starting point. Given dϕp t (x) dt = f p(x) with f p( ) is the prior model and dϕθ t (x) dt = f p(x) + θt, meanwhile as the θt are defined per sample, we can see them as a function of x. Given the ODEs governing ϕθ t (x) and ϕp t (x) and use Lipschitz condition: ϕθ t (x) ϕp t (x) Z 1 0 f p(ϕθ t(x)) f p(ϕp t (x))+θt dt L Z 1 0 ϕθ t(x) ϕp t (x) dt+ Z 1 (54) By Gronwall s inequality: ϕθ 1(x) ϕp 1(x) e L Z 1 0 θt dt. (55) From the bound we build above, we can set ϕθ 1(x) ϕp 1(x) = g(θt, x) with g(θt, x) e L Z 1 0 θt dt, (56) and then by definition: ϕ 1 θ,1(x) = ϕ 1 p,1(x g). (57) Now we consider the push-forward functions and assume p is non-zero for all x: pp 1(x) = p0(ϕ 1 p,1(x))det " ϕ 1 p,1 x (x) pθ 1(x) = p0(ϕ 1 θ,1(x))det " ϕ 1 θ,1 x (x) Given the starting distribution is standard Gaussian, we can get the KL divergence as: KL(pp 1 pθ 1) = Z p1(x) 2 ϕ 1 p,1(x) 2 + 1 2 ϕ 1 θ,1(x) 2 + log det ϕ 1 p,1 x (x) det ϕ 1 θ,1 x (x) With the Mean Value Theorem, we can find a point x between x and x g so that ϕ 1 p,1(x) ϕ 1 p,1(x g) = (ϕ 1 p,1( x)) g. (60) Assume the gradient is bounded so that (ϕ 1 p,1( x)) k: 2 ϕ 1 p,1(x) 2 + 1 2 ϕ 1 θ,1(x) 2 k g + k2 2 g 2. (61) For the log term, with MVT again, there exists a point ξ on the line segment between x and x g such that: h(x) h(x g) = h(ξ) g. (62) Applying this to h(x) = log det f(x): log det f(x) log det f(x g) = (log det f)(ξ) g. (63) Thus, | (x)| = | (log det f)(ξ) g| (log det f)(ξ) g . (64) Published as a conference paper at ICLR 2025 To bound (log det f)(ξ) , the gradient of log det f(x) with respect to x is: (log det f(x)) = (det f(x)) 1 (det f(x)). (65) Given the derivative of the determinant of a matrix-valued function assuming f(x) is square: (det f(x)) = det f(x) Tr f (x) 1f (x) . (66) Therefore: (log det f(x)) = Tr f (x) 1f (x) . (67) Assuming that: f (x) 1 M for some constant M; f (x) N for some constant N. Then: (log det f(x)) f (x) 1 f (x) n MNn = K. (68) log det ϕ 1 p,1 x (x) det ϕ 1 θ,1 x (x) K g . (69) Overall, we can bound the KL divergence using the integration of θt and the integration of the square of θt: KL(pp 1 pθ 1) Ex1 pp 1(x) (k + K) g + k2 Ex1 pp 1(x) (k + K)e L x1 xθ 1(x1) + k2 0 θt 2 dt . (70) C.2 PROOF FOR THEOREM 2 Since the Extended Method of Successive Approximations (E-MSA) is applied in this context, its convergence properties are directly inherited. In this section, we focus on how our update rule is derived from the E-MSA update rule, specifically under the framework of additive control terms and running cost. Define Hamiltonian H and Extended Hamiltonian H: H(t, x, µ, θ) = µt f(x, t) + µt θt 1 H(t, x, µ, θ, x, µ) = µt f(x, t) + µt θk+1 t 1 2 θk+1 t 2 γ 2 θk+1 t θk t 2. (71) Apply Extended MSA: xk t = θk t + f(xk t , t), µk t = x H(t, xk, µk, θk) = xf(xk t , t)µk t , µk 1 = α x1Φ(xk 1), θk+1 t = argmaxθt H(t, x, µ, θ, x, µ). (72) µk t can be calculated in closed form as below, with T exp is the time-order exponential: µk t = T exp Z 1 t xf(xk s, s) ds α x1Φ(Xk 1 ) (73) Thus, the update rule of θt is: θk+1 t = γ 1 + γ θk t + α 1 + γ T exp Z 1 t xf(xk s, s) ds x1Φ(Xk 1 ). (74) Published as a conference paper at ICLR 2025 Further, the time-order exponential term can be simplified as follows. Evans (1983) provides a method to calculate p(t) = xtx(1) efficiently by defining the adjoint and using the following ODEs: p(t) = xf(xk t , t)p(t), p(1) = x(1)x(1) = I. (75) Sakurai & Napolitano (2020) provides an alternative viewpoint of the adjoint ODE, they show there would be a closed form solution: p(t) = A(t)p(t), p(t) = T exp Z 1 t A(s) ds p(1). (76) Thus, combining the closed form solution and the adjoint representation, we find that: xtx(1) = T exp Z 1 t xf(xk s, s) ds . (77) Given the linearity of the ODEs, multiply a constant to the terminal by change p(t) = xtx(1) into p(t) = xtx(1) x1Φ(x1), the conclusions would not change, we have: xtx(1) x1Φ(x1) = T exp Z 1 t xf(xk s, s) ds x1Φ(x1). (78) Therefore, our update rule is in fact: θk+1 t = γ 1 + γ θk t + α xtΦ(Xk 1 ). (79) The solutions can be naturally generalised to the space of RN if we use Tr(A B) to replace A B. C.3 DISCRETIZATION ERROR Given the discretization method we are using is Euler step, we can show the error in terminal states due to Euler method with turning continous setting xθ t = ht(xθ t , θt) into discrete setting xt+1 = xt + ht(xθ t, θt) t is bounded. Consider the first update of the states, define the local truncation error as: τk := x t+1 x t ht(xθ t , θt) t. (80) By definition: τk = Z tk+1 tk hs(xθ s, θs) ds ht(xθ t , θt) t. (81) Given t is small, with Taylor expansion: hs(xθ s, θs) = ht(xθ t , θt) + h x(xθ s xθ t) + h t (s t) + higher order terms. (82) Similarly: xθ s xθ t = ht(xθ t , θt)(s t) + higher order terms. (83) We can then obtain hs(xθ s, θs) = ht(xθ t, θt) + ht (s t) + higher order terms. (84) Therefore: Z tk+1 tk hs(xθ s, θs) ds = t ht(xθ t, θt) + t2 + higher order terms, + higher order terms. (85) Now we accumulate Local Errors to Global Error. Define error in kth step: ek = x (tk) xk, Published as a conference paper at ICLR 2025 ek+1 = x (tk+1) xk+1 = (x (tk) + t f(x (tk), u (tk), tk) + τk) (xk + t f(xk, u k, tk)) . (86) Simplifying: ek+1 = ek + t (f(x (tk), u (tk), tk) f(xk, u k, tk)) + τk. (87) Assuming f is Lipschitz continuous in x and u: f(x (tk), u (tk), tk) f(xk, u k, tk) Lf x (tk) xk = Lf ek . (88) Error Recurrence Inequality: ek+1 ek + t Lf ek + τk = (1 + t Lf) ek + τk . (89) Solving the Error Inequality given τk C( t)2, where C is a constant depending on f and its derivatives. We will solve the inequality: ϵk+1 αϵk + C( t)2, (90) where ϵk = ek and α = 1 + t Lf. Thus: ϵk+1 αk+1ϵ0 + C( t)2 k X j=0 αk j. (91) Since ϵ0 = 0 (assuming x0 is exact), the first term drops out. The sum becomes: j=0 αk j = αk+1 1 For small t: α e t Lf , αk+1 e(k+1) t Lf = e Lf tk+1. (93) Compute the Sum Sk: Sk e Lf tk+1 1 e t Lf 1 e Lf tk+1 1 t Lf . (94) Final Bound on ϵk+1: ϵk+1 C( t)2 e Lf tk+1 1 t Lf = C t(e Lf tk+1 1) At the Final Time tf: ϵN = x (tf) x N C t(e Lf tf 1) For multiple rounds of updates, the error would be accumulated, for the N+1th update, an additional local error is added to τk: εk,N+1 = x t,N xt,N + (ht(x t,N, θt, N ) ht(xt,N, θt N)) t. (97) As the magnitude of this additional term is still O( t2), the order of magnitude of τk does not change, thus our conclusion does not change. The case in the SO(3) manifold should be similar, noticing the discretization also uses the Euler step. C.4 ASYNCHRONOUS SETTING AND VJP IN SO(3) In practice, as described by equation 22, discretization techniques are employed to simulate the ordinary differential equations (ODEs) governing both the state trajectory xt and the corresponding cotangent vector µt. Most existing methods, as well as the algorithm presented earlier, operate under a synchronous setting, where the number of time steps for the state trajectory xt matches the number of control terms θt. However, OC-Flow can be extended to an asynchronous framework by approximation to allow greater flexibility in update scheduling. Published as a conference paper at ICLR 2025 Rather than employing the standard update rule xt+ t = xt + f(t, xt, θt) t, we subdivide the time interval t into N equally spaced subintervals, applying the control term θt only during the first subinterval. The update for xt+ t is given by: xt+ t = xt + t N f(xt, θt) + t N f p xθ t+ t N f p xθ (N 1) t Moreover, for intermediate steps, we define: N f(xt, θt) + N f p xθ t+ l t where xt+ i t N denotes the state at the i-th subinterval. Recall that f(xt, θt) = f p(xt)+θt, when we have t N is small enough, the update rule in Equation 10 can be approximated as a case in Equation 4 by considering θxt+ t: N θf p xθ t+ t N θf p xθ t+ (N 1) t 2 xf p xθ t+ t N xf p xθ t+ (N 1) t Therefore, if we denote xt as the trajectory of the state variable x over the time interval [t, t + t], and xθ t as the trajectory when the control term θt is applied in the first subinterval, it can be reasonably approximated as: xt+ t xt + t l=1 f p xt+ l t N θt. (101) As a result, for the asynchronous setting, the step 4 in Algorithm 1 should be modified as: l=1 f p(xt+ l t For the case on the SO(3) manifold, the asynchronous setting can be deployed using the Taylor expansion of the matrix exponential exp(A), and noting that when t N and t are sufficiently small, the terms become commutative. We can derive the approximation as follows: xt+ t = xt exp t N f(xt, θt) exp t N f p xθ t+ t N f p xθ (N 1) t l=1 f p xt+ l t xt exp t f p e (xt) + 1 The vector-Jacobian method can also be applied to compute the term f p t x (xk t Ej) in Algorithm 2: µt, f p t x xk t Ej = Tr µT t f p t x (xk t Ej) . (104) C.5 PROOF FOR PROPOSITION 3 The key inequality we used for the proof of Proposition 3 is following: h, v h v , x = 1. (105) Published as a conference paper at ICLR 2025 for h, v so(3) and x SO(3) Before proving the proposition 3, we first show that the cotangent vector {µt} is also bounded. Lemma 2: Assume all functions satisfy Lipschitz condition. Then, there exists a constant K > 0 such that for any θ, µθ t K , for all t [0, T]. Proof. Using necessary condition and setting τ := T t, µθ τ := µθ T τ we get µθ τ = ad H µ µθ τ + (d Lx) ( x µθ τ, f ), µθ 0 = (d Lxθ T ) Φ(xθ T ). (106) With Lipschitz condition, we have µθ 0 Φ(xθ T ) xθ T K, xf(t, xθ t, θt) K, and ad H µ µ K µ . Hence, µθ τ K µθ τ , (107) µθ τ µθ 0 µθ τ µθ 0 Z t 0 µθ s ds Z t 0 (K µθ τ ) ds, (108) and by Gronwall s inequality, µθ τ µθ 0 e KT =: K . (109) This proves the claim since it holds for any τ. Now given all related terms are bounded, we can prove Proposition 2. Proposition 3: Assume that the reward function, the prior model, and their derivatives satisfy Lipschitz continuity, bounded by a Lipschitz constant L. Then, there exists a constant C > 0 such that for any θ, ϕ so(3), the following inequality holds: 0 ϕ,θH(t) dt ϕt θt 2dt J(ϕ), (110) where Xθ and P θ satisfy the PMP conditions in Equation 34, and Hϕ,θ denotes the change in the Hamiltonian, defined as: Hϕ,θ(t) := H(t, xθ t, µθ t, ϕt) H(t, xθ t, µθ t , θt). (111) Proof. Firstly, by the definition of Hamiltonian and PMP conditions, we always have: I(xθ, µθ t, θ) := Z T 0 µθ t , f θ t H(t, xθ t , µθ t , θ) L(θ) dt 0. (112) Define δµt = µϕ t µθ t and δft = f ϕ t f θ t , the difference in I can be decomposed as: 0 I(xϕ, µϕ t , ϕ) I(xθ, µθ t, θ) 0 µθ t , δft + δµt, f θ t + δµt, δft dt 0 H(t, xϕ t , µϕ t , ϕ) H(t, xθ t, µθ t, θ) dt 0 (L(ϕt) L(θt)) dt. Now define U(t) = R t 0 fs ds = log(xtx 1 0 ) and by integrating by parts: Z T 0 µθ t, δft dt = µθ t , δUt |T 0 Z T 0 µθ t, δUt dt, Published as a conference paper at ICLR 2025 0 δµt, δft dt = δµt, δUt |T 0 Z T 0 δ µt, δUt dt. (114) Combine the two terms above: Z T 0 µθ t , δft + δµt, f θ t dt = µθ t, δUt |T 0 + Z T µ + µθ t, ad Hθ x , δftxθ t (115) Similarly, we get: 0 δµt, δft dt = 1 0 δµt, δft dt + 1 0 δµt, δft dt 2 δµt, δUt |T 0 1 0 δµt, δUt dt + 1 0 δµt, δft dt 2 δµt, δUt |T 0 + 1 µϕ t , ad Hϕ µθ t, ad Hθ (d Lxϕ t ) Hϕ x (d Lxθ t ) Hθ With mean value theorem and x, µ are bounded by constant L, we can always find xγ t between xϕ t and xθ t, µγ t between µϕ t and µθ t , γ between ϕ and θ, so that : x H(t, xϕ t , µϕ t , ϕ) (d Lxθ t ) x H(t, xθ t , µθ t , θ), δUt x H(t, xθ t, µθ t , ϕ) (d Lxθ t ) x H(t, xθ t, µθ t , θ), δUt D x((d Lxγ t ) )xγ t δUt x H(t, xϕ t , µϕ t , ϕ), δUt E dt D (d Lxθ t ) 2 x H(t, xγ t , µϕ t , ϕ)xγ t δUt, δUt E dt D (d Lxθ t ) µ x H(t, xθ t , µγ t , ϕ)δµt, δUt E dt x H(t, xθ t, µθ t , ϕ) (d Lxθ t ) x H(t, xθ t, µθ t , θ), δUt 0 δUt 2 + δµt δUt dt. Using the same method we get: 0 δµt, δft dt 1 2 δµt, δUt |T 0 + 1 µϕ t , ad Hϕ µθ t, ad Hθ x H(t, xθ t, µθ t, ϕ) x H(t, xθ t , µθ t , θ), δUtxθ t µH(t, xθ t, µθ t, ϕ) µH(t, xθ t , µθ t, θ), δµt 0 δUt 2 + δµt δUt dt. Published as a conference paper at ICLR 2025 With boundary conditions: |T 0 = µθ T + 1 = D (d Lxθ t ) Φ(xθ T ), δUt E + 1 D (d Lxϕ t ) Φ(xϕ T ) (d Lxθ t ) Φ(xθ T ), δUt E Φ(xϕ T ) Φ(xθ T ) + K δUT 2. Using same method to H(t, xϕ t , µϕ t , ϕ) H(t, xθ t , µθ t , θ), we obtain: Φ(xθ T ) + Z T Φ(xϕ T ) + Z T K δUT 2 Z T 0 Hϕ,θ(t) dt + 1 µϕ t , ad Hϕ µθ t , ad Hθ x H(t, xθ t, µθ t , ϕ) x H(t, xθ t, µθ t, θ), xθ tδUt µH(t, xθ t, µθ t , ϕ) µH(t, xθ t, µθ t, θ), δµt 0 ||δUt||2 + ||δµt||||δUt|| dt. By definition: 0 f(t, xϕ T , ϕ) f(t, xθ T , θ) dt, (121) 0 f(s, xϕ s , ϕ) f(s, xθ s, θ) ds 0 f(s, xϕ s , ϕ) f(s, xθ s, ϕ) ds + Z t 0 f(s, xθ s, ϕ) f(s, xθ s, θ) ds 0 f(s, xθ s, ϕ) f(s, xθ s, θ) ds + K Z t By Gronwall s inequality: δUt e KT Z T 0 f(s, xθ s, ϕ) f(s, xθ s, θ) ds. (123) To estimate δµ, we use the same substitution as in Lemma 6 with τ = T t, we get: δ µτ = δ µ0 + Z τ 0 (d L xϕ s ) x H(s, xϕ s , µϕ s , ϕ) (d L xθs) x H(s, xθ s, µθ s, θ) ds µ µϕ τ ad H µ µθ τ ds. (124) Published as a conference paper at ICLR 2025 Using Lemma 1 and Liptichitz conditions: δ µτ δ µ0 + Z τ 0 (d L xϕ s ) x H(s, xϕ s , µϕ s , ϕ) (d L xθs) x H(s, xθ s, µθ s, θ) ds µ µϕ τ ad H K δUT + KK Z T 0 δUt dt + K Z τ 0 (d L xθs) x H(s, xθ s, µθ s, ϕ) (d L xθs) x H(s, xθ s, µθ s, θ) ds δUT + K Z T + e KT K Z T 0 x H(s, xθ s, µθ s, ϕ) x H(s, xθ s, µθ s, θ) ds. Using the bound of Ut, we obtain: 0 f(s, xθ s, ϕ) f(s, xθ s, θ) ds + e KT K Z T 0 x H(s, xθ s, µθ s, ϕ) x H(s, xθ s, µθ s, θ) ds. Also we obtain: µϕ t , ad Hϕ µθ t, ad Hθ µ µϕ t ad Hθ µ µθ t , δUt 0 δµt δUt dt. Finally we get: Published as a conference paper at ICLR 2025 J(θ) J(ϕ) Z T 0 Hϕ,θ(t) dt δUt 2 + δµt 2 dt 0 δµt f(t, xθ t, ϕt) f(t, xθ t , θt) dt 0 δUt x H(t, xθ t, µθ t , ϕt) x H(t, xθ t, µθ t, θt) dt 0 Hϕ,θ(t) dt 0 f(t, xθ t , ϕt) f(t, xθ t, θt) dt 0 x H(t, xθ t , µθ t, ϕt) x H(t, xθ t , µθ t , θt) 2 dt 0 Hϕ,θ(t) dt 0 f(t, xθ t , ϕt) f(t, xθ t, θt) 2 dt 0 x H(t, xθ t , µθ t, ϕt) x H(t, xθ t , µθ t , θt) 2 dt. Therefore, given the form of the addictive control terms and the running cost, we can derive the final term of our claim: 0 ϕ,θH(t) dt C ϕt θt 2 dt J(ϕ). (129) C.6 PROOF FOR PROPOSITION 4 Due to the similarity between the bound derived in Proposition 3 and the bound obtained from the E-MSA method, the proofs of Proposition 4 and Theorem 5 follow the same reasoning as outlined in Section 3.3 of Li et al. (2018). For the sake of completeness, we provide the full derivations here. Proposition 4: Let Xθ and P θ satisfy the PMP conditions in Equation 34. If the update rule follows Equation 19, we define ϵk := R 1 0 θk+1,θk H(t) dt, and ϵk is bounded as: 0 θk+1,θk H(t) dt, lim k ϵk = 0. (130) Furthermore, when ϵk = 0, we have θ = θ := arg maxθ J(θ)To establish convergence, define 0 Hθk+1,θk(t) dt 0. (131) Proof. By definition, if ϵk = 0, then from the update rule which maximizes the Hamiltonian, we must have 0 θk+1 θk 2 dt 0, (132) and so max θ H(xθk t , µθk t , θ, xθk t , µθk t ) = H(xθk t , µθk t , θk t , xθk t , µθk t ). (133) Published as a conference paper at ICLR 2025 Therefore, we always have the quantity ϵk 0 and it measures the distance from the optimal solution, and if it equals 0, then we reach the optimum. C.7 PROOF FOR THEOREM 5 Theorem 5: Assume that the reward function, the prior model, and their derivatives satisfy Lipschitz continuity, bounded by a Lipschitz constant L. Let θ0 so(3) be any initial measurable control with J(θ0) < + . Suppose also that infθ so(3) J(θ) > . If the update of θ satisfies equation 19, for sufficiently large γ, the following inequality holds: Dϵk J(θk+1) J(θk) (134) for some constant D > 0 Proof. Using Proposition 3 we have J(θk) J(θk+1) ϵk + C Z T 0 θk+1 θk 2 dt. (135) From the Algorithm 2 maximizing step, we know that H(t, Xθk t , P θk t , θk t ) H(t, Xθk t , P θk t , θk+1 t ) γ 2 θk+1 θk 2. (136) Hence, we have J(θk) J(θk+1) 1 2C Pick γ > 2C, then we shall have J(θk) J(θk+1) Dϵk with D = (1 2C Moreover, we can rearrange and sum the above expression to get k=0 ϵk D 1 J(θM+1) J(θ0) D 1 inf θ U J(θ) J(θ0) , (138) and hence P k=0 ϵk < + , which implies ϵk 0 and the algorithm converges to the optimum. D COMPUTATIONAL EFFICIENCY Table 6: Comparison of runtime and memory complexity of different methods used in backpropthrough guided-ODE in Euclidean and SO(3) manifold. For complexity, N is the number of ODE steps, n is the number of effective control terms with synchronized and in the range [1, N] and D2 is the complexity of computing 1-step gradient (VJP or Autograd), D depends on data and model size. c is the deficiency introduced by L-BFGS optimizer. Number Of Memory Runtime Convergence Generalization Control Terms Complexity Complexity to Optimal to SO(3) OC-Flow n O(D2) O(n D2) Flow Grad n O(D2) O(n D2) D-Flow N O(ND2) O(c ND2) Red-Diff N/A O(D) O(LND) In terms of memory complexity, both our implementation and Flow Grad utilize the vector-Jacobian approach, which allows solving differentiation through ODE-integral with the adjoint method, significantly reducing memory from O(ND2) (D-Flow) to O(D2). In comparison, D-Flow does not employ the adjoint method, significantly increasing the memory complexity to O(ND2), as detailed in Table 6. In terms of runtime, both Flow Grad and OC-Flow are predominantly influenced by the number of control terms. Various strategies are implemented to reduce either the number of control terms (or Published as a conference paper at ICLR 2025 Table 7: Illustration how different methods decrease the runtime and memory complexity and enhance model capability Effective Memory Runtime Generalization Timestep Complexity Complexity to SO(3) OC-Flow n O(D2) O(n D2) w/o-asynchronous N O(D2) O(ND2) w/o-VJP (adjoint) N O(ND2) O(ND2) OC-Flow-SO(3) n O(D2) O(n D2) w/o-asynchronous N O(D2) O(ND2) w/o-VJP (adjoint) N O(ND4) O(ND4) the effective time steps requiring back-propagation). Specifically, Flow Grad applies a straightening technique, and in oc-flow, this concept is extended to an asynchronous setting. Consequently, the runtime complexity is expressed as O(n D2) for both Flow Grad and OC-Flow. In contrast, D-Flow does not imply an asynchronous setting, necessitating N steps of Autograd, resulting in O(ND2) complexity. In addition, D-Flow is heavily relying on L-BFGS optimizer (Table 4), which also adds a significant increase in runtime due to e.g., uncontrollable additional iteration of linear search, which we estimate with a factor of constant c. Table 7 provides a more direct comparison of the impact of the asynchronous setting and the vector Jacobian product (VJP) method. Compared to optimization-based algorithms, in some methods such as Red-Diff and DPS, the gradientguidance can be approximated directly. These methods demonstrate significantly lower memory complexity and faster runtime. However, their capability is notably constrained. As reported in the Dflow paper, Red-Diff encounters difficulties in handling noise, even in simple linear inverse problems involving images. When oc-flow is adapted to the SO(3) manifold, an additional calculation is introduced for each control term. Specifically, this involves computing the trace of the product of two D D matrices, which adds a computational cost of O(D2). Despite this additional burden, the overall complexity remains in the order of O(D2). The VJP method is also applied for oc-flow on the SO(3) manifold, as shown in Equation 15, for the term µk t , f p t x xk t Ei , where f p t x xk t is computed using VJP. This additional complexity is justified by oc-flow s improved convergence to optimal solutions on the SO(3) manifold. Notably, the OC-Flow-SO3 involves computing full Jacobian which could induce a complexity of O(D4) if directing computing it. Our Jacobian-Vector Product derivation significantly reduces the complexity of OC-Flow-SO3 from O(D4) to O(D2), which enabled our efficient implementation. Table 8: Memory Usage and Runtime on Text-guided Image (256*256) Manipulation. Note: ODE steps = 100, optimization steps = 15. Flow Grad DFlow OC-Flow Fast Simulation on off - on off Peak GPU Mem (GB) 5.2 5.2 OOM 5.4 5.4 Runtime per Sample (s) 114.7 206.2 - 115.7 216.7 In Table 8 we show memory usage and runtime on high-dimensional images, evaluated on a single A100 with 40G memory. We compare Flow Grad, DFlow, and OC-Flow on images. (Liu et al., 2023) proposed fast simulation through skipping some Euler steps if the relative velocity change is smaller than a threshold (1e-3 in (Liu et al., 2023)). We compare the 3 models with or without fast simulation. As for DFlow, we encountered Out-Of-Memory error even on 40G GPU. (Ben-Hamu et al., 2024) used gradient checkpoint to bypass the OOM error, however, due to the lack of code and implementation details of DFlow, we are unable to fully reproduce their implementation. For reference, DFlow takes 15 minutes per image (128*128) on a single 32GB V100 GPU, according to their paper. Generally speaking, Flow Grad and OC-Flow have similar memory usage and runtime except for DFlow. Published as a conference paper at ICLR 2025 Table 9: Memory Usage and Runtime Comparison of OC-Flow on Euclidean and SO(3). SO(3) Euclidean Peak GPU Mem (GB) 1.6 1.2 Runtime per Sample (s) 296.6 188.2 For peptides 9, since Flow Grad and DFlow are not applicable in SO(3), we only report results for OC-Flow using our proposed asynchronous algorithm for efficient simulation, evaluated on a single A100 with 40G memory. Comparing the results, rotation is computationally 0.5x more expensive than translation due to the additional cost introduced by Eq. 22. However, as shown in Table 7, the costs of rotation and translation remain within the same order of magnitude. It is important to note that the extra cost of rotation is justified, as the additional computations required by Eq. 22 are inherent to operations on the SO(3) manifold. Moreover, as demonstrated in Table 5, generation in SO(3) is crucial for the task. Table 10: Runtime Comparison on Molecule. Note: ODE steps = 50, SGD steps = 20, L-BFGS steps = 5 with inner steps = 5. Equi FM Flow Grad DFlow OC-Flow Optimizer N/A SGD SGD L-BFGS SGD L-BFGS Runtime per Sample (s) 2.4 37.6 31.3 102.8 38.1 103.7 In Table 10, we compare Flow Grad, DFlow, and OC-Flow on molecule, evaluated on a single A100 with 40G memory. For reference, we also list Equi FM model as used in our method. To align with DFlow, we also adopt L-BFGS optimizer and find it crucial in performance yet slows down the efficiency. E EXPERIMENTAL DETAILS E.1 TEXT-GUIDED IMAGE MANIPULATION In our text-to-image generation experiment, we adopted the pipeline presented in Liu et al. (2023), utilizing the generative prior from Liu et al. (2022). We employed standard evaluation metrics: LPIPS and ID (face identity similarity) as introduced in Kim & Ye (2021) to assess the differences between the original image and the manipulated image. Additionally, the CLIP score was used to evaluate the alignment between the generated image and the provided text prompt. To enforce consistency with the original image and inspired by Proposition 1, we introduced a constraint term to the terminal reward function to penalize significant deviations from the original image: Φ(x1) = λCLIP(x1, T) (1 η) x1 xp 1 . (139) Here, the hyperparameter λ was set to 0.7 across all experiments, and the Euler discretization step was set to N = 100 and the number of optimization iterations M = 15. As discussed in Theorem 2, increasing the learning rate η results in greater emphasis on the terminal reward, leading to a higher CLIP score but lower LPIPS and ID scores. The weight decay is a function of γ, which is tuned to maximize (1 2C γ )ϵk γ for iteration k. In this experiment, due to the limitation of storage, we set γ the same for all k. In our implementation, the learning rate η was set to 2.5, and the weight decay β was set to 0.995. Baseline configurations were aligned with those reported in Liu et al. (2023), and the results presented in Table 2 reflect the same experimental conditions. For quantitative comparison, we used the Celeb A dataset, randomly sampling 1,000 images, which were manipulated based on text guidance: {old, sad, smiling, angry, curly hair}. Published as a conference paper at ICLR 2025 E.2 MOLECULE GENERATION In our QM9 generation experiment, we mostly followed the conditional generation pipeline in Hoogeboom et al. (2022). An equivariant geometric GNN was trained for each property on half of the QM9 data as the classifier, which was then frozen during our training-free controlled generation. The Equi FM (Song et al., 2024) checkpoint, trained on the whole QM9 training data, was loaded as the generative prior. The test time properties were sampled from the whole training dataset, making it slightly different from the settings in Ben-Hamu et al. (2024). Therefore, we reimplemented the D-Flow algorithm with 5 optimizer steps and 5 inner steps each with a linear search using the L-BFGS optimizer. The results roughly matched those reported in the D-Flow paper with slightly worse MAEs as we included the whole training dataset for property sampling. Our proposed OC-Flow also used the same optimization hyperparameters and also almost the same running time as the D-Flow. For Flow Grad, we followed the suggestion in the original paper to use 20 SGD steps to update the learnable parts, which ran slightly faster than OC-Flow and D-Flow. For all properties, MAE was used as the optimization target and γ is the regularization coefficient such that γ R 1 0 θt 2 dt is the additional OC loss. For all optimization methods, we always used a fixed number of 50 Euler steps so θ can be indexed by discrete indices. As the integral is done with a step size of 1/50, any γ is effectively γ = 2γ/50 = 0.04 when taking the derivative with respect to θ or x. Therefore, γ = 10 effectively corresponds to γ = 0.4, which is still a valid optimization scheme. We noted the difference in the optimizer in these settings in order to be consistent with the D-Flow setup. We provide additional ablation studies on the effect of the optimizer, in which all guided generation approaches used the SGD optimizer with 20 iterations and a learning rate of 1, following the Flow Grad setup. The results are summarized in Table 11. It can be clearly demonstrated that D-Flow performance is significantly worse than OC-Flow and even Flow Grad, the latter of which is a special case of our OC-Flow. Indeed, we can safely conclude that the advantage of D-Flow came solely from its optimizer of using L-BFGS. Using the SGD optimizer caused it to perform even worse than the unconditional Equi FM baseline on the dipole moment µ. On the other hand, OC-Flow achieved consistent improvements, using either the L-BFGS or the simple SGD optimizer, demonstrating our superior performance. Table 11: Ablation on optimizer. MAE for guided generations on QM9 (lower is better). Property α ε εHOMO εLUMO µ cv Unit Bohr³ me V me V me V D cal K mol OC-Flow(Ours) 1.907 346 187 300 0.362 0.972 D-Flow-SGD 5.753 1241 571 1195 1.639 2.982 Flow Grad 2.484 517 273 429 0.542 1.270 OC-Flow-LBFGS(Ours) 1.383 367 183 342 0.314 0.819 D-Flow-LBFGS (Ben-Hamu et al., 2024) 1.566 355 205 346 0.330 0.893 Equi FM 8.969 1439 622 1438 1.593 6.873 E.3 PEPTIDE DESIGN In our peptide experiments, we adopted Pep Flow (Li et al., 2024) as the baseline model, utilizing the pre-trained checkpoint provided in the original Pep Flow paper. The test dataset split was also based on the one defined in the Pep Flow framework. For hyperparameter tuning, we randomly selected 10 complexes from the dataset. After tuning, the full set of 162 complexes was used for guided sampling and evaluation to ensure a comprehensive performance assessment. To enable flexible update scheduling, we adopted an asynchronous setting in OC-Flow. This design maintains the same ODE time steps as Pep Flow while utilizing fewer control terms. Specifically, 200 time steps are used for ODE simulation, with 10 control terms, each controlling 20 time steps. As shown in Table 7, this approach reduces memory and runtime complexity without compromising the accuracy of the ODE simulation. Furthermore, for consistency and comparability across experiments, Published as a conference paper at ICLR 2025 we strictly controlled the initial noise during reruns to ensure consistency and comparability across experiments. We used the pre-trained model as the initialization for our experiments, allowing us to build upon the pre-trained weights and achieve consistent performance improvements through hyperparameter adjustments. In OC-Flow(rot), we used α = 0.95 and β = 0.8; in OC-Flow(trans), α = 0.9 and β = 1.2; and in OC-Flow(both), α = 0.95 and β = 2.0. For all methods, we followed the same hyperparameter choices outlined in the Pep Flow paper to ensure fairness. For evaluation, in addition to Madra X, the reward function used and optimized during training, we included several key metrics not used for training to comprehensively assess the physical validity and overall performance of the generated structures: Stability: Calculated using Rosetta over five independent runs, averaged to reduce high variance. Affinity: Measured by Rosetta to determine the binding energy of designed peptides, also averaged over five runs. IMP (Improvement Percentage): The percentage of peptides with improved affinities (lower binding energies) compared to the native peptides, aligning with the definition used in Pep Flow. Diversity: Calculated as the average of 1 TM-Score among generated peptides, reflecting structural dissimilarities. SSR (Secondary-Structure Similarity Ratio): The proportion of shared secondary structures between the designed peptide and the native peptide. BSR (Binding Site Ratio): The overlapping ratio between the binding site of the generated peptide and the native binding site on the target protein. Due to the time-intensive nature of Rosetta evaluations, we drew 10 samples per pocket for our experiments, in contrast to Pep Flow s use of 64 samples. This approach enabled us to achieve reproducible and fair comparisons across methods without excessive computational costs. Moreover, by employing OC-Flow with guidance, we demonstrated that superior performance can be achieved with fewer samples while maintaining consistency in evaluation settings. As shown in Table 5, we demonstrated the importance of optimization on the SO(3) manifold for peptide design. To further evaluate the impact of using optimal control for updating rotations compared to standard gradient descent as in Euclidean space, we implemented Naive-SO(3). In this implementation, the gradient of the terminal reward with respect to the control terms is computed directly and mapped to so(3), and gradient descent is used to update the control terms. θk+1 t = βθk t + η[ θk t Φ(xθk 1 )]so(3), xθk+1 t = (f p(xθk+1 t ) + θk+1 t )xθk+1 t . (140) For the experimental setup, we conducted a comparative study using a randomly selected subset of 30 pockets. Each peptide was tested under identical conditions, with the primary difference being the parameterization and update method for rotations. Table 12: Comparison of OC-Flow-SO3 and naive SO3 gradient descent Madra X RMSD SSR % BSR % Stability Affinity Diversity imp(%) Ground-truth -0.610 - - - -91.107 -39.807 - - Pep Flow -0.157 1.932 0.788 0.882 -39.807 -28.080 0.322 11.6 Naive-SO(3) 0.275 5.206 0.769 0.748 75.842 -21.901 0.635 6.0 OC-Flow-SO(3) -0.191 1.943 0.794 0.874 -50.947 -29.027 0.332 14.0 As shown in Table 12, the results indicate that updating rotations on SO(3) using optimal control outperforms the naive method in terms of energy optimization and stability. The significant performance gap can be attributed to the accuracy loss during the projection of the gradient onto so(3), which may disrupt the delicate dynamics of the SO(3) space. Furthermore, due to the complexity of the SO(3) Published as a conference paper at ICLR 2025 manifold, gradient-based methods are more prone to becoming trapped in local optima, whereas the optimal control-based algorithm provides a pathway toward achieving the global optimum. Our efforts to reproduce the Pep Flow baseline involved extensive steps, including reaching out to the original authors to address the absence of certain scripts and obtaining partial instructions for the evaluation pipelines. In the original Pep Flow paper, affinity and stability are reported as percentages. However, to facilitate more fine-grained comparisons, we opted to report absolute energy values instead. Additionally, to ensure fairness in comparison, we included IMP (Interaction Metric for Peptides) as an evaluation metric, aligning it with the affinity measure used in Pep Flow. Furthermore, our experiments were conducted using the latest version of Madra X, ensuring that our results are both robust and reproducible. These updates provide a consistent and comprehensive framework for evaluating and comparing future methods in peptide design.