# go_hessian_for_expectationbased_objectives__7ba22090.pdf

GO Hessian for Expectation-Based Objectives

Yulai Cong, Miaoyun Zhao,* Jianqiao Li, Junya Chen, Lawrence Carin Department of Electrical and Computer Engineering, Duke University

An unbiased low-variance gradient estimator, termed GO gradient, was proposed recently for expectation-based objectives Eqγ (y)[f(y)], where the random variable (RV) y may be drawn from a stochastic computation graph (SCG) with continuous (non-reparameterizable) internal nodes and continuous/discrete leaves. Based on the GO gradient, we present for Eqγ (y)[f(y)] an unbiased low-variance Hessian estimator, named GO Hessian, which contains the deterministic Hessian as a special case. Considering practical implementation, we reveal that the GO Hessian in expectation obeys the chain rule and is therefore easy-to-use with auto-differentiation and Hessian-vector products, enabling efﬁcient cheap exploitation of curvature information over deep SCGs. As representative examples, we present the GO Hessian for non-reparameterizable gamma and negative binomial RVs/nodes. Leveraging the GO Hessian, we develop a new second-order method for Eqγ (y)[f(y)], with challenging experiments conducted to verify its effectiveness and efﬁciency.

1 Introduction Many machine learning problems can be formulated as an optimization problem involving an expectation. A classic such setup (Robbins and Monro 1951) is of the form

Framework I: minϑ J (ϑ) Eq(x)[h(x, ϑ)], (1)

where the random variable (RV) x obeys a distribution q(x) unrelated to the parameters ϑ of interest, and h(x, ϑ) is a continuous function wrt ϑ. General assumptions making J (ϑ) (and the following L(γ)) a valid loss function are omitted for simplicity. In practice one often encounters its ﬁnite-sum form minϑ 1

N PN i=1 h(xi, ϑ) with q(x) = 1 N PN i=1 δ(x xi), where δ( ) is the Dirac delta function (this discrete form is typically an approximation, based on N observed samples drawn from the true underlying data distribution). A popular example of Framework I is maximumlikelihood learning with the data distribution q(x) and the negative log-likelihood h(x, ϑ) = log p(x; ϑ), where p(x; ϑ) represents the model.

Equal Contribution. Correspondence to: Miaoyun Zhao, miaoyun9zhao@gmail.com. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

An alternative framework, attracting increasing attention recently, considers the form

Framework II: minγ L(γ) Eqγ(y)[f(y)], (2)

where parameters γ of interest determine the distribution qγ(y) that, for example, models a stochastic computation graph (SCG) (Schulman et al. 2015a; Parmas 2018; Weber et al. 2019).1 Note in general the function f( ) may also be related to γ; however, as the generalization is straight-forward, we focus on the setup in (2) for simpler derivations. Popular examples of Framework II include the ELBO in variational inference (Bishop 2006; Kingma and Welling 2014), the generator training objective of generative adversarial networks (Goodfellow et al. 2014), and many objectives associated with reinforcement learning (RL) (Schulman et al. 2015b; Finn, Abbeel, and Levine 2017; Foerster et al. 2018). Many optimization methods have been proposed for Framework I, utilizing the ﬁrst-order gradient information (Allen-Zhu 2018; Jin et al. 2019) or exploiting the secondorder Hessian information (Tripuraneni et al. 2018; Zhou and Gu 2020). Compared with ﬁrst-order methods, second-order ones are often characterized by convergence in fewer training iterations to a second-order stationary point, requiring less tweaking of meta-parameters (like learning rate), invariance to linear parameter rescaling, and navigating better when facing pathological curvature in deep learning (Martens 2010; Tripuraneni et al. 2018). For computation and memory efﬁciency in high-dimensions (like for deep neural networks), recent second-order methods often resort to Hessian-free techniques, i.e., Hessian-vector products (HVP), which can be computed as efﬁciently as gradients (Pearlmutter 1994) and remove the need to construct the full Hessian (Martens 2010; Kohler and Lucchi 2017; Kasai and Mishra 2018). In contrast to the classic Framework I, few optimization methods have been proposed for Framework II with general SCG qγ(y), partially because of the signiﬁcant challenge in even estimating its gradient with low variance without bias in general/non-reparameterizable (subsequently abbreviated as non-rep ) situations (Cong et al. 2019; Weber et al. 2019).

1When the RV y is reparameterizable (rep) (Salimans, Knowles et al. 2013; Rezende, Mohamed, and Wierstra 2014), Framework II can be rewritten as Framework I for optimization. To distinguish, we focus on the challenging non-rep setup for Framework II by default, despite the proposed GO Hessian also applies to rep RVs.

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

For second-order optimization of Framework II, most existing works resort to the log-trick,2 often suffering from high variance and poor sample efﬁciency, and therefore seeking help from variance reduction control variates with a potential variance-bias trade-off (Heess et al. 2015; Rothfuss et al. 2019). Moreover, to facilitate the implementation via autodifferentiation (AD), cumbersome designs of surrogate losses and control variates are often necessary (Foerster et al. 2018; Mao et al. 2019), which are challenging when derivatives of different orders are used simultaneously (Farquhar, Whiteson, and Foerster 2019), like in meta reinforcement learning (Finn, Abbeel, and Levine 2017). Therefore, an easy-to-use unbiased (gradient and) Hessian estimator for Framework II, with low variance and high sample efﬁciency, is highly appealing (Liu, Socher, and Xiong 2019). Different from existing methods that leverage the log-trick, we follow a different research path (Figurnov, Mohamed, and Mnih 2018; Jankowiak and Obermeyer 2018; Cong et al. 2019) that tries to generalize the classic deterministic derivatives/back-propagation to Framework II. Speciﬁcally, based on the GO gradient (Cong et al. 2019), we show that its double application delivers an unbiased low-variance Hessian estimator, termed GO Hessian, for Framework II in (2), where y may be drawn from a SCG consisting of continuous rep/non-rep internal nodes3 and continuous/discrete leaves, with neural networks as links. Besides proposing the GO Hessian, our other contributions include: we reveal that the GO Hessian contains the deterministic Hessian as a special case; they both show intuitively simple patterns obeying the chain rule, except the GO Hessian exhibits slight differences due to SCG randomness; we reveal the GO Hessian is easy to use with existing AD software and HVP, enabling computationally and memory efﬁcient exploitation of curvature information over SCGs; speciﬁcally, one can simulate one sample per SCG node in the forward pass through that SCG, followed by doubly backward passes to form a one-sample-based Hessian estimate, which is the same as the deterministic Hessian; we derive the GO Hessian for non-rep gamma and negative binomial RVs; we also propose a simple yet effective method to make optimization over gamma RVs more friendly to gradient-based methods; we empirically show that the GO Hessian often works well with one sample without variance reduction techniques; combining the GO Hessian with an existing method for Framework I (Tripuraneni et al. 2018), we present a novel second-order method for Framework II, theoretically analyze its convergence, and empirically verify its effectiveness and efﬁciency with challenging experiments.

2 Preliminary 2.1 General and One-sample (GO) Gradient Containing as special cases the low-variance reparameterization gradient (Salimans, Knowles et al. 2013; Rezende, Mo-

2Also named the likelihood ratio, score function, or REINFORCE (Williams 1992) estimator. See Section 3.1 for details. 3Note discrete internal nodes are not supported.

hamed, and Wierstra 2014) and the pathwise derivative (Figurnov, Mohamed, and Mnih 2018; Jankowiak and Obermeyer 2018)4, the GO gradient (Cong et al. 2019) serves as a general framework of unbiased low-variance gradient estimates for Framework II in (2), where RV y may be drawn from a SCG with continuous rep/non-rep internal nodes and continuous/discrete leaves. With the GO gradient, one can forward pass through the SCG with one sample activated per node to estimate the objective, followed by backward-propagating an unbiased low-variance gradient estimate through each node again to the parameters for updating (see Theorem 3 of Cong et al. (2019)). The low-variance and one-sample properties make the GO gradient easy-to-use in practice, e.g., in variational inference with a complicated inference distribution. To introduce the approach, the simplest setup, i.e., a singlelayer RV y satisfying conditional independence qγ(y) = Q v qγ(yv), is employed to demonstrate the GO gradient, i.e.,

γL(γ) = γEqγ(y)[f(y)] = Eqγ(y) Gqγ(y) γ Dyf(y) , (3)

where Dyf(y) = , Dyvf(y), T with Dyvf(y) yvf(y) for continuous yv while Dyvf(y) f(yv+) f(y) for discrete yv. yv+ [ , yv 1, yv + 1, yv+1, ]T . Gqγ(y) γ = , gqγ(yv) γ , gathers the variable-nabla

gqγ(yv) γ 1 qγ(yv) γQγ(yv),5 which has the intuitive meaning of the derivative of a RV yv wrt its parameters γ. Qγ(yv) is the CDF of qγ(yv). With the variable-nabla, one can informally interpret Gqγ(y) γ as the gradient of the RV y wrt the parameters γ. Similar intuitive patterns hold for deep SCGs with continuous internal nodes. As an informal summarization, the GO gradient in expectation obeys the chain rule and acts like its special case of the classic back-propagation algorithm (Rumelhart and Hinton 1986; Cong et al. 2019).

2.2 Hessian-free Techniques Developed for efﬁcient implementation of second-order optimization in high-dimensions (like for deep neural networks, where the explicit construction of the full Hessian is prohibitive), Hessian-free techniques (Martens 2010; Byrd et al. 2011) exploit HVP for implicit usage of the Hessian information. An example technique to calculate HVP leverages

[ 2 ϑJ (ϑ)]p = ϑ [ ϑJ (ϑ)]T p , (4)

where p is a vector uncorrelated with the parameters ϑ of interest. For better efﬁciency than the above 2-backward technique, Pearlmutter (1994) proposed a faster HVP calculation that takes about the same amount of computation as a gradient evaluation. The low-cost HVP is essential because common subsolvers used to search for second-order directions merely exploit Hessian information via HVP (Agarwal et al. 2017; Tripuraneni et al. 2018; Zhou and Gu 2020).

4See Appendix A for the detailed comparisons among these reparameterization-like gradient estimators. 5The notations for the variable-nabla and the variable-hess below are optimized to improve the clarity when deep SCGs are of interest; refer to the derivations in Appendix C.3 and Cong et al. (2019).

2.3 Stochastic Cubic Regularization (SCR) As a second-order method for Framework I in (1), the SCR (Tripuraneni et al. 2018) searches for a second-order stationary point via iteratively minimizing a local third-order Taylor expansion of the objective J (ϑ), i.e.,

ϑt+1 = argminϑ

J (ϑt) + g T t (ϑ ϑt)+ 1 2(ϑ ϑt)T Ht(ϑ ϑt) + ρ

where gt = ϑJ (ϑt) and Ht = 2 ϑJ (ϑt) are the stochastic gradient and Hessian at ϑt, respectively (often estimated with sub-sampled data batches), ρ is the cubic penalty coefﬁcient, and (5) can be solved efﬁciently with gradient decent (Carmon and Duchi 2016). Since Newton-like methods are much more tolerant to the Hessian estimation error than that of the gradient (Byrd et al. 2011), one can often use signiﬁcantly less data samples to calculate the stochastic Hessian for better efﬁciency (Tripuraneni et al. 2018).

3 GO Hessian for Framework II Targeting efﬁcient second-order optimization of Framework II in (2), we ﬁrst propose for it an unbiased low-variance Hessian estimator, termed General and One-sample (GO) Hessian, that is based on the GO gradient (Cong et al. 2019) and is easy-to-use in practice. We then combine the proposed GO Hessian with the SCR (Tripuraneni et al. 2018) to propose a novel second-order method for Framework II.

3.1 GO Hessian A straight-forward way to estimate the Hessian of Framework II lies in exploiting the log-trick γqγ(y) = qγ(y) γ log qγ(y), that is,

2 γL(γ) = Eqγ(y)

" f(y)[ γ log qγ(y)][ γ log qγ(y)]T

+ f(y) 2 γ log qγ(y)

However, such a log-trick estimation shows high Monte Carlo (MC) variance in both theory and practice (Rezende, Mohamed, and Wierstra 2014; Ruiz, Titsias, and Blei 2016), often seeking help from variance-reduction techniques (Grathwohl et al. 2017; Mao et al. 2019). Besides, for practical implementation with AD, cumbersome designs of surrogate losses and control variates are often necessary (Foerster et al. 2018; Farquhar, Whiteson, and Foerster 2019). Different from the above method based on the log-trick, our GO Hessian doesn t require additional designs of surrogate losses and it estimates the curvature of Framework II in a pathwise manner, like the classic deterministic Hessian (as detailed below); moreover, it empirically shows low variance, often working well in practice with one-sample-based estimation without variance reduction techniques (see Figure 1 and the experiments).6 For simpliﬁed notations, we ﬁrst employ the single-layer setup with qγ(y) = Q v qγ(yv) to demonstrate our main results, which are then generalized to deep SCGs with continuous rep/non-rep internal nodes and continuous/discrete leaves. Proofs are given in Appendix C.

6Discussions on the lower variance of the GO Hessian than the log-trick estimation are provided in Appendix E.4.

Hessian Error

GO Hessian Log-trick

Hessian Error

8 10 0.4 0.5 12 0.6

GO Hessian Log-trick

Figure 1: One-sample-based variance comparisons of the log-trick estimation and the GO Hessian on (left) 2 {α,β}EGam(α,β)[log Gam(α,β) Gam(10,10) ] and (right)

2 {r,p}ENB(r,p)[log NB(r,p) NB(10,0.5) ]. See Appendix E for details.

Motivated by the GO gradient that leverages the integration-by-parts to form a gradient estimate of Framework II in a pathwise manner, one may try to naively apply the integrationby-parts twice to form a Hessian estimate; however, as shown in Appendix B, such naive derivations lead to complex calculation terms that are difﬁcult to interpret or implement. Fortunately, we ﬁnd that (i) the variable-nabla, e.g., gqγ(yv) γ that constitutes Gqγ(y) γ in (3), is often differentiable with a simple expression (see Table 3 of Cong et al. (2019)); and (ii) the GO gradient has a similar expectation-based expression as the objective L(γ). Accordingly, we propose to view the GO gradient of L(γ) as another expectation-based vector objective, followed by calculating the GO gradient for that vector objective to form our GO Hessian of L(γ). Note this pattern of double applications of a GO gradient resulting in the GO Hessian is the same as that of the deterministic Hessian, which actually is a special case of our GO Hessian as proved below. Assuming a single-layer continuous setup with qγ(y) = Q v qγ(yv), the GO Hessian is deﬁned as

2 γL(γ) = γEqγ(y) Gqγ(y) γ yf(y)

= Eqγ(y) h Gqγ(y) γ [ 2 yf(y)]Gqγ(y) γ T + Hqγ(y) γγ yf(y) i , (7)

where Hqγ(y) γγ is a three-dimensional tensor with its element

[Hqγ(y) γγ ]b,a,v = gqγ(yv) γb yvgqγ(yv) γa + γbgqγ(yv) γa hqγ(yv) γbγa (8)

and the tensor-vector product Ha outputs a matrix whose element [Ha]b,a =P

v Hb,a,vav. We name hqγ(yv) γbγa the variablehess, because of its intuitive meaning of the second-order derivative of a RV yv wrt parameters {γa, γb} (see below). Note three components are necessary for employing the GO Hessian, i.e., accessible variable-nabla gqγ(yv) γb and its two gradients yvgqγ(yv) γa and γbgqγ(yv) γa , for each node yv.7 For better understanding, we draw parallel comparisons to deterministic optimization with objective ˆL(γ) = f[ˆy(γ)], which is a special case of Framework II with qγ(y) = δ(y ˆy(γ)) and where

2 γ ˆL(γ) = γ [ γ ˆy(γ)][ ˆyf(ˆy)]

= [ γ ˆy(γ)][ 2 ˆyf(ˆy)][ γ ˆy(γ)]T + [ 2 γ ˆy(γ)] ˆyf(ˆy). (9)

7This roughly means analytical PDF/CDF for yv is accessible.

By comparing (7) and (9), interesting patterns arise: (i) the interpretation of Gqγ(y) γ as the gradient of RV y wrt parameters γ (informally Gqγ(y) γ γy) also holds in secondorder settings; (ii) the newly-introduced Hqγ(y) γγ can be intuitively interpreted as the Hessian of RV y wrt parameters γ (informally Hqγ(y) γγ 2 γy). In fact, the GO Hessian contains the deterministic Hessian as a special case, as detailed in Theorem 1 (see Appendix C.4 for the proof). Note in general the GO Hessian has an additional term due to the RV randomness (i.e., the ﬁrst item of the variable-hess in (8)). On Discrete RVs. For a single-layer/leaf discrete RV y with PDF qγ(y)=Q v qγ(yv), the GO Hessian is of the form

2 γL(γ) = γEqγ(y) Gqγ(y) γ Dyf(y)

= Eqγ(y) Gqγ(y) γ [D2 yf(y)]Gqγ(y) γ T + Hqγ(y) γγ Dyf(y) , (10)

where H qγ (y) γγ Dyf(y) represents a matrix with its elements

H qγ (y) γγ Dyf(y)

" [gqγ(yv) γb Dyvgqγ(yv) γa ]Dyvf(yv+)

+ [ γbgqγ(yv) γa ]Dyvf(y)

It is clear that (7) for continuous RVs and (10) for discrete RVs show similar patterns but with slight differences, like the gradient/difference of f(y) and the deﬁnition of the variablehess. We leave discrete situations as future research and focus mainly on continuous non-rep cases in this paper. On Deep SCGs. Based on the above derivations for the single-layer continuous/discrete setup, we prove in Appendix C.3 that similar patterns also hold for deep SCGs with continuous internal nodes and continuous/discrete leaves. The main results are summarized in Theorem 1, with the corresponding complexity analysis given in Appendix D.

Theorem 1 (GO Hessian). For an expectation-based objective Eqγ(y)[f(y)], where qγ(y) = Q i qγi(yi|pa(yi)) models a SCG with continuous internal nodes and continuous/discrete leaves (possibly with neural networks as links), y = {y1, y2, }, γ = {γ1, γ2, }, and each component satisﬁes conditional independence, i.e.,

qγi(yi|pa(yi)) = Y

v qγi(yi,v|pa(yi)) (12)

with yi,v being the v-th element of yi, pa(yi) the parent RVs of node yi, and qγi(yi,v|pa(yi)) having accessible variablenabla/variable-hess, double application of the GO gradient (Cong et al. 2019) to that objective delivers an unbiased low-variance Hessian estimator, i.e., the GO Hessian. The GO Hessian in expectation obeys the chain rule, exhibiting an expression similar to the deterministic Hessian (e.g., see (7) versus (9)), except with the variablenabla/variable-hess serving as the ﬁrst-order/second-order derivative of a RV. Moreover, when limiting each component qγi(yi|pa(yi)) as a Dirac delta function, the GO Hessian degrades into the deterministic Hessian.

The similarity between the GO Hessian and its special case of the deterministic Hessian is exploited in Section 3.2 to reveal the easy-to-use property of the GO Hessian during practical implementations.

On the Applicability of the GO Hessian. The conditional independence assumption in (12) may be falsely considered limited at ﬁrst glance. It s worth noting that the GO Hessian also applies to correlated multivariant qγ(y) if (i) the factorization qγ(y) = qγ(y1)qγ(y2|y1) is accessible, as qγ(y) is now a deep SCG; (ii) qγ(y) can be reparametrized to some extent (such as a MVN or a Dirichlet). Multivariate RVs beyond these two cases are rarely used in practice. From an alternative perspective, our setup generalizes deterministic deep learning with RVs, by leveraging sampling to replace/reinforce traditional nonlinear activation functions. Accordingly, the GO Hessian is deemed widely applicable for most statistical deep learning problems. On Limited f( )-information. For challenging situations where only zero-order f(y)-information is available at the current sample y (like in many RL applications (Baxter and Bartlett 2001; Shen et al. 2019)), we reveal that our GO Hessian can be combined with the LAX technique (Grathwohl et al. 2017) to facilitate that issue. Speciﬁcally, with a surrogate function cω(y) (often a neural network) to approximate f(y), we unify the zero-order f-evaluation from the log-trick estimation and the low-variance of our GO Hessian via

HLAX[f] = Hlog-trick[f] Hlog-trick[cω] + HGO[cω],

where Hmethod[func] denotes the Hessian estimator of objective Eqγ(y)[func(y)] based on the method. The surrogate parameters ω can be optimized by minimizing the MC variance of HLAX[f] (Grathwohl et al. 2017). Note when cω(y) = f(y), HLAX[f] delivers the same low variance as our GO Hessian.

3.2 GO Hessian Is Easy-to-use In practice, to explicitly construct/store the GO Hessian may be prohibitively expensive, especially for SCGs with neuralnetwork-based links. Fortunately, we ﬁnd the GO Hessian is easy to use in practice with AD and HVP. The key observations include (i) the Framework II objective, its GO gradient, and its GO Hessian are all expectations over the same qγ(y); and (ii) when estimated with one shared sample, they all act the same as their counterpart/special-case in deterministic optimization, except with the variable-nabla/variable-hess serving as the ﬁrst-order/second-order derivative at each RV node. Therefore, based on one shared sample, one can simply manipulate well-developed AD software (like Py Torch (Paszke et al. 2017) or Tensor Flow (Abadi et al. 2015)) to guarantee correct variable-nabla/variable-hess for each node, which in turn delivers an easy-to-use exploitation of the one-sample-estimated GO Hessian via existing AD and HVP techniques. Multi-sample-based estimation can be implemented with parallel computations. Figure 2(a) shows a demonstrative example, where, thanks to conditional independence, we zoom in on a scalar RV node y of the SCG for illustration, α denotes the distribution parameters of that node (e.g., the shape and rate of a gamma RV), and one sample is stochastically activated for subsequent forward pass. To exploit the one-sample-estimated GO Hessian over the SCG with AD/HVP, one may employ the general approach in Figure 2(b) to guarantee correct variablenabla/variable-hess for each node, which then delivers seam-

(a) Forward/backward pass

class GOActivation(Function): def forward(ctx, α): y qα(y) ctx.save_for_backward(α, y) return y def backward(ctx, grad_y): α, y = ctx.saved_tensors grad_α = grad_y S[gqα(y) α ] +y S[ ygqα(y) α ] S[y ygqα(y) α ] +αS[ αgqα(y) α ] S[α αgqα(y) α ]

return grad_α (b) Py Torch-like pseudo code

Figure 2: Illustration of the simplicity in jointly implementing the GO gradient and Hessian. A red circle is a RV node. Black solid arrows indicate both the forward pass and parameters γ of the SCG. Red dashed arrows show the backward pass through the current node. S[ ] is the stop-gradient operator.

less (double) back-propagation through that SCG (as the rest computations are deterministic and well-deﬁned in existing AD software). Note simpler approaches than the one in Figure 2(b) potentially exist, e.g., one associated with the method in Pearlmutter (1994) or one designed for a speciﬁc RV.

3.3 Second-order Optimization of Framework II Beneﬁting from the low-variance and easy-to-use properties of the GO Hessian, one can readily combine it with existing second-order methods for Framework I to develop novel variants for Framework II in (2). Considering practical applications like variational inference, we employ a more common objective here for presentation, i.e., minγ L(γ) Eq(x)qγ(y|x)[f(x, y)], where q(x) is the data distribution. We employ the stochastic cubic regularization (SCR) (Tripuraneni et al. 2018)8 that exploits stochastic gradient/Hessian information within its subroutine (see (5)). By leveraging the GO gradient and our GO Hessian in place of the classic gradient/Hessian, we present in Algorithm 1 a new second-order method for Framework II, termed SCRGO. The Cubic-Subsolver and Cubic-Finalsolver (given in Appendix F) minimize the local third-order Taylor approximation of Framework II, mimicking (5). The detailed convergence analysis is provided in Appendix G, where a gammarelated special case is discussed.

4 GO Hessian for Common RVs Based on the variable-nablas summarized in Table 3 of Cong et al. (2019) and the deﬁnitions in (8) and (11), one can derive the variable-hess (or its components) for many kinds of RVs,9 which are essential for easy-to-use curvature exploitation over SCGs that are ﬂexibly constructed by those RVs. Although the derivations for most RVs are straight-forward, we highlight two challenging but interesting RVs for demonstration following Cong et al. (2019), i.e., continuous non-rep gamma and discrete negative binomial RVs.

8One may consider the SRVRCfree from Zhou and Gu (2020). 9Note for each RV type, one need only derive the variablehess/components for once, whose expressions are then reusable.

Algorithm 1 SCR-GO for minγ Eq(x)qγ(y|x)[f(x, y)]

Input: Batch sizes Ng, NH, initialization γ0, total iterations T, and ﬁnal tolerance ϵ. Output: ϵ-second-order stationary point γ or γT +1. for t = 0, 1, , T do Sample {xi}Ng i=1 and {x j}NH j=1 from q(x) Sample yi qγt(y|xi) and y j qγt(y|x j)

Estimate GO gradient gt with {xi, yi}Ng i=1 Estimate GO Hessian Ht[ ] with {x j, y j}NH j=1 , Cubic-Subsolver( gt, Ht[ ], ϵ) γt+1 = γt + if > p

ϵ3/ρ/100 then Cubic-Finalsolver( gt, Ht[ ], ϵ) γ = γt + and break

4.1 GO Hessian for Non-rep Gamma RVs To demonstrate the effectiveness of the proposed techniques, we focus on situations with non-rep gamma RVs in our experiments. Such a concentration is motivated by their broad practical applications (Boland 2007; Mendoza-Parra et al. 2013; Al-Ahmadi 2014; Wright et al. 2014; Belikov 2017) and by their fundamental utility in statistics and machine learning. For example, many popular distributions can be reparameterized as gamma (Leemis and Mc Queston 2008), such as exponential, chi-squared, inverse-gamma, log-gamma, beta, and Dirichlet; other ones can be mixed via gamma (Zhou et al. 2012; Zhou and Carin 2015), like gamma-normal-mixed student-t and gamma-Poisson-mixed negative binomial. Accordingly, the presented techniques for gamma can be readily extended to those gamma-related cases of Framework II. As discussed following (8) and shown in Figure 2(b), three components are crucial in constructing a GO Hessian for a continuous RV y with PDF qα(y), that is,

gqα(y) α , ygqα(y) α , and αgqα(y) α . (13)

For a gamma RV ˆy Gam(α, β), the distribution parameters α in general contain both the shape α and the rate β. However, we notice the reparameterization of ˆy = y/β, y Gam(α, 1), with which one can leave the derivatives wrt β to AD for simplicity and focus solely on the non-rep part associated with α. Accordingly, we only need to calculate the three components in (13) for qα(y) = Gam(y; α, 1). Moving detailed derivations to Appendix H for clarity, we yield

gqα(y) α = [log y ψ(α)]γ(α, y)y1 αey

+ yeyα 2 2F2(α, α; α + 1, α + 1; y)

ygqα(y) α =[ψ(α) log y] + gqα(y) α (y α + 1)/y

αgqα(y) α =ψ(1)(α)γ(α, y)y1 αey

+[log y ψ(α)]yeyα 2 2F2(α, α; α+1, α+1; y)

2yeyα 3 3F3(α, α, α; α + 1, α + 1, α + 1; y)

where ψ(α) is the digamma function, ψ(m)(x) the polygamma function of order m, γ(α, y) the lower incomplete gamma function, and p Fq(a1, , ap; b1, , bq; x) is

KL Divergence

SGD , SGD ,

(a) α-β space

KL Divergence

SGD , SGD ,

(b) µ-σ space

0 1000 2000 3000 4000 5000 Oracle Calls

KL Divergence

SGD , SCR-GO , SGD ,

0 1000 2000 3000 Oracle Calls

KL Divergence

SGD , Adam , SCR-GO ,

Figure 3: (a-b) Demonstration of reverse KL divergences and SGD trajectories in α-β and µ-σ parameter spaces for Section 4.1. The subscript indicates the parameter space. SGDα,β leverages a learning rate of 10, while the more efﬁcient SGDµ,σ only uses 10 3. The red star denotes the optimum. 2,000 iterations are used to generate both trajectories, which are adapted to the other side for clear comparisons. (c-d) Training objectives versus the number of oracle calls for Section 5.1. The same curve of SGDµ,σ is shown in both plots for clear comparisons.

the generalized hypergeometric function. Reparameterizing the rate β ﬁrst, followed by substituting the components in (14) into the approach in Figure 2(b), one enables easy-to-use exploitation of GO Hessian with AD over a non-rep gamma node. Figure 1 illustrates the low variance of our GO Hessian. A Gradient-friendly Reparameterization. To model a gamma node within a SCG, a naive method would parameterize shape α=softplus(γα) and rate β =softplus(γβ), where without loss of generality γ ={γα, γβ} is considered as the parameters of interest. However, we ﬁnd empirically that such a naive modeling may not be friendly to gradient-based methods, especially when target shape and/or rate are large. Figure 3(a) shows an example with the reverse KL objective KL[Gam(y; α, β)||Gam(y; 200, 200)]; with that modeling, SGD (labeled as SGDα,β) bounces between two slopes at the bottom of the valley and advances slowly. Alternatively, noticing that the valley bottom is approximately located in a line where Gam(y; α, β) shares the same mean as the target, we propose to reparameterize via mean µ and standard deviation σ, i.e., qγ(y) = Gam(y; µ2

σ2 ) with µ = softplus(γµ) and σ = softplus(γσ). With this reparameterization, we obtain an approximately decorrelated objective surface (see Figure 3(b)) that is more friendly to gradient-based methods; it s apparent SGD in the µ-σ space, termed SGDµ,σ, converges to the optimum much faster.

4.2 GO Hessian for Discrete NB RVs For a NB RV y qα(y)=NB(r, p), the distribution parameters α = {r, p} contain both the number of failures r and the success probability p. From the deﬁnition in (11), three components are necessary to calculate the GO Hessian, i.e., gqα(y) α , Dygqα(y) α , and αgqα(y) α . Due to space constraints, analytic expressions and detailed derivations are given in Appendix I. The low variance of the GO Hessian is demonstrated in Figure 1.

5 Experiments

The proposed techniques are veriﬁed with challenging experiments where non-rep gamma RVs are of interest. Generalizing Section 4.1, we ﬁrst test our SCR-GO on minimizing the reverse KL divergence between two gamma RVs. Next, we

consider mean-ﬁeld variational inference for Poisson factor analysis (PFA; which is closely related to the LDA (Blei, Ng, and Jordan 2003)) (Zhou and Carin 2015; Zhou, Cong, and Chen 2015). Finally concerning deep neural networks, the SCR-GO is tested on training variational encoders, mimicking the VAE (Kingma and Welling 2014), for PFA and its deep generalization of the Poisson gamma belief network (PGBN) (Zhou, Cong, and Chen 2016; Cong et al. 2017). Experimental Settings. We follow (Xu, Roosta Khorasani, and Mahoney 2017; Tripuraneni et al. 2018; Yu, Xu, and Gu 2018; Roosta et al. 2018; Kasai and Mishra 2018) to show training objectives versus the number of oracle calls (calculations of gradients and/or HVPs); this is deemed a fair metric because it s independent of implementation-details/system-conﬁgurations and ideally an HVP can take about the same amount of computation as a gradient (Pearlmutter 1994).10 We compare SCR-GO to standard SGD and the popular Adam (Kingma and Ba 2014) that leverage the GO gradient. For both SGD and Adam, learning rates from {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1} are tested with the best-tuned results shown. Other settings are given in Appendix J. Code will be available at github.com/Yulai Cong/GOHessian.

5.1 Minimizing the Reverse KL Divergence Between Gamma RVs

To demonstrate the effectiveness of the µ-σ reparameterization introduced in Section 4.1 and the efﬁciency achieved from exploiting the curvature information via the GO Hessian, we ﬁrst consider a simpliﬁed example, with the objective KL[Gam(y; α, β)||Gam(y; 200, 1)], for better introduction. SGD, Adam, and our SCR-GO are compared within both α-β and µ-σ parameter spaces.

10This is the ideal/theoretical situation for evaluation. However, it may not hold for our current implementation, which uses the less efﬁcient 2-backward technique in (4) (due to the missing forwardmode AD) and calculates special functions with a surrogate lib (see Appendix H). The implementation also makes it impossible for fair comparisons wrt wall-clock time. With our implementation/computer, a GO-HVP in gamma-related experiments is about 3 times more expensive than a GO gradient.

0.5 1 1.5 2 Oracle Calls 106

Adam , SCR-GO ,

1 2 3 4 Oracle Calls 105

Adam-GO , SCR-GO ,

1 2 3 4 # Observations 105

Adam-GO , SCR-GO ,

Figure 4: Training curves of mean-ﬁeld variational inference (a) and variational encoder (b-c) for PFA on MNIST. Variances are estimated based on 5 random seeds. See Appendix J.2 and J.3 for more results.

The training curves of the compared methods are given in Figures 3(c)-3(d). By comparing SGDα,β with SGDµ,σ in Figure 3(c), it s clear that the µ-σ reparameterization method leads to a much faster convergence with smoother training curves, similar to those from deterministic optimization. By contrast, SGDα,β visits both high and low KL values frequently (bouncing within a valley bottom as shown in Figure 3(a)), with a much slower convergence to the optimum. Thanks to the exploited curvature information, our SCR-GOα,β shows a clearly improved convergence relative to SGDα,β. When moving to the µ-σ space (see Figure 3(d)), our SCR-GOµ,σ delivers an even faster and more stabilized convergence than its counterpart SCR-GOα,β and also SGDµ,σ and Adamµ,σ, demonstrating the effectiveness of both the µ-σ reparameterization and the curvature exploitation via the GO Hessian.

5.2 Mean-ﬁeld Variational Inference for PFA For practical applications, we leverage the proposed techniques to develop efﬁcient mean-ﬁeld variational inference for the PFA, whose generative process is

pθ(x, z) : x Pois(x|Wz), z Gam(z|α0, β0),

where x is the count data variable, W the topic matrix with each column/topic wk located in the simplex, i.e., wvk > 0, P v wvk =1, z the latent code, and θ={W, α0, β0}. For mean-ﬁeld variational inference, we assume variational approximation distribution qφ(z)=Gam(z; µ2

σ2 ) with φ= {µ, σ}. Accordingly given training dataset {x1, , x N}, the objective is to maximize

ELBO(θ, {φi}) = 1

i=1 Eqφi(zi) h log pθ(xi, zi)

The training curves from a well-tuned Adam optimizer and our SCR-GO are shown in Figure 4(a) (see Appendix J.2 for more results). It s clear that with the additional curvature information exploited via GO-HVP, our SCR-GO exhibits a faster convergence to a better local optimum, with a lower variance than the well-tuned Adam optimizer.

5.3 Variational Encoders for PFA and PGBN To test the effectiveness of the presented techniques when combined with deep neural networks, we consider developing

a variational encoder for PFA mimicking the VAE, that is,

qφ(z|x)=Gam(z; µ2

σ2 ), µ=NNµ(x), σ=NNσ(x), (15)

where NN( ) denotes a neural network and φ contains all the parameters of NNµ( ) and NNσ( ). The objective is to maximize ELBO(θ, φ)=Eqφ(z|x) log pθ(x, z) log qφ(z|x) . Figures 4(b)-4(c) show the training objectives versus the number of oracle calls and processed observations. It s clear that the proposed SCR-GO performs better than a well-tuned Adam optimizer in terms of oracle calls and data efﬁciency, when applied to a model with deep neural networks. The better performance of SCR-GO is attributed to its exploitation of the curvature information via the GO Hessian, which takes into consideration the correlation among parameters within pθ(x, z) and qφ(z|x) and utilizes an (implicit) adaptive learning rate mimicking the classical Newton s method. For further testing under more challenging settings with a hierarchically-structured qγ(y) for Framework II in (2), we consider developing a variational encoder for a 2-layer PGBN. Speciﬁcally, with z = {z1, z2},

pθ(x, z) : x Pois(x|W1z1), z1 Gam(z1|W2z2, c2), z2 Gam(z2|α0, β0) qφ(z|x) : z2 qφ2(z2|z1), z1 qφ1(z1|x),

where θ = {W1, W2, c2, α0, β0}, φ = {φ1, φ2}, and both qφ2(z2|z1) and qφ1(z1|x) are constructed as in (15). Due to space constraints, the experimental details and results are moved to Appendix J.3, where one observes similar plots as those in Figure 4, conﬁrming the effectiveness and efﬁciency of the presented techniques.

6 Conclusions An unbiased low-variance Hessian estimator, termed GO Hessian, is proposed to efﬁciently exploit curvature information for an expectation-based objective over a SCG, with continuous rep/non-rep internal nodes and continuous/discrete leaves. Containing the deterministic Hessian as a special case, the GO Hessian is easy-to-use with AD and HVP, enabling a low cost second-order optimization over deep SCGs. Based on the proposed GO Hessian, a new second-order optimization method is proposed for the expectation-based objective, which empirically performs better than a well-tuned Adam optimizer in challenging situations with non-rep gamma RVs.

Acknowledgments We thank the anonymous reviewers for their constructive comments that helped improve the paper. The research was supported in part by DARPA, DOE, NIH, NSF and ONR. The Titan Xp GPU used was donated by the NVIDIA Corporation.

References Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Man e, D.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Vi egas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2015. Tensor Flow: Large-Scale Machine Learning on Heterogeneous Systems. URL https://www.tensorﬂow. org/. Software available from tensorﬂow.org.

Agarwal, N.; Allen-Zhu, Z.; Bullins, B.; Hazan, E.; and Ma, T. 2017. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, 1195 1199. ACM.

Al-Ahmadi, S. 2014. The gamma-gamma signal fading model: A survey [wireless corner]. IEEE Antennas and Propagation Magazine 56(5): 245 260.

Allen-Zhu, Z. 2018. Natasha 2: Faster non-convex optimization than SGD. In Neur IPS, 2675 2686.

Baxter, J.; and Bartlett, P. 2001. Inﬁnite-horizon policy-gradient estimation. Journal of Artiﬁcial Intelligence Research 15: 319 350.

Belikov, A. 2017. The number of key carcinogenic events can be predicted from cancer incidence. Scientiﬁc reports 7(1): 1 8.

Bishop, C. 2006. Pattern Recognition and Machine Learning. Springer.

Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent Dirichlet allocation. JMLR 3: 993 1022.

Boland, P. 2007. Statistical and probabilistic methods in actuarial science. Chapman and Hall/CRC.

Byrd, R.; Chin, G.; Neveitt, W.; and Nocedal, J. 2011. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization 21(3): 977 995.

Carmon, Y.; and Duchi, J. 2016. Gradient descent efﬁciently ﬁnds the cubic-regularized non-convex newton step. ar Xiv preprint ar Xiv:1612.00547 .

Cong, Y.; Chen, B.; Liu, H.; and Zhou, M. 2017. Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In ICML.

Cong, Y.; Zhao, M.; Bai, K.; and Carin, L. 2019. GO Gradient for Expectation-Based Objectives. In ICLR. URL https://openreview. net/forum?id=ryf6Fs09YX.

Farquhar, G.; Whiteson, S.; and Foerster, J. 2019. Loaded Di CE: Trading off Bias and Variance in Any-Order Score Function Gradient Estimators for Reinforcement Learning. In Neur IPS, 8149 8160.

Figurnov, M.; Mohamed, S.; and Mnih, A. 2018. Implicit Reparameterization Gradients. In Neur IPS.

Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic metalearning for fast adaptation of deep networks. In ICML, 1126 1135. JMLR. org.

Foerster, J.; Farquhar, G.; Al-Shedivat, M.; Rockt aschel, T.; Xing, E.; and Whiteson, S. 2018. Di CE: The inﬁnitely differentiable monte-carlo estimator. In ICML, 10204 10214.

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672 2680.

Grathwohl, W.; Choi, D.; Wu, Y.; Roeder, G.; and Duvenaud, D. 2017. Backpropagation through the Void: Optimizing control variates for black-box gradient estimation. ar Xiv:1711.00123 .

Heess, N.; Wayne, G.; Silver, D.; Lillicrap, T.; Erez, T.; and Tassa, Y. 2015. Learning continuous control policies by stochastic value gradients. In Neur IPS, 2944 2952.

Jankowiak, M.; and Obermeyer, F. 2018. Pathwise Derivatives Beyond the Reparameterization Trick. In ICML.

Jin, C.; Netrapalli, P.; Ge, R.; Kakade, S.; and Jordan, M. 2019. Stochastic Gradient Descent Escapes Saddle Points Efﬁciently. ar Xiv preprint ar Xiv:1902.04811 .

Kasai, H.; and Mishra, B. 2018. Inexact trust-region algorithms on Riemannian manifolds. In Neur IPS, 4249 4260.

Kingma, D.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980 .

Kingma, D. P.; and Welling, M. 2014. Auto-encoding variational Bayes. In ICLR.

Kohler, J.; and Lucchi, A. 2017. Sub-sampled cubic regularization for non-convex optimization. In ICML, 1895 1904. JMLR. org.

Leemis, L.; and Mc Queston, J. 2008. Univariate distribution relationships. The American Statistician 62(1): 45 53.

Liu, H.; Socher, R.; and Xiong, C. 2019. Taming MAML: Efﬁcient unbiased meta-reinforcement learning. In ICML, 4061 4071.

Mao, J.; Foerster, J.; Rocktaschel, T.; Al-Shedivat, M.; Farquhar, G.; and Whiteson, S. 2019. A baseline for any order gradient estimation in stochastic computation graphs .

Martens, J. 2010. Deep learning via Hessian-free optimization. In ICML, volume 27, 735 742.

Mendoza-Parra, M.; Nowicka, M.; Van Gool, W.; and Gronemeyer, H. 2013. Characterising Ch IP-seq binding patterns by model-based peak shape deconvolution. BMC genomics 14(1): 834.

Parmas, P. 2018. Total stochastic gradient algorithms and applications in reinforcement learning. In Neur IPS, 10204 10214.

Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in Py Torch .

Pearlmutter, B. 1994. Fast exact multiplication by the Hessian. Neural computation 6(1): 147 160.

Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic backpropagation and approximate inference in deep generative models. In ICML.

Robbins, H.; and Monro, S. 1951. A stochastic approximation method. The annals of mathematical statistics 400 407.

Roosta, F.; Liu, Y.; Xu, P.; and Mahoney, M. 2018. Newton-MR: Newton s Method Without Smoothness or Convexity. ar Xiv preprint ar Xiv:1810.00303 .

Rothfuss, J.; Lee, D.; Clavera, I.; Asfour, T.; and Abbeel, P. 2019. Pro MP: Proximal meta-policy search. In ICLR.

Ruiz, F. J. R.; Titsias, M. K.; and Blei, D. 2016. The generalized reparameterization gradient. In NIPS, 460 468.

Rumelhart, D.; and Hinton, G. 1986. Learning representations by back-propagating errors. Nature 323(9).

Salimans, T.; Knowles, D. A.; et al. 2013. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis 8(4): 837 882.

Schulman, J.; Heess, N.; Weber, T.; and Abbeel, P. 2015a. Gradient estimation using stochastic computation graphs. In NIPS, 3528 3536.

Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2015b. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438 .

Shen, Z.; Ribeiro, A.; Hassani, H.; Qian, H.; and Mi, C. 2019. Hessian aided policy gradient. In ICML, 5729 5738.

Tripuraneni, N.; Stern, M.; Jin, C.; Regier, J.; and Jordan, M. 2018. Stochastic cubic regularization for fast nonconvex optimization. In Neu IPS, 2899 2908.

Weber, T.; Heess, N.; Buesing, L.; and Silver, D. 2019. Credit assignment techniques in stochastic computation graphs. ar Xiv preprint ar Xiv:1901.01761 .

Williams, R. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4): 229 256.

Wright, M.; Winter, I.; Forster, J.; and Bleeck, S. 2014. Response to best-frequency tone bursts in the ventral cochlear nucleus is governed by ordered inter-spike interval statistics. Hearing research 317: 23 32.

Xu, P.; Roosta-Khorasani, F.; and Mahoney, M. 2017. Second-order optimization for non-convex machine learning: An empirical study. ar Xiv preprint ar Xiv:1708.07827 .

Yu, Y.; Xu, P.; and Gu, Q. 2018. Third-order smoothness helps: Faster stochastic optimization algorithms for ﬁnding local minima. In Neur IPS, 4525 4535.

Zhou, D.; and Gu, Q. 2020. Stochastic recursive variance-reduced cubic regularization methods. In AISTATS, 3980 3990.

Zhou, M.; and Carin, L. 2015. Negative binomial process count and mixture modeling. TPAMI 37(2): 307 320.

Zhou, M.; Cong, Y.; and Chen, B. 2015. The Poisson gamma belief network. In NIPS, 3025 3033.

Zhou, M.; Cong, Y.; and Chen, B. 2016. Augmentable gamma belief networks. JMLR 17(1): 5656 5699.

Zhou, M.; Hannah, L.; Dunson, D. B.; and Carin, L. 2012. Beta Negative Binomial Process and Poisson Factor Analysis. In AISTATS, 1462 1471.