# dnase_towards_deep_neuralnets_assisted_semiparametric_estimation__23b367a5.pdf DNA-SE: Towards Deep Neural-Nets Assisted Semiparametric Estimation Qinshuo Liu 1 Zixin Wang 2 Xi-An Li 3 Xinyao Ji Lei Zhang 3 Lin Liu 3 4 Zhonghua Liu 5 Semiparametric statistics play a pivotal role in a wide range of domains, including but not limited to missing data, causal inference, and transfer learning, to name a few. In many settings, semiparametric theory leads to (nearly) statistically optimal procedures that yet involve numerically solving Fredholm integral equations of the second kind. Traditional numerical methods, such as polynomial or spline approximations, are difficult to scale to multi-dimensional problems. Alternatively, statisticians may choose to approximate the original integral equations by ones with closed-form solutions, resulting in computationally more efficient, but statistically suboptimal or even incorrect procedures. To bridge this gap, we propose a novel framework by formulating the semiparametric estimation problem as a bi-level optimization problem; and then we develop a scalable algorithm called Deep Neural-Nets Assisted Semiparametric Estimation (DNA-SE) by leveraging the universal approximation property of Deep Neural-Nets (DNN) to streamline semiparametric procedures. Through extensive numerical experiments and a real data analysis, we demonstrate the numerical and statistical advantages of DNA-SE over traditional methods. To the best of our knowledge, we are the first to bring DNN into semiparametric statistics as a numerical solver of integral equations in our proposed general framework. The two corresponding authors are alphabetically ordered and contributed equally to this project. 1Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China 2School of Life Sciences, Shanghai Jiao Tong University, Shanghai, China 3Institute of Natural Sciences, MOE-LSC, School of Mathematical Sciences, CMAShanghai, Shanghai Jiao Tong University, Shanghai, China 4SJTUYale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China 5Department of Biostatistics, Columbia University, New York, NY, USA. Correspondence to: Lin Liu , Zhonghua Liu . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). 1. Introduction Deep Neural-Nets (DNN) have made unprecedented progress in a variety of scientific and industrial problems, such as image classification (Krizhevsky et al., 2012), natural language processing (Vaswani et al., 2017), and scientific computing (E & Yu, 2018). Recently, the statistical literature has also garnered increasing interests in deploying DNN to solve statistical problems that baffled traditional methods, for example, nonparametric regression (Chen & White, 1999; Suzuki, 2019; Schmidt-Hieber, 2020; Padilla et al., 2022) and nonparametric density estimation (Liu et al., 2021b; Liang, 2021). A common theme in these applications is to view DNN as an alternative method to approximate infinite-dimensional nonparametric models, conventionally done by splines, polynomials, etc. (Gin e & Nickl, 2016). Interpretability is at the core of many statistical problems. A conceptually appealing statistical paradigm is semiparametric statistics (including semiparametric statistical models, methods and related theory). In semiparametric statistics, one models the part of the Data Generating Process (DGP) of scientific interest with low-dimensional parametric models to improve interpretability, while leaving the remaining part (termed as nuisance parameters) either completely unspecified or modeled by infinite-dimensional nonparametric models to avoid unnecessary model misspecification bias (Newey, 1990; Bickel et al., 1998). As a result, semiparametric statistics have been very popular in problems calling for both interpretability and robustness against model misspecification, such as missing data (Robins et al., 1994), causal inference (van der Laan & Robins, 2003), and more recently, transfer learning (Qiu et al., 2023; Tian et al., 2023). It is natural to employ DNN to estimate the infinite-dimensional nuisance parameters in semiparametric models, an extension of the aforementioned idea from nonparametric statistics. In fact, this is the route that most recent semiparametric statistics literature has taken (Farrell et al., 2021; Shi et al., 2021; Zhong et al., 2022; Xu et al., 2022; Kompa et al., 2022; Chernozhukov et al., 2022; Chen et al., 2023a; 2024). In this paper, we take a different path and instead propose to treat DNN as a numerical solver for semiparametric statistics. Here is our motivation. All the above cited works on semiparametric statistics are special cases, where the statistically optimal (i.e. semiparametric efficient) estimator DNN-Assisted Semiparametric Estimation for the parameter of scientific interest has closed-form. In general, however, semiparametric theory can lead to nearly statistically optimal estimators that depend on the solution of Fredholm integral equations of the second kind1 (Bickel et al., 1998). These integral equations often: (1) do not have closed-form solutions; and (2) depend on the unknown parameter of interest, calling for complicated iterative methods (Robins et al., 1994). As shown in our numerical experiments in Section 4, using traditional methods such as polynomials/splines (Bellour et al., 2016) or approximately inverting the integral kernel (Atkinson, 1967) to approximate the solution does not always lead to optimal or satisfactory estimators in practice. In certain cases, statisticians may derive alternative estimators with closed-forms, but this strategy generally leads to statistically suboptimal or sometimes even inconsistent estimators (Zhao & Ma, 2022; Liu et al., 2021a; Yang, 2022). To bridge this gap, inspired by recent works using DNN as a new-generation numerical method for scientific computing (E & Yu, 2018; Raissi et al., 2019; Li et al., 2020), we propose a novel framework by formulating the semiparametric estimation problem as a bi-level optimization problem; and then we develop a scalable algorithm called Deep Neural Nets Assisted Semiparametric Estimation (DNA-SE). In a nutshell, DNA-SE uses multi-layer feed-forward neural networks to approximate the solution of the integral equations and then employs the alternating gradient descent algorithm to solve the integral equations and estimate the parameters of interest simultaneously (see Section 2). We summarize our main contributions below. Main Contributions We develop a new DNN-based algorithm called DNA-SE that streamlines solving the integral equations and estimating the parameter of interest simultaneously, by (1) formulating the semiparametric estimation problem as a bi-level optimization problem minimizing loss functions with a nested structure; and (2) solving this bi-level programming via Alternating Gradient Descent algorithms. To our knowledge, our paper is the first that employs DNN to numerically solve integral equations that arise in semiparametric problems and draws connections between semiparametric statistics and bi-level optimization (Chen et al., 2021). We investigate the empirical performance of DNA-SE, together with the issue of hyperparameter tuning through numerical experiments and a real data analysis, which demonstrates DNA-SE as a promising practical tool for estimating parameters in semiparametric models, outperforming traditional methods (e.g. polynomi- 1In the sequel, we will simply call Fredholm integral equations of the second kind as integral equations to ease exposition. als) (Atkinson, 1967; Ren et al., 1999). We have wrapped DNA-SE into a python package, which is accessible via this link. Semiparametric methods have been viewed as a relatively difficult subject for some practitioners (Hines et al., 2022). We envision that DNA-SE and its future version, together with other related open-source software (Lee et al., 2023), can help popularize semiparametric methods in artificial intelligence and other related scientific disciplines (Pan et al., 2022; Min et al., 2023; Wang et al., 2023). Other Related Works To the best of our knowledge, our work is the first attempt to bring DNN into semiparametric statistics as a numerical solver for integral equations. Recently, there have been a few works using DNN to solve a variety of integral equations (Guan et al., 2022; Zappala et al., 2023). In semiparametric statistics, however, solving integral equations is only a means to an end (parameter estimation), and more importantly, the integral equations appeared in semiparametric problems often depend on the unknown parameters to be estimated, except in some special cases (e.g. Example 3.3). Therefore these prior works cannot be directly applied to semiparametric statistics in the most general form. Organization of the Paper The rest of our paper is organized as follows. In Section 2, we briefly review semiparametric statistics, describe the problem, and formulate parameter estimation under semiparametric models as a generic bi-level optimization problem, that minimizes a set of nested loss functions. The DNA-SE algorithm is then proposed to solve this type of problems. Three concrete examples from missing data, causal inference and transfer learning are used to illustrate our algorithm in Section 3. As a proof-of-concept, numerical experiments and a real data analysis are, respectively, conducted in Section 4 and Section 5. We conclude our article and discuss a few future research avenues in Section 6. Minor details are deferred to the Appendix. Before proceeding, we collect some notation used throughout. Capital letters are reserved for random variables and lower-case letters are used for their realized values. The true data generating distribution is denoted as P, the probability density function (p.d.f.) or the probability mass function (p.m.f.) of which is denoted as p. Throughout the paper, β denotes the parameter of scientific interest (possibly vectorvalued) and η the nuisance parameter. We let FL,K be the class of fully-connected feedforward DNN with depth L and width K, whose intput and output dimensions will be clear later. This is the type of DNN that will be considered in DNN-Assisted Semiparametric Estimation this paper, although other more complex architectures (e.g. Transformers (Vaswani et al., 2017)) may also be explored in the future. ω is reserved for the neuron weight parameters. Finally, we denote {Oi}N i=1 as the N i.i.d. observed data drawn from P. In general, O has several components, e.g. O = (O1, O2). We denote the conditional distribution of O2|O1 as PO2|O1. 2. Main Results: A New Framework and the DNA-SE Algorithm 2.1. Problem Setup of Semiparametric Estimation We begin by describing the generic setup for parameter estimation that arises ubiquitously from semiparametric problems. Concrete examples that facilitate understanding the high-level description here are provided later in Section 3. Let β β(P) Rq denote the parameter of scientific interest, with a fixed dimension q > 0. η η(P) denotes the nuisance parameter, which can be infinite-dimensional. Semiparametric theory (Bickel et al., 1998; van der Vaart, 2002) often leads to the following system of moment estimating equations that the true β must satisfy: E [ψ(O; b ( ), β )] = 0, where b solves Z K(s, t, O; β )b (s, O)ds = C(t, O; β ) + b (t, O). (1) Here ψ Rq denotes the estimating functions of the same dimension as β and b ( ) with output dimension q is the solution to the system of q integral equations in the second line of (1). K( ) is a kernel function known up to the parameter of interest β and the nuisance parameter η, C( ) is a known vector-valued function of the same dimension as β, also known up to β and η. To avoid clutter, we omit the dependence on η of these functions throughout. s and t denote variables lying in the same sample space as (possibly parts of) the data O and they will be made explicit in the concrete examples later. In general the solution b ( ) implicitly depends on the unknown β, but there are exceptions (e.g. Example 3.3 in Section 3). In certain cases, the nuisance parameter η can even be set arbitrarily, and semiparametric theory can help eliminate the dependence of bβ on η entirely: e.g., see Examples 3.1 and 3.2. An estimator bβ of β can be obtained by solving the empirical analogue of (1) (van der Vaart & Wellner, 2023). 2.2. Formulating Semiparametric Estimation as a Bi-level Optimization Problem Here, we propose a new framework by formulating the estimating equations (1) into a bi-level optimization problem, which is amenable to being solved by modern DNN training algorithms. Specifically, we propose to minimize the following nested loss function to simultaneously solve the integral equation and estimate the parameter of interest β: bβ := arg min β Rq Lψ(β; bb) s.t. bb := arg min b F LK(b), where Lψ(β; b) := i=1 ψ(Oi; β, b) R K(s, t, Oi; β)b(s, Oi)ds C(t, Oi; β) b(t, Oi) where ν( ) is an easy-to-sample probability distribution over the sample space of t (default: uniform over a sufficiently large but bounded domain). It is noteworthy that the loss function (2) converges to its population version (see Appendix B) and K, C may depend on the estimate bη of η but again we do not make it explicit. Note that the loss function (2) has a distinctive nested structure, in the sense that the outer loss Lψ depends on the solution to the inner loss LK. Here F is some hypothesis class for approximating b( ); in this paper, we choose F FL,K, the class of fully-connected feedforward DNN. Then b( ) can be parameterized by the neuron weights as b( ; ω) and thus bb( ) b( ; bω), where bω denotes the neuron weights trained by minimizing (2). In general, the integrals involved in (2) need to be evaluated numerically. Here we use Monte Carlo integration (Liu, 2001) by drawing J1 samples from ν to approximate the outer integral and J2 samples uniformly from the space of s to approximate the inner integral of LK(b), although other methods can also be used in principle. For example, for fixed t and ω, the inner integral in LK(b) is approximated by: Z K (s, t, Oi; β) b (s, Oi; ω) ds j=1 K (sj, tl, Oi; β) b (sj, Oi; ω) , where vols is the volume of the sample space of s, vols = R Ωds, Ωis the domain of s. As J1, J2 , the Monte Carlo approximations converge to the true integrals in probability by the law of large numbers. We recommend that they be set as large as the computational resource allows to minimize the numerical approximation error. 2.3. The Alternating Gradient-Descent Algorithm Since minimizing the loss function (9) constitutes a bi-level optimization problem, we adopt the Alternating Gradient Descent (GD) and its extensions (e.g. its stochastic/adam version), one of the most natural algorithmic choices for bi-level programming (Finn et al., 2017; Chen et al., 2021). The algorithm is described in Algorithm 1. At the same time, the flow chart and network structure of the whole DNN-Assisted Semiparametric Estimation process are shown in Appendix A. We highlight several key features. First, the algorithm alternates between two types of GD updates one for minimizing the outer loss b Lψ and the other for minimizing the inner loss b LK. Second, it involves a hyperparameter γ, the alternating frequency , determining the number of GD updates for the inner loss before switching to GD updates for the outer loss. We end this section with a few remarks. Remark 2.1. Although we ground our work in the context of semiparametric estimation, this new framework can go beyond semiparametric statistics and be adapted to models subject to general (conditional) moment restrictions, commonly encountered in the statistics and econometrics literature (Ai & Chen, 2003; Chen & Santos, 2018; Cui et al., 2023). Remark 2.2. In later numerical experiments (Section 4), we adopt the Alternating Stochastic GD (Wilson et al., 2017; Zhou et al., 2020) to solve the optimization problem (2). However, we point out that the component Lψ corresponding to the parameter of interest β in (2) often has much fewer degrees-of-freedom than LK for solving the integral equations. Thus in practice, one could also exploit optimization algorithms with greater computational complexities to minimize Lψ, to ensure a convergence to the actual global minimum. Furthermore, since algorithms for bi-level optimization problems are generally difficult to analyze (Chen et al., 2021), we decide to leave the theoretical analysis of the proposed training algorithm to future work. Remark 2.3. Before moving onto concrete examples, we briefly comment on the statistical guarantee of the DNA-SE algorithm. One could have established convergence rate of bb( ) to b ( ) using standard arguments from M-estimation theory (Schmidt-Hieber, 2020; van der Vaart & Wellner, 2023), and hence the asymptotic normality of N( bβ β ). However, we decide not to take this route and only demonstrate the performance of our proposed algorithm via numerical experiments (see Section 4). This is because this proof strategy ignores the effect of the training algorithm (Neyshabur, 2017; Goel et al., 2020; Vempala & Wilmes, 2019), and thus the obtained guarantee is misleading (Xu et al., 2022). 3. Motivating Examples We now provide three concrete motivating examples to illustrate the general formulation detailed in the previous section. These three examples are drawn from, respectively, regression parameter estimation under the Missing-Not-At Random (MNAR) missing data mechanism, sensitivity analyses in causal inference, and average loss (risk) estimation in transfer learning under dataset shift. These examples demonstrate the generality of the generic estimation problem (1) in semiparametric statistics. Algorithm 1 Pseudocode for DNA-SE 1: main hyperparameters: alternating frequency γ, Monte Carlo sample sizes J1 and J2 for approximating the outer and inner integrals of LK, convergence tolerance δ, maximal iteration M and other common hyperparameters for DNN such as width, depth, step size, learning rate, etc; 2: input: data samples: {Oi}N i=1; Monte Carlo samples for the outer and inner integrals of LK: {tl}J1 l=1, {sj}J2 j=1; 3: set initial values β0 and ω0 randomly; 4: for m = 1, , M do 5: fix ωm 1, update βm 1 to βm by one-step minimization of Lψ via GD (or its variants); 6: fix βm, update ωm 1 to ωm by γ-step minimization of LK via GD (or its variants); 7: set bβ βm, bω ωm; 8: if βm βm 1 2< δ and ωm ωm 1 2< δ then 9: break the for loop; 10: end if 11: end for 12: output: ( bβ, bω). 3.1. A Shadow Variable Approach to MNAR Example 3.1. First, we consider the problem of parameter estimation when the response variable Y is MNAR using the approach introduced in Zhao & Ma (2022). The observed data is comprised of N independent and identically distributed (i.i.d.) observations {Xi, Ai, Yi}N i=1 drawn from the following DGP: X N(0.5, 0.25), Y |X β 1 + β 2X + N(0, 1), A|Y Bern (expit(1 + Y )) , and Y is missing if A = 0. Here, the missing-indicator A depends on the potentially unobserved value of Y . The parameter of interest is the coefficients β = (β1, β2) , while the nuisance parameter is η = (p X, p A|Y ). The difficulty lies in the fact that p A|Y is unidentified because one never observes Y when A = 0. Zhao & Ma (2022) showed that the following semiparametric estimator bβ of β is asymptotically unbiased by positing an arbitrary model ep A|Y for p A|Y : Let η2(y) := e P(A = 1|Y = y), and bβ is defined as: bβ = arg min β R2 i=1 ψ(Xi, Yi, Ai; β, bb) bb = arg min b 1 N Z b(s, Xi)K(s, t, Xi; bβ)ds DNN-Assisted Semiparametric Estimation C(t, Xi; bβ) p Y |X(t, Xi; bβ)b(t, Xi) Here, we have ψ(Xi, Yi, Ai; β, b) βp Y |X(Yi, Xi; β) p Y |X(Yi, Xi; β) Aib(Yi, Xi) R βp Y |X(Yi, Xi; β)η2(t)dt R p Y |X(t, Xi; β)(1 η2(t))dt R p Y |X(t, Xi; β)b(t, Xi)η2(t)dt R p Y |X(t, X; β)(1 η2(t))dt and C(t, Xi; β) and K(s, t, Xi; β) are known functions up to β defined as follows: C(t, Xi; β) β p Y |X(t, Xi; β) R β p Y |X(s, Xi; β)η2(s)ds R p Y |X(s, Xi; β)(1 η2(s))dsp Y |X(t, Xi; β), K(s, t, Xi; β) p Y |X(s, Xi; β)p Y |X(t, Xi; β)η2(s) R p Y |X(s , Xi; β)(1 η2(s ))ds . The nuisance parameter η1 p X is ancillary to the parameter of interest β, so it does not appear in the above estimation procedure. The proposed estimator bβ in (4) in fact solves the so-called semiparametric efficient score equation (Tsiatis, 2007), the semiparametric analogue of the score equation in classical parametric statistics. Derivations of the above estimator can also be found in Appendix C.1 for the sake of completeness. 3.2. Sensitivity Analysis in Causal Inference Example 3.2. The second example is about sensitivity analysis in causal inference with unmeasured confounding (Robins et al., 2000). Specifically, we consider the DGP below with X of higher-dimension than that of Zhang & Tchetgen Tchetgen (2022): X := (X1, , X10) i.i.d. Uniform(0, 1), U X1 X2 2 + N(0, 0.1), A|X, U Bern j=1 ( 1)j+1Xj + 2U )! Y |X, A, U Bern j=1 ( 1)j+1Xj + β A + 2U )! Here the parameter of interest β R with true value β encodes the causal effect of A on Y . Since U is unmeasured, β is unidentifiable. Sensitivity analyses shall be conducted to evaluate the potential bias resulted from some fictitious U specified by the statistician. Zhang & Tchetgen Tchetgen (2022) first posit an arbitrary working model for (one of the nuisance parameters) ep(U|X) := η(U, X) inducing a possibly misspecified working joint model ep, and then define the following estimator bβ of β that minimizes the loss as follows: bβ = arg min β R i=1 ψ(Xi, Ai, Yi; β,bb) bb = arg min b Z b(u , X)K(u , u, X; bβ)du C(u, X; bβ) + λb(u, X) ψ(Xi, Ai, Yi; β, b) R f(Yi, Ai, Xi, u; β)p Y,A|X,U(Yi, Ai|Xi, u; β)η(u, Xi)du R p Y,A|X,U(Yi, Ai|Xi, u; β)η(u, Xi)du , f(Yi, Ai, Xi, u; β) := esβ(Yi, Ai, Xi, u; β) b(u, Xi). Here esβ(y, a, X, u) is the score of the working joint model ep with respect to β, and C(u, X; β) and K(u , u, X; β) are known functions up to unknown parameters of the forms: C(u, X; β) = Z R esβ(y, a, X, u )h(y, a, u , X; β)du g(y, a, X; β) p Y,A|X,U(y, a; β|X, u)dyda, K(u , u, X; β) = Z p Y,A|X,U(y, a|X, u ; β) g(y, a, X; β) p Y,A|X,U(y, a|X, u; β)η(u , X)dyda, h(y, a, u , X; β) = p Y,A|X,U(y, a|X, u ; β)η(u , X), g(y, a, X; β) = Z p Y,A|X,U(y, a|X, u ; β)η(u , X)du . Derivations of the estimator (6) can also be found in Appendix C.2 for the sake of completeness. 3.3. Transfer Learning: Average Loss Estimation under Dataset Shift Example 3.3. One of the goals of transfer learning (Zhang et al., 2020; Gong et al., 2016; Scott, 2019; Li et al., 2023) is to improve estimation and inference of the parameter of interest in the target population by borrowing information from different but similar source populations. Here we consider the scenario where the target and source populations differ under the so-called Covariate Shift Posterior Drift (CSPD) mechanism introduced in Qiu et al. (2023), building upon Scott (2019). We consider the following DGP DNN-Assisted Semiparametric Estimation restricted by the CSPD mechanism (see Section 6 of Qiu et al. (2023)). The data consists of covariates X, outcome Y , and a binary data source indicator A with A = 0/1 for the target/source population. Let φ(x) be any function with non-zero first derivative and the data is generated as follows: X Uniform(1, 3), A|X = x Bern(expit π(x)) Y |X = x, A = a Bern (expit ra(x)) where π(x) := logit P(A = 1|X = x), ra(x) := logit P(Y = 1|X = x, A = a) and r1(x) φ (r0(x)) . (7) The nuisance parameter is η = (p X, p A|X, p Y |A,X), with bη = (bp X, bp A|X, bp Y |A,X) denoting their estimates. In the procedure shown later, many quantities depend on η and we also attach b to denote their estimates. η and those related quantities can be estimated by possibly nonparametric methods including DNN (Farrell et al., 2021; Xu et al., 2022; Chen et al., 2024) with sample splitting (Chernozhukov et al., 2018), but this is tangential to the main message of this paper. The parameter of interest is the average squared error loss under the target population: β := E[ℓ(X, Y )|A = 0] E[(Y f(X))2|A = 0], where f(x) is any prediction function for Y possibly outputted from some machine learning algorithm. Qiu et al. (2023) proposed the following statistically optimal (i.e. semiparametric efficient) estimator bβ of β : ( (1 Ai)bv(Xi) + Ai {Yi expit φ br0(Xi)} bg1(Xi) + (1 Ai) {Yi expit br0(Xi)} bg2(Xi) bb = arg min b LK(b) where K(x) := bh(x, 1) bµ(x) bp X(x), LK(b) := Z Z x S(x) K(x )b(x )dx + C(x) b(x) We now unpack the above definition: here v(x) := E[ℓ(X, Y )|X = x, A = 0] is the average conditional squared error loss in the target population with bv being its estimate, S(x) := {x : br0(x ) = br0(x)} is a subset in the sample space of the covariate X, and further define bh(x, a) := c var(Y |A = a, X = x)b P(A = a|X = x), b C(x) := bh(x, 0) R x S(x) bh(x , 1)bp X(x )dx bh(x, 1) {φ br0(x)}2 ℓ(x, 1) ℓ(x, 0) b P(A = 0) , x S(x) bh(x , 1)bp X(x )dx ( 1 + bh(x, 0) bh(x, 1)φ br0(x)2 We finally turn to the definitions of bg1 and bg2: first let A b(x) := R x S(x) K(x )b(x )dx , and eventually define bg1(x) := A bb(x) φ br0(x)/A bµ(x) + φ br0(x) bb(x)/bµ(x), bg2(x) := bb(x)/bµ(x). Derivations of the above estimator can also be found in Appendix C.3 for the sake of completeness. Though not stated as such, bβ in (8) can also be written as a minimizer to a loss function, but with a closed form. 4. Numerical Experiments In this section, we conduct three simulation studies for the three examples given in Section 3 to demonstrate the empirical performance of our proposed DNA-SE algorithm. 4.1. Hyperparameter Tuning In Section 3, we present three examples that use semiparametric models for estimation and inference that involves solving integral equations. In the first two examples, our objective is to estimate the regression parameters/causal effects. In the third example, we aim to estimate the risk (average loss) function in transfer learning problems under dataset shift. The hyperparameters of the proposed DNA-SE model include: the alternating frequency γ, Monte Carlo sizes J1 and J2 for numerically approximating the integrals for the integral equations and the related loss LK, and other common DNN tuning parameters such as depth, width, step sizes, batch sizes, and etc. These hyperparameters are tuned via grid search using cross-validation. We employ the hyperbolic tangent (tanh) function as the activation function (Lu et al., 2022), and set the learning rate to be 10 4. Detailed information regarding hyperparameter tuning can be found in Appendix D. 4.2. Simulation Results 4.2.1. EXAMPLE 3.1: REGRESSION PARAMETER ESTIMATION UNDER MNAR First, we apply the DNA-SE algorithm to the problem of parameter estimation in regression models with the response variable Y being subject to MNAR as described in (3) in Example 3.1. Here we are interested in estimating the regression coefficient β = (β1, β2) , the true value of which is set to be β = (0.25, 0.5) in the simulation. The sample size N is 500. We choose η2(y) e P(A = 1|Y = y) = expit(1 y) which is completely misspecified compared to the truth, repeat the simulation experiments for 100 times, and numerical summary statistics of all estimators are given in Table 1 and boxplots in Figure 4 in Appendix E. We compare the estimation errors bβ β over 100 simulated datasets of DNN-Assisted Semiparametric Estimation the DNA-SE algorithm against polynomials with degrees varying from two to five, together with the estimates from Zhao & Ma (2022) using the algorithm of Atkinson (1967) (Table 1). From Table 1, it is evident that the estimator bβ from DNA-SE has smaller bias than polynomials. Our result has similar bias to that of Zhao & Ma (2022), while our variance is smaller. Table 1. This table displays the mean and standard error of bβ1 β 1 and bβ2 β 2 over 100 simulations. Our method is displayed as DNA-SE, compared to the other methods using d-th degree polynomial, with d {2, 3, 4, 5}, shown as quadratic, cubic, quartic and quintic in the table. MEAN(STD) DNA-SE ZHAO & MA (2022) bβ1 β 1 0.012(0.142) -0.016(0.209) bβ2 β 2 0.004(0.104) 0.004(0.122) MEAN(STD) QUINTIC QUARTIC bβ1 β 1 -0.113(0.070) -0.081(0.186) bβ2 β 2 0.182(0.095) 0.124(0.112) MEAN(STD) CUBIC QUADRATIC bβ1 β 1 -0.143(0.133) -0.033(0.178) bβ2 β 2 0.130(0.158) 0.092(0.124) 4.2.2. EXAMPLE 3.2: SENSITIVITY ANALYSIS IN CAUSAL INFERENCE Next, we apply the DNA-SE algorithm to the problem of sensitivity analyses in causal inference with unmeasured confounding described in (5) in Example 3.2. Recall in this problem we are interested in estimating the causal effect parameter β under user-specified fictitious unmeasured confounder U and sensitivity analyses models. We set the true parameter value β 2. The sample size N is 1000. We choose η(u, x) = 1{u [ 0.5, 0.5]} (defined in Example 3.2) which is completely misspecified compared to the true p U|X(u|x). Table 2 displays the numerical estimator of bβ β over 100 simulated datasets obtained from various methods. It is evident that our proposed DNA-SE algorithm is less biased and more efficient compared to polynomials with varying degrees. We also compared DNA-SE with a grid-based method adopted in Zhang & Tchetgen Tchetgen (2022) using their R code; the latter method has longer computation time (about 35 minutes per run) than DNA-SE (below 30 minutes per run) but larger bias and variance. Table 2 provides the summary statistics of the simulation results, supplemented with boxplots in Figure 5 in Appendix E. Table 2. The table displays the mean and standard error bβ β over 100 simulations. Our method is displayed as DNA-SE, compared to the other methods including a grid-based method used in Zhang & Tchetgen Tchetgen (2022) and d-th degree polynomial, with d {2, 3}, shown as quadratic, cubic in the table. MEAN(STD) DNA-SE GRID-BASED METHOD bβ β 0.036(0.155) 0.109(0.426) MEAN(STD) CUBIC QUADRATIC bβ β 0.142(0.189) 0.145(0.188) 4.2.3. EXAMPLE 3.3: AVERAGE LOSS ESTIMATION IN TRANSFER LEARNING Next we apply DNA-SE to the transfer learning problem described in (7) in Example 3.3. Recall in this problem we are interested in estimating the average target-population loss β given in Example 3.3. Here we simply choose f(x) expit(x 1 2 ) as the prediction function for Y , φ(x) x+1 as the outcome shift mechanism function, and ra(x) π(x) x as the logit-transformed outcome and data-source conditional expectations. Note that to focus on the main issue of this paper, in the simulation we use the true nuisance parameter η and related quantities such as v, h, defined in Example 3.3, without using their estimates. The sample size N is 10000. The numerical summary statistics of the discrepancies between the estimated parameters, bβ, and the true parameters, β , across 100 simulated datasets are presented in Table 3. The corresponding boxplots, which provide a visual representation of these results, have been relegated to Figure 6. It is evident that our proposed DNA-SE method outperforms the traditional polynomial method with different degrees, in particular in terms of variance (reduced by at least 80%). Table 3. This table displays bβ β over 100 simulations. Our method is displayed as DNA-SE, compared to methods using d-th degree polynomials, with d {5, 10, 15}, respectively displayed as quintic , 10th-degree , 15th-degree in the table. MEAN(STD) DNA-SE QUINTIC bβ β -0.0008(0.0084) 0.0009(0.0186) MEAN(STD) 10TH-DEGREE 15TH-DEGREE bβ β 0.0033(0.0234) 0.0038(0.0385) 5. Real Data Application In this section, we further demonstrate the empirical performance of DNA-SE by reanalyzing a real dataset that was DNN-Assisted Semiparametric Estimation previously analyzed in Zhao & Ma (2022). 5.1. The Connecticut Children s Mental Health Study We now apply our proposed DNA-SE algorithm to reanalyze a dataset from a comprehensive investigation conducted by Zahner & Daskalakis (1997) on children s mental health in Connecticut, USA. This study employed the Achenbach Child Behavior Checklist (CBCL) as a tool to evaluate the psychiatric status of children. The CBCL consists of three different forms, namely parent, teacher, and adolescent self-report. Here the response Y of interest is teacher s report of the child s psychiatric status, where Y = 1 suggests the presence of clinical psychopathology and Y = 0 suggests normal psychological functioning. In the study, a logistic regression was conducted to examine the relationship between Y and covariates, including X1: father s presence in the household (X1 = 1/0: absence/presence), X2: the child s physical health (X2 = 1/0: poor/good health), Z: parent s report (Z = 1/0: abnormal/normal clinical psychopathology): logit P(Y = 1|Z, X1, X2) = β0 + β1X1 + β2X2 + βz Z, where the regression coefficients β = (β0, β1, β2, βz) are the parameters of interest. This dataset comprises 2486 subjects, of which 1061 (approximately 42.7%) exhibit missing values. The presence of such a large amount of missing data poses a significant challenge, as modeling with only the 1425 subjects with complete data may lead to highly biased estimates for β. The missingness A of teacher s report Y is unlikely to be related to the parent s report Z, but Z may be highly correlated with Y . Hence Z can be viewed as a so-called shadow variable (Miao et al., 2024) to help identify β even under MNAR. The semiparametric estimator of β leveraging the shadow variable has been derived in Zhao & Ma (2022). A more detailed exposition can be found in Appendix F . 5.2. Data Analysis Results Besides the estimator computed by DNA-SE, for comparison purposes, we also report the results (1) from the observed-only estimator that completely ignores the missing data and (2) from Zhao & Ma (2022) using the algorithm of Atkinson (1967). The observed-only estimator is biased under the MNAR assumption for the regression coefficients of the covariates except Z, the shadow variable. To estimate the standard errors of the DNA-SE estimators of β, we generate 100 bootstrap samples, each of size 2000, rerun the whole DNA-SE algorithm to obtain the point estimates, and then compute the mean and standard deviation of the point estimates over 100 bootstrap samples. Table 4 displays the point estimates and the estimated standard errors from different methods, including DNA-SE, the reported estimates from Table 6 of Zhao & Ma (2022), and the potentially biased observed-only estimates. First, it is reassuring that all three methods produce roughly the same estimate for regression coefficient βz of the shadow variable Z (parent s report), although our method, DNA-SE, has the smallest standard error (the 2nd row, 2nd column of Table 4). For the other coefficients, the point estimates of DNA-SE are very close to those from Zhao & Ma (2022) but our (estimated) standard errors of DNA-SE are smaller. Overall, the results of DNA-SE are consistent with that of Zhao & Ma (2022). The observed-only approach is more likely to produce biased result due to its stronger yet empirically unverifiable assumption on the missing data mechanism. Table 4. Comparison of different estimators for β in the real data analysis. MEAN(SE) bβ0 bβ1 DNA-SE -1.3289(0.1402) -0.0623(0.1196) ZHAO & MA (2022) -1.3585(0.1823) -0.0718(0.1470) OBSERVED-ONLY -1.9307(0.1132) 0.3652(0.1480) MEAN(SE) bβ2 bβz DNA-SE -0.6159(0.1366) 1.4620(0.0761) ZHAO & MA (2022) -0.9817(0.1320) 1.4623(0.1194) OBSERVED-ONLY -0.0516(0.1690) 1.4621(0.1583) 6. Concluding Remarks In this paper, we first proposed a general framework by formulating the semiparametric estimation problem as a bi-level optimization problem; and then we developed a novel algorithm DNA-SE that can simultaneously solve the integral equation and estimate the parameter of interest by employing modern DNN as a numerical solver of the integral equations appeared in the estimation procedure. Compared to traditional approaches, DNA-SE can be scaled to higher-dimensional problems owing to: (1) the universal approximation property (Siegel & Xu, 2020); (2) more efficient optimization algorithms (Kingma & Ba, 2015); and (3) the increasing computing power for deep learning. We also developed a python package that implements the DNA-SE algorithm. In fact, this proposed framework goes beyond semiparametric estimation and can be adapted more broadly to models subject to general moment restrictions in the statistics and econometrics literature (Ai & Chen, 2003; Chen & Santos, 2018). There are some recent works introducing modern numerical methods, such as methods based on gradient flows (Crucinio et al., 2022) or tensor networks (e.g. tensor train decomposition) (Cichocki et al., 2016; 2017; Corona et al., 2017; Chen et al., 2023b), to the problem of purely solving in- DNN-Assisted Semiparametric Estimation tegral equations. It will be interesting to explore how to incorporate these new generations of numerical toolboxes into semiparametric statistics. Our future plan also includes: (1) providing convergence analysis of Algorithm 1 and possibly supplementing it with consensus-based methods (Carrillo et al., 2021) to accelerate the convergence towards global optimizer in practice; (2) characterizing statistical properties of the obtained estimator; and more importantly, (3) fully automating and democratizing semiparametric statistics by eliminating the need of deriving estimators by human, via techniques such as large language models, arithmetic formula learning, differentiable and symbolic computation (Frangakis et al., 2015; Carone et al., 2019; Polu & Sutskever, 2020; Garg et al., 2020; Trinh et al., 2024; Luedtke, 2024). Acknowledgements The authors would like to thank Bao Luo Sun for helpful discussions. Lin Liu is supported by NSFC Grant No.12101397 and No.12090024, Shanghai Science and Technology Commission Grant No.21ZR1431000 and No.21JC1402900, and Shanghai Municipal Science and Technology Major Project No.2021SHZDZX0102. Zixin Wang and Lei Zhang are also supported by Shanghai Science and Technology Commission Grant No.21ZR1431000. Lin Liu is also affiliated with Shanghai Artificial Intelligence Laboratory and the Smart Justice Lab of Koguan Law School of SJTU. Zhonghua Liu is also affiliated with Columbia University Data Science Institute. Impact Statement The goal of this paper is to advance the field of Machine Learning/Artificial Intelligence by using these tools to boost statistical sciences. There could be many potential positive societal consequences of our work, none of which we feel must be specifically highlighted here. Ai, C. and Chen, X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795 1843, 2003. Atkinson, K. E. The numerical solution of Fredholm integral equations of the second kind. SIAM Journal on Numerical Analysis, 4(3):337 348, 1967. Bellour, A., Sbibih, D., and Zidna, A. Two cubic spline methods for solving Fredholm integral equations. Applied Mathematics and Computation, 276:1 11, 2016. Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins Series in the Mathematical Sciences. Springer New York, 1998. ISBN 9780387984735. Carone, M., Luedtke, A. R., and van der Laan, M. J. Toward computerized efficient estimation in infinite-dimensional models. Journal of the American Statistical Association, 114(527):1174 1190, 2019. Carrillo, J. A., Jin, S., Li, L., and Zhu, Y. A consensus-based global optimization method for high dimensional machine learning problems. ESAIM: Control, Optimisation and Calculus of Variations, 27:S5, 2021. Chen, J., Chen, X., and Tamer, E. Efficient estimation of average derivatives in NPIV models: Simulation comparisons of neural network estimators. Journal of Econometrics, 235(2):1848 1875, 2023a. Chen, S., Li, J., Li, Y., and Zhang, A. R. Learning polynomial transformations via generalized tensor decompositions. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pp. 1671 1684, 2023b. Chen, T., Sun, Y., and Yin, W. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Advances in Neural Information Processing Systems, 34:25294 25307, 2021. Chen, X. and Santos, A. Overidentification in regular models. Econometrica, 86(5):1771 1817, 2018. Chen, X. and White, H. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Transactions on Information Theory, 45(2):682 691, 1999. Chen, X., Liu, Y., Ma, S., and Zhang, Z. Causal inference of general treatment effects using neural networks with a diverging number of confounders. Journal of Econometrics, 238(1):105555, 2024. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018. Chernozhukov, V., Newey, W., Quintas-Martınez, V. M., and Syrgkanis, V. Riesz Net and Forest Riesz: Automatic debiased machine learning with neural nets and random forests. In International Conference on Machine Learning, pp. 3901 3914. PMLR, 2022. Cichocki, A., Lee, N., Oseledets, I., Phan, A.-H., Zhao, Q., and Mandic, D. P. Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Foundations and Trends in Machine Learning, 9(4-5):249 429, 2016. DNN-Assisted Semiparametric Estimation Cichocki, A., Phan, A.-H., Zhao, Q., Lee, N., Oseledets, I., Sugiyama, M., and Mandic, D. P. Tensor networks for dimensionality reduction and large-scale optimization: Part 2 applications and future perspectives. Foundations and Trends in Machine Learning, 9(6):431 673, 2017. Corona, E., Rahimian, A., and Zorin, D. A tensor-train accelerated solver for integral equations in complex geometries. Journal of Computational Physics, 334:145 169, 2017. Crucinio, F. R., De Bortoli, V., Doucet, A., and Johansen, A. M. Solving Fredholm integral equations of the first kind via Wasserstein gradient flows. ar Xiv preprint ar Xiv:2209.09936, 2022. Cui, Y., Pu, H., Shi, X., Miao, W., and Tchetgen Tchetgen, E. Semiparametric proximal causal inference. Journal of the American Statistical Association, 2023. E, W. and Yu, B. The deep Ritz method: a deep learningbased numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6 (1):1 12, 2018. Farrell, M. H., Liang, T., and Misra, S. Deep neural networks for estimation and inference. Econometrica, 89(1): 181 213, 2021. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126 1135. PMLR, 2017. Frangakis, C. E., Qian, T., Wu, Z., and Diaz, I. Deductive derivation and Turing-computerization of semiparametric efficient estimation. Biometrics, 71(4):867 874, 2015. Garg, A., Kayal, N., and Saha, C. Learning sums of powers of low-degree polynomials in the non-degenerate case. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pp. 889 899. IEEE, 2020. Gin e, E. and Nickl, R. Mathematical foundations of infinitedimensional statistical models, volume 40. Cambridge University Press, 2016. Goel, S., Gollakota, A., Jin, Z., Karmalkar, S., and Klivans, A. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. In International Conference on Machine Learning, pp. 3587 3596. PMLR, 2020. Gong, M., Zhang, K., Liu, T., Tao, D., Glymour, C., and Sch olkopf, B. Domain adaptation with conditional transferable components. In International Conference on Machine Learning, pp. 2839 2848. PMLR, 2016. Guan, Y., Fang, T., Zhang, D., and Jin, C. Solving Fredholm integral equations using deep learning. International Journal of Applied and Computational Mathematics, 8 (2):87, 2022. Hines, O., Dukes, O., Diaz-Ordaz, K., and Vansteelandt, S. Demystifying statistical learning based on efficient influence functions. The American Statistician, 76(3): 292 304, 2022. Kingma, D. P. and Ba, J. L. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Kompa, B., Bellamy, D., Kolokotrones, T., Robins, J. M., and Beam, A. Deep learning methods for proximal inference via maximum moment restriction. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp. 11189 11201, 2022. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Image Net classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, pp. 1097 1105, 2012. Lee, J. J., Bhattacharya, R., Nabi, R., and Shpitser, I. Ananke: A python package for causal inference using graphical models. ar Xiv preprint ar Xiv:2301.11477, 2023. Li, X.-A., Xu, Z.-Q. J., and Zhang, L. A multi-scale DNN algorithm for nonlinear elliptic equations with multiple scales. Communications in Computational Physics, 28 (5):1886 1906, 2020. Li, Z., Cai, R., Chen, G., Sun, B., Hao, Z., and Zhang, K. Subspace identification for multi-source domain adaptation. In Advances in Neural Information Processing Systems, volume 36, 2023. Liang, T. How well generative adversarial networks learn distributions. Journal of Machine Learning Research, 22 (228):1 41, 2021. Liu, J. S. Monte Carlo strategies in scientific computing, volume 75. Springer, 2001. Liu, L., Shahn, Z., Robins, J. M., and Rotnitzky, A. Efficient estimation of optimal regimes under a no direct effect assumption. Journal of the American Statistical Association, 116(533):224 239, 2021a. Liu, Q., Xu, J., Jiang, R., and Wong, W. H. Density estimation using deep generative neural networks. Proceedings of the National Academy of Sciences, 118(15): e2101344118, 2021b. DNN-Assisted Semiparametric Estimation Lu, Y., Chen, H., Lu, J., Ying, L., and Blanchet, J. Machine learning for elliptic PDEs: Fast rate generalization bound, neural scaling law and minimax optimality. In International Conference on Learning Representations, 2022. Luedtke, A. Simplifying debiased inference via automatic differentiation and probabilistic programming. ar Xiv preprint ar Xiv:2405.08675, 2024. Miao, W., Liu, L., Li, Y., Tchetgen Tchetgen, E. J., and Geng, Z. Identification and semiparametric efficiency theory of nonignorable missing data with a shadow variable. ACM/JMS Journal of Data Science, 1(2):1 23, 2024. Min, S., Shi, W., Lewis, M., Chen, X., Yih, W.-t., Hajishirzi, H., and Zettlemoyer, L. Nonparametric masked language modeling. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 2097 2118, 2023. Newey, W. K. Semiparametric efficiency bounds. Journal of Applied Econometrics, 5(2):99 135, 1990. Neyshabur, B. Implicit regularization in deep learning. Ph D thesis, Toyota Technological Institute at Chicago (TTIC), 2017. Padilla, O. H. M., Tansey, W., and Chen, Y. Quantile regression with Re LU networks: Estimators and minimax rates. Journal of Machine Learning Research, 23(247): 1 42, 2022. Pan, X., Yao, W., Zhang, H., Yu, D., Yu, D., and Chen, J. Knowledge-in-context: Towards knowledgeable semiparametric language models. In International Conference on Learning Representations, 2022. Polu, S. and Sutskever, I. Generative language modeling for automated theorem proving. ar Xiv preprint ar Xiv:2009.03393, 2020. Qiu, H., Tchetgen, E. T., and Dobriban, E. Efficient and multiply robust risk estimation under general forms of dataset shift. ar Xiv preprint ar Xiv:2306.16406, 2023. Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physicsinformed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 707, 2019. Ren, Y., Zhang, B., and Qiao, H. A simple Taylor-series expansion method for a class of second kind integral equations. Journal of Computational and Applied Mathematics, 110(1):15 24, 1999. Robins, J. M., Rotnitzky, A., and Zhao, L. P. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846 866, 1994. Robins, J. M., Rotnitzky, A., and Scharfstein, D. O. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In Statistical Models in Epidemiology, the Environment, and Clinical Trials, pp. 1 94. Springer, 2000. Schmidt-Hieber, J. Nonparametric regression using deep neural networks with Re LU activation function. Annals of Statistics, 48(4):1875 1897, 2020. Scott, C. A generalized Neyman-Pearson criterion for optimal domain adaptation. In Algorithmic Learning Theory, pp. 738 761. PMLR, 2019. Shi, C., Xu, T., Bergsma, W., and Li, L. Double generative adversarial networks for conditional independence testing. Journal of Machine Learning Research, 22(285):1 32, 2021. Siegel, J. W. and Xu, J. Approximation rates for neural networks with general activation functions. Neural Networks, 128:313 321, 2020. Suzuki, T. Adaptivity of deep Re LU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations, 2019. Tian, Q., Zhang, X., and Zhao, J. ELSA: Efficient label shift adaptation through the lens of semiparametric models. In International Conference on Machine Learning, pp. 34120 34142. PMLR, 2023. Trinh, T. H., Wu, Y., Le, Q. V., He, H., and Luong, T. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476 482, 2024. Tsiatis, A. Semiparametric theory and missing data. Springer Science & Business Media, 2007. van der Laan, M. J. and Robins, J. M. Unified methods for censored longitudinal data and causality. Springer Science & Business Media, 2003. van der Vaart, A. Part III: Semiparameric statistics. Lectures on Probability Theory and Statistics, pp. 331 457, 2002. van der Vaart, A. W. and Wellner, J. Weak Convergence and Empirical Processes: with Applications to Statistics. Springer Science & Business Media, 2023. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000 6010, 2017. DNN-Assisted Semiparametric Estimation Vempala, S. and Wilmes, J. Gradient descent for one-hiddenlayer neural networks: Polynomial convergence and SQ lower bounds. In Conference on Learning Theory, pp. 3115 3117. PMLR, 2019. Wang, H., Fu, T., Du, Y., Gao, W., Huang, K., Liu, Z., Chandak, P., Liu, S., Van Katwyk, P., Deac, A., et al. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47 60, 2023. Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The marginal value of adaptive gradient methods in machine learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4151 4161, 2017. Xu, S., Liu, L., and Liu, Z. Deep Med: Semiparametric causal mediation analysis with debiased deep learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp. 28238 28251, 2022. Yang, S. Semiparametric estimation of structural nested mean models with irregularly spaced longitudinal observations. Biometrics, 78(3):937 949, 2022. Zahner, G. E. and Daskalakis, C. Factors associated with mental health, general health, and school-based service use for child psychopathology. American Journal of Public Health, 87(9):1440 1448, 1997. Zappala, E., Fonseca, A. H. d. O., Moberly, A. H., Higley, M. J., Abdallah, C., Cardin, J. A., and van Dijk, D. Neural integro-differential equations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37 (9), pp. 11104 11112, 2023. Zhang, B. and Tchetgen Tchetgen, E. J. A semi-parametric approach to model-based sensitivity analysis in observational studies. Journal of the Royal Statistical Society Series A: Statistics in Society, 185(Supplement 2):S668 S691, 2022. Zhang, K., Gong, M., Stojanov, P., Huang, B., Liu, Q., and Glymour, C. Domain adaptation as a problem of inference on graphical models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 4965 4976, 2020. Zhao, J. and Ma, Y. A versatile estimation procedure without estimating the nonignorable missingness mechanism. Journal of the American Statistical Association, 117(540): 1916 1930, 2022. Zhong, Q., Mueller, J., and Wang, J.-L. Deep learning for the partially linear Cox model. The Annals of Statistics, 50(3):1348 1375, 2022. Zhou, P., Feng, J., Ma, C., Xiong, C., Hoi, S., and E, W. Towards theoretically understanding why SGD generalizes better than ADAM in deep learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 21285 21296, 2020. DNN-Assisted Semiparametric Estimation A. The flow chart of DNA-SE Hyperparameters (𝛾, 𝐽1, 𝐽2, 𝛿, 𝑀) Network settings (lr,width,depth) Input ({𝑂𝑖}𝑖=1 𝐽1 , {𝐬𝑗}𝑗=1 Randomly initial (𝑚= 0, 𝛽0, 𝜔0) Fix 𝜔𝑚, Update 𝛽𝑚 to 𝛽𝑚+1 by 1-step GD Fix 𝛽𝑚, Update 𝜔𝑚 to 𝜔𝑚+1 by 𝛾-step GD 𝛽𝑚+1 𝛽𝑚 2 < 𝛿? 𝜔𝑚+1 𝜔𝑚 2 < 𝛿? Output (𝛽𝑚, 𝜔𝑚) Figure 1. Depicted here is an illustration of our DNA-SE algorithm and the neural network architecture employed for modeling b, the solution to the integral equations. Left: We present a comprehensive flowchart that outlines the algorithmic procedure. Right: The architecture of b is chosen to be the standard Multilayer Perceptron (MLP) or feedforward neural networks in this paper with tanh activation function. Notably, the MLP is constructed with a depth of 2L + 1, indicating a strategic layering to optimize performance. B. Further remarks on the loss function (2) As mentioned in Section 2.2, the (empirical) loss function (2) originates from the following population loss function: β := arg min β Rq Lψ(β, b ) s.t. b ( ) := arg min b F LK(b, β ), where Lψ(β, b) := {E [ψ(O; b( ), β)]}2 , Z K(s, t, o; β)b(s, o)ds C(t, o); β) b(t, o) 2 dν(t)d P(o). We first turn the estimating equation (1) into the above population loss and then replace the integral with respect to d P(o) by the empirical average. We would like to make another remark on this loss function. One can in principle first draw many samples from the space where the parameter of interest β lies, and then solve the integral equations over all these possible values. However, the Alternating GD type of algorithms can explore the β space more efficiently. Finally, it is noteworthy that we are interested in the global minimizer of the loss (9). Since the DNA-SE algorithm relies on GD, when multiple local minimizers exist, one could in principle consider a trial-and-error approach and initialize multiple starting points for β and the function b( ) to ensure convergence to the global minimizer. One could also use the idea from consensus-based optimization (Carrillo et al., 2021) by initializing different starting points using Interacting Particle Systems to accelerate the trial-and-error approach by injecting benign correlations among those starting points. DNN-Assisted Semiparametric Estimation C. Derivations for Section 3 To make our paper relatively self-contained, we provide derivations of the estimators for the examples in Section 3. We want to remark that these derivations are not new and have appeared in the cited works in Section 3. C.1. Derivations for Example 3.1 From the model given by Example 3.1, the true missing-data mechanism is P(A = 1|Y = y) = expit(1 + y). Given a dataset consisting of N i.i.d. observations (Xi, Ai, Ai Yi)N i=1, the observed-data joint likelihood and the observed-data score are given below: p(x, a, y; β, η) = p X(x) p Y |X(y, x; β)P(A = 1|Y = y) a 1 Z p Y |X(t, x; β)P(A = 1|Y = t)dt 1 a , sβ(x, a, ay; β) = a β p Y |X(y, x; β) p Y |X(y, x; β) (1 a) R βp Y |X(t, x; β)P(A = 1|Y = t)dt 1 R p Y |X(t, x; β)P(A = 1|Y = t)dt, where p Y |X(y, x; β) is assumed to be known up to the parameter value β, but η is completely unknown to the statistician. From semiparametric theory, the nuisance tangent space is Λ = Λ1 Λ2, where Λ1 = {h(x) : Eh(X) = 0}, Λ2 = ag(y) (1 a) R p Y |X(t, x; β)g(t)η(t)dt 1 R p Y |X(t, x; β)η(t)dt : g( ) . By definition, one can obtain the observed-data efficient score seff,β(x, a, ay) with respect to β by subtracting from sβ(x, a, ay) its projection onto the observed-data nuisance tangent space Λ. The projection operation leads to the terms involving function b( ) (corresponding to the function g in Λ2) in the display below: seff,β(x, a, ay) = sβ(x, a, ay, β) ab(y) + (1 a) R p Y |X(t, x; β)b(t; β)π(t)dt 1 R p Y |X(t, x; β)π(t)dt , where b(y) satisfies the following function: Z ( β p Y |X(y, x; β) + R βp Y |X(t, x; β)η(t)dt R p Y |X(t, x; β)(1 η(t))dtp Y |X(y, x; β) = Z b(y; β)p Y |X(y, x; β) + R b(t; β)p Y |X(t, x; β)η(t)dt R p Y |X(t, x; β)(1 η(t))dt p Y |X(y, x; β) p X(x)dx. Zhao & Ma (2022) showed that a working model η(y) can be used to replace the true missing mechanism P(A = 1|Y = y), which eventually leads to the estimator bβ that solves the following empirical version of the estimating equation: sβ (Xi, Ai, Ai Yi) Aib (Yi; β) + (1 Ai) R p Y |X (t, Xi; β) b(t; β)η(t)dt R p Y |X(t, X; β) (1 η(t)) dt ( β p Y |X (y, Xi; β) + R βp Y |X (t, Xi; β) η(t)dt R p Y |X (t, Xi; β) (1 η(t)) dtp Y |X (y, Xi; β) b(y; β)p Y |X (y, Xi; β) + R b(t; β)p Y |X (t, Xi; β) η(t)dt R p Y |X (t, Xi; β) (1 η(t)) dt p Y |X (y, Xi; β) . Intuitively, this is because one eliminates the information from the nuisance parameters (in particular the guessed model for A|Y ) in the observed-data efficient score by projecting onto the orthocomplement of the observed-data nuisance tangent space. DNN-Assisted Semiparametric Estimation C.2. Derivations for Example 3.2 In this example, the likelihood of the complete data can be factorized as follows: p(X = x, U = u, A = a, Y = y) = p X(x)p U|X(u|x)p A|X,U(a|x, u)p Y |X,U,A(y|x, u, a; β). Here the parameter of interest is β and the remaining pieces of the joint likelihood are treated as the nuisance parameters η. One can derive the observed-data score with respect to β as the conditional expectation of the full-data score with respect to β: sβ(X, A, Y ) = E [sβ(X, U, A, Y )|X, A, Y ] (Robins et al., 1994; van der Laan & Robins, 2003). However, one could obtain more efficient estimators which also rely on fewer modeling assumptions by calculating the observed-data efficient score based on sβ(x, a, y) as: seff,β(X, A, Y ) sβ(X, A, Y ) E [b(U, X) | X, A, Y ] , where the second term in the above display is simply the projection of sβ(X, A, Y ) onto the orthocomplement of the observed-data nuisance tangent space Λ to be defined later, which further implies that b(u, x) solves the following the integral equation: E [sβ(X, A, Y ) | X, U] = E {E[b(U, X) | X, A, Y ] | X, U} . To derive the observed-data nuisance tangent space, one first obtains the complete-data nuisance tangent space ΛF : ΛF = ΛF 1 ΛF 2 , ΛF 1 = {a1(x) : E [a1(X)] = 0} , ΛF 2 = {a2(u, x) : E [a2(U, X) | X] = 0} . It is well-known that the observed-data model tangent space is the expectation of the complete-data model tangent space conditioning on the observed data (Robins et al., 1994; van der Laan & Robins, 2003). Hence the observed-data tangent space Λ is: Λ1 E[ΛF 1 |X, A, Y ] ΛF 1 {a1(x) : E [a1(X)] = 0} , Λ2 E[ΛF 2 |X, A, Y ] {a 2(x, a, y) E [a2(U, X) | X = x, A = a, Y = y] : E [a2(U, X) | X] = 0} . Then the projection of sβ(X, A, Y ) onto Λ is simply to find the function b(U, X) such that E[sβ(X, A, Y ) E[b(U, X)|X, A, Y ]|X, U] 0 (for this step, see Equation (6) in the online supplementary materials of Zhang & Tchetgen Tchetgen (2022)), which is a Fredholm integral equation of the first kind, an ill-posed problem. However, one can turn this ill-posed problem to a well-posed problem by introducing Tikhonov regularization, which is the route taken in Zhang & Tchetgen Tchetgen (2022). Since p U|X(u|x) cannot be identified from the observed data distribution, one postulate a working model for p U|X(u|x) as ep(u|x) := η(u, x)1{u [ B, B]}/2B where B is sufficiently large. Thus, under this working model, the integral equation (after Tikhonov regularization by a possibly very small tuning parameter λ) becomes Z b (u , X)K(u , u, X)du = C(u, X) λb (u, X). where C(u, X) and K(u, u , X) are known functions up to the unknown parameters. In particular, C(u, X) and K(u , u, X) are of the forms C(u, X) = Z R esβ(y, a, X, u )h(y, a, u , X)du g(y, a, X) p Y,A|X,U(y, a|X, u)dyda, K(u , u, X) = Z p Y,A|X,U(y, a|X, u )p Y,A|X,U(y, a|X, u)η(u , X) g(y, a, X) dyda, h(y, a, u , X) = p Y,A|X,U(y, a|X, u )η(u , X), g(y, a, X) = Z p Y,A|X,U(y, a|X, u )η(u , X)du . DNN-Assisted Semiparametric Estimation Eventually, one obtains the estimator bβ parameter of interest β by solving: i=1 ψ(Xi, Ai, Yi; β), where ψ(Xi, Ai, Yi; β) = R f(Yi, Ai, Xi, u)p Y,A|X,U(Yi, Ai|Xi, u)η(u, Xi)du R p Y,A|X,U(Yi, Ai|Xi, u)η(u, Xi)du , f(Yi, Ai, Xi, u) := esβ(Yi, Ai, Xi, u) b (u, Xi), where we let ep(x, u, a, y) denote the joint likelihood induced by the working model: ep(x, u, a, y) = p X(x)η(u, x)p A|X,U(a|x, u)p Y |X,U,A(y|x, u, a; β) and the score for the working joint law with respect to β is simply esβ(X, U, A, Y ) = log ep(X,U,A,Y ;β) β . Zhang & Tchetgen Tchetgen (2022) showed that even when the working model is completely misspecified, the estimator based on the observed-data efficient score, not the observed-data score, is still consistent. Intuitively, this is because one eliminates the information from the nuisance parameters (in particular the guessed distribution for U|X) in the observed-data efficient score by projecting onto the orthocomplement of the observed-data nuisance tangent space. Proof details are provided in Zhang & Tchetgen Tchetgen (2022) for the aforementioned equations. C.3. Derivations for Example 3.3 In the example of target-population average loss estimation in the context of transfer learning, the main structural assumption of Qiu et al. (2023) is the following relationship of the logit conditional outcome models between the target (A = 0) and the source (A = 1) population: let φ : R R be a strictly increasing function with non-zero first derivative, and Qiu et al. (2023) impose the following: logit P(Y = 1|X = x, A = 1) = φ (logit P(Y = 1|X = x, A = 0)) . (10) As mentioned in the main text, the parameter of interest is the average squared-error loss of prediction Y using a (possibly arbitrary) prediction function f( ) in the target population (i.e. conditioning on A = 0): β := E[ℓ(X, Y )|A = 0], where ℓ(x, y) = (y f(x))2. The full derivation of the semiparametric efficient estimator bβ can be found in Section S9.6 of Qiu et al. (2023). But we recapture the main idea here for readers convenience. To avoid clutter, we do not repeat the definitions of the notation already appeared in Example 3.3. Similar to the above two examples, the derivation starts from the observed-data joint likelihood: p(X = x, A = a, Y = y) = p X(x)p A|X(a|x)P(Y = 1|X = x, A = 1)ay P(Y = 0|X = x, A = 1)a(1 y) P(Y = 1|X = x, A = 0)(1 a)y P(Y = 0|X = x, A = 0)(1 a)(1 y) = p X(x)p A|X(a|x){expit φ r0(x)}ay{1 expit φ r0(x)}a(1 y) {expit r0(x)}(1 a)y{1 expit r0(x)}(1 a)(1 y). In this example, since the parameter of interest β and the nuisance parameters are not in separate pieces of the joint likelihood decomposition, one needs to first derive one influence function IF of β (a mean-zero random variable that is the functional first-order derivative of β (van der Vaart, 2002)), then characterize the model tangent space T , and finally obtain the efficient influence function EIF of β by projecting IF onto the model tangent space T (van der Laan & Robins, 2003; Tsiatis, 2007). First, one can easily obtain one influence function of β following standard functional differentiation argument recently reviewed in Hines et al. (2022): IF = 1 A P(A = 0)(ℓ(X, Y ) β ). DNN-Assisted Semiparametric Estimation The model tangent space T under constraint (10) can be derived from the observed-data joint likelihood as the closure of the direct sum of the space of all scores with respect to each piece of the joint likelihood. Here since one does not model the joint likelihood parametrically, the score with respect to each piece is defined via paths of parametric submodels that go through the underlying likelihood; for details, see van der Vaart (2002) or Section S9.6 of Qiu et al. (2023). With constraint like (10), it is often easier to derive the orthocomplement T of T . In particular, Qiu et al. (2023) showed the following: ( A(Y expit φ r0(X))f(X) (1 A)(Y expit r0(X)) h(X,1) φ r0(X) h(X,0) f(X) : f s.t. E[h(X, 1)f(X)|r0(X)] 0 With T , one can obtain EIF by subtracting from IF the projection of IF onto T to obtain the final semiparametric efficient estimator bβ. In particular, this projection operation in order to determine the unique f b in T leads to the following integral equation: Z x S(x) b (x )v(x ; 1) µ(x ) p X(x )dx + C(x) = b (x). where S(x) = {x : r0(x ) = r0(x)} and the definition of C can be found in Example 3.3 (with all theb removed in b C). As in the main text, let: x S(x) g(x )v(x ; 1) µ(x ) p X(x )dx . The efficient influence function EIF for β is then EIF = 1 A P(A = 0)(v(X) β ) + A(Y expit φ r0(X))g1(X) + (1 A)(Y expit r0(X))g2(X), with g1, g2 defined as in the main text, except removing all theb . D. Hyperparameter tuning The tuning parameters that vary the most include the width and depth of the network, as well as the alternating frequency γ. Through grid search and 5-fold cross-validation, for MNAR, the optimal network width and depth are 5 and 3, respectively, resulting in the lowest average MSE. Conversely, in the sensitivity analysis, the optimal network width and depth are 33 and 23, respectively. In the case of transfer learning, which differs from the other two examples in that the integral equations do not depend on the parameter of interest β, the optimal width and depth are 5 and 3. The effect of different choices of γ is depicted in Figure 2 and 3. In all the examples, we choose the activation function as tanh, which is more commonly used in DNN for problems in scientific computing (Lu et al., 2022). We choose the Monte Carlo sample sizes J1 and J2 to be 1000. D.1. Hyperparameter tuning for Example 3.1 We find that in this example the alternating frequency γ heavily affects the speed and stability of convergence, for the detail see Figure 2. When γ is set to 1, i.e. when the neuron parameters ω and the target parameter β are updated in every other iteration, the training process fails to converge and oscillates with extremely high frequency (not shown in Figure 2). When we increases γ, the training process starts to stabilize after a certain number of iterations and DNA-SE eventually outputs bβ β = (0.25, 0.5) (Figure 2). However, there exists a trade-off: when γ = 5, the training process converges steeply but also exhibits some instability as the iteration continues; whereas when γ = 20, it takes much longer time for the training process to converge. For the time being, we recommend practitioners try multiple γ s (higher than 5) and check if they all eventually converge to the same values. It is an interesting research problem to study an adaptive procedure of choosing γopt that optimally balances speed and stability of the training. Finally, we found that the results are not sensitive to the Monte Carlo sample sizes J1, J2 as long as they are sufficiently large. D.2. Hyperparameter tuning for Example 3.2 Similar to the last example, it is imperative to examine whether various alternating frequencies will impact the outcomes. However, in contrast to Example 3.1, all training processes stabilize at different rates that are directly proportional to the alternating frequency. The tuning details for simulations described in Section 4.2.2 are presented in Figure 3. The depicted DNN-Assisted Semiparametric Estimation 0 10000 20000 30000 40000 50000 0.0 0.1 0.2 0.3 0.4 0.5 5 epochs 10 epochs 20 epochs 0 10000 20000 30000 40000 50000 1.0 0.8 0.6 0.4 0.2 0.0 5 epochs 10 epochs 20 epochs 0 10000 20000 30000 40000 50000 0.000 0.004 0.008 Loss for Seff 5 epochs 10 epochs 20 epochs 0 10000 20000 30000 40000 50000 0.000 0.004 0.008 Loss for DNN integral equation solver 5 epochs 10 epochs 20 epochs Figure 2. The evolution of training processes in one simulated dataset of Example 4.2.1 by setting the alternating frequency γ as 5, 10, or 20. The horizontal axis is the number of iterations. The upper panels show the evolution of β1 (left) and β2 (right) over iterations, while the lower panels show the evolution of the loss corresponding to the score equation (left) and the integral equation (right) over iterations. DNN-Assisted Semiparametric Estimation figure provides a comprehensive analysis of the alternate frequencies γ employed, namely 1, 5, 10, and 20, revealing their performance similarities. Notably, the convergence rate appears to be faster when γ is set to 1, as observed from the figure. For the regularization parameter λ, we simply choose λ = 0.001 in this paper. Figure 3. The evolution of training processes in one simulated dataset of Example 4.2.2 by setting the alternating frequency γ as 1, 5, 10, or 20. The horizontal axis is the number of iterations. The upper panels show the evolution of (a) β over iterations, while the lower panels show the evolution of the loss corresponding to (b) the score equation and (c) the integral equation over iterations. DNN-Assisted Semiparametric Estimation D.3. Hyperparameter tuning for Example 3.3 Compared to the two previous examples, the main distinctive feature of this example is that the integral equation does not depend on the parameter of interest β. Thus in this special case, there is no need to use Alternating GD type of algorithms and regular GD type of algorithms suffices. As a result, there is no need to tune the alternating frequency γ in Algorithm 1. E. Supplementary figures for Section 4 This section collects supplementary figures (Figures 4 to 5) for the numerical experiments that were only referenced in Section 4 due to space limitation. DNA-SE quintic quartic cubic quadratic Mean Median True Value DNA-SE quintic quartic cubic quadratic 0.25 Mean Median True Value Figure 4. The boxplots of (a) and (b) display bβ1 β1 and bβ2 β2 over 100 simulations. Our method is displayed as DNA-SE, compared to the methods using d-th degree polynomials, with d {2, 3, 4, 5}, respectively displayed as quadratic , cubic , quartic , and quintic on the horizontal axis of the plot. DNA-SE grid-based Method cubic quadratic 0.5 Mean Median True Value Figure 5. The boxplot displays bβ β over 100 simulations. Our method is displayed as DNA-SE, compared to the grid-based method of Zhang & Tchetgen Tchetgen (2022) and the methods using d-th degree polynomials, with d {2, 3}, respectively displayed as quadratic and cubic on the horizontal axis of the plot. F. The estimators used for real data analysis Following Zhao & Ma (2022), the missing data mechanism is posited to be η (Y, X1, X2), as a function of the response teacher s report, covariates X1 (father s presence) and X2 (child s health): η (Y, X1, X2) = logit P(A = 1|Y, X1, X2) = 1.058 2.037Y 0.002X1 + 0.298X2. Zhao & Ma (2022) also posited the following model for the shadow variable Z, as: logit P(Z = 1|X1, X2) = 2.106 + 0.623X1 + 0.890X2. DNN-Assisted Semiparametric Estimation DNA-SE quintic 10th-degree 15th-degree 0.1 Mean Median True Value Figure 6. The boxplot displays bβ β over 100 simulations. Our method is displayed as DNA-SE, compared to methods using d-th degree polynomials, with d {5, 10, 15}, respectively displayed as quintic , 10th-degree , 15th-degree on the horizontal axis of the plot. The parameters of interest are β = (β0, β1, β2, βz) and the estimator bβ is given below: bβ = arg min β R2 i=1 ψ(Xi, Zi, Yi, Ai; β,bb) bb = arg min b 1 N Z b(s, x)K(s, t, x, Xi, Zi; bβ)ds C(t, x, Zi; bβ) p Y |X,Z(t|x, Zi; bβ)b(t, x) ψ(Xi, Zi, Ai, Yi, β, b) = 1 βp Y |X,Z(Yi|Xi, Zi; β) p Y |X,Z(Yi|Xi, Zi; β) Aib(Yi, Xi) R βp Y |X,Z(Yi|Xi, Zi; β)η (t)dt R p Y |X,Z(t|Xi, Zi; β)(1 η (t))dt + (1 Ai) R p Y |X,Z(t|Xi, Zi; β)b(t, Xi)η (t)dt R p Y |X,Z(t|Xi, Zi; β)(1 η (t))dt and C( ) and K( ) are defined as follows: C(t, x, Zi; β) β p Y |X,Z(t|x, Zi; β) + R βp Y |X,Z(s|x, Zi; β)η (s)ds R p Y |X(s, x, Zi; β)(1 η (s))dsp Y |X(t, x, Zi; β), K(s, t, x, Xi, Zi; β) p Y |X,Z(s|x, Zi; β)p Y |X,Z(t|x, Zi; β)η (s) R p Y |X,Z(s |x, Zi; β)(1 η (s ))ds Kernelh(Xi x). Here Kernelh( ) := Kernel( /h)/h and the kernel function is chosen based on the criteria outlined in Section 4 of Zhao & Ma (2022). In the simulation, we choose the same kernel function as in Zhao & Ma (2022) with the same bandwidth h = 3N 1/5: Kernel(t) = 45 3t2 1 t2 I{|t| 1}.