# interpretable_metalearning_of_physical_systems__85b2e27d.pdf

Published as a conference paper at ICLR 2024

INTERPRETABLE META-LEARNING OF PHYSICAL SYSTEMS

Matthieu Blanke Inria Paris, DI ENS, PSL Research University matthieu.blanke@inria.fr

Marc Lelarge Inria Paris, DI ENS, PSL Research University marc.lelarge@inria.fr

Machine learning methods can be a valuable aid in the scientiﬁc process, but they need to face challenging settings where data come from inhomogeneous experimental conditions. Recently, meta-learning approaches have made signiﬁcant progress in multi-task learning, but they rely on black-box neural networks, resulting in high computational costs and limited interpretability. Leveraging the structure of the learning problem, we argue that multi-environment generalization can be achieved using a simpler learning model, with an afﬁne structure with respect to the learning task. Crucially, we prove that this architecture can identify the physical parameters of the system, enabling interpretable learning. We demonstrate the competitive generalization performance and the low computational cost of our method by comparing it to state-of-the-art algorithms on physical systems, ranging from toy models to complex, non-analytical systems. The interpretability of our method is illustrated with original applications to parameter identiﬁcation and to adaptive control.

1 INTRODUCTION

Learning physical systems is an essential application of artiﬁcial intelligence that can unlock significant technological and societal progress. Physical systems are inherently complex, making them difﬁcult to learn Karniadakis et al. (2021). A particularly challenging and common scenario is multienvironment learning, where observations of a physical system are collected under inhomogeneous experimental conditions Caruana (1997). In such cases, the scarcity of training data necessitates the development of robust learning algorithms that can efﬁciently handle environmental changes and make use of all available data.

This multi-environment learning problem falls within the framework of multi-task learning, which has been widely studied in the ﬁeld of statistics since the 1990s (Caruana, 1997). The aim is to exploit task diversity to learn a shared representation of the data and thus improve generalization. With the rise of deep learning, several meta-learning approaches have attempted in recent years to incorporate multi-task generalization into gradient-based training of deep neural networks. In the seminal paper by Finn et al. (2017), and several variants that followed (Zintgraf et al., 2019; Raghu et al., 2020), this is done by integrating an inner gradient loop in the training process. Alternatively, Bertinetto et al. (2019) proposed adapting the weights using a closed-form solver. As far as physical systems are concerned, the majority of the proposed methods have focused on speciﬁc architectures oriented towards trajectory prediction (Wang et al., 2022a; Kirchmeyer et al., 2022).

When learning a physical system from data, a critical yet often overlooked challenge is model interpretability (Lipton, 2018; Grojean et al., 2022). Interpreting the learned parameters in terms of the system s physical quantities is crucial to making the model more explainable, allowing for scientiﬁc discovery and downstream model-based applications such as control. The above approaches beneﬁt from the expressiveness of deep learning, but are costly in terms of computational time, both for learning and for inference. Furthermore, the complexity and the black-box nature of neural networks hinder the interpretability of the learned parameters, even when the physical system is linearly parametrized.

Recently, Wang et al. (2021) showed theoretically that the learning capabilities of gradient-based meta-learning algorithms could be matched by the simpler architecture of multi-task representation

Published as a conference paper at ICLR 2024

learning with hard parameter sharing, where the heads of a neural network are trained to adapt to multiple tasks (Caruana, 1997; Ruder, 2017). They also demonstrated empirically that this architecture is competitive against state-of-the-art gradient-based meta-learning algorithms for few-shot image classiﬁcation. We propose to use multi-task representation learning for physical systems, and show how it can bridge the gap between the power of neural networks and the interpretability of the model, with minimal computational costs.

Contributions In this work, we study the problem of multi-environment learning of physical systems. We model the variability of physical systems with a multi-task representation learning architecture that is afﬁne in task-speciﬁc parameters. By exploiting the structure of the learning problem, we show how this architecture lends itself to multi-environment generalization, with considerably lower cost than complex meta-learning methods. Additionally, we show that it enables identiﬁcation of physical parameters for linearly parametrized systems, and local identiﬁcation for arbitrary systems. Our method s generalization abilities and computational speed are experimentally validated on various physical systems and compared with the state of the art. The interpretability of our model is illustrated by applications to parameter identiﬁcation and to adaptive control.

2 LEARNING FROM MULTIPLE PHYSICAL ENVIRONMENTS

In this section, we present the problem of multi-task learning as it occurs in the physical sciences and we summarize how it can be tackled with deep learning in a meta-learning framework.

2.1 THE VARIABILITY OF PHYSICAL SYSTEMS

In general, a physical system is not ﬁxed from one interaction to the next, as experimental conditions vary, whether in a controlled or uncontrolled way. From a learning perspective, we assume a metadataset D := T t=1Dt composed of T datasets, each dataset gathering observations of the physical system under speciﬁc experimental conditions. The goal is to learn a predictor from D that is robust to task changes, in the sense that when presented a new task, it can learn the underlying function from a few samples (Hospedales et al., 2021). Note that in practice the number of tasks T is typically very limited, owing to the high cost of running physical experiments.

For simplicity, we assume a classical supervised regression setting where Dt := {x(i) t , y(i) t }1 i Nt and the goal is to learn a x 7 y predictor, although the approaches presented generalize to other settings such as trajectory prediction of dynamical systems. We discuss two physical examples illustrating the need for multi-task learning algorithms, with different degrees of complexity. Example 1 (Actuated pendulum). We begin with the pendulum, one of physics most famous toy systems. Denoting its inertia and its mass by I and m and the applied torque by u, the angle q obeys

I q + mg sin q = u. (2.1) For example, we may want to learn the action y = u as a function of the coordinates x = (q, q, q). In a data-driven framework, the trajectories collected may show variations in the pendulum parameters: the same equation (2.1) holds true, albeit with different parameters m and I.

A more complex, non-analytical example is that of learning the solution to a partial differential equation, which is rarely known in closed form and varies strongly according to the boundary conditions. Example 2 (Electrostatic potential). The electrostatic potential y in a space Ωdevoid of charges solves Laplace s equation, with boundary conditions y = 0 on Ω, y(x) = b(x) on Ω. (2.2) A robust data-driven solver should be able to generalize to (at least small) changes of Ωand b.

2.2 OVERVIEW OF MULTI-ENVIRONMENT DEEP LEARNING

Multi-task statistical learning has a long history, and several approaches to this problem have been proposed in the statistics community (Caruana, 1997). We will focus on the meta-learning paradigm (Hospedales et al., 2021), which has recently gained considerable importance and whose application to neural nets looks promising given the complexity of physical systems. We describe the generic structure of meta-learning algorithms for multi-task generalization.

Published as a conference paper at ICLR 2024

Learning model Given the learning capabilities of neural networks, incorporating multi-task generalization into their gradient descent training algorithms is a major challenge. Since the seminal paper by Finn et al. (2017), several algorithms have been proposed for this purpose, with the common idea of ﬁnding a map adapting the weights of the neural network according to task data. A convenient point of view is to introduce a two-fold parametrization of a meta-model F(x; θ, w), with a task-agnostic parameter vector θ Rp and task-speciﬁc weights w (also called learning contexts). For each task t, the task-speciﬁc weight is computed based on some trainable meta-parameters π and the task data currently being processed as wt := A(π, Dt), according to an adaptation rule A that is differentiable with respect to π. The meta-parameters are trained to minimize the meta-loss function aggregated over the tasks, as we will see below.

We provide examples of recent architectures in Table 1. In MAML (Finn et al., 2017), the metaparameter π is simply θ and the adaptation rule is computed as a gradient step in the direction of the task-speciﬁc loss improvement, in an inner gradient loop. In Co DA (Kirchmeyer et al., 2022), the meta-parameter π has a dimension growing with the number of tasks t and the adaptation rule is computed directly from the meta-parameters, with task-speciﬁc low-dimensional context vectors ξt Rdξ and a linear hypernetwork Θ Rp dξ. Variants of MAML, CAVIA (Zintgraf et al., 2019) and ANIL (Raghu et al., 2020), ﬁt into this scheme as well and correspond to the restriction of the adaptation inner gradient loop to a predetermined set of the network s weights. This framework also encompasses the CAMEL algorithm, which we introduce in Section 3.

Meta-training The training process is summarized in Algorithm 1. For each task t, the meta-learner computes a task-speciﬁc version of the model from the task dataset Dt, deﬁning ft(x; π) := F(x; θ, A(π, Dt)). The error on the dataset Dt is measured by the task-speciﬁc loss

ℓ(Dt; θ, w) = X

1 2 F(x; θ, w) y 2. (2.3)

Parameters π are trained by gradient descent in order to minimize the regularized meta-loss deﬁned as the aggregation of Lt and a regularization term R(π):

t=1 ℓ Dt; θ, wt(π) + R(π). (2.4)

Algorithm 1 Gradient-based meta-training

input meta-model F(x; θ, w), adaptation rule A, initial meta-parameters π, learning rate η, task datasets D1, . . . DT output learned meta-parameters π while not converged do

for tasks 1 t T do

compute θ from π adapt wt(π) := A(π, Dt) compute ℓ Dt; θ, wt(π)

end for compute L(π), as in (2.4) update π π η L(π) end while

Table 1: Structure of various meta-learning models. Here h(x; θ) R and v(x; θ) Rr denote arbitrary parametric models, such as neural networks; order stands for differentiation order.

MAML Co DA CAMEL π θ θ, Θ, {ξt} θ, {ωt} dim(π) p p+p dξ+dξ T p+r T dim(w) p r A(π, Dt) α θLt Θξt ωt F(x; θ, w) h(x; θ + w) w v(x; θ) training 2 1 1 order adaptation 1 1 0 order

Test-time adaptation At test time, the trained meta-model is presented with a dataset DT +1 consisting of few samples (or shots) from a new task. Using this adaptation data and the learned metaparameters π, the task-agnostic component θ of the meta-model is frozen, and the task-speciﬁc component is tuned (possibly in a constrained set) by minimizing the prediction error on the adaptation dataset: w T +1 argmin w ℓ DT +1; θ, w . (2.5)

In all the above approaches, this minimization is performed by gradient descent. The resulting adapted predictor is deﬁned as F(x; θ, w T +1), and is evaluated by its performance on a separated test set from task T + 1, averaged over the task distribution.

Published as a conference paper at ICLR 2024

3 CONTEXT-AFFINE MULTI-ENVIRONMENT LEARNING

Physical systems often have a particular structure in the form of mathematical models and equations. The general idea behind model-based machine learning is to exploit the available structure to increase learning performance and minimize computational costs (Karniadakis et al., 2021). With this in mind, we adopt in this section a simpler architecture than those shown above, and show how it lends itself particularly well to learning physical systems.

Problem structure We note that many equations in physics exhibit an afﬁne task dependence, since the varying physical parameters often are linear coefﬁcients (as we see in Example 1, and we shall further explain in Section 4). By incorporating this same structure and hence mimicking physical equations, the model should be well-suited for learning them and for interpreting the physical parameters. Following these intuitions, we propose to learn multi-environment physical systems with afﬁne task-speciﬁc context parameters.

Deﬁnition 1 (Context-afﬁne multi-task learning). The prediction is modeled as an afﬁne function of low-dimensional task-speciﬁc weights w Rr with a task-agnostic feature map v(x; θ) Rr and a task-agnostic bias c(x; θ) R:

F(x; θ, w) = c(x; θ) + w v(x; θ). (3.1)

The dimension r of the task weight must be chosen carefully. It must be larger than the estimated number of physical parameters varying from task to task but smaller than the number of training tasks, so as to observe the function v projected over a sufﬁcient number of directions. During training, the task-speciﬁc weights are directly trained as meta-parameters along with the shared parameter vector: π = (θ, ω1 . . . , ωT ) and wt = A(π, Dt) = ωt. The meta-parameters are jointly trained by gradient descent as in Algorithm 1. At test time, the minimization problem of adaptation (2.5) reduces to ordinary least squares.

The architecture introduced in Deﬁnition 1 is equivalent to multi-task representation learning with hard parameter sharing Ruder (2017) and is proposed as a meta-learning algorithm in (Wang et al., 2021) We will refer to it in our physical system framework as Context-Afﬁne Multi-Environment Learning (CAMEL). In this work, we show that CAMEL is particularly relevant for learning physical systems. Table 1 compares CAMEL with the meta-learning algorithms described above.

Computational beneﬁts As the task weights (ωt)T t=1 are kept in memory during training instead of being computed in an inner loop, CAMEL can be trained at minimal computational cost. In particular, it does not need to compute Hessian-vector products as in MAML, or to propagate gradients through matrix inversions as in (Bertinetto et al., 2019). Adaptation at test time is also computationally inexpensive since ordinary least squares guarantees a unique solution in closed form, as long as the number of samples exceeds the dimension r of the task weight. For realtime applications, the online least-squares formula (Kushner & Yin, 2003) ensures adaptation with minimal memory and compute requirements, whereas gradient-based adaptation (as in Co DA or in MAML) can be excessively slow.

Applicability The meta-learning models described in Section 2.2 seek to learn multi-task data from a complex parametric model (typically a neural network), making the structural assumption that the weights vary slightly around a central value in parameter space: ft(x; π) = h(x; θ0 + δθt), with δθ θ0 . Extending this reasoning, the model should be close to its linear approximation:

h(x; θ0 + δθt) h(x; θ0) + δθt h(x; θ0), (3.2)

where we observe that the output is an afﬁne function of the task-speciﬁc component δθt. We believe that (3.2) explains the observation that MAML mainly adapts the last layer of the neural network (Raghu et al., 2020). In Deﬁnition 1, v and c are arbitrary parametric models, which can be as complex as a deep neural network and are trained to learn a representation that is linear in the task weights. Following (3.2), we expect CAMEL s expressivity to be of the same order as that of more complex architectures, with c(x; θ), wt and v(x; θ) playing the roles of h(x; θ0), δθt and h(x; θ) respectively. Another key advantage of CAMEL is the interpretability of the model, which we describe next.

Published as a conference paper at ICLR 2024

4 INTERPRETABILITY AND SYSTEM IDENTIFICATION

The observations of a physical system are often known to depend on certain well-identiﬁed physical quantities that may be of critical importance in the scientiﬁc process. When modeling the system in a data-driven approach, it is desirable for the trained model parameters to be interpretable in terms of these physical quantities (Karniadakis et al., 2021), thus ensuring controlled and explainable learning (Linardatos et al., 2021). We here focus on the identiﬁcation of task-varying physical parameters, which raises the question of the identiﬁability of the learned task-speciﬁc weights. System identiﬁcation and model identiﬁability are key issues when learning a system (Ljung, 1998). Although deep neural networks are becoming increasingly popular for modeling physical systems, their complex structure makes them impractical for parameter identiﬁcation in general (Nelles, 2001).

Physical context identiﬁcation In mathematical terms, the observed output is considered as an unknown function y(x; ϕ) of the input and a physical context vector ϕ Rn, gathering the parameters of the system. In our multi-environment setting, each task is deﬁned by a vector ϕt as yt(x) = y(x, ϕt). At test time, a new environment corresponds to an unknown underlying physical context ϕT +1. While adaptation consists in minimizing the prediction error on the data as in (2.5), the interpretation goes further and seeks to identify ϕT +1. This means mapping the learned task-speciﬁc weights w to the physical contexts ϕ, i.e. learning an estimator ˆϕ : w 7 ϕ using the training data and the trained model. Assuming that the physical parameters of the training data {ϕt} are known, this can be viewed as a regression problem with T samples, where ˆϕ is trained to predict ϕt from weights wt learned on the training meta-dataset.

4.1 LINEARLY PARAMETRIZED SYSTEMS

We are primarily interested in the case where the physical parameters are known to intervene linearly in the system equation, as

y(x; ϕ) = κ(x) + ϕ ν(x), ν(x) Rn. (4.1)

This class of systems is of crucial importance: although simple, it covers a large number of problems of interest, as the following examples illustrate. Furthermore, it can apply locally to more general system, as we shall see later. Example 3 (Electric point charges). Point charges are a particular case of Example 2 with point boundary conditions, proportional to the charges ϕ = (ϕ(1), . . . , ϕ(n)). The resulting ﬁeld can be computed using Coulomb s law and is proportional to these charges: y(x; ϕ) = ϕ ν(x), with ν(x) (1/ x x(j) )j. Although the solution is known in closed form, this example can illustrate more complex problems where an analytical solution is out of reach (and hence ν is unknown) but the linear dependence on certain well-identiﬁed parameters is postulated or known. Example 4 (Inverse dynamics in robotics). One application where our model is particularly well suited in robotics is inverse dynamics: it turns out that the Euler-Lagrange formulation for the rigid body dynamics is always linear with respect to the system s dynamic parameters (Nguyen-Tuong & Peters, 2010), and hence takes the form of (4.1). A simple, yet illustrative system with this structure is the actuated pendulum (2.1), where it is clear that the equation is linear in the inertial parameters I and m. The inverse dynamics equation can be used for trajectory tracking (Spong et al., 2020), as it predicts u from a target trajectory {q(s)}. We provide more details in Appendix B.3.

4.2 LOCALLY LINEAR PHYSICAL CONTEXTS

In the absence of prior knowledge about the system under study, the most reasonable structural assumption for multi-task data is to postulate small variations in the system parameter: ϕ = ϕ0 + δϕ. The learned function can then be expanded and found to be locally linear in physical contexts:

y(x; ϕ) y(x; ϕ0) + δϕ y(x; ϕ0), (4.2)

which has the form (4.1) with κ(x) = y(x; ϕ0) and ν(x) = y(x; ϕ0). Example 5 (Identiﬁcation of boundary perturbations). For a general boundary value problem such as (2.2), we may assume that the boundary conditions Ω(ϕ), b(x, ϕ) vary smoothly according to parameters ϕ (such as angles or displacements). If these variations are small and the problem is sufﬁciently regular, the resulting solution y(x, ϕ) can be reasonably well approximated by (4.2).

Published as a conference paper at ICLR 2024

4.3 PARAMETER IDENTIFICATION WITH CAMEL

We now study the problem of system identiﬁcation under the assumption of parameter linearity (4.1) using the CAMEL metamodel (3.1). We study the identiﬁability of the model and therefore investigate the vanishing training loss limit, with c = κ = 0 for simplicity, yielding

ωt v(x(i) t ) = ϕt ν(x(i) t ) for all 1 t T, 1 i Nt. (4.3)

Identiﬁability Posed as it is, we can easily see that the physical parameters ϕt are not directly identiﬁable. Indeed, for any P GLr(R), the weights ω and the feature map v produce the same data as the weights ω := P ω and the feature map v = P 1v, since ω v = ω PP 1v. This problem is related to that of identiﬁcation in matrix factorization (see for example Fu et al. (2018)). Now that we have recognized this symmetry of the problem, we can ask whether it characterizes the solutions found by CAMEL. The following result provides a positive answer.

Proposition 1. Assume that the training points are uniform across tasks: x(i) t = x(i), and Nt = N for all 1 t T and 1 i N, with n r < N, T. Assume that both sets {ν(x(i))} and {ϕt} span Rn. In the limit of a vanishing training loss L(π) = 0, the trained meta-parameters recover the parameters of the system up to a linear transform: there exist P, Q Rn r such that ϕt = Pωt for all training task t and ν(x(i)) = Qv(x(i)) for all 1 i N. Additionally, QP = In.

A proof is provided in Appendix A, along with the case κ = 0. Proposition 1 shows that CAMEL learns a meaningful representation of the system s features instead of overﬁtting the examples from the training tasks. Remarkably, the relationship between the learned weights and the system parameters is linear and can be estimated using ordinary least squares:

ˆϕ(ω) = ˆPω, ˆP argmin P Rn r 1 2

t=1 Pωt ϕt 2 2. (4.4)

For black-box meta-learning architectures, exhibiting the symmetries in model parameters and computing an identiﬁcation map seems out of reach, as the number of available tasks T can be very limited in practice (Pourzanjani et al., 2017).

5 EXPERIMENTING ON PHYSICAL SYSTEMS

The architecture that we have presented is expected to adapt efﬁciently to the prediction of new environments, and identify (locally or globally) their physical parameters, as shown in Section 4. In this section, we validate these statements experimentally on various physical systems: Sections 5.1 and 5.2 deal with systems with linear parameters (as in (4.1)), on which we evaluate the interpretability of the algorithms. We then examine a non-analytical, general system in Section 5.3. We compare the performances of CAMEL with state-of-the-art meta-learning algorithms. Our code and demonstration material are available at https://github.com/MB-29/meta-learning.

Baselines We have implemented the MAML algorithm of Finn et al. (2017), and its ANIL variant (Raghu et al., 2020), which is computationally lighter and more suitable for learning linearly parametrized systems (according to observation (3.2)). We have also adapted the ℓ1-Co DA architecture of Kirchmeyer et al. (2022) for supervised learning (originally designed for time series prediction). In all our experiments, the different meta-models share the same underlying neural network architecture, with the last layer of size r dim(ϕ). Additional details can be found in Appendix B. The linear regressor computed for CAMEL in (4.4) is computed after training for all architectures with their trained weights wt, and is available at test time for identiﬁcation.

5.1 INTERPRETABLE LEARNING OF AN ELECTRIC POINT CHARGE SYSTEM

As a ﬁrst illustration of multi-environment learning, we are interested in a data-driven approach to electrostatics, where the experimenter has no knowledge of the theoretical laws (Maxwell s equations, as in Example 2) of the system under study. The electrostatic potential is measured at various points in space, under different experimental conditions. The observations collected are then used to train a meta-learning model to predict the electrostatic ﬁeld from new experiments, based on very limited data. We start with the toy system described in Example 3,

Published as a conference paper at ICLR 2024

target ANIL CAMEL

Figure 1: Few-shot adaptation on two out-of-domain environments of the point charge system in a dipolar setting (left) and the capacitor (right). The adaptation points are represented by the symbols. The vector ﬁelds are derived from the learned potential ﬁelds using automatic differentiation.

100 101 102

100 identiﬁcation error

CAMEL ANIL Co DA

Figure 2: Average relative error for the point charge identiﬁcation.

which provides a qualitative illustration of the behavior of various learning algorithms: n = 3 point charges placed in the plane at ﬁxed locations. This experiment is repeated with varying charges ϕ R3.

Results For this system with linear physical parameters, CAMEL outperforms other baselines and can predict the electrostatic ﬁeld with few shots, as shown in Figure 1 and Table 2 (5-shot adaptation). Figure 2 shows the identiﬁcation error over 30 random test environments with standard deviations, as a function of the number of training tasks. Thanks to the sample complexity of linear regression, CAMEL accurately identiﬁes system charges, achieving less than 1% relative error with 10 training tasks.

5.2 MULTI-TASK REINFORCEMENT LEARNING AND ONLINE SYSTEM IDENTIFICATION

Another scientiﬁc ﬁeld in which our theoretical framework can be applied is multi-task reinforcement learning, in which a control policy is learned using data from multiple environments of one system (Vithayathil Varghese & Mahmoud, 2020). We saw in Example 4 that robot joints obey the inverse dynamics equation, which turns out to be linear in the robot s inertial parameters. Consequently, our architecture lends itself well to the statistical learning of this equation from multiple environment data, as well as to the identiﬁcation of the dynamic parameters. We may then exploit the learned model of the dynamics to perform adaptive inverse dynamics control (see Appendix B.4) of robots with unknown parameters, and identify the parameters simultaneously.

Systems We experiment with systems of increasing complexity, starting with 2D simulated systems: cartpole and acrobot. To make them more realistic, we add friction in their dynamics. The analytical equation (4) is hence inaccurate, which motivates the use of a data-driven learning method. We then experiment on the simulated 6-degree-of-freedom robot Upkie (Figure 3), for which we don t know the true inverse dynamics function and the wheel torque is learned from the ground position and the joint angles.

Experimental setup Learning algorithms are trained on trajectories (a more challenging setting than uniformly spaced data) obtained from multiple system environments. At test time, a new environment is instantiated and the model is adapted from a trajectory of few observations. The resulting adapted model is then used to predict control values for the rest of the trajectory. For the carptole and the robot arm, the predicted values are used to track a reference trajectory using inverse dynamics control. For Upkie, we could not directly use the predicted controls for actuation, but we compare the open-loop predictions with the executed control law. The target motions are swing-up trajectories for the cartpole and the arm, and a 0.5m displacement for Upkie. Since Upkie is a very unstable system, it is controlled in a 200Hz model predictive control loop (Rawlings, 2000).

Online adaptive control We also investigate a challenging time-varying dynamics setting where the inertial parameters of the system change abruptly at a given time. This scenario is very common in real life and requires the development of control algorithms robust to these changes and fast enough to be adaptive ( Astr om & Wittenmark, 2013). In our case, we double the mass of the cart in

Published as a conference paper at ICLR 2024

0 100 time 2

tip position

0 1000 time

0 3000 time 1

Figure 4: Tracking of a reference trajectory using the learned inverse dynamics controller. Left. 50-shot adaptation. Center and right. The model and the controller are adapted online.

the cartpole system, and we quadruple the mass of Upkie s torso. The learning models adapt their task weights online and adjust their control prediction. In an application to parameter identiﬁcation, we also compute the estimated values of the varying parameter over time.

Figure 3: Upkie.

Results The 100-shot adaptation error of the control values is reported in Table 2. The trajectories obtained with inverse dynamics control adapted from 50 shots are plotted in Figure 4 for CAMEL and for the best-performing baseline, ANIL, along with the analytical solution. Only CAMEL adapts well enough to track the target trajectory. The analytic solution underestimates the control as it does not account for friction, resulting in inaccurate tracking. In the adaptive control setting, the variation in the mass of the cart leads to a deviation from the target trajectory but CAMEL is able to adapt quickly to the new environment and identiﬁes the new mass, unlike ANIL. Experimentation on Upkie shows that the computational time of adaptation can be crucial, as we found that the gradient-based adaptation of ANIL and Co DA was too slow to run in the 200Hz model predictive control loop. On the other hand, CAMEL s gradient-free adaptation and interpretability allow it to track and identify changes in system dynamics, and to correctly predict the stabilizing control law.

5.3 BEYOND CONTEXT-LINEAR SYSTEMS

0.1 0.5 1 dϕ 10 2 10 1

identiﬁcation error

CAMEL ANIL Co DA

Figure 5: Adaptation and relative identiﬁcation error for the ε-capacitor, with increasing ε.

In order to evaluate our method on general systems with no known parametric structure, we consider the following non-analytical electrostatic problem of the form shown in Example 2. The ﬁeld is created by a capacitor formed by two electrodes that are not exactly parallel. The variability of the different experiments stems from the misalignment δϕ R2, in angle and position, of the upper electrode. We apply the same methodology as described in Section 5.1. The whole multi-environment learning experiment is repeated several times with varying magnitudes of misalignment, by replacing δϕ with ε δϕ for different values of ε [0, 1]. This allows us to move gradually from local perturbations when ε 1 (as in Example 5) to arbitrary variations in the environment.

Results The 40-shot adaptation error for the ε-capacitor is reported in Table 2, with perturbation of full magnitude ε = 1 and with ε = 0.1. We also show the 5-shot adaptation of CAMEL and the best performing baseline, Co DA, for ε = 0.2 in Figure 1. When the system parameters are fully nonlinear, CAMEL and the baselines perform similarly, but CAMEL is much faster. In the second case, CAMEL outperforms them by an order of magnitude and accurately predicts the electrostatic ﬁeld, whereas Co DA s exhibits lower precision. Predictions and average identiﬁcation error (with

Published as a conference paper at ICLR 2024

standard deviations) are plotted as a function of ε in Figure 5. For small ε, the system parameter perturbation is well identiﬁed, enabling a zero-shot adaptation.

Table 2: Average adaptation mean squared error (left) and computational time (right). System Charges Capacitor ε-Capacitor Cartpole Arm Upkie MAML 1.6E-1 N/A N/A 1.8E0 8.1E-1 1.5E-2 ANIL 9.2E-4 3.6E-2 1.1E-3 2.5E-2 7.5E-1 1.9E-2 Co DA 8.2E-2 2.6E-2 1.0E-3 8.1E-1 9.3E-1 2.1E-2 R2-D2 1.2E-4 3.1E-4 4.2E-4 8.5E-3 3.5E-1 2.3E-2 CAMEL 1.0E-4 2.6E-2 1.9E-4 3.1E-3 2.4E-1 8.2E-3

Training Adaptation 30 10 10 3 2 8 20 1 1 1

6 RELATED WORK

Multi-task meta-learning Meta-learning algorithms for multi-task generalization have gained popularity (Hospedales et al., 2021), with the MAML algorithm of Finn et al. (2017) playing a fundamental role in this area. Based on the same principle, the variants ANIL (Raghu et al., 2020) and CAVIA (Zintgraf et al., 2019) have been proposed to mitigate training costs and reduce overﬁtting. Interpretability is addressed in the latter work, using a large number of training tasks. In a different line of work, Bertinetto et al. (2019) proposed the R2-D2 architecture where the heads of the network are adapted using the closed-form formula of Ridge regression. The similarities between multi-task representation learning and gradient-based learning are studied in (Wang et al., 2021) from a theoretical point of view, in the limit of a large number of tasks. Unlike our method, the approaches above rely on the assumption that the number of training tasks is large (in few-shot image classiﬁcation for example, where it can be in the millions (Wang et al., 2021; Hospedales et al., 2021)), while it is typically very limited for physical systems.

Meta-learning physical systems Meta-learning has been applied to multi-environment data for physical systems, with a focus on dynamical systems, where the target function is the ﬂow of a differential equation. Recent algorithms include LEADS (Yin et al., 2021), in which the task dependence is additive in the output space and Co DA (Kirchmeyer et al., 2022), where parameter identiﬁcation is addressed brieﬂy, but under strong assumptions of input linearity. Wang et al. (2022b) propose physical-context-based learning, but context supervision is required for training. From a broader point of view, the interpretability of the statistical model can be imposed by adding physical constraints to the loss function (Raissi et al., 2019).

Multi-task reinforcement learning Meta-learning has given rise to a number of fruitful new approaches in the ﬁeld of reinforcement learning. Sodhani et al. (2021) and Clavera et al. (2019) propose multi-task deep learning algorithms, but no structure is assumed on the dynamics and the learned weights can be interpreted only statistically, in the parameter space of a large black-box neural network. Multi-task learning of inverse dynamics with varying inertial parameters is studied in (Williams et al., 2008) using Gaussian processes, but parameter identiﬁcation is not addressed.

7 CONCLUSION

We introduced CAMEL, a simple multi-task learning algorithm designed for multi-environment learning of physical systems. For general and complex physical systems, we demonstrated that our method performs as well as the state-of-the-art, at a much lower computational cost. Moreover, when the learned system exhibits a linear structure in its physical parameters, our architecture is particularly effective, and enables the identiﬁcation of these parameters with little supervision, independently of training. The identiﬁability conditions found in Proposition 1 are not very restrictive, and the effectiveness of the linear identiﬁcation map is demonstrated in our experiments. We proposed a particular application in the ﬁeld of robotics where our data-driven method enables concurrent adaptive control and system identiﬁcation. We believe that enforcing more physical structure in the meta-model, using for example Lagrangian neural networks (Lutter et al., 2019), can improve its sample efﬁciency and extend its applicability to more complex robots. While we focused on classical regression tasks, our framework can be generalized to predict dynamical systems by combining it with a differentiable solver (Chen et al., 2018). Another interesting avenue for future research is the use of active learning, to make the most at out the available training resource and enhance the efﬁciency of multi-task learning for static and dynamic systems (Wang et al., 2023; Blanke & Lelarge, 2023).

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

This work was partially supported by the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR19-P3IA0001 (PRAIRIE 3IA Institute).

Karl J Astr om and Bj orn Wittenmark. Adaptive control. Courier Corporation, 2013.

L Bertinetto, J Henriques, P Torr, and A Vedaldi. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations (ICLR), 2019. International Conference on Learning Representations, 2019.

Matthieu Blanke and Marc Lelarge. FLEX: an adaptive exploration algorithm for nonlinear systems. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 2577 2591. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/blanke23a.html.

Rich Caruana. Multitask learning. Machine learning, 28:41 75, 1997.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Ignasi Clavera, Anusha Nagabandi, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt in dynamic, real-world environments through metareinforcement learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Hyztso C5Y7.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126 1135. PMLR, 2017.

Xiao Fu, Kejun Huang, and Nicholas D Sidiropoulos. On identiﬁability of nonnegative matrix factorization. IEEE Signal Processing Letters, 25(3):328 332, 2018.

Christophe Grojean, Ayan Paul, Zhuoni Qian, and Inga Str umke. Lessons on interpretable machine learning from particle physics. Nature Reviews Physics, 4(5):284 286, 2022.

Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9): 5149 5169, 2021.

George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-informed machine learning. Nature Reviews Physics, 3(6):422 440, 2021.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015. URL http://dblp.uni-trier.de/db/conf/iclr/iclr2015. html#Kingma B14.

Matthieu Kirchmeyer, Yuan Yin, Jeremie Dona, Nicolas Baskiotis, Alain Rakotomamonjy, and Patrick Gallinari. Generalizing to new physical systems via context-informed dynamics model. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 11283 11301. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/kirchmeyer22a.html.

Martin Kretzschmar. Particle motion in a penning trap. European Journal of Physics, 12(5):240, 1991.

H. Kushner and G.G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability. Springer New York, 2003. ISBN 9780387008943. URL https://books.google.fr/books?id=_0b Iieu UJGk C.

Published as a conference paper at ICLR 2024

Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1), 2021. ISSN 1099-4300. doi: 10.3390/e23010018. URL https://www.mdpi.com/1099-4300/23/1/18.

Zachary C Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3):31 57, 2018.

Lennart Ljung. System identiﬁcation. In Signal analysis and prediction, pp. 163 173. Springer, 1998.

Michael Lutter, Christian Ritter, and Jan Peters. Deep lagrangian networks: Using physics as model prior for deep learning. ar Xiv preprint ar Xiv:1907.04490, 2019.

O. Nelles. Nonlinear System Identiﬁcation: From Classical Approaches to Neural Networks and Fuzzy Models. Engineering online library. Springer, 2001. ISBN 9783540673699. URL https: //books.google.fr/books?id=7q HDgw MRq M4C.

Duy Nguyen-Tuong and Jan Peters. Using model knowledge for learning inverse dynamics. In 2010 IEEE international conference on robotics and automation, pp. 2677 2682. IEEE, 2010.

Arya A Pourzanjani, Richard M Jiang, and Linda R Petzold. Improving the identiﬁability of neural networks for bayesian inference. In NIPS Workshop on Bayesian Deep Learning, volume 4, pp. 31, 2017.

Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id= rkg Mk CEt PB.

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686 707, 2019.

James B Rawlings. Tutorial overview of model predictive control. IEEE control systems magazine, 20(3):38 52, 2000.

Sebastian Ruder. An overview of multi-task learning in deep neural networks. ar Xiv preprint ar Xiv:1706.05098, 2017.

Shagun Sodhani, Amy Zhang, and Joelle Pineau. Multi-task reinforcement learning with contextbased representations. In International Conference on Machine Learning, pp. 9767 9779. PMLR, 2021.

Mark W Spong, Seth Hutchinson, and Mathukumalli Vidyasagar. Robot modeling and control. John Wiley & Sons, 2020.

Russ Tedrake. Underactuated Robotics. 2022. URL https://underactuated.csail. mit.edu.

Nelson Vithayathil Varghese and Qusay H. Mahmoud. A survey of multi-task deep reinforcement learning. Electronics, 9(9), 2020. ISSN 2079-9292. doi: 10.3390/electronics9091363. URL https://www.mdpi.com/2079-9292/9/9/1363.

Haoxiang Wang, Han Zhao, and Bo Li. Bridging multi-task learning and meta-learning: Towards efﬁcient training and effective adaptation. In International conference on machine learning, pp. 10991 11002. PMLR, 2021.

Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip Yu. Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, 2022a.

Rui Wang, Robin Walters, and Rose Yu. Meta-learning dynamics forecasting using task inference. Advances in Neural Information Processing Systems, 35:21640 21653, 2022b.

Published as a conference paper at ICLR 2024

Yiping Wang, Yifang Chen, Kevin Jamieson, and Simon Shaolei Du. Improved active multi-task representation learning via lasso. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 35548 35578. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/ v202/wang23b.html.

Christopher Williams, Stefan Klanke, Sethu Vijayakumar, and Kian Chai. Multi-task gaussian process learning of robot inverse dynamics. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008. URL https://proceedings.neurips.cc/paper_files/ paper/2008/file/15d4e891d784977cacbfcbb00c48f133-Paper.pdf.

Yuan Yin, Ibrahim Ayed, Emmanuel de B ezenac, Nicolas Baskiotis, and Patrick Gallinari. Leads: Learning dynamical systems that generalize across environments. Advances in Neural Information Processing Systems, 34:7561 7573, 2021.

Mohammad Asif Zaman. Numerical solution of the poisson equation using ﬁnite difference matrix operators. Electronics, 11(15), 2022. ISSN 2079-9292. doi: 10.3390/electronics11152365. URL https://www.mdpi.com/2079-9292/11/15/2365.

Luisa Zintgraf, Kyriacos Shiarli, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. In International Conference on Machine Learning, pp. 7693 7702. PMLR, 2019.

Published as a conference paper at ICLR 2024

Lemma 1. Let v1, . . . , v N, and w1, . . . , w T Rr, and let r r and v 1, . . . , v N, and w 1, . . . , w T Rr be two sets of vector of full rank, satisfying i, t, wt vi = w t v i. Then there exist P, Q Rr r such that w t = Pwt and v i = Qvi. Furthermore, QP = Ir .

Proof of Lemma 1. Denoting by V RN r, V RN r , W RT r and W RT r the matrix representations of the vectors, the scalar equalities i, t, wt vi = w t v i take the matrix form

V W = V W . (A.1)

Since V is of full rank, the matrix V + := (V V ) 1V Rr N is well deﬁned and is a left inverse of V . Multiplying (A.1) by V + yields

W = WP with P := V +V Rr r. (A.2)

Similarly, V = V Q with Q := W +W Rr r. (A.3)

Now compute QP = W +WP = W +W = Ir

Proof of Proposition 1. Applying Lemma 1 to v i := ν(x(i)), vi := v(x(i)), and wt := ωt, w t := ϕt yields the stated result.

The case where c, κ = 0 can be handled as follows. We augment ϕ and ν, and ω and v with an additional dimension, with the last components of ϕ and ω equal to 1 and the last components of ν and v equal to κ and c respectively. The augmented vectors satisfy the assumptions of Proposition 1 provided the augmented v i and w t span Rn+1. The proposition then applies, and implies that the physical parameters ϕt can be recovered with an afﬁne transform. This case is tackled experimentally in the capacitor experiment (Section 5.3), where κ = 0 a fortiori since the electrostatic ﬁeld is linearized around a nonzero value. The physical parameters are identiﬁed using an afﬁne regression.

B EXPERIMENTAL DETAILS

B.1 ARCHITECTURES All neural networks are trained with the ADAM optimizer Kingma & Ba (2015). For Co DA, we set dξ = r, chosen according to the system learned. For all the baselines, the adaptation minimization problem (2.5) is optimized with at least 10 gradient steps, until convergence. For training, the number of inner gradient steps of MAML and ANIL is chosen to be 1, to reduce the computational time. We have also experimented with larger numbers of inner gradient steps. This improved the stability of training, but at the cost of greater training time.

B.2 SYSTEMS We provide further details about the physical systems on which the experiments of Section 5 are performed.

B.2.1 POINT CHARGES The n charges are placed at ﬁxed locations in the plane at ﬁxed location. The training inputs are located in Ω= [ 1, 1] [0, 1] which is discretized into a 20 20 grid and the ground truth potential ﬁeld is computed using Coulomb s law. The training data is generated by changing each charge s value in {1, . . . , 5}n, hence T = 5n. We have experimented on different settings with various numbers of charges, and various locations. In Section 5.1, a dipolar conﬁguration is investigated, where n = 3, and one of the charges is far away on the left and two other charges of opposite sign are located near x2 = 0. Gaussian noise of size σ = 0.1 is added to the ﬁeld values revealed to the learner in the test dataset. The system is learned with a neural network of 4 hidden layers of width 16, with the last layer of size r = n. For evaluation, the test data is generated with random charges drawn from a uniform distribution in [1, . . . , 5]n and the data points are drawn uniformly in Ω

Published as a conference paper at ICLR 2024

B.2.2 CAPACITOR The space is discretized into a 200 300 grid. The training environments are generated with 10 values of the physical context ϕ := (α, η) [0, 0.5] [ 0.5, 0.5] containing the angular and the positional perturbation of the second plate, drawn uniformly. The ground truth electrostatic ﬁeld is computed with the Poisson equation solver of Zaman (2022). For evaluation, 5 new environments are drawn with the same distribution. The system is learned with a neural network of 4 hidden layers of width 64, with the last layer of size r = n + 1 = 3.

B.2.3 CARTPOLE AND ARM We have implemented the manipulator equations for the cartpole and the arm (or acrobot), following Tedrake (2022), and have added friction. The training data is generated by actuating the robots with sinusoidal inputs, with for each environment 8 trajectories of 200 points and random initial conditions and periods. At test time, the trajectories are generated with sinusoidal inputs for evalutation, and with swing-up inputs for trajectory tracking.

Cartpole The pole s length is set to 1, the varying physical parameters are the masses of the cart and of the pole: ϕt {1, 2} {0.2, 0.5}, so T = 4. For evalutation, the masses are drawn uniformly around (2, 0.3), with an amplitude of (1, 0.2). The system is learned with a neural network of 3 hidden layers of width 16, with the last layer of size r = n + 2 = 4.

Arm The arm s length are set to 1, the varying physical parameters are the inertia and the mass of the second arm: ϕt {0.25, 0.3, 0.4} {0.9, 1.0, 1.3}, so T = 9. For evalutation, the inertial parameters are drawn uniformly around (0.5, 1), with an amplitude of (0.2, 0.3). The system is learned with a neural network of 4 hidden layers of width 64, with the last layer of size r = n+2 = 4.

B.2.4 UPKIE Information about the open-source robot Upkie can be found at https://github.com/ tasts-robots/upkie. We trained the meta-learning algorithm on balancing trajectories of 1000 observations, with 10 different values for Upkie s torso, ranging from 0.5 to 10 kilograms. For evaluation, the mass is sampled in the same interval. The system is learned with a neural network of 4 hidden layers of width 64, with the last layer of size r = n + 2 = 3.

B.3 INVERSE DYNAMICS CONTROL

The Euler-Lagrange formulation for the rigid body dynamics has the form

M(q) q + C(q, q) q + g(q) = Bu, (B.1)

where q is the generalized coordinate vector, M is the mass matrix, C is the Coriolis force matrix, g(q) is the gravity vector and the matrix B maps the input u into generalized forces (Tedrake, 2022). Inverse dynamics control is a nonlinear control technique that aims at computing the control inputs of a system given a target trajectory { q(s)} Spong et al. (2020). Using a model ˆ ID for the inverse dynamics equation (B.1), the feedforward predicted control signal ˆu = ˆ ID( q, q, q). These feedforward control values can then be combined with a low gain feedback controller to ensure stability, as

u = ˆu + K( q q) + K ( q q). (B.2)

For the cartpole, we used K = K = 0.5. For the robot arm, we used K = K = 1.

B.4 ADAPTIVE CONTROL

In a time-varying dynamics scenario, CAMEL can be used for adaptive control and system identiﬁcation. Given a target trajectory, the task-agnostic component v of the model predictions can be computed ofﬂine. In the control loop, the task-speciﬁc component ω is updated with the online least squares formula. The control loop is summarized in Algorithm 2, where we have assumed c = 0 for simplicity. The estimated inertial parameters are deduced from the task-speciﬁc weights with the identiﬁcation matrix (4.4).

Published as a conference paper at ICLR 2024

Algorithm 2 Adaptive trajectory tracking

input trained feature map v(x), target trajectory s 7 qs Ofﬂine control for timestep 0 s H 1 do

compute xs = ( qs, qs, qs) compute features vs := v( xs) end for Control loop Initialize M0 = Ir, ω0 = (0, . . . , 0) for time step 1 s H do

compute ˆus = ω s vs compute es = qs qs play us := ˆus + Kes observe qs+1, qs+1 compute vs := v(xs) update Ms+1 = Ms Msvs(Msvs)

1+v s Msvs update ωs+1 = ωs (vs ωs us)Ms+1vs end for

Published as a conference paper at ICLR 2024

B.5 ADDITIONAL NUMERICAL RESULTS We provide details concerning Table 2.

Computational time For the computational times of Table 2, we arbitrarily chose the shortest time as the time unit, for a clearer comparison among the baselines. The computational times were measured and averaged over each experiment, with equal numbers of batch sizes and gradient steps across the different architectures. For training, the time was divided by the number of gradient steps.

Table 3: Adaptation performances with standard deviations.

System Charges, 30 trials Capacitor, 5 trials 3-shot 10-shot 5-shot 40-shot MAML 4.1E-0 2E-0 1.6E-1 5E-2 N/A N/A ANIL 3.5E0 5E-1 9.2E-4 5E-4 4.4E-2 2E-2 3.6E-2 1E-2 Co DA 1.0E-1 9E-2 8.2E-2 3E-2 4.7E-2 5E-5 2.6E-2 1E-2 CAMEL 2.0E-4 1E-4 1.0E-4 5E-5 3.6E-2 2E-2 2.6E-2 1E-2 ϕ-CAMEL 3.0E-3 6.5E-2

System ε-Capacitor, ε = 0.1, 5 trials

3-shot 30-shot MAML N/A N/A ANIL 1.1E-3 5E-5 1.1E-3 5E-5 Co DA 1.2E-3 5E-4 1.0E-3 5E-4 CAMEL 4.2E-4 1E-4 1.9E-4 2E-5 ϕ-CAMEL 1.9E-4

System Cartpole, 50 trials Arm, 50 trials 50-shot 100-shot 50-shot 100-shot MAML 4.3E0 7E-1 3.5E0 6E-1 1.0E0 1E-1 8.1E-1 5E-2 ANIL 3.8E-1 1E-1 2.5E-2 9E-2 8.5E-1 1E-1 7.5E-1 4E-2 Co DA 3.8E-1 9E-3 8.1E-1 1E-1 9.5E-1 9E-2 9.3E-1 6E-2 CAMEL 4.8E-2 1E-2 3.1E-3 5E-4 3.1E-1 5E-2 2.4E-1 1E-2

System Upkie, 15 trials MAML 1.5E-2 7E-3 ANIL 1.9E-2 6E-3 Co DA 2.1E-2 3E-3 CAMEL 8.2E-3 5E-3

Published as a conference paper at ICLR 2024

target ANIL MAML CAMEL ϕ-CAMEL

x1 x1 x1 x1

Figure 6: 5-shot adaptation for the 4 point charge system. Top. The four charges are positive, as in the training meta-dataset. Bottom Two of the four charges are negative.

Figure 7: Capacitor, 40-shot adaptation.

200 250 300 350 400 450 500 time

CAMEL ANIL Co DA MAML

Figure 8: Upkie torque prediction, 100-shot adaptation.

B.6 ZERO-SHOT ADAPTATION AND SCIENTIFIC DISCOVERY Zero-shot adaptation Looking at the problem from another angle, Proposition 1 also shows that ω can be estimated linearly as a function of ϕ, at least when r = n (which ensures that P is nonsingular). Computing an estimator of ω as a function of ϕ with the inverse regression to (4.4) enables

Published as a conference paper at ICLR 2024

a zero-shot (or physical parameter-induced) adaptation scenario: when an estimate of the physical parameters of the new environment is known a priori, a value for the model weights can be inferred. We call this adaptation method ϕ-CAMEL. In a data-driven approach, training CAMEL offers not only the ability to adapt to a small number of observations, but also to predict the system without any data for arbitrary values of the its parameters. We believe that the 0-shot adaptation algorithm ϕ-CAMEL can be used in the process of scientiﬁc discovery. In many cases, the experimenter has the knowledge of (or knows an estimate of) the physical quantities varying across experimental conditions, while not knowing accurately the system itself. Then, ϕ-CAMEL can be used to infer the target function for chosen values of the physical parameters ϕ independently of the values observed for training. Of course, the predictions of ϕ-CAMEL are good only if the estimator ˆϕ of (4.4) is good, implying a sufﬁcient number of training tasks and an effective training of CAMEL. For nonlinear physical contexts, the values of ϕ that are investigated should be close to the reference value ϕ0 so that (4.2) holds. We further illustrate on the toy example of n = 4 point charges, for which the experimenter could observe experiments with positive charges. Figure 6 shows the predictions after 5-shot adaptation of the different meta-models, along with the zero-shot adaptation of ϕ-CAMEL. We can see that only CAMEL and ϕ-CAMEL adapt well to negative charges. In particular, the zero-shot adaptation of ϕ-CAMEL enables estimating the system in an experiment whose numerical values are completely different from the training dataset, thanks to the structure of the model and of the equations in this case (since they are known to be linear in the charges). Importantly, evaluating ϕ-CAMEL for different values of ϕ is not costly, since the identiﬁcation map is already computed using the training data. We could imagine that this scenario might enable discovering new properties of complex physical systems as by exploring the space of physical parameters, in a data-driven fashion. Regarding the simple example of Figure 6, knowing the form of the electrostatic ﬁeld in this quadrupole setting underlies the understanding of Penning s ion trap Kretzschmar (1991).