# chacha_for_online_automl__af697dc7.pdf

Cha Cha for Online Auto ML

Qingyun Wu 1 Chi Wang 1 John Langford 1 Paul Mineiro 1 Marco Rossi 1

We propose the Cha Cha (Champion Challengers) algorithm for making an online choice of hyperparameters in online learning settings. Cha Cha handles the process of determining a champion and scheduling a set of live challengers over time based on sample complexity bounds. It is guaranteed to have sublinear regret after the optimal conﬁguration is added into consideration by an application-dependent oracle based on the champions. Empirically, we show that Cha Cha provides good performance across a wide array of datasets when optimizing over featurization and hyperparameter decisions.

1. Introduction

Motivation Online learning services (for example the Personalizer Azure cognitive service (Per, 2019)) have a natural need for "auto ML" style learning which automatically chooses hyperparameter conﬁgurations over some set of possible choices. The well-studied setting of ofﬂine auto ML strategies (Bergstra et al., 2015; Feurer et al., 2015; Li et al., 2017; Falkner et al., 2018; Huang et al., 2019; Elsken et al., 2019; Real et al., 2020) do not satisfy several natural constraints imposed by the online setting.

1. In online settings, computational constraints are more sharply bounded. Partly, this is for budgetary concerns and partly this is because any functioning online system must keep up with a constantly growing quantity of data. Hence, we seek approaches which require at most a constant factor more computation and more generally requires that the online learning algorithm keeps up with the associated data stream. 2. Instead of having a ﬁxed dataset, online learning settings naturally have a speciﬁed data source with unbounded, and sometimes very fast, growth. For example, it is natural to have datasets growing in the terabytes/day range. On the other hand, many data

1Microsoft Research. Correspondence to: Qingyun Wu <qxw5138@psu.edu>, John Langford <jcl@microsoft.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

sources are also at much lower volumes, so we naturally seek approaches which can handle vastly different scales of data volume. 3. Instead of evaluating the ﬁnal quality of the model produced by a learning process, online learning settings evaluate a learning algorithm constantly, implying that an algorithm must be ready to answer queries at all times, and that the relevant performance of the algorithm is the performance at all of those evaluations.

It is not possible to succeed in online auto ML by naively applying an existing ofﬂine auto ML algorithm on the data collected from an online source. First, this approach does not address the computational constraint, as direct use of ofﬂine auto ML is impractical when the dataset is terascale or above. Operating on subsets of the data would be necessary, but the dramatic potential performance differences in learning algorithms given different dataset sizes suggests that the choice of subset size is critical and data-dependent. Automating that choice is non-trivial in general. Second, this approach does not address the issue of online evaluation, as ofﬂine auto ML algorithms are assessed on the quality of the ﬁnal conﬁguration produced. It is also not possible to succeed in online auto ML via naive application of existing regret-minimizing online algorithms (Cesa-Bianchi & Lugosi, 2006): Instantiating a no-regret algorithm over all sets of possible conﬁgurations is computationally prohibitive.

How then can we best design an efﬁcient online automated machine learning algorithm?

In the online setting there are no natural points for stopping training, evaluating a conﬁguration, and trying the next conﬁguration. If we keep evaluating a ﬁxed set of conﬁgurations, other conﬁgurations are denied experience, which could lead to linearly increasing total regret. Exploration is however delicate because an unknown quantity of data may be necessary for a conﬁguration to exhibit superior performance. Given this, a simple exploration approach which periodically switches between conﬁgurations in a batched epsilon-greedy style may also incur large total regret. A method that can allocate the limited computational power (at any time point) to learning models while maintaining good online performance (i.e., low regret) and working despite an unknown required example threshold is needed.

Cha Cha for Online Auto ML

0 500 1000 1500 2000 2500 3000 # of data samples

Progressive validation loss

Figure 1. Cha Cha improves online learning - a demonstration.

1.1. What we do

In Section 2, we deﬁne a new online automated machine learning setting with tight computational constraints, consistent with standard online learning settings. A key element in this setting is a multiplicative constant on the amount of computation available which corresponds to the maximum number of "live" models at any time point. In addition, we assume no model persistence is allowed. The other key element is the availability of a conﬁguration oracle which takes as input a conﬁguration and provides as output a set of possible conﬁgurations to try next. This oracle is designed to capture the ability of a domain expert or ofﬂine auto ML methods to propose natural alternatives which may yield greater performance.

The oracle could propose more conﬁgurations than can be simultaneously assessed given the computational budget. How can we efﬁciently discover a better conﬁguration while respecting the computational budget? It is critical to allocate computation across the possibilities in a manner which both does not starve potentially best possibilities and which does not waste either computation or experience. The Cha Cha algorithm detailed in Section 3 is designed to do this using an amortized computation approach. Two other critical issues exist: How can a better conﬁguration be reliably identiﬁed? And when evaluating multiple (potentially disagreeing) alternatives how should we make predictions? For the former, we use progressive validation sample complexity bounds while for the latter we identify a model using these bounds and predict according to it, providing beneﬁt from a potentially better conﬁguration before it is proved as such.

In Section 4 we analyze theoretical properties of Cha Cha. We prove that as long as the oracle can successfully suggest the optimal conﬁguration, Cha Cha can always achieve a sublinear regret bound even in the worst case. In some ideal cases it can achieve a regret bound that matches the regret bound of the optimal conﬁguration. This sanity-check anal-

ysis provides some reassurance that the algorithm behaves reasonably in many settings.

We test the Cha Cha algorithm on a suite of large regression datasets from Open ML (Vanschoren et al., 2014) for two online auto ML tasks. Figure 1 shows a demonstrative result obtained by Cha Cha for tuning features interactions choices, eclipsing a widely used online learning algorithm. Further experimentation demonstrates Cha Cha is consistently near-best amongst plausible alternatives.

1.2. Related work

There are many interesting auto ML systems designed for the ofﬂine setting. A few representative examples from academia are Auto-sklearn (Feurer et al., 2015), Hyperopt (Bergstra et al., 2015), Hyperband (Li et al., 2017) and FLAML (Wang et al., 2021b). There are also many endto-end commercial services such as Amazon AWS Sage Maker, Data Robot, Google Cloud Auto ML Tables, Microsoft Azure ML Auto ML and H2O Driverless AI. Notable hyperparameter optimization methods include: random search (Bergstra & Bengio, 2012; Li et al., 2017), Bayesian optimization (Snoek et al., 2012; Bergstra et al., 2015; Feurer et al., 2015; Falkner et al., 2018), local search (Koch et al., 2018; Wu et al., 2021) and methods that combine local and global search (Eriksson et al., 2019; Wang et al., 2021a). In addition, progressively increasing resource and early stopping methods (Li et al., 2017; Falkner et al., 2018; Huang et al., 2019) are usually found useful when training is expensive. Despite the extensive study, none of these methods is directly applicable to performing efﬁcient auto ML in the online learning setting.

An incremental data allocation strategy, which is conceptually similar to the adaptive resource scheduling strategy in this work, for learner selection is proposed in (Sabharwal et al., 2016). However, the method is designed to function only in the ofﬂine batch learning setting with a small ﬁxed set of learners. Other research directions that share strong synergy with auto ML in the online learning literature include parameter-free online learning (Chaudhuri et al., 2009; Luo & Schapire, 2015; Orabona & Pál, 2016; Foster et al., 2017; Cutkosky & Boahen, 2017) and online model selection (Sato, 2001; Muthukumar et al., 2019). Many of these works provide theoretical foundations for online model selection in different learning environments, including both probabilistic and adversarial. However, most of these works only address speciﬁc hyper-parameters, such as the learning rate. In addition, these works overlook the sharp constraints on computational power encountered in practice. (Dai et al., 2020) views a model in the context of a bigger system with a target of actively controlling the data collection process for online experiments, which is different from the goal of this work.

Cha Cha for Online Auto ML

2. Learning Setting

Several deﬁnitions are used throughout. Examples are drawn i.i.d. from a data space X Y with X the input domain and Y the output domain. A function f : X Y maps input features to an output prediction. A learning algorithm A : X (X Y) Y maps a dataset and a set of input features to a prediction. A loss function l : Y Y R deﬁnes a loss for any output and prediction. Lf := E(X,Y )[l(f(X), Y )] denotes the true loss of hypothesis f (under the i.i.d. assumption). L F := minf F E[l(f(xt), yt)] denotes the best loss achievable using the best ﬁxed choice of parameters in a function class F, which contains a set of functions. f is the best function given loss function l and the data distribution.

We consider the following online learning setting: at each interaction t, the learner receives a data sample X from the input domain, and then makes a prediction of the sample A(X, Dt) based on knowledge from the historical data samples Dt. After making the prediction, the learner receives the true answer Y from the environment as a feedback. Based on the feedback, the learner measures the loss, and updates its prediction model by some strategy so as to improve predictive performance on future received instances. In such an online learning setting, we want to minimize the cumulative loss PT t=1 LA( ,Dt) from the online learner A over the whole interaction horizon T compared to the loss of the best function f . The gap between them is called regret and deﬁned as R(T) := PT t=1(LA( ,Dt) Lf ).

Similar to the case in ofﬂine machine learning, hyperparameters are also prevalent in online learning. A naive choice of the conﬁguration c may lead to a function class that does not contain the best function. In this case R(T) := PT t=1(L c,t Lf ) PT t=1(L F c Lf ) = Θ(T), i.e., linearly increasing regret is inevitable if the wrong hyperparameter conﬁguration is used. We propose online auto ML to make online choices of the conﬁgurations drawn from a conﬁguration space C. For a particular conﬁguration c C, we denote by Ac the corresponding online learner instance and Fc the corresponding function class for Ac. We denote by Dt,c the data samples received by Ac up to time t. We denote Lc,t := LAc( ,Dt,c). We choose ct C for prediction at time t while operating online auto ML. Correspondingly, the regret can be rewritten as R(T) := PT t=1(Lct,t Lf ).

To account for practical online learning scenarios, we target the online setting where (1) a maximum number of b "live" models are allowed to perform online learning at the same time; and (2) no model persistence or ofﬂine training is allowed, which means that once we decide to replace a live model with a new one, the replaced model can no longer be retrieved. After each model replacement, the new live model needs to learn from scratch, i.e., Dt,c is empty when a model is not live.

Solving the online auto ML problem requires ﬁnding a balance between searching over a large number of plausible choices and concentrating the limited computational budget on a few promising choices such that we do not pay a high learning price (regret). Cha Cha (short for Champion Challengers) is designed to this end with two key ideas: (1) a progressive expansion of the search space according to the online performance of existing conﬁgurations; (2) an amortized scheduling of the limited computational resources to conﬁgurations under consideration.

To realize these two ideas, we ﬁrst categorize conﬁgurations under consideration in our method into one Champion, denoted by C, and a set of Challengers, denoted by S. Intuitively, the Champion is the best proven conﬁguration at the concerned time point. The rest of the candidate conﬁgurations are considered as Challengers. Cha Cha starts by setting the initial or default conﬁguration, denoted by cinit as the champion, and starts with an empty challenger set, i.e, S = initially. As online learning goes, it (1) updates the champion when necessary and adds more challengers progressively; (2) assigns one of the b slots for live models to the champion, and does amortized scheduling of the challengers for the remaining b 1 slots when the number of challengers is larger than b 1. Under this framework, in the special case where b = 1, our method degenerates to a vanilla online learning algorithm using the default conﬁguration. In the case where b > 1, we are able to evaluate more challengers, which gives us a chance to ﬁnd potentially better conﬁgurations. With b live models running, Cha Cha at each iteration selects one of them to do the ﬁnal prediction. We present the framework of Cha Cha in Figure 2 and provide the pseudocode description in Algorithm 1.

3.1. Progressive search space construction in Cha Cha

Generating challengers with champion and Config Oracle. Cha Cha assumes the availability of a Config Oracle. When provided with a particular input conﬁguration c, Config Oracle should produce a candidate conﬁguration set that contains at least one conﬁguration that is signiﬁcantly better than the input conﬁguration c each time a new conﬁguration is given to it. Such a Config Oracle is fairly easy to construct in practice. Domain expertise or ofﬂine auto ML search algorithms can be leveraged to construct such a Config Oracle. For example, when the conﬁgurations represent feature interaction choices, one typically effective way to construct the oracle is to add pairwise feature interactions as derived features based on the current set of both original and derived features (Luo et al., 2019). With the availability of such a Config Oracle, we use a Champion as the seed to the Config Oracle to construct a search space which

Cha Cha for Online Auto ML

Figure 2. Cha Cha online auto ML framework. The arrows in black are executed at every iteration of online learning. The arrows in green show the operations that are triggered once the Champion is updated, and the arrows in blue show the scheduling of live challengers. Each live challenger is represented by a blue rectangle with shaped area, the length of which represents the assigned resource lease. The shaded area of each rectangle reﬂects how many samples are consumed by each corresponding learner.

is then expanded only when a new Champion is identiﬁed.

A progressive update of the champion using statistical tests. Cha Cha updates the champion when a challenger is proved to be sufﬁciently better than it. We use statistical tests with sample complexity bounds to assess the true quality of a conﬁguration and promote new champions accordingly. The central idea of the statistical tests is to use sample complexity bounds and empirical loss to assess the true performance L Fc of the conﬁguration c through a probabilistic lower and upper bound, denoted by Lc,t and Lc,t respectively.

Since we consider the online learning setting, we use progressive validation loss (Blum et al.) as the empirical loss metric. For a given conﬁguration c, we denote by LP V c,t the progressive validation loss of Ac on data sequence Dt,c.

LP V c,t := 1 |Dt,c|

(Xi,Yi) Dt,c l(A(Xi, Dt,c[: i 1]), Yi)

Without loss of generality, we assume the learning algorithm A can ensure that, for any δ > 0, with probability at least 1 δ, t, |LP V c,t L Fc| (comp Fc log(|Dt,c|/δ)|Dt,c|p 1) where comp Fc denotes a term related to the complexity of the function class Fc, and 0 < p < 1 characterizes the estimator error s dependency on the number of data samples. For example, one typical result is p = 1

2, comp F =

d for the d-dimensional linear regression functions (Shalev Shwartz & Ben-David, 2014).

Based on the bound speciﬁed above and an union bound, we can obtain a probabilistic error bound ϵc,t which holds for all t and for all c St. More formally, with probability at least 1 δ, t, c St, |LP V c,t L Fc| ϵc,t with

ϵc,t := comp Fc(log(|Dt,c||St|/δ))|Dt,c|p 1) (1)

Lc,t := LP V c,t + ϵc,t and Lc,t := LP V c,t ϵc,t are thus probabilistic upper and lower bounds of L Fc for t [T], c St. Based on these probabilistic bounds, we deﬁne the following two tests to help distinguish which conﬁguration can lead to better (true) performance.

Better(c, C, t) := 1{Lc,t < LC,t ϵC,t} (2)

Worse(c, C, t) := 1{Lc,t > LC,t} (3)

Cha Cha eliminates a challenger from consideration once the result of Worse test is positive and promotes a challenger to the new champion once the result of a Better test is positive (line 11-15 in Algorithm 1). When a new champion is promoted, a series of subsequent operations are triggered, including (a) an update of the Cha Cha s champion and (b) a call of the Config Oracle which generates a new set of challengers to be considered. These two operations are reﬂected in line 8 and line 16-17 of Algorithm 1 and the green arrows in Figure 2.

Note that when testing whether a challenger c should be promoted into a new champion using the Better test speciﬁed in Eq. (2), we require the gap between the lower and upper bounds to be at least ϵC,t. This ensures that a challenger is promoted into a champion only when it is sufﬁciently better than the old champion, a strategy which avoids the situation where we keep switching champions that are only slightly better than the old ones. That situation is undesirable for two reasons: (a) it does not guarantee any lower bound on the loss reduction and thus the true loss between the champion and the true best conﬁguration may remain larger than a constant, which causes a linearly increasing regret in the worst case, and (b) since new challengers are generated and added into consideration, it makes the challenger pool unnecessarily large.

Cha Cha for Online Auto ML

In summary, with the help of Config Oracle and the statistical tests, Cha Cha is able to maintain a set of challengers that are not yet sufﬁciently better than the champion but also not signiﬁcantly worse than the champion.

Algorithm 1 Cha Cha

1: Inputs: Initial conﬁguration cinit, budget b. 2: Initialization: C cinit; S Config Oracle(C); ˆc C; B = 3: for t = 1, ... do 4: Observe Xt 5: B Schedule(b, B, S) 6: Predict ˆYt Aˆc(Xt, Dt,ˆc), in which ˆc arg minc B+C LFc 7: Receive Yt and incur loss l( ˆYt, Yt) 8: for c B + C do 9: Dt+1,c Dt,c+(Xt, Yt) and update A( , Dt+1,c) 10: Cold C 11: for c S do 12: if Better(c, C, t) then 13: C c 14: if Worse(c, C, t) then 15: S S c 16: if C = Cold then 17: S S + Config Oracle(C)

3.2. Live challenger scheduling

Now that we have a set of challengers, if the number of live model slots is larger than the number of challengers (either because we have a large b or because we have a small |S|), we can evaluate all the challengers simultaneously. Otherwise we need to perform a careful scheduling. The scheduling problem is challenging since: (1) no model persistence is allowed in the online learning setting so frequent updates of the live challengers is costly in terms of learning experience; (2) a blind commitment of resources to particular choices may fail due to those choices yielding poor performance. One principled way to amortize this cost is to use the doubling trick when allocating the sample resource: assign each challenger an initially small lease and successively double the lease over time. We adopt this amortized resource allocation principle and also add a special consideration of the challengers empirical performance while doing the scheduling. The scheduling step is realized through a Schedule function in Cha Cha.

Speciﬁcally, the Schedule function takes as input the budget b, the current live challenger set B, the candidate set S, and provides as output a new set of live challengers (which can have overlap with the input B). It is designed to eventually provide any conﬁguration with any necessary threshold of examples required for a regret guarantee. We call this resource threshold a resource lease denoted by

nc. Initially every conﬁguration is assigned a particular minimum resource lease nc = nmin (for example nmin = 5 #features). When a conﬁguration has been trained with nc examples (line 6 in Algorithm 2), i.e., reaches its assigned resource lease, we double this resource lease (line 7 in Algorithm 2).

To avoid starving a challenger under consideration indefinitely, we remove the challenger which just reached its assigned resource lease, from the live challenger pool and add the challenger with the minimum resource lease into the live challenger pool (line 9 and line 11-13 of Algorithm 2 and Algorithm 3). In addition, to avoid throwing away valuable experience for a promising challenger, we adopt a more conservative exploration: We replace a live challenger which reaches its assigned resource lease only if it is not among the top performing (according to loss upper bound) live challengers (line 8 in Algorithm 2). In other words, we use half of the compute resources to exploit the candidates that have good performance for now, and another half to explore alternatives that may have better performance if given more resources.

With the b live models running, at each interaction, Cha Cha selects one of them to make the prediction (line 6 of Algorithm 1) following the structural risk minimization principle (Vapnik, 2013).

Algorithm 2 Schedule(b, B, S)

1: Notions: nc denotes the resource lease assigned to for c; nc,t denotes the resource consumed by Ac by time t, e.g., nc,t = |Dt,c| 2: for c B do 3: if c / S and |S| > b then 4: B B c 5: for c B do 6: if nc,t nc then 7: nc 2nc 8: if Lc,t > median({Lc,t}c B), and |S| > b then 9: B B c 10: while |B| < b 1 do 11: c Choose(S \ B) 12: B B + c 13: Dt+1,c {Assuming no persistence of models} 14: return B

Algorithm 3 Choose(S)

1: CPending {c S : nc has not been set} 2: if |CPending| = 0 then 3: c Random(CPending) and set nc nmin 4: return c 5: return arg minc S nc

Cha Cha for Online Auto ML

For the convenience of analysis, we split the total time horizon T into M + 1 phases, the index of which ranges from 0 to M. Phase 0 starts from t = 1 and a new phase starts once a new champion is found. We denote by tm and tm+1 1 the starting and end time index of phase m respectively. Thus Nm := tm+1 tm is the length of phase m. We denote by Sm the set of candidate conﬁgurations and Cm the champion conﬁguration at phase m.

1, 2, , t1 1 | {z } phase 0 , , tm, , tm+1 1 | {z } phase m

, ,t M, , T | {z } phase M

We denote by c the optimal conﬁguration, and thus Lf = L Fc . To make meaningful analysis, we assume c is added in the candidate conﬁguration pool after the Config Oracle is called a constant number of times.

Lemma 1 With ϵc,t being set as in Eq. 1 and 0 < δ < 1, m [M],

Claim 1. c Sm, if L Fc L FCm + 2ϵc,t + 3ϵCm,t < 0, with probability at least 1 δ, c can pass the Better test described in Eq. (2) when compared with Cm at time t.

Claim 2. When the Better test is positive, with probability at least 1 δ, L FCm L FCm+1 > ϵCm,tm+1.

Claim 3. If L Fc < L FCm, with probability at least 1 δ, c will not pass the Worse test when compared with Cm.

Proposition 1 With a base learning algorithm having a sample complexity based error bound of comp Fc log(|Dt,c|/δ)|Dt,c|p 1 and ϵc,t being set as in Eq. (1), with high probability, Cha Cha can obtain,

t=tm (L FCm L Fc ) (4)

= O max m [M] |Sm|

b comp Fc T p +

m=0 comp FCm N p m

= O max m [M] |Sm|

b comp Fc T p + comp FCm T 1 2 p

Proof intuition of Proposition 1. The proof is mainly based on the ﬁrst two claims of Lemma 1. Claim 1 of Lemma 1 ensures that, for all phases m [M], the gap between true loss of the champion and the best conﬁguration in the pool is with high probability upper bounded by ϵc ,t and ϵCm,t. Since ϵc,t shrinks with the increase of data samples for c, and our scheduling strategy ensures that the optimal conﬁguration c receives at least b 4 maxm M |Sm|T data samples, we can obtain an upper

bound of O(maxm [M] |Sm|

b comp Fc T p) for the summation term on ϵc ,t over T iterations. The summation over

ϵCm,t, i.e., PM m=0 Ptm+1 1 t=tm ϵCm,t is delicate because we must account for the case where the champion is updated frequently, which makes M and ϵCm,t large. Fortunately, we designed the Better test such that we switch to a new champion only when it is sufﬁciently better than the old one which ensures an upper bound on the number of switches, i.e., M. Speciﬁcally, with Claim 2 of Lemma 1, and the fact that PM m=0 Nm = T, we can prove M = O(T

1 p 2 p ). With that we are able to prove PM m=0 Ptm+1 1 t=tm ϵCm,t = O(PM m=0 N p m) = O(T 1 2 p ), which contributes to the second term of the bound in Eq. (4).

Remark 1 (Special cases of Proposition 1) Proposition 1 provides an upper bound of PM m=0 Ptm+1 1 t=tm (L FCm L Fc ) in the most general case. It provides guarantees even for the worst cases where the algorithm needs to pay high switching costs, i.e., with a large number of phases M. Such bad cases happen when the algorithm frequently ﬁnds new Champions from newly added candidates, whose accumulated sample number is still small and thus |Dtm+1,Cm| can only be lower bounded by 1, m [M]. In the special case where |Dtm+1,Cm| z|Dtm,Cm 1|, in which z is a constant larger than 1 for all m, we have PM m=0 N p m log Nm = O(T p log T), which reduces the upper bound in Proposition 1 to O T p . In this special case, the bound matches the order of regret (w.r.t. T) for using the optimal conﬁguration.

Without further assumptions, Proposition 1 is tight. For any small η > 0, we can construct a sequence of Nm = Θ(mq), where q = 1 p 1 2 p p η 1 = (1 p)(2 p) 1 (2 p)(p+η) 1 > 2 p

1 = 1 1 p. Such a sequence satisﬁes Eq. (9), and makes PM m=0 N p m log Nm = Ω(T 1 2 p η).

Theorem 1 Under the same conditions as speciﬁed in Proposition 1, Cha Cha can achieve the following regret bound, PT t=1(Lˆct,t L Fc ) = O maxm [M] |Sm|

b comp Fc T p + comp FCm T 1 2 p .

Proof intuition of Theorem 1. We decompose the regret in the following way PT t=1(Lˆct,t L Fc ) = PM m=0 Ptm+1 1 t=tm (Lˆct,t L Cm)+PM m=0 Ptm+1 1 t=tm (L Cm L Fc ), the bound of which is quite straightforward to prove once the conclusion in Proposition 1 is obtained.

5. Empirical Evaluation

Cha Cha is general enough to handle the online tuning of various types of hyperparameters of online learning algorithms in a ﬂexible and extensible way. It can be tailored to satisfy customized needs through customized implementations of the Config Oracle.

Cha Cha for Online Auto ML

Out-of-the-box Cha Cha with a default Config Oracle We provide a default implementation of the Config Oracle such that Cha Cha can be used as an out-of-the-box solution for online auto ML. This default Config Oracle leverages an existing ofﬂine hyperparameter optimization method (Wu et al., 2021) for numerical and categorical hyperparameters suggestion. In addition to these two types of typically concerned hyperparameters, we further extend the Config Oracle by including another important type of hyperparameter, whose conﬁgurations space can be expressed as a set of polynomial expansions generated from a ﬁxed set of singletons. One typical example of this type of hyperparameter is the choice of feature interactions or feature crossing on tabular data (Luo et al., 2019), where raw features (or groups of raw features) are interacted, e.g., through cross product, to generate an additional set of more informative features. In this example, the set of raw feature units are the set of singletons, and each resulting feature set with additional feature interactions can be treated as one polynomial expansion based on the original singletons. This type of featurization related hyperparameter is of particular importance in machine learning practice: it has been well recognized that the performance of machine learning methods, including both ofﬂine batch learning and online learning, depends greatly on the quality of features (Domingos, 2012; Agarwal et al., 2014). The tuning of this type of hyperparameter is notoriously challenging because of the doubly-exponentially large search space: for a problem with m raw singletons, each interaction (up to a maximum order of m) involves any subset of the m raw singletons, implying a (Pm i=0 m i m 1 = 2m m 1)-dimensional binary vector space deﬁning the set of possible interactions. Since any subset of the polynomial expansions is allowed, there are 22m m 1 possible choices. For this type of hyperparameter, only heuristic search approaches such as Beam search (Medress et al., 1977), are available even in the ofﬂine learning setting. Inspired by these ofﬂine greedy search approaches and domain knowledge obtained from online learning practice, we realize the Config Oracle for this type of hyperparameter in the following greedy approach by default in Cha Cha: given the input conﬁguration, Config Oracle generates all conﬁgurations that have one additional second order interaction on the input conﬁguration. For example, given a input conﬁguration C = {e1, e2, e3}, the Config Oracle(C) outputs the following set of conﬁgurations {{e1, e2, e3, e1e2}, {e1, e2, e3, e1e3}, {e1, e2, e3, e2e3}}. Under this design, when provided with an input conﬁguration with k groups of features, the Config Oracle generates a candidate set with k(k 1)

2 conﬁgurations.

5.1. Auto ML tasks

Online Learning with Vowpal Wabbit. Our evaluation is performed using Vowpal Wabbit1 (VW), which is an opensource online machine learning library. Users can tune various hyperparameters of online learning algorithms in VW depending on their needs, for example, namespaces interactions, learning rate, l1 regularization and etc.

Target hyperparameters to tune. We perform evaluation in two tuning scenarios to demonstrate both the effectiveness and the ﬂexibility of Cha Cha, including the tuning of namespaces interactions (a namespace is essentially a group of features), and the tuning of both namespace interactions and learning rate. The automatic tuning of namespace interactions is of signiﬁcant importance in VW because (a) it can greatly affect the performance; (b) the challenge exists no matter what online learning algorithms one use; (c) it is not well handled by any of the parameter-free online learning algorithms developed recently. For these two tuning tasks, we use a default implementation of the Config Oracle described earlier in this section. We use the default conﬁguration in VW as the the initial conﬁguration cinit: no feature interactions, and the learning rate is 0.5.

We use the VW default learning algorithm (which uses a variant of online gradient descent) as the base learner. We perform the main evaluation under the constraint that a maximum of 5 live learners are allowed, i.e., b = 5.

Baselines and Comparators. No existing auto ML algorithm is designed to handle the online learning scenario, so we compare our method with several natural alternatives.

- Random: Run learners built on the initial conﬁguration and (b 1) randomly selected conﬁgurations from the ﬁrst batch of candidate conﬁgurations generated by the same Config Oracle used in Cha Cha.

- Exhaustive: Exhaust all the conﬁgurations generated from Config Oracle(cinit) ignoring the computational limit on the number of live models. This should provide better performance than Cha Cha on the ﬁrst batch of conﬁgurations generated from the Config Oracle with the initial conﬁguration since it removes the computational limits of Cha Cha. It is possible for Cha Cha to perform better despite a smaller computational budget by moving beyond the initial set of conﬁgurations.

- Vanilla: Run the learner built on the initial conﬁguration using cinit without further namespace interactions.

Note that Vanilla and Exhaustive are comparators to better understand the performance of the proposed method, rather than baselines in the context of online auto ML with the same computational constraints.

1https://vowpalwabbit.org

Cha Cha for Online Auto ML

0.0 5000.0 10000.0 15000.0 20000.0 25000.0 30000.0 35000.0 # of data samples

Progressive validation loss

(a) Progressive validation loss over time on dataset # 41506

0.0 200000.0 400000.0 600000.0 800000.0 1000000.0 # of data samples

Progressive validation loss

(b) Progressive validation loss over time on dataset # 5648

Figure 3. Real time progressive validation loss on a small and large dataset. The shaded area shows the standard derivation of the loss over the 5 runs. The sufﬁxes :NI and :NI+LR with Cha Cha and Random denote the tuning task of namespace interactions, and both namespace Interactions and learning rate respectively. This legend syntax applies to all ﬁgures in this paper.

Datasets. We evaluate our method on a set of large scale (# of instance: 10K to 1M) regression datasets from Open ML (in total 40). On these datasets, we group the raw features into up to 10 namespaces. Among the 40 large scale openml datasets, 25 do not allow superior performance to Vanilla amongst the conﬁgurations given by Config Oracle. We therefore focus on the 15 meaningful datasets where Exhaustive is better than Vanilla.

5.2. Results

Performance on Open ML datasets. Figure 3 shows typical results in terms of the progressive validation loss (mean square error) on two of the openml datasets. Cha Cha is able to outperform all the competitors except Exhaustive. To show the aggregated results over all the openml datasets, we normalized the progressive validation loss of a particular algorithm using the following formula: Score(alg) := LP V (Vanilla) LP V (alg) LP V (Vanilla) LP V (Exhaustive). By this deﬁnition, Score(Vanilla) = 0 and Score(Exhaustive) = 1. The score is undeﬁned when Score(Exhaustive) = Score(Vanilla), so we only report results on the meaningful datasets where the difference is nonzero. As the serving order of conﬁgurations from the candidate conﬁguration pool is randomized, for all the experiments, we run each method 5 times with different settings of random seed (except Vanilla and Exhaustive as they are not affected by the random seeds) and report the aggregated results. Figure 4(a) and Figure 4(b) show the ﬁnal normalized scores for two tuning tasks on the 15 datasets after running for up to 100K data samples (or the maximum number of data samples if it is smaller than 100K). We include the ﬁnal normalized score on datasets with larger than 100K samples in the appendix. Results in Figure 4 show that Cha Cha

has signiﬁcantly better performance comparing to Random over half of the 15 meaningful datasets.

Analysis and ablation. We now investigate several important components in Cha Cha to better understand the effectiveness of it.

(1) Search space construction in Cha Cha. Cha Cha uses the champion and Config Oracle to gradually expand the search space, which leads to a provable improvement on the overall quality of conﬁgurations while keeping the size still manageable. The method Random and Exhaustive use a straightforward way to take advantage of the Config Oracle: they consider the conﬁgurations generated by the Config Oracle based on the initial conﬁguration. Random works reasonably well if the conﬁguration pool contains many conﬁgurations better than the initial conﬁguration. However as expected, it has very large variance. In addition, it is not able to further improve beyond Exhaustive. The fact that in many cases, Cha Cha has better performance than Exhaustive indicates that there is indeed a need for expansion of the search space. In addition to the performance boost comparing to Exhaustive on almost half of the datasets, Cha Cha shows almost no performance degeneration comparing to both Vanilla and Random.

(2) The scheduling of live models. Another important component of Cha Cha is the scheduling of live models. Cha Cha always keeps the champion live and schedules b 1 live challengers in an amortized manner. When scheduling the challengers, to avoid throwing away useful experiences due to the live challengers swapping, Cha Cha keeps the top-performing half of the live challengers running. Results in Figure 5 show the two variants of Cha Cha

Cha Cha for Online Auto ML

Normalized score

(a) Namespace interactions tuning

Normalized score

(b) Namespace interactions and learning rate tuning

Figure 4. Results on openml datasets for two tuning tasks with error-bars showing the standard derivation over the 5 runs.

Normalized score

Figure 5. Ablations of Cha Cha in namespace interactions tuning.

(for the tuning of namespace interactions) without these two designs. Cha Cha-Aggressive Scheduling denotes the variant of Cha Cha in which we do the scheduling purely based on the doubling resource lease without regard for their empirical performance. Cha Cha-w/o-Champion is a variant of Cha Cha where we further do not intentionally keep the champion running. The bad results of Cha Chaw/o-Champion shows the necessity of keeping a learner with sufﬁciently large training samples live . The comparison between Cha Cha-Aggressive Scheduling and Cha Cha shows the beneﬁt of balancing the long-term and short-term utility of conﬁguration exploration.

We include more evaluation details, the setting of Cha Cha and additional results in the appendix.

6. Conclusion

In this work, we propose a novel solution Cha Cha for Online Auto ML. Cha Cha is a ﬁrst of its kind auto ML solution that can operate in an online learning manner. It respects the sharp computational constraints and optimizes for good online learning performance, i.e., cumulative regret. The sharp computational constraint and the need for learning solutions that have a good guarantee on the cumulative regret are unique properties of online learning and not addressed by any existing auto ML solution. Cha Cha is theoretically sound and has robust good performance in practice. Cha Cha provides a ﬂexible and extensible framework for online auto ML, which makes many promising future work about online auto ML possible. For example, it is worth studying how to make Cha Cha handle online learning scenarios with bandit feedback.

Software and Data

Our method is open-sourced in the Auto ML Libriary FLAML2. Please ﬁnd a demonstration of usage in this notebook3. All the datasets are publicly available in Open ML4.

Acknowledgements

The authors would like to thank Akshay Krishnamurthy for the discussions about this work and the anonymous reviewers for their revision suggestions.

2https://github.com/microsoft/FLAML/tree/ main/flaml/onlineml

3https://github.com/microsoft/FLAML/blob/ main/notebook/flaml_autovw.ipynb

4https://www.openml.org/search?type=data

Cha Cha for Online Auto ML

Personalizer azure cognitive service, 2019.

Agarwal, A., Beygelzimer, A., Hsu, D. J., Langford, J., and Telgarsky, M. J. Scalable non-linear learning with adaptive polynomial expansions. In Advances in Neural Information Processing Systems, 2014.

Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281 305, February 2012.

Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., and Cox, D. D. Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1):014008, July 2015.

Blum, A., Kalai, A., and Langford, J. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT) 1999.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press, 2006. ISBN 978-0521-84108-5.

Chaudhuri, K., Freund, Y., and Hsu, D. A parameterfree hedging algorithm. ar Xiv preprint ar Xiv:0903.2851, 2009.

Cutkosky, A. and Boahen, K. A. Stochastic and adversarial online learning without hyperparameters. In Advances in Neural Information Processing Systems, 2017.

Dai, Z., Chandar, P., Fazelnia, G., Carterette, B., and Lalmas, M. Model selection for production system via automated online experiments. In Advances in Neural Information Processing Systems, 2020.

Domingos, P. A few useful things to know about machine learning. Communications of the ACM, 55(10):78 87, 2012.

Elsken, T., Metzen, J. H., and Hutter, F. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1 21, 2019.

Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., and Poloczek, M. Scalable global optimization via local bayesian optimization. In Advances in Neural Information Processing Systems, 2019.

Falkner, S., Klein, A., and Hutter, F. BOHB: Robust and efﬁcient hyperparameter optimization at scale. In International Conference on Machine Learning (ICML), 2018.

Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. Efﬁcient and robust automated machine learning. In Advances in Neural Information Processing Systems, 2015.

Foster, D. J., Kale, S., Mohri, M., and Sridharan, K. Parameter-free online learning via model selection. In Advances in Neural Information Processing Systems, 2017.

Huang, S., Wang, C., Ding, B., and Chaudhuri, S. Efﬁcient identiﬁcation of approximate best conﬁguration of training in large datasets. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2019.

Koch, P., Golovidov, O., Gardner, S., Wujek, B., Grifﬁn, J., and Xu, Y. Autotune: A derivative-free optimization framework for hyperparameter tuning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.

Li, L., Jamieson, K., De Salvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. In The International Conference on Learning Representations (ICLR), 2017.

Luo, H. and Schapire, R. E. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, 2015.

Luo, Y., Wang, M., Zhou, H., Yao, Q., Tu, W.-W., Chen, Y., Dai, W., and Yang, Q. Autocross: Automatic feature crossing for tabular data in real-world applications. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.

Medress, M. F., Cooper, F. S., Forgie, J. W., Green, C., Klatt, D. H., O Malley, M. H., Neuburg, E. P., Newell, A., Reddy, D., Ritea, B., et al. Speech understanding systems: Report of a steering committee. Artiﬁcial Intelligence, 9 (3):307 316, 1977.

Muthukumar, V., Ray, M., Sahai, A., and Bartlett, P. Best of many worlds: Robust model selection for online supervised learning. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2019.

Orabona, F. and Pál, D. Coin betting and parameter-free online learning. ar Xiv preprint ar Xiv:1602.04128, 2016.

Real, E., Liang, C., So, D., and Le, Q. Automl-zero: evolving machine learning algorithms from scratch. In International Conference on Machine Learning (ICML), 2020.

Sabharwal, A., Samulowitz, H., and Tesauro, G. Selecting near-optimal learners via incremental data allocation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2016.

Cha Cha for Online Auto ML

Sato, M.-A. Online model selection based on the variational bayes. Neural computation, 13(7):1649 1681, 2001.

Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014. ISBN 1107057132.

Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, 2012.

Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. Openml: Networked science in machine learning. SIGKDD Explor. Newsl., 15(2):49 60, June 2014.

Vapnik, V. The nature of statistical learning theory. Springer science & business media, 2013.

Wang, C., Wu, Q., Huang, S., and Saied, A. Economical hyperparameter optimization with blended search strategy. In The International Conference on Learning Representations (ICLR), 2021a.

Wang, C., Wu, Q., Weimer, M., and Zhu, E. Flaml: A fast and lightweight automl library. In The Conference on Machine Learning and Systems (MLSys), 2021b.

Wu, Q., Wang, C., and Huang, S. Frugal optimization for cost-related hyperparameters. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2021.