# learning_with_adaptive_resource_allocation__d383cc87.pdf Learning with Adaptive Resource Allocation Jing Wang 1 2 Miao Yu 1 2 Peng Zhao 1 2 Zhi-Hua Zhou 1 2 The study of machine learning under limited resources has gathered increasing attention, considering improving the learning efficiency and effectiveness with budgeted resources. However, previous efforts mainly focus on single learning task, and a common resource-limited scenario is less explored: to handle multiple time-constrained learning tasks concurrently with budgeted computational resources. In this paper, we point out that this is a very challenging task because it demands the learner to be concerned about not only the progress of the learning tasks but also the coordinative allocation of computational resources. We present the Learning with Adaptive Resource Allocation (LARA) approach, which comprises an efficient online estimator for learning progress prediction, an adaptive search method for computational resource allocation, and a balancing strategy for alleviating prediction-allocation compounding errors. Empirical studies validate the effectiveness of our proposed approach. 1. Introduction The impact of limited computational resources on machine learning (ML) is of increasing attention due to its significant influence on the efficiency and effectiveness of the model training process in various real-world scenarios. This is particularly noticeable in situations involving large models, such as large language models (Achiam et al., 2023), which always demand the use of thousands of GPUs for several months to achieve effective training, as well as the situation of tiny training devices such as certain Io T devices or microcontrollers (Lin et al., 2023), which are incapable of supporting even regular sized models. All these situations highlight the diverse challenges posed by computational resource limitations in different contexts of learning tasks. 1National Key Laboratory for Novel Software Technology, Nanjing University, China 2School of Artificial Intelligence, Nanjing University, China. Correspondence to: Zhi-Hua Zhou . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Computational facility ... ... Upload time Deadline time Figure 1. Multiple users upload diverse learning tasks to a shared computational facility, all intended to be completed before userdefined deadlines. Due to computational resource limitation, we aim to complete as many tasks as possible within these constraints. In order to improve training efficiency and effectiveness with limited computational resources, various significant strides have been made in prior research, which include: improving resource usage efficiency (Dean et al., 2012; Li et al., 2014), reducing model size (Frankle & Carbin, 2019; Mirzadeh et al., 2020), designing memory-efficient optimization algorithms (Anil et al., 2019; Hu et al., 2022), etc. Notably, previous efforts mainly focus on the computational resource issue of single learning task. However, a common resource limitation scenario has received less attention: multiple time-constrained learning tasks competing for budgeted computational resources, as illustrated in Figure 1. Consider a practical example of a quantitative finance startup, where there can be multiple analysts or traders who aim to train many different learning models to help forecast the market or manage risks, and these model training tasks require a large amount of computational resources. Given their limited budgets, these startups often face challenges in acquiring advanced computational resources necessary for concurrently handling everyone s computational needs. This limitation typically results in a first-come, first-served basis in resource allocation. Such an approach can be highly inefficient, particularly when earlier uploaded tasks use resources poorly, which not only leads to computational power wastage but also causes delays in executing other critical tasks. In the dynamic and fast-paced domain of finance, such delays can incur significant costs, such as missed opportunities or delayed responses to emerging risks. A similar challenge arises when a user purchases or leases cloud computational services. As the user always needs to execute multiple tasks concurrently, each corresponding to different model structures or hyperparameter tunings, these Learning with Adaptive Resource Allocation computational resources are often incapable of supporting parallel training for these learning tasks. Consequently, the user has to adopt a sequential approach for model training. To this end, Zhou (2023) advocated that the concept of timesharing should be introduced to machine learning. On one hand, this will enable us to take into account the computational resource constraints in real-world issues mentioned above and recognize that the performance of machine learning depends not only on the amount of data received but also on the computational resources available to process it. On the other hand, current intelligent supercomputing facilities generally operate in an exclusive manner, allocating a fixed amount of resources to each user, which can either be insufficient or excessive, leading to inefficiencies. Such a working style can be improved if time-sharing concept is taken into account. For this purpose, Zhou (2023) formally proposed the Computational Resource Efficient Learning (Co RE-Learning) paradigm where the key is to consider the influence of resource scheduling during learning process, aiming to complete as many tasks on time as possible within resource limits. In this paper, we present the first practical Co RE-Learning approach. We identify that the key challenge of Co RE-Learning lies in the coupling of learning progress and resource allocation: the learning progress of each task guides resource allocation, while resource allocation affects the learning progress of each task. To tackle the challenge of Co RE-Learning problem, we introduce the Learning with Adaptive Resource Allocation (LARA) approach, which periodically predicts each task s resource requirement and allocates resources accordingly. LARA consists of three components: (i) Resource prediction: this component adopts an efficient online estimator to fit the learning progress curve from historical data and predicts the resources needed for each task to meet success criteria; (ii) Resource allocation: based on each task s resource prediction, this component models the resource allocation problem and employs an adaptive searching method guaranteed to find the optimal solution; (iii) Prediction-allocation balancing: this component employs an exploration strategy to reduce compounding error caused by the prediction inaccuracies and their cascading effects. Experimental results validate the effectiveness of our LARA approach, demonstrating its capacity to significantly improve resource utilization efficiency in the context of Co RE-Learning. Notation. In this paper, we let {ak}K k=1 {a1, ..., a K}; [T] {1, 2, ..., T}; [a, b] {a + 1, ..., b}; I( ) is indicator function; |A| denotes the cardinality of set A. 2. Problem Formulation and Key Challenge In this section, we provide the problem formulation of Co RE-Learning (Zhou, 2023) and discuss its key challenge. task bundle N2 = 12 N3 = 18 thread 1 thread 2 thread 3 η1,t = η2,t = η3,t = 1/3 unit time t Figure 2. An example of uniform allocation within unit time t. 2.1. Problem Formulation According to Zhou (2023), we consider a task bundle {Tk}K k=1 during the time period [T]. The k-th thread Tk represents a machine learning training task uploaded by the user, which is defined by: the task beginning time bk and the user-defined deadline time dk such that 1 bk < dk T, signifying the time constraint of k-th thread; the data budget Nk represents the upper limit of the data amount that k-th thread can learn within any unit time t [T], reflecting the computational resource capability. At unit time t, the learner observes a set of active threads At = {k | bk t dk, k [K]}. The budget computational resources cannot simultaneously support all threads k At to meet their maximum data capabilities Nk, so the learner needs to select the data throughput {ηk,t}k At such that k-th thread can learn ηk,t Nk data within unit time t. Figure 2 provides an example of allocation for three threads. The computational budgets bring constraints: k At, ηk,t 0; P k At ηk,t 1. Then, the learner observes the new training loss value ℓk(s) of k-th thread, where s represents the amount of accumulated usage data. The success criteria for k-th thread is that its training loss reaches a user-defined threshold within the time constraint [bk 1, dk]: ℓk(Pdk t=bk ηk,t Nk) ϵk. The learner s goal is to maximize the number of tasks that meet their success criterion within the time constraints. 2.2. Key Challenge We can model the Co RE-learning problem for task bundle {Tk}K k=1 as solving the following optimization problem max {ηk,t}k [K],t [T ] t=bk ηk,t Nk s.t. t [T], X k At ηk,t 1, k [K], t [T], ηk,t 0. The key challenge of Problem (2.1) arises from the coupling of learning progress and resource allocation. Specifically, the loss function ℓk( ) is unknown, which requires esti- Learning with Adaptive Resource Allocation mating ℓk based on early data and solving the allocation problem with the extrapolated function bℓk( ). The predictive error in bℓk( ) directly impacts the effectiveness of resource allocation. Additionally, even with a known ℓk( ), solving Problem (2.1) remains difficult. The non-convex and non-continuous nature of the objective function and the extensive number of parameters make classic methods like dynamic programming computationally intensive and timeconsuming, potentially affecting the learning process. One may also consider using reinforcement learning techniques to deal with the learning-allocation coupling issue, similar to the exploration-exploitation dilemma. However, these techniques are not directly applicable to Co RE-Learning due to the need for efficient, high-frequency re-estimation and reallocation to minimize the impact on learning tasks. 3. Our Approach In this section, we present Learning with Adaptive Resource Allocation (LARA), an innovative approach developed to manage budgeted computational resources in the context of Co RE-Learning. LARA comprises three components: resource prediction, resource allocation, and predictionallocation balancing. At each unit time, LARA predicts the data amount required for each thread to succeed and then allocates resources based on these predictions. The balancing component is vital in minimizing compounded errors from prediction and allocation during training. The following sections provide detailed discussions: resource prediction in Section 3.1, resource allocation in Section 3.2, and prediction-allocation balancing in Section 3.3. 3.1. Resource Prediction In this part, we introduce our resource prediction method. Briefly, this process involves analyzing the previously observed training loss values and the corresponding cumulative training data volume. By applying regression to fit the training loss curve, LARA extrapolates the fitted loss curve to predict when it is likely to fall below the success threshold. Many existing studies indicate that the training loss curves of various models generally follow the form of a negative power function (Hestness et al., 2017; Kaplan et al., 2020). Leveraging this insight, our approach utilizes a negative power function to fit the training loss curve. Specifically, for the k-th thread Tk, we model this relationship between the loss and the training data amount as ℓk(s) = aks bk + ξk,s, (3.1) where ℓk(s) denotes the training loss value of the cumulative data amount s for the k-th thread, ak and bk are unknown positive constants, and ξk,s represents noise. To facilitate a more efficient regression, we transform it into a linear form by taking ln( ) on both sides of Eq. (3.1): rk,s = X s θk + ξ k,s, (3.2) where rk,s ln ℓk(s), Xs [ln s; 1], θk [ bk; ln ak] and ξ k,s is the surrogate noise. It is critical to note that LARA needs to use early samples from the training loss function to estimate the true trajectory of the loss function curve. Then, we extrapolate this curve to predict the future data amount required for reducing the loss function below a specific success threshold. This extrapolation process requires paying more attention on minimizing future errors in the function estimation. To enhance accuracy, recent data is given priority, as it more accurately reflects upcoming trends. Consequently, we adopt a weighted regularized least squares method (Guo et al., 1993), which gives increased weight to recent data points while reducing the influence of older ones. For the k-th thread Tk, after observing n data pairs {si, ℓk(si)}n i=1, the estimator bθk,n is obtained by solving the following problem: i=1 γn i(X siθ rk,si)2, (3.3) where γ (0, 1) is the discounted factor, measuring the degree of discounting to early data. Problem (3.3) admits a closed-form solution bθk,n = V 1 n i=1 γn irk,si Xsi where Vn Pn i=1 γn i Xsi X si is the covariance matrix. Moreover, we can reformulate the solution (3.4) into an online update format (Haykin, 2002, Chapter 10.3) bθk,n = bθk,n 1 + V 1 n Xsn(rk,sn X sn bθk,n 1) V 1 n 1 V 1 n 1Xsn X sn V 1 n 1 γ + X sn V 1 n 1Xsn This new format does not require matrix inversion and is one-pass, in the sense that it processes each data only once, hence eliminating the need to store historical data and significantly enhancing the efficiency of the estimation. Upon estimating the unknown parameters by Eq. (3.5) as bθk,n = [ bbk,n, ln bak,n], we derive the estimated negative power function bℓk(s) = bak,ns bbk,n. By applying the success criterion bℓk(s) ϵk, we can compute the data volume needed for the k-th thread to achieve its success criterion as s = exp (1/bbk,n ln bak,n/ϵk). By subtracting the already used data amount sn, we obtain the remaining data requirement 1 bbk,n ln bak,n Learning with Adaptive Resource Allocation Remark 1. Recent studies on curve extrapolation and scaling laws for various models have yielded notable advancements. For instance, Domhan et al. (2015) employ an ensemble of diverse structural models, leveraging their combined strengths to approximate the loss curve more accurately. Alabdulmohsin et al. (2022) suggest using an extrapolation loss instead of the traditional interpolation loss for better prediction accuracy. Additionally, Shen & Meinshausen (2023) introduce engression as an advanced extrapolation technique. These approaches have demonstrated improved performance in curve extrapolation tasks. However, these advanced techniques typically require substantial time and computational resources for managing multiple models or optimizing the extrapolation or engression losses. Such extensive resource requirements are infeasible in the context of Co RE-Learning, where an efficient estimation algorithm is essential for frequent updates. 3.2. Resource Allocation At time τ, let Aτ denote the current active thread set and Kτ |Aτ|. We reorder and re-label the threads in Aτ such that their deadlines satisfying d0 d1 d2 . . . d Kτ where d0 τ 1. Let Sk denotes the data amount still required for k-th thread to succeed, which can be estimated by estimator (3.6), and the success criteria becomes Pdk t=d0+1 ηk,t Nk Sk. The learner s goal at time τ is to maximize the number of successful threads within the active thread set Aτ, which can be formulated as max {ηk,t}k Aτ ,t [d0,d Kτ ] t=d0+1 ηk,t Nk Sk s.t. t [d0, d Kτ ], X k Aτ ηk,t 1, k Aτ, t [d0, d Kτ ], ηk,t 0. (3.7) The indicator function introduces non-convexity to Problem (3.7). A typical solution might be convex relaxation followed by gradient descent. Here instead, we identify that the nature of non-continuous allows for both dimensionality reduction and discretization of the solution space, while guaranteeing that the optimal solution is encompassed within this space. This advantage allows us to develop efficient search methods to effectively locate optimal solutions. Parameter Reduction. The following lemma presents an optimal solution representation that reduces the number of parameters from P k Aτ (dk d0) to merely Kτ. Lemma 1. At time τ, there exists a set of time-invariant parameters {eηk}k Aτ , where k Aτ, 0 eηk 1, such that the optimal solution {η k,t}k Aτ ,t [d0,d Kτ ] to the allocation Problem (3.7) can be characterized by {eηk}k Aτ . Specifically, for each time interval [dj 1, dj], j [Kτ], η3,t = !η3 η3,t = (1 !η1)(1 !η2)!η3 η3,t = (1 !η2)!η3 η2,t = (1 !η1)!η2 η2,t = !η2 d1 d2 d3 d0 Figure 3. An example of parameter reduction for three threads. ( eηj, k = j, Qk 1 i=j (1 eηi)eηk, j < k Kτ. (3.8) Figure 3 provides an example to illustrate Eq. (3.8). Lemma 1 shows that we only needs to determine the set {eηk}Kτ k=1 to find the optimal solution of Problem (3.7). Adaptive Binary Tree Search. After parameter reduction, the cumulative data amount Pdk t=d0+1 ηk,t Nk allocated to thread k can be expressed as Dk(eηk) Nk, where we define D1(eη1) 1eη1, and k 2, i=j (1 eηi) j + k where j dj dj 1 is the time interval width. A critical insight is that for each thread, resource allocation has only two outcomes: success (i.e. Dk(eηk) Nk Sk) or failure. Therefore, eηk can take only two possible values: a value that ensures the thread s success or 0. Starting with eη1, the decision is either to allocate no resources (eη1 = 0) or just enough to meet its requirement (eη1 = S1/N1 1). Once eη1 is set, a similar decision is made for eη2. This process is then iteratively applied to each subsequent thread, such that when eη1, . . . , eηk 1 are determined, we can set eηk accordingly. Based on this allocation method, we can construct a complete binary tree of Kτ layers, such that at each node in the k-th layer, we make the following allocation ( 0, if αk > 1, αk, otherwise, (3.10) where αk Sk/Nk( Pk 1 j=1 Qk 1 i=j (1 eηi) j+ k). The binary tree with 2Kτ leaf nodes represents all possible allocation methods for {eηk}Kτ k=1, and our goal is to find the optimal path within this tree. However, a full tree traversal would have an impractical time complexity of O(2Kτ ). To this end, more efficient search methods are required. We find that prioritizing threads based on earlier deadline times dk or a lower ratio of required resources to work Sk/Nk can have a better effect. With this strategy in mind, we have developed an adaptive search algorithm for the binary tree. This algorithm allocates resources preferentially to tasks with earlier deadlines and smaller resource requirements. Learning with Adaptive Resource Allocation Algorithm 1 Adaptive Binary Tree Search Input: active thread set A, current time τ, required resource Sk, candidate thread set C = 1: Set d0 = τ 1, k A, set eηk = 0 2: Sort and re-label threads in A such that d0 ... d|A| 3: for k = 1, 2, ..., |A| do 4: Add k into C 5: if Sk/Nk dk d0 P i C Si/Ni+Sk/Nk > 1 then 6: Choose i = arg maxi C Si Ni , remove i from C 7: end if 8: end for 9: for k C do 10: Set eηk = Sk/Nk dk d0 P i C,i k Si/Ni+Sk/Nk 11: end for 12: return {eηk}k A Our algorithm maintains a set C to keep track of candidate threads where each k C has a non-zero eηk, then the resource allocation rate αk in the allocation condition (3.10) can be recalculated as αk = Sk/Nk dk d0 P i C,i k Si/Ni+Sk/Nk . Initially, C is empty. Starting from k = 1 and progressing to Kτ, the algorithm first adds thread k to C and if resource allocation to the k-th thread breaches the constraints (αk > 1), it then identifies and removes the most resource-demanding thread from C, determined by i = arg maxi C Si/Ni. Once all Kτ threads have been considered, the algorithm assigns eηk = 0 for k / C and eηk = αk for all k C. The overall algorithm is summarized in Algorithm 1. Algorithm 1 significantly reduces the time complexity of binary tree search from O(2Kτ ) to O(Kτ). Additionally, the subsequent theorem demonstrates that Algorithm 1 effectively attains the optimal solution for the Problem (3.7). Theorem 1 (Optimality). For the allocation Problem (3.7), the results {eηk}Kτ k=1 generated by Algorithm 1 can be used to construct the optimal resource allocation {η k,t}k Aτ ,t [d0,d Kτ ], as outlined in Eq. (3.8). 3.3. Prediction-Allocation Balancing In previous sections, we introduced a method to predict the resource requirements for each task to complete each task based on their learning progress. This prediction is vital for resource allocation, which in turn influences the learning progress and future predictions. Initially, our dataset of training loss observations for each thread is limited, leading to significant uncertainty in predictions. Early allocation based on these imprecise predictions could introduce compounding errors, potentially misdirecting resources from feasible tasks to less feasible ones. We note that it is important to address this issue early in the process of regular prediction and allocation. Otherwise, such compounded errors might Algorithm 2 Learning with Adaptive Resource Allocation Input: Task bundle {Tk}K k=1, exploration period Hk 1: Set exploration set Et = , active thread set At = 2: for t = 1, 2, ..., T do 3: Update active thread set At and exploration set Et 4: Calculate resource estimation b Sk by Eq. (3.6) 5: if Et = then 6: for all k Et do 7: Allocate ηk,t = 1/|Et| to thread k 8: end for 9: else 10: Run Algo. 1 with At, t, b Sk and return {eηk}k At 11: Set η1,t = eη1 and k 2, ηk,t = Qk 1 i=1 (1 eηi)eηk 12: Rescale to make sure P k At ηk = 1 13: end if 14: Learn ηk,t Nk data for thread k At and receive new loss value ℓk for k At 15: Update regression bθk by Eq. (3.5) 16: end for accumulate over time, reducing the algorithm s overall effectiveness. This challenge requires a prediction-allocation balancing, where resources are allocated widely enough to ensure accurate predictions, while also being strategically focused on tasks with the greatest likelihood of success. To achieve this, we design our balancing strategy based on the idea of Explore-Then-Exploit (ETE) strategy from bandits theory (Lattimore & Szepesv ari, 2020). Explore-then-Exploit. At each time τ, we maintain an exploration set Eτ = {k | Pτ t=1 ηk,t Nk Hk, k At}, where Hk is a predefined exploration threshold. This threshold indicates that if a thread s cumulative allocated data amount is below Hk, it requires more data for accurate evaluation. If Eτ = , implying there are still exists underexplored active threads, we then allocate the resources of time unit τ uniformly among all threads in Eτ, such that k Ek, ηk,τ = 1/|Eτ |. This allocation is vital for gathering sufficient data for precise training loss estimation, thus significantly reducing errors in resource prediction. It is important to clarify that this approach differs from pretraining a model for loss curve fitting, as the exploration period is a critical part of each thread s time allocation. Setting the exploration threshold Hk appropriately is crucial: too long a period can use up excessive time, affecting subsequent resource allocation, while too short a period may cause significant prediction errors, exacerbating compounded errors. The complete LARA algorithm with the Explore-then-Exploit strategy, is detailed in Algorithm 2. Remark 2. The ETE strategy relies on a exploration threshold, which still depends on an empirical adjustment. This suggests the need for more adaptive algorithms, such as the Upper Confidence Bound (UCB). We could adopt UCB Learning with Adaptive Resource Allocation here by adding a bonus to resource prediction, for example, at time τ, set corrected resource prediction as e Sk,n = b Sk,n β q ln(τ bk)/Pτ t=bk ηk,t, where β > 0 is a UCB con- trol constant, ln(τ bk) represents the potential data budget, and Pτ t=bk ηk,t is the actual data received. Less exploration leads to a larger bonus and a smaller e Sk,n, encouraging more resource allocation to the thread. However, we have conducted preliminary experiments and found that UCB-type strategy may not outperform the simple ETE. This UCBtype strategy may often allocate fewer resources than required. This is because UCB adopts an optimistic way to encourage exploration: e Sk,n = b Sk,n optimism. So the allocated resource e Sk,n may be fewer than the actual resource required by the task. In the future, we might consider integrating other adaptive strategies such as Thompson sampling to improve the exploration phase of our approach. 4. Experiments In this section, we evaluate the empirical performance of our proposed LARA approach. We begin with an experiment involving a pure task bundle, where five different models are trained concurrently on the same dataset to demonstrate our approach s efficiency and effectiveness. Next, we conduct an experiment with a mixed task bundle, where different models are trained concurrently on different datasets. We then perform a scalability experiment to assess how our algorithm performs as the number of threads increases. Finally, we conduct experiments to evaluate the performance of specific components of our algorithm. In each experiment, each thread represents a model training task with a specific beginning time bk, deadline dk, and predetermined success threshold for the loss ϵk. Each time unit t [T] is set to one second. At the start of each time unit, the learner observes the training loss of the previous time unit and allocates the resources for the current time unit as needed. 4.1. Pure Task Bundle Experiment In this part, we consider a pure task bundle, where each task trains a different model but uses the same dataset. Setting. We focus on image classification of CIFAR-10 dataset (Krizhevsky et al., 2009). We train five distinct models concurrently, each featuring unique hyperparameters and structures. The training loss for all models is calculated as the average cross-entropy loss over the past training batches. All five models begin training concurrently, and the beginning time for each thread is set to 0-th second. The specific settings for each of the five models, including model type, data budget (Nk), deadline time (dk), and success threshold (ϵk), are outlined in Table 1. Nk is the maximum data processable per second calculated by the average time to process unit data during the exploration, and unit data is de- fined by a batch with 64 data points. For the deadline dk and success criteria ϵk, we initially designed a series of tasks that could be completed in short priority order. We then adjusted certain tasks, creating scenarios with both short and challenging tasks (small dk, large ϵk) as well as long and simple ones. For the subsequent experiment over mixed task bundle, we employed the same generation strategy. Model Vi T LSTM CNN Res Net18 Res Net34 Nk 163 220 375 97 55 dk 500 570 590 610 630 ϵk 1.35 1.18 0.15 0.35 0.45 Table 1. The setting of pure task bundle. We evaluate the performance of our proposed LARA approach against several classic resource allocation strategies: (a) Uniform Allocation (UA), which distributes computational resources equally across all active threads regardless of their specific needs or deadlines; (b) Shortest Thread First (STF), giving priority to threads with the nearest deadlines and allocating all resources to them first; (c) Least Resources First (LRF), allocating resources primarily to threads requiring the smallest amount of training data Sk, thereby aiming to quickly complete tasks with lower demands; and (d) Easiest Thread First (ETF), focusing on the ratio of required data to time constraints Sk/dk bk, thus prioritizing threads that are relatively easier to complete within their available time. Both Least Resources First and Easiest Thread First use the same resource prediction method as ours, as specified in (3.6). For exploration we set all exploration threshold Hk = 8000, this takes a total of about 335 seconds to explore all threads. Allocation Result. Figure 4(a) illustrates the results of successfully completed thread number using various resource allocation strategies. Both UA and STF methods allocate excessive resources to the challenging thread 1, leading to none of the threads being completed. LRF and ETF strategies, which employ our provided resource prediction method, manage to avoid dedicating resources to some of the more challenging threads. However, their inherently greedy allocation methods result in the completion of only two threads. In contrast, the LARA algorithm, by strategically giving up on the difficult-to-complete thread 1 early, efficiently allocates resources and successfully completes the remaining four threads. 4.2. Mixed Task Bundle Experiment In this part, we consider a mixed task bundle, where each task trains a different model on a different dataset. Setting. We consider 10 threads across four different types of learning tasks: computer vision (CV) with CIFAR-10, natural language processing (NLP) with IMDB (Maas et al., Learning with Adaptive Resource Allocation UA STF LRF ETF LARA 0 Success Thread Number (a) Pure task bundle STF LRF ETF UA LARA 0 Success Thread Number (b) Mixed task bundle with ϵ1 k ETF UA LRF STF LARA 0 Success Thread Number (c) Mixed task bundle with ϵ2 k Figure 4. Effectiveness experiments of resource allocation for pure and mixed task bundles. 2011), reinforcement learning (RL) with Montezuma s Revenge1, and audio processing with Yesno2, each using different models. All ten models begin training concurrently, with the starting time for each thread set to the 0-th second. The specific settings, including model type, data budget (Nk), deadline time (dk), and success threshold (ϵk), are outlined in Table 2. We provide two sets of success thresholds, ϵ1 k and ϵ2 k, to characterize task bundles of varying difficulties. Num Task Model dk ϵ1 k ϵ2 k Nk 1 CV Vi T 530 1.1 0.99 176 2 CV Vi T 540 1.35 1.215 176 3 RL DAgger+CNN 545 0.0011 0.0011 69 4 Audio Transformer 585 0.06 0.09 293 5 NLP Attention+LSTM 625 7 10 4 6.3 10 4 139 6 CV Res Net18 655 0.55 0.46 107 7 CV Res Net18 685 0.55 0.48 107 8 NLP Attention+LSTM 690 9 10 4 4.5 10 4 139 9 Audio Transformer 700 0.08 0.108 293 10 RL DAgger+CNN 710 0.0012 0.0006 69 Table 2. The setting of mixed task bundle. Allocation Result. Figures 4(b) and 4(c) illustrate the comparative results of successfully completed threads using various resource allocation strategies. The task bundle with ϵ1 k contains difficult short tasks (threads 1 and 2) and easier longer tasks. Figure 4(b) shows that this scenario is more suited for UA, which fails only in the short threads but completes the six easier long threads. STF allocates excessive resources to challenging threads 1 and 2, leading to only two tasks in the task bundle being completed. LRF and ETF, employing our resource prediction method, manage to avoid dedicating resources to some of the more challenging threads. However, their inherently greedy allocation methods result in the completion of only four threads. In contrast, the LARA algorithm, by strategically giving up on the difficult-to-complete threads 1 and 2 early, efficiently allocates resources and successfully completes the remaining eight threads. For the task bundle with ϵ2 k, threads 4 and 9 are made easier while the rest become harder, which is 1Details can be found at Montezuma s Revenge. 2Information about the Yesno dataset is available at Yesno. more suitable for STF. Figure 4(c) shows that, under this setup, UA finishes only three tasks. STF finishes five tasks by completing the two easier tasks. LRF and ETF continue to perform poorly due to their greedy allocation strategy. However, our LARA algorithm still manages to complete six tasks. These scenarios show that our LARA strategy performs well in different environments, demonstrating its robustness and effectiveness. 4.3. Scalability Experiment To more comprehensively validate our experiment, we aimed to investigate how increasing the number of training tasks influences the effectiveness of our approach. Setting. In this part, we focus on image classification of MNIST dataset (Deng, 2012). To generate task bundles, we start by randomly selecting a model from those listed in Table 1 and randomly setting its success threshold ϵ within a certain range. We run each model to completion with the full budgeted computational resources, recording the time taken as its deadline time. This deadline is then used as the beginning time of the next thread. After all threads generated, we let all beginning time be 0-second, ensuring that a series of learning tasks is created in a way that allows them to be completely solvable using the Shortest Thread First algorithm. To add complexity, we modify 1/3 of these tasks by reducing their success thresholds to 1/10 of the original values, thereby increasing the difficulty of the entire task bundle. In this experiment, we create task bundles with 10, 25, 50, 75, and 100 threads to analyze how the number of threads affects the performance of the scheduling algorithm. Result. Figure 6(a) demonstrates the changes in the number of successful threads for different resource allocation strategies as the number of threads increases. It is evident that LARA consistently outperforms other strategies with an increasing thread count. Uniform Allocation and Shortest Thread First, lacking thorough evaluation of threads, show declining success rates as the number of threads grows. In contrast, Least Resources First and Easiest Thread First, utilizing our resource prediction method, perform better Learning with Adaptive Resource Allocation 75 100 125 150 175 200 Data volume Average prediction error LS M4 Engression WLS (a) Prediction error over pure task bundle 75 100 125 150 175 200 Data volume Average prediction error LS M4 Engression WLS (b) Prediction error over mixed task bundle LS M4 Engression WLS Average Time (seconds) (c) Average time of single round prediction Figure 5. Effectiveness and efficiency experiments of resource prediction. than the former two strategies. This highlights the effectiveness of our resource prediction. However, their reliance on a greedy allocation strategy limits their effectiveness compared to LARA. Under the same resource prediction conditions, LARA continues to outperform them as the number of threads increases, demonstrating the superior performance of our allocation algorithm. Figure 6(b) shows the average time cost for a single round of resource prediction and allocation with our LARA algorithm as the number of threads increases. It is observed that the time taken for each update grows linearly with the number of threads. Notably, even with 100 threads, a single update period requires only 0.0007 seconds. Since LARA performs a periodical resource prediction and allocation every second, this time cost has a negligible impact on the model training process. Moreover, all model training tasks are executed on the GPU, while prediction and allocation tasks are handled on the CPU, further minimizing LARA s impact on model training. 4.4. Component Performance Evaluation In this section, we evaluate the performance of different resource prediction methods on a pure task bundle in Table 1 and on a mixed task bundle in Table 2. Setting. We compare the prediction performance of our proposed Weighted Least Squares (WLS) method with several extrapolation methods mentioned in Remark 1: (a) Least Squares (LS); (b) the M4 estimator (Alabdulmohsin et al., 2022); and (c) Engression (Shen & Meinshausen, 2023). We evaluate these four extrapolation methods based on two measures: (i) the average prediction error, calculated for each task using the formula |b S S|/S, where b S is the predicted resource and S is the true resource, then averaging these errors across the entire task bundle; and (ii) the average time cost of a single round of prediction. For the WLS, we set the discounted factor γ to 0.9. Both the M4 and Engression methods are applied with their default settings. To perform the extrapolations, we use varying amounts of data points (specifically, 60, 80, up to 200 data points) for each task. 20 40 60 80 100 Number of threads Success Number Uniform Allocation Shortest Thread First Least Resources First Easiest Thread First LARA (a) Scalability experiment of effectiveness 10 25 50 75 100 Number of threads Average Time (seconds) (b) Scalability experiment of efficiency Figure 6. Scalability experiments. Prediction Result. Figure 5(a) illustrates how the average prediction error changes as more data points are used for the pure task bundle. Surprisingly, the results reveal that the WLS method results in a lower prediction error compared to methods specifically designed for extrapolation. The LS method, which is originally for interpolation, shows the worst performance in this scenario. Figure 5(b) explores performance over the mixed task bundle. Here, WLS performs slightly worse than the two extrapolation-specific strategies but significantly better than LS. Among these methods, Engression outperforms the others for extrapolation. Additionally, Figures 5(c) demonstrate that WLS incurs significantly lower time costs compared to both the M4 estimator and Engression. This suggests that WLS offers a good balance between accuracy and efficiency, achieving relatively pre- Learning with Adaptive Resource Allocation cise outcomes efficiently. We also conducted parameter sensitivity tests on some components of our algorithm. For more experimental details, please refer to Appendix B. 5. Related Topics and Discussion In this section, we will discuss some topics related to combining machine learning with resource scheduling. Distributed Machine Learning. Distributed Machine Learning (DML) (Dean et al., 2012; Li et al., 2014) aims to improve the training efficiency of a single large-scale model through distributed computing strategies, especially in scenarios with sufficient computational resources. During the learning process, DML promotes parallel processing by dividing data and model training tasks, and achieves load balancing through reasonable resource allocation methods to accelerate the training process. Although DML also considers the problem of combining machine learning and resource allocation, it diverges significantly from our study. In DML, multiple training tasks are usually coordinated, and the efficiency of each training task is ensured through resource allocation to maximize the efficiency and effect of the entire model training. In contrast, our focus is on training multiple independent learning tasks concurrently with budgeted computing resources. Here, each learning task competes for these budgeted resources, which requires careful evaluation of each learning task and making informed task-wise trade-offs in resource allocation. Tiny/Edge Machine Learning. Tiny/Edge ML (Lin et al., 2023) focuses on processing learning tasks on devices with extremely limited resources. A line of studies (Wang et al., 2020a;b; Zhou et al., 2021) explored parallel training of multiple learning tasks on power-constrained edge devices, aiming to ensure the effectiveness of each learning task through power resource allocation. Although these works also study the scheduling of multiple learning tasks under resource limitations, they have yet to consider the key challenge caused by the coupling of learning and allocation, as raised in our paper. In their approach, each task s learning progress is known in advance, escaping the uncertainty inherent in the learning process and reducing the issue to a purely computational problem. Furthermore, this also exhibits many setup differences, including the absence of deadlines and varying objectives for scheduling, etc. In contrast, in our study, we cannot obtain any prior before the start of each task. This requires us to predict the learning progress of each task during the resource allocation process. The uncertainty caused by the prediction will significantly affect the allocation decisions, requiring us to balance the accuracy of prediction and the effectiveness of allocation. Related Studies in Operation System. Recently, a line of operating system research focuses on designing clus- ter schedulers to improve the efficiency and effectiveness of parallel training of multiple learning tasks. Efforts devoted to maximizing the efficiency of CPU or GPU clusters by Zhang et al. (2017) and Peng et al. (2018) also involve integrating learning task management with resource allocation. However, our research diverges fundamentally in both its objectives and challenges. Unlike these studies, which often rely on pre-training estimations of learning progress, our approach requires the real-time estimation for learning progress of each task during the training phase. Our objective is not just to minimize time or maximize performance; instead, we are faced with the problem of making strategic decisions about which tasks to prioritize. Sometimes, this means making the difficult choice to sacrifice the completion of certain tasks to ensure the successful execution of others. This trade-off between the completion of individual tasks and the overall success of the learning objectives is a unique aspect of our research, distinguishing it fundamentally from existing studies on combining learning and scheduling. 6. Conclusion and Future Work In this paper, we consider the Computational Resource Efficient Learning (Co RE-Learning) problem which aims to manage multiple time-constrained learning tasks with budgeted computational resources. We identify that the key challenge of this problem lies in the coupling of learning progress and resource allocation. To address this issue, we propose the LARA approach which periodically predicts the learning progress of each learning task, and allocates resources accordingly. LARA consists of three essential components: an efficient online estimator to predict the resource requirements of each thread, an adaptive searching method for resource allocation guaranteed to find the optimal solution and an exploration strategy for predictionallocation balancing. Through a series of experiments, we validate the effectiveness of our approach in handling multiple learning tasks with limited computational resources. Currently, our LARA approach is in a preliminary stage, with several directions open for future research in this field. Methodologically, we use WLS for resource prediction, which is essentially an interpolation strategy for extrapolating loss curves. Therefore, future work could explore efficient and extrapolation-specific strategies. For predictionallocation balancing, we employ the ETE approach, which depends on predefined exploration thresholds. Future research could investigate more adaptive exploration methods, such as ideas based on Thompson sampling. Theoretically, while the optimality has been proved for our resource allocation method, it remains open on how to provide generalization guarantees for the overall process. Future work could introduce curve extrapolation or scaling law theories to provide the theoretical basis and establish the learnability guarantee in the context of Co RE-Learning framework. Learning with Adaptive Resource Allocation Acknowledgements This research was supported by National Science and Technology Major Project (2022ZD0114800) and Collaborative Innovation Center of Novel Software Technology and Industrialization. Impact Statement This paper presents a preliminary study on multiple learning tasks training concurrently with limited computational resources. High-performance computational resources are often expensive and inaccessible, creating resource bottlenecks. Our research aims to enhance resource utilization efficiency and fairness while reducing energy waste. This line of research benefits budget-constrained users and supports better intelligent supercomputing resource management. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. GPT-4 technical report. Ar Xiv preprint, arxiv:2303.08774, 2023. Alabdulmohsin, I. M., Neyshabur, B., and Zhai, X. Revisiting neural scaling laws in language and vision. In Advances in Neural Information Processing Systems 35 (Neur IPS), pp. 22300 22312, 2022. Anil, R., Gupta, V., Koren, T., and Singer, Y. Memory efficient adaptive optimization. In Advances in Neural Information Processing Systems 32 (Neur IPS), pp. 9746 9755, 2019. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A. W., Tucker, P. A., Yang, K., and Ng, A. Y. Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25 (NIPS), pp. 1232 1240, 2012. Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012. Domhan, T., Springenberg, J. T., and Hutter, F. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), pp. 3460 3468, 2015. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021. Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019. Guo, L., Ljung, L., and Priouret, P. Performance analysis of the forgetting factor RLS algorithm. International Journal of Adaptive Control and Signal Processing, 7(6): 525 537, 1993. Haykin, S. S. Adaptive Filter Theory. Pearson Education India, 2002. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the 29th (IEEE) Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. Ar Xiv preprint, ar Xiv:1712.00409, 2017. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lo RA: Low-rank adaptation of large language models. In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022. Kaplan, J., Mc Candlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. Ar Xiv preprint, arxiv:2001.08361, 2020. Krizhevsky, A., Hinton, G., et al. Learning Multiple Layers of Features from Tiny Images. Toronto, ON, Canada, 2009. Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cambridge University Press, 2020. Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 583 598, 2014. Lin, J., Zhu, L., Chen, W.-M., Wang, W.-C., and Han, S. Tiny machine learning: Progress and futures. IEEE Circuits and Systems Magazine, 23(3):8 34, 2023. Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Learning with Adaptive Resource Allocation Association for Computational Linguistics: Human Language Technologies (ACL-HLT), pp. 142 150, 2011. Mirzadeh, S., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., and Ghasemzadeh, H. Improved knowledge distillation via teacher assistant. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), pp. 5191 5198, 2020. Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the 13th European Conference on Computer Systems (Euro Sys), pp. 1 14, 2018. Shen, X. and Meinshausen, N. Engression: Extrapolation for nonlinear regression? Ar Xiv preprint, arxiv:2307.00835, 2023. Wang, S., Wang, R., Hao, Q., Wu, Y., and Poor, H. V. Learning centric power allocation for edge intelligence. In Proceedings of the IEEE International Conference on Communications (ICC), pp. 1 6, 2020a. Wang, S., Wu, Y., Xia, M., Wang, R., and Poor, H. V. Machine intelligence at the edge with learning centric power allocation. IEEE Transactions on Wireless Communications, 19(11):7293 7308, 2020b. Zhang, H., Stafman, L., Or, A., and Freedman, M. J. Slaq: Quality-driven scheduling for distributed machine learning. In Proceedings of the 8th Symposium on Cloud Computing (So CC), pp. 390 404, 2017. Zhou, L., Hong, Y., Wang, S., Han, R., Li, D., Wang, R., and Hao, Q. Learning centric wireless resource allocation for edge computing: Algorithm and experiment. IEEE Transactions on Vehicular Technology, 70(1):1035 1040, 2021. Zhou, Z.-H. Learnability with time-sharing computational resource concerns. Ar Xiv preprint, ar Xiv:2305.02217, 2023. Learning with Adaptive Resource Allocation In this section, we provide the analysis for our resource allocation strategy. In Appendix A.1, we analyze our parameter reduction method and prove that optimal solutions to Problem 3.7 can still be found after reduction, which is captured in Lemma 1. In Appendix A.2, we provide a proof for the optimality of our adaptive binary tree search method for Problem 3.7. A.1. Proof of Lemma 1 Proof. For Problem 3.7, we let set C, comprising KC threads, represent those successfully completed under the optimal allocation solution denoted by {η k,t}k C,t [d0,d KC ]. The threads in C are sorted and labeled such that the deadline time satisfies that d0 < d1 < d2 < ... < d KC, where d0 represents the current unit time. Then the optimal allocation should satisfy the following three optimal conditions, (i) k C, Dk t=d0+1 η k,t Sk (ii) t [d0, d KC], X k C η k,t 1; (iii) k C, t [d0, d KC], η k,t 0. At the same time, it can also be found that the allocation method that satisfies these three conditions is the optimal allocation method. We can first modify the optimal allocation {η k,t}k C,t [d0,d KC ] such that in any interval [di 1, di] satisfying i C, di 1 = di, η k,t = η k,[di 1,di] is time-invariant. For the interval [di 1, di], we let η k,[di 1,di] = Pdi t=di 1+1 η k,t di di 1 , (A.1) we can find that the allocation (A.1) still satisfies the optimal condition: i=1 (di di 1)η k,[di 1,di] = t=di 1+1 η k,t = t=d0+1 η k,t = Dk Sk (ii) i C, X k C η k,[di 1,di] = k C Pdi t=di 1+1 η k,t di di 1 = Pdi t=di 1+1 P k C η k,t di di 1 di di 1 di di 1 = 1, (iii) k C, i C, η k,[di 1,di] = Pdi t=di 1+1 η k,t di di 1 0. Based on condition (ii), we know that Pk j=1 η j,[di 1,di] P j C η j,[di 1,di] 1, then we have i C, k C, η k,[di 1,di] 1 j=1 η j,[di 1,di], then k C, we have i=1 η k,[di 1,di](di di 1) = Dk j=1 η j,[di 1,di] (di di 1) Dk i=1 (di di 1) j=1 η j,[di 1,di](di di 1) Dk Learning with Adaptive Resource Allocation i=1 η j,[di 1,di](di di 1) Dk So we have k C, Dk dk d0 Pk 1 j=1 Dj 1. (A.2) For the 1-st thread, we define eη1,[d0,d1] η 1,[d0,d1], then we have (d1 d0)eηk,[d0,d1] = D1, then for the interval [d0, d1], we can define eη2,[d0,d1] η 2,[d0,d1] (1 eη1,[d0,d1]) eη3,[d0,d1] η 3,[d0,d1] (1 eη1,[d0,d1])(1 eη2,[d0,d1]) eηk,[d0,d1] η k,[d0,d1] Qk 1 i=1 (1 eηi,[d0,d1]) Similarly, for any interval [di 1, di], i C, we can define eηi,[di 1,di] η i,[di 1,di] ... eηk,[di 1,di] η k,[di 1,di] Qk 1 j=i (1 eηj,[di 1,di]) Based on the success criteria and definition of eηi,[di 1,di], we have eη1,[d0,d1](d1 d0) = D1 (1 eη1,[d0,d1])eη2,[d0,d1](d1 d0) + eη2,[d1,d2](d2 d1) = D2 ... i=j (1 eηi,[di 1,di])eηk,[dj 1,dj](dj dj 1) + eηk,[dk 1,dk](dk dk 1) = Dk Next, we try to prove that for any k, eηk,[d0,d1], ..., eηk,[dk 1,dk] can be replace by a fixed eηk, which still satisfies these three optimal conditions. We let eηk = Dk Pk 1 j=1 Qk 1 i=j (1 eηi)(dj dj 1) + (dk dk 1) , (A.3) Learning with Adaptive Resource Allocation And for any interval [di 1, di], i C, the allocation becomes η i,[di 1,di] = eηi η k,[di 1,di] = j=i (1 eηj)eηk then we satisfies condition (i) as follows, i=j (1 eηi)eηk(dj dj 1) + eηk(dk dk 1) = Dk. Moving eηk on the left side of the equation to the right, we have i=j (1 eηi)(dj dj 1) + (dk dk 1) i=j (1 eηi)(dj dj 1)(1 eηk 1) + (1 eηk 1)(dk 1 dk 2) + (dk dk 1) i=j (1 eηi)(dj dj 1) + (dk 1 dk 2) + (dk dk 1) eηk 1 Dk 1 + (dk dk 1) i=1 (di di 1) Then we still have eηk = Dk dk d0 Pk 1 i=1 Di , based on property (A.2), we have k C, eηk 1. Then we have k C η k,[di 1,di] = eηi + j=i (1 eηj)eηk j=i (1 eηj)eηk + j=i (1 eηj) j=i (1 eηj)eηk + j=i (1 eηj) eηi + 1 eηi = 1, which satisfies the condition (ii). Building upon the conditions that k C, eηk 1, the sequence of deadlines d0 < . . . < d KC, and the allocation method as described by equation (A.3), we can obtain that k C, eηk 0. Based on this observation, η k,[di 1,di] further satisfies the condition (iii), thus we complete the proof. Learning with Adaptive Resource Allocation !η1 = α1, !η2 = α2 !η1 = α1, !η2 = 0 !η1 = 0, !η2 = α2 !η1 = α1, !η2 = α2, !η3 = 0 !η1 = α1, !η2 = 0, !η3 = 0 !η1 = 0, !η2 = α2, !η3 = 0 !η1 = 0, !η2 = 0, !η3 = 0 !η1 = α1, !η2 = α2, !η3 = α3 !η1 = α1, !η2 = 0, !η3 = α3 !η1 = 0, !η2 = α2, !η3 = α3 !η1 = 0, !η2 = 0, !η3 = α3 !η1 = 0, !η2 = 0 Figure 7. Example of Binary Search Tree with 3-layers. A.2. Proof of Theorem 1 Proof. For the Problem 3.7, let set C, comprising KC threads, denote those successfully completed under the optimal allocation solution denoted by {η k,t}k C,t [d0,d KC ]. According to Lemma 1, we know that there exists a set of {eη k}k Aτ , where k Aτ, 0 eη k 1, can formulate a solution with Eq. (3.8) also achieve set C, such that k C, Dk(eη k) Nk Sk, where Dk is defined in Eq. (3.9), representing accumulated data throughput. Since Dk(eηk) is a linear monotonically increasing function with respect to eηk, which means there exists set of {eηk}k Aτ , where k C, 0 eηk eη k satisfying Dk(eηk) Nk = Sk and k / C, eηk = 0. This means {eηk}k Aτ will also achieve the optimal solution, since the indicator function in Problem 3.7 only need Dk(eηk) Nk = Sk to success. Now for eηk for each thread k Aτ only has two outcomes: success (Dk(eηk) Nk Sk) or failure (eηk = 0), which can be formulated as 0, if Sk Nk( Pk 1 j=1 Qk 1 i=j (1 eηi)(dj dj 1)+1) > 1, Sk Nk( Pk 1 j=1 Qk 1 i=j (1 eηi)(dj dj 1)+1), otherwise. (A.4) We can observe that, eηk is determined by the previous eη1, . . . , eηk 1. Based on this allocation method, we can construct a complete binary tree of Kτ layers, Figure 7 provides a example for illstration, where αk Sk/Nk( Pk 1 j=1 Qk 1 i=j (1 eηi)(dj dj 1)+1). Now the optimal solution of Problem 3.7 is the path contains maximum number of eηk = αk, αk 1 within this tree. So now, we just need to prove that, the output of Algorithm 1, is the optimal path of this tree such that contains maximum number of eηk = αk, αk 1. We present an exchange argument to prove the optimality of Algorithm 1 for finding the optimal path. Let A and B denote two sequences of threads, ordered by their duration, representing the choices made by our algorithm and a hypothetical alternative, respectively. In these sequences, threads are either chosen to be allocated just enough resources, or not included at all. The success count for A is #succ A, and we assume that B has a higher success count #succ B > #succ A. Defining A and B. Sequence A represents the threads chosen by our algorithm, while sequence B represents an alternate set of threads with a presumed higher success rate. Both A and B are ordered based on thread deadline time, with shorter-duration threads prioritized. Identifying the First Divergence. By comparing sequences A and B, we identify the first point of divergence at thread k. Two scenarios are considered: (i) A includes thread k but B does not; (ii) B includes thread k but A does not. (i) If A chooses thread k and B does not, we examine the subsequent threads in sequence. If B selects a later thread k + i that A omits, we can replace thread k + i in B with k because A s choice reflects a more resource-efficient selection. We continue this process until we reach a thread k + j where up to k + j, #succ B equals #succ A. Since A makes more resource-efficient choices, we can align B s choices with A without reducing #succ B. (ii) If B includes thread k, it implies that at thread k, A also considers k but ultimately replaces it with a later thread i where Di < Dk. If B selects k but not i, we can directly replace k in B with i, maintaining #succ B. If B selects both k and i, and a thread j exists with Dj < Dk not chosen by B, then we can replace k in B with j. If no such j exists and all threads in A with D < Dk are included in B, it becomes impossible for B to have #succ B > #succ A. Conclusion. By evaluating each thread from 1 to K and adjusting B to match A, a contradiction arises if we assume #succ B > #succ A. Therefore, no better choice exists than the output of Algorithm 1, confirming its optimality. Learning with Adaptive Resource Allocation 0 1 2 3 4 Data volume 104 Training Loss data point true loss regular least square weighted least square (a) Curve fitting performance of thread 1 1 2 3 4 Data volume 104 Prediction Error Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 (b) Convergence of prediction error Average Time (seconds) 2.87e-04 2.75e-04 2.84e-04 2.97e-04 (c) Average time of single round prediction Figure 8. Effectiveness and efficiency experiments in resource prediction. 350 400 450 500 550 600 Time (seconds) Data throughput Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 (a) Adaptive changes in data throughput of LARA 450 500 550 600 650 Time(seconds) Success Threads Number (0/5) Uniform Allocation (0/5) Shortest Thread First (2/5) Least Resources First (2/5) Easiest Thread First (4/5) LARA (b) Algorithm comparation Figure 9. Performance of computational resource allocation. B. Experimental Supplement In this section, we provide supplementary information for our experiments. Appendix B.1 outlines the specific settings for the model structures used in our experiments. Appendix B.2 presents detailed illustrations of the resource prediction and resource allocation process for the pure task bundle experiments. Additionally, we conduct parameter sensitivity tests in Appendix B.3 to explore the performance of the LARA algorithm under different exploration thresholds Hk and various discounted factors γ, respectively. B.1. Implementation Details In this section, we will discuss the implementation of the model architectures detailed in Table 1 and Table 2. For Table 1, the first model is a Vision Transformer (Vi T) (Dosovitskiy et al., 2021), implemented as a Simple Vi T model.3 The second model is a Long Short-Term Memory (LSTM) network, comprising two LSTM layers followed by three fully connected layers. The third model is a Convolutional Neural Network (CNN), including three convolutional layers, two pooling layers, and two fully connected layers. The fourth and fifth models are Res Net18 and Res Net34, respectively (He et al., 2016). For Table 2, the implementations are as follows: Tasks 1 and 2 use Vi T, and tasks 6 and 7 use Res Net18, both of which follow the implementations described in Table 1. Tasks 3 and 10 use a DAgger+CNN model, involving five iterations of convolutional and pooling layers, followed by flattening and three fully connected layers. Tasks 5 and 8 use an Attention+LSTM model, where the embedding results are fed into a self-attention layer, followed by LSTM layers. The final hidden state is passed through four fully connected layers. Tasks 4 and 9 use a Transformer model, which processes the input through a Transformer Encoder and then passes it through a fully connected layer. B.2. Details of Pure Task Bundle Experiments In this section, we present additional details for the pure task bundle experiments described in Section 4.1. First, we illustrate the loss function curve fitting performance of the weighted least squares method for each threads, as well as the associated time costs. Next, we provide a detailed view of the real-time resource allocation process over the running of the five-thread 3https://github.com/lucidrains/vit-pytorch?tab=readme-ov-file Learning with Adaptive Resource Allocation 0 1 2 3 4 Data volume 104 Training Loss data point true loss regular least square weighted least square (a) Curve fitting of thread 2 0 1 2 3 4 Data volume 104 Training Loss data point true loss regular least square weighted least square (b) Curve fitting of thread 3 0 1 2 3 4 Data volume 104 Training Loss data point true loss regular least square weighted least square (c) Curve fitting of thread 4 0 1 2 3 4 Data volume 104 0 Training Loss data point true loss regular least square weighted least square (d) Curve fitting of thread 5 Figure 10. Loss function curve fitting of weighted least square on different models. experiment and show how the completion status of each thread changes over time. Prediction Result. Figure 8(a) shows the fitting and extrapolation of the loss function for thread 1, using the loss values observed during the exploration phase. The performance across the remaining four threads is shown in Figure 10. We compare our weighted least squares method, with a discounted factor γ = 0.9, against the regular least squares approach. The weighted least squares method provides a closer approximation to the actual loss function during the extrapolation phase and outperforms the regular method, despite slightly inferior performance during the interpolation phase. This indicates that weighted least squares is more suitable for predicting future trends of the loss function. Figure 8(b) presents the estimation error between the estimated and actual data amounts needed to meet the success criteria. The estimation error progressively diminishes and eventually converges to zero as the volume of data increases. Figure 8(c) shows the average time taken for resource prediction across different tasks. It demonstrates that the time cost of a single round prediction across different model training tasks is generally consistent, indicating minor impact from model differences. This time cost is negligible compared to the time unit (e.g., one second), and therefore does not impact the model training process. Allocation Result. Figure 9(a) illustrates the dynamic changes in data throughput ηk,t for LARA, highlighting the process of ongoing iterative adjustments in resource prediction and resource allocation. The first 335 seconds is the exploration period, such that all the threads have equal data throughput 0.2. Initially, LARA fully explores thread 1 during the exploration period and abandons it directly after the exploration period due to its low likelihood of completion. Subsequently, between 335 and 500 seconds, LARA continually refines its resource allocation strategy, and pauses the thread 5 (has the most ample time) multiple times to prioritize resources for other more urgent tasks around 450-second. Around 500-second, it prioritizes allocating more resources to thread 2 and thread 3 due to their more urgent deadlines. The remaining resources are then directed towards completing threads 4 and 5. Figure 9(b) illustrates the comparative results of successfully completed threads over time using various resource allocation strategies. Both Shortest Thread First and Uniform Allocation methods allocate excessive resources to the challenging thread 1, leading to none of the threads being completed. Least Resources First and Easiest Thread First strategies, employing our provided resource prediction method, manage to avoid dedicating resources to some of the more challenging threads. However, their inherently greedy allocation methods result in the Learning with Adaptive Resource Allocation 1000 2000 3000 4000 5000 6000 7000 8000 Exploration threshold Hk Success Thread Number Task bundle 1 k Task bundle 2 k (a) Parameter sensitivity test of Hk 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Discounted factor Success Thread Number Task bundle 1 k Task bundle 2 k (b) Parameter sensitivity test of γ Figure 11. Parameter sensitivity tests of exploration threshold Hk and discounted factor γ. completion of only two threads. In contrast, the LARA algorithm, by strategically giving up on the difficult-to-complete thread 1 early, efficiently allocates resources and successfully completes the remaining four threads. B.3. Parameter Sensitivity Tests In this section we conduct the sensitivity experiments to study the LARA s performance under different exploration thresholds Hk and dicounted factors γ, repectively. Both experiments focus on the mixed task bundle over different sets of success criteria (ϵ1 k and ϵ2 k), as outlined in Table 2. Exploration threshold Hk. We test the success thread number of our LARA algorithm with different choices of the exploration threshold Hk with a fixed discounted factor γ = 0.9. Figure 11(a) shows that LARA s performance decreases when Hk is either too small or too large, and becomes relatively stable when lies in an appropriate region 4000 6000. When Hk is too small, the precision of resource predictions is very low, leading to poor resource allocation. Conversely, a too large Hk results in excessive exploration time, leaving inadequate time for effective resource allocation. Discounted factor γ. We evaluate the successful thread number of LARA employing weighted LS estimators with different discounted factors γ with a fixed Hk = 6000. Figure 11(b) shows that when γ is either too large (close to 1) or too small (close to 0.5), there is a significant increase in prediction error, leading to bad performance of LARA. Both results over task bundles with success criteria ϵ1 k and ϵ2 k indicate that when discounted factor γ in an appropriate region 0.85 0.95, the performance of LARA approaches its optimum.