# the_clrs_algorithmic_reasoning_benchmark__a723dace.pdf The CLRS Algorithmic Reasoning Benchmark Petar Veliˇckovi c 1 Adri a Puigdom enech Badia 1 David Budden 1 Razvan Pascanu 1 Andrea Banino 1 Misha Dashevskiy 1 Raia Hadsell 1 Charles Blundell 1 Learning representations of algorithms is an emerging area of machine learning, seeking to bridge concepts from neural networks with classical algorithms. Several important works have investigated whether neural networks can effectively reason like algorithms, typically by learning to execute them. The common trend in the area, however, is to generate targeted kinds of algorithmic data to evaluate specific hypotheses, making results hard to transfer across publications, and increasing the barrier of entry. To consolidate progress and work towards unified evaluation, we propose the CLRS Algorithmic Reasoning Benchmark, covering classical algorithms from the Introduction to Algorithms textbook. Our benchmark spans a variety of algorithmic reasoning procedures, including sorting, searching, dynamic programming, graph algorithms, string algorithms and geometric algorithms. We perform extensive experiments to demonstrate how several popular algorithmic reasoning baselines perform on these tasks, and consequently, highlight links to several open challenges. Our library is readily available at https://github.com/deepmind/clrs. 1. Introduction Neural networks and classical algorithms are two techniques that operate on diametrically opposite (and complementary) sides of problem-solving: neural networks can adapt and generalise to raw inputs, automatically extracting appropriate features and a single neural network setup is often applicable to many separate tasks (Zamir et al., 2018). However, they are hard to interpret, notoriously unreliable when extrapolating outside of the dataset they have been trained on, and rely on massive quantities of training data. On 1Deep Mind. Correspondence to: Petar Veliˇckovi c . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). the other hand, algorithms trivially strongly generalise to inputs of arbitrary sizes, and can be verified or proven to be correct, with interpretable step-wise operations. Their shortcoming is that inputs must be made to conform to a particular algorithm specification, and looking at a separate task often requires coming up with an entirely new algorithm (Veliˇckovi c & Blundell, 2021). Bringing the two sides closer together can therefore yield the kinds of improvements to performance, generalisation and interpretability that are unlikely to occur through architectural gains alone. Accordingly, algorithmic modelling as a domain for testing neural networks has been gaining popularity over the last few years (Zaremba & Sutskever, 2014; Kaiser & Sutskever, 2015; Trask et al., 2018; Vinyals et al., 2015; Kool et al., 2018; Freivalds et al., 2019; Dwivedi et al., 2020; Chen et al., 2020; Tang et al., 2020; Veliˇckovi c et al., 2019; Yan et al., 2020; Deac et al., 2020) due to its ability to highlight various reasoning limitations of existing architectures. Earlier work (Zaremba & Sutskever, 2014; Kaiser & Sutskever, 2015) focused on the need of long-term memory capabilities when executing algorithms, which offered a good test-bed for various recurrent and memory architectures. Recently, algorithmic tasks have been used to highlight the efficiency of graph neural networks (Dwivedi et al., 2020; Chen et al., 2020; Veliˇckovi c et al., 2019; Yan et al., 2020; Corso et al., 2020; Tang et al., 2020; Georgiev & Li o, 2020; Veliˇckovi c et al., 2020) and to distinguish between different variations of them, typically through the lens of algorithmic alignment architectures that align better with the underlying algorithm can be proven to have better sample complexity (Xu et al., 2019). Unfortunately, many of these works remain disconnected in terms of the algorithms they target, how the data is presented to the model or through the training and testing protocols they use, making direct comparison somewhat difficult. To make a first step towards a unified benchmark for algorithmic reasoning tasks, we propose a comprehensive dataset which we will refer to as The CLRS Algorithmic Reasoning Benchmark, in homage to the Introduction to Algorithms textbook by Cormen, Leiserson, Rivest and Stein (Cormen et al., 2009). The CLRS Algorithmic Reasoning Benchmark Within this benchmark, we propose and evaluate on CLRS30: a dataset containing trajectories a trajectory is formed of inputs, the corresponding outputs and optional intermediary targets of 30 classical algorithms covering various forms of reasoning, including sorting, searching, dynamic programming, geometry, graphs and strings. Some of these algorithms are depicted in Figure 1. The appeal and motivation for such a benchmark goes beyond unifying or providing a common ground for previous works, as we will describe. We believe that CLRS-30 is well positioned to explore out-of-distribution (OOD) generalization and transfer (as potentially part of a meta-learning setting) given the explicit and known relationship between different algorithms (e.g. what subroutines are shared and so forth). 2. Motivation 1 2 4 5 6 3 1 2 3 4 5 6 b a c b a b a b a a b c b a b T a b a b a c a S s Figure 1. Example of four algorithms within CLRS-30. A) insertion sort; B) string matching; C) greedy task scheduling; D) shortest paths. Timely posed benchmarks have led to a significant progress in the field, from the impact of Image Net (Russakovsky et al., 2015) on the vision community, to that of Wikipedia and Penn Treebank in popularizing neural networks for language modelling (Merity et al., 2016; Mikolov et al., 2011) or Atari-2600 for deep reinforcement learning (Bellemare et al., 2013). The prevalence of recent works focusing on algorithmic reasoning1, as well as a history of disparate work on a variety of bespoke benchmarks (Graves et al., 2014; Zaremba & Sutskever, 2014; Kaiser & Sutskever, 2015; Trask et al., 2018), suggests significant utility in a benchmark covering a wide-range of classical CS algorithms. Learning to mimic an algorithm also provides an opportunity to extensively test the limitations of architectures both in terms of their representation capacity and processing. This can then be related back directly onto underlying operations and qualities of the well-studied CS algorithms being mimicked as we are aware of both the process used to generate the inputs and the specifics of the underlying function 1Concurrent works published at the same venue include: (Xu et al., 2019; Veliˇckovi c et al., 2019) at ICLR 20 and (Veliˇckovi c et al., 2020; Corso et al., 2020; Tang et al., 2020) at Neur IPS 20. producing the corresponding outputs. Hence, benchmarking in this area can be used to better understand the limitations of current architectures and the optimisation schemes used. This benchmarking can come in many forms: Data can be easily generated, allowing the neural network behaviour to be probed under different regimes: from fewshot learning all the way to infinite-data. Algorithms can be used to understand the efficiency of different inductive biases and neural components. For example, a recent study (Tang et al., 2020) has demonstrated the direct benefits of choosing inductive biases that align well with iterative algorithms. Algorithms have also been used to highlight the importance of attention mechanisms (Graves et al., 2014) or to disambiguate various message passing mechanisms for graph neural networks (Richter & Wattenhofer, 2020; Joshi et al., 2020; Veliˇckovi c et al., 2019). Algorithms can require repeated computation, recursion, or performing very different forms of computations conditioned on the input, providing an excellent test-bed for evaluating compositionality; i.e. whether an algorithm executor can effectively exploit these repeated computations. One can control the amount of memory required to solve a problem instance, hence test the memorization ability of neural networks. Moreover, one can build a curriculum of tasks of increasing memory requirements (Zaremba & Sutskever, 2014). Control over the difficulty of problem instances also allows the behaviour of a trained model to be tested on OOD samples. While neural networks are highly efficient on solving complex perceptual tasks, current theoretical understanding suggests that their power relies on their ability to interpolate (Liu et al., 2020; Belkin et al., 2019; Jacot et al., 2018), limiting them to in-distribution generalisation. General reasoning systems, however, need to be able to expand beyond this type of generalization. OOD generalization (Li et al., 2020) is paramount, as generally one can not control the distribution a model will face over time when deployed. Understanding how algorithms operate on corner cases is a standard approach for analysing their correctness. Similarly, understanding the behaviour of a trained model on larger instances of the problem, or instances that expose such corner cases that were not covered in the training set, can elucidate to what degree the model has truly learned the algorithm (as opposed to overfitting to specific statistics of the training data). Particularly, we can control how far from the training distribution a test instance is, potentially allowing us to understand to what extent the model generalizes OOD, and under which circumstances. In turn, this can offer insight into the effectiveness of different inductive biases, highlighting what kinds of inductive biases are useful for mimicking reasoning processes. The CLRS Algorithmic Reasoning Benchmark One would also expect a general reasoning system to be able to reuse parts of learned computations when learning a new task, and to compose learnt computational subroutines (Lake, 2019; Griffiths et al., 2019; Alet et al., 2018). These forms of generalization have been the aim of several learning paradigms from transfer learning to meta-learning and continual learning or domain adaptation. However, many of these paradigms rely on the concept of a task, and measuring or understanding the ability of a learned system to reuse or compose requires the ability to decompose a task into sub-tasks and to be able to relate tasks among themselves. In many scenarios, such decompositions are ambiguous. Without a clear segmentation into sub-tasks, there can be no clearly defined distance metric between tasks (Du et al., 2018). Conversely, algorithms are built based on subroutines that tend to be extensively shared, providing a good playground for formalizing and measuring reuse and composition, making an algorithmic reasoning benchmark potentially attractive to meta-learning practitioners. Lastly and fundamentally, computer scientists rely on a relatively small2 number of algorithms to address an extremely vast set of problems. They can be seen as a very powerful basis that spans most forms of reasoning processes. On one hand, this means that any generic reasoning system likely has to be able to reproduce all such kinds of procedures, hence, building a system that properly learns all of them is a major stepping stone towards generic reasoning. On the other hand, this means that they can be used to discover inductive biases that will enable tackling more complex problems. This is either because these complex problems can be seen as a combination of several algorithms, or because learning certain algorithms can provide a reliable way for the model to learn how to access its own memory or how to attend to its input or other such internal mechanisms. So by first training on algorithms potentially controlling the difficulty of training instances one can pre-train for tasks where full trajectories may not be available (Veliˇckovi c et al., 2021). One such example is discovering novel polynomialtime heuristics for combinatorial optimisation (Bengio et al., 2020; Cappart et al., 2021; Khalil et al., 2017) or reinforcement learning (Deac et al., 2021). Note that our focus with this benchmark lies in learning the basic algorithms themselves only this in itself proves sufficiently challenging for neural networks, and is itself a useful outcome for the reasons highlighted above. However, we speculate that once a neural network can learn not only individual algorithms but novel combinations of multiple algorithms or even discover new algorithms, such networks will be useful in a wide variety of problems from scientific problems such as protein folding and genomics to simulated environments such as those used by reinforcement learning and control much 2The entire Introduction to Algorithms textbook (Cormen et al., 2009) proposes and discusses 100 algorithms in total. as classic CS algorithms already make in-roads into these domains but lack the ability to learn from data. Guided by these observations, we regard CLRS-30 as a first step towards a pragmatic setting to test many of these different aspects of current architectures. While we do not directly target all of the scenarios outlined above, the benchmark was built with ease of expansion in mind; enabling for extensive tweaking of training/testing setups, kinds of information captured in algorithm trajectories, as well as including additional algorithms, which we aim to do consistently over time. 3. CLRS Algorithmic Reasoning Benchmark Owing to its name, CLRS-30 consists only of algorithms which may be encountered in the CLRS textbook (Cormen et al., 2009). Further, all algorithm trajectories and relevant variables have been designed to match the pseudocode in the textbook as closely as possible. We begin by describing the selection criteria we applied when determining which algorithms to include in CLRS-30. Our initial survey of the textbook yielded 94 algorithms and data structures of interest. From this point, we set out to filter this set to algorithms suitable for inclusion in the initial version of our benchmark. The criteria we applied, with justification and remarks, are as follows: We want to be able to reliably generate ground-truth outputs for large inputs. As such, NP-hard tasks (and approximation algorithms thereof) have been excluded. Our decision is backed up by theoretical work suggesting impossibility of accurately modelling NP-hard problems using polynomialtime samplers, unless NP=co-NP (Yehuda et al., 2020). Tasks requiring numerical outputs have been excluded. Evaluating their performance is ambiguous, and may be dependent on the way architectures choose to represent numbers. For example, Yan et al. (2020) (which represents numbers in binary) and Veliˇckovi c et al. (2019) (which represents them in floating-point) report different metrics on predicting shortest-path lengths. This excludes most number-theoretic algorithms, linear programming, and max-flow3. It does not exclude shortest-path algorithms: we can treat them as tasks of finding edges belonging to the shortest path, as was done in Veliˇckovi c et al. (2019); Tang et al. (2020). The numerical values of path lengths are then treated as intermediate parts of the trajectory, and not directly evaluated on. Standalone data structures do not directly represent a task4. 3It should be noted that, by the max-flow min-cut theorem (Ford Jr & Fulkerson, 2015), any max-flow problem can be cast as finding the minimum cut containing the source vertex. This is a discrete decision problem over input vertices, which hence doesn t violate our constraints, and could be included in future iterations. 4In programming language terms, their algorithms tend to be The CLRS Algorithmic Reasoning Benchmark Rather, their target is appropriately updating the internal state of the data structure. Hence, we don t include their operations, unless they appear as components of algorithms. We, of course, look forward to including them in subsequent versions of the dataset, as they can provide useful building blocks for learning complex algorithms. Lastly, there are representational issues associated with dynamically allocated memory it may be unclear what is the best way to represent the internal memory storage and its usage in algorithm trajectories. One example of the ambiguity is in asking whether the algorithm executor should start with a scratch space defined by the space complexity of the problem that gets filled up, or dynamically generate such space5 (Strathmann et al., 2021). As such, we for now exclude all algorithms that require allocating memory which cannot be directly attached to the set of objects provided at input time. This excludes algorithms like merge sort, Hierholzer s algorithm for finding Euler tours (Hierholzer & Wiener, 1873), or string matching using finite automata. All of the above applied, we arrive at the 30 algorithms that are selected into CLRS-30, which we categorize as follows: Sorting: Insertion sort, bubble sort, heapsort (Williams, 1964), quicksort (Hoare, 1962). Searching: Minimum, binary search, quickselect (Hoare, 1961). Divide and Conquer (D&C): Maximum subarray (Kadane s variant (Bentley, 1984)). Greedy: Activity selection (Gavril, 1972), task scheduling (Lawler, 1985). Dynamic Programming: Matrix chain multiplication, longest common subsequence, optimal binary search tree (Aho et al., 1974). Graphs: Depth-first and breadth-first search (Moore, 1959), topological sorting (Knuth, 1973), articulation points, bridges, Kosaraju s strongly-connected components algorithm (Aho et al., 1974), Kruskal s and Prim s algorithms for minimum spanning trees (Kruskal, 1956; Prim, 1957), Bellman-Ford and Dijkstra s algorithms for single-source shortest paths (Bellman, 1958; Dijkstra et al., 1959) (+ directed acyclic graphs version), Floyd-Warshall algorithm for all-pairs shortest paths (Floyd, 1962). Strings: Na ıve string matching, Knuth-Morris-Pratt (KMP) string matcher (Knuth et al., 1977). Geometry: Segment intersection, Convex hull algorithms: Graham scan (Graham, 1972), Jarvis march (Jarvis, 1973). The chosen algorithms span a wide variety of reasoning of the void type. 5Akin to malloc-like calls in C++. procedures, and hence can serve as a good basis for algorithmic reasoning evaluation, as well as extrapolation to more challenging problems. 3.1. Implementation, probes and representation We have implemented the selected 30 algorithms in an idiomatic way, which aligns as closely as possible to the original pseudocode from Cormen et al. (2009). This allows us to automatically generate input/output pairs for all of them, enabling full control over the input data distribution, so long as it conforms to the preconditions of the algorithm. Further, we capture the intermediate algorithm trajectory in the form of hints (detailed in section 3.2), which allow insight into the inner workings of the algorithm. Such trajectories have already been extensively used in related work (Veliˇckovi c et al., 2019; 2020; Georgiev & Li o, 2020; Deac et al., 2020) and are typically crucial for OOD generalisation. In the most generic sense, algorithms can be seen as manipulating sets of objects, along with any relations between them (which can themselves be decomposed into binary relations). If the sets are (partially) ordered (e.g. arrays or rooted trees), this can be imposed by including predecessor links. Therefore, algorithms generally operate over graphs. Motivated by existing theoretical results showing that graph neural networks align well with dynamic programming-style computations (Xu et al., 2019; Dudzik & Veliˇckovi c, 2022), we propose a graph-oriented way to encode the data. Generally, our data is represented as a set of n vertices6, where n is a hyperparameter that is provided as part of the dataset generation process. When the semantics of these nodes are not immediately clear from the task (e.g. graph algorithms naturally operate over a graph of n nodes), we make an appropriate modification to derive nodes. For example, in sorting algorithms, we treat every input list element as a separate node, and in string matching, we treat each character of the two input strings as a separate node. All information over these graphs falls under the following categorisation: Stage: Every feature, i.e. observation in the trajectory, is either part of the input, output, or the hints. As we do not cover algorithms that perform on-line querying, for all 30 algorithms there will be exactly one snapshot of the input and output values, whereas hints will be a time-series of intermediate algorithm states. Location: Every feature is either present within the nodes, edges (pairs of nodes) or the graph7. 6Edges are only present to represent the predecessor vertex if the input is a partially ordered. 7This also determines shapes of each feature, e.g. node features The CLRS Algorithmic Reasoning Benchmark Type: Every feature can be of five possible types, which can determine the appropriate method for encoding/decoding it, and the appropriate loss function to use when learning to predict it: scalar: Floating-point scalar8 feature. This would typically be fit using mean-squared error. categorical: Categorical feature over K possible classes. The type corresponds typically to crossentropy loss over the classes. mask: Categorical feature over two classes. This can be fit using binary cross-entropy. mask one: Categorical feature over two classes, where exactly one node is active ( one-hot ). One would generally optimise this argmax operation using categorical cross-entropy. pointer: Categorical feature over the n nodes. To predict similarity score against every node, and typically optimised using categorical cross entropy (as introduced in Pointer Graph Networks (PGN) (Veliˇckovi c et al., 2020)). Specifying a feature s stage, location and type fully determines its role in the dataflow. A tuple (stage, loc, type, values) is referred to as a probe. Each of the 30 algorithms has a static (w.r.t. stage, location and type) set of probes, which are considered to be a spec for the algorithm. We will later describe how these specs may be used to construct baseline architectures for the benchmark. Every node is always endowed with a position scalar input probe, which uniquely indexes it the values are linearly spaced between 0 and 1 along the node index. This allows not only representing the data sequentially (when this is appropriate), but also serves as a useful tie-breaker when algorithms could make an arbitrary choice on which node to explore next we force the algorithms to favour nodes with smaller position values. To illustrate these concepts further, at the end of this section we will describe the probes in detail for a popular algorithm (insertion sort). Note that, while we format the data in a way that clearly favours graph neural network executors, it can be easily adapted for different types of neural architectures; for example, sequence to sequence models (Sutskever et al., 2014). are of shape n f; edge features are of shape n n f; graph features are of shape f, where f is the dimension of this feature (excluding batch axis). 8Given our current restriction on numerical predictions, scalar types will never be given in the output stage. Overall, CLRS-30 requires 1h to generate, and occupies 4.5GB when uncompressed, across all 30 tasks. Hints are an important component of our benchmark, which we find fundamental in order to make progress on algorithmic reasoning. As we previously argued, the advantage of algorithms as a task is our understanding of their behaviour, and our ability to decompose them into useful subroutines that can be shared or repeatedly applied. While, implicitly, we hope that such a decomposition would happen in any learned system, even when trained just using inputs and outputs (as studied in Xu et al. (2019)), the degree to which we can measure or encourage this is limited in the typical end-to-end learning process, and often most of the generalisation happens only in-distribution (as observed by Veliˇckovi c et al. (2019); Xu et al. (2020); Bevilacqua et al. (2021)). The underlying algorithm may not be statistically identifiable from a small set of input/output pairs. Conversely, a perfect decomposition of a task into small subtasks can be generated for algorithmic problems. Then, individual models for each subtask may be trained and recomposed into a solution. Such an approach will, by construction, provide strong decompositional benefits: as studied by Yan et al. (2020), perfect OOD generalisation can be observed with such models, and they can even generalise zero-shot to test algorithms that reuse their modules. However, the downstream applicability of this is potentially limited; when faced with a novel task which cannot be easily decomposed into subtasks, it can be hard to decide how to reuse the learnt modules. We believe hints to lie in-between these two approaches. On one hand, they represent intermediate targets which the network should be able to predict if it performs reasoning similar9 to the ground truth algorithm it is supposed to mimic. Indeed, several lines of recent work (Veliˇckovi c et al., 2019; Georgiev & Li o, 2020; Veliˇckovi c et al., 2020; Deac et al., 2020) make favourable conclusions about using them, when it comes to achieving stronger OOD generalisation. Furthermore, models leveraging hints are still end-to-end models; when faced with a novel task at test-time, we don t need explicit knowledge of that task s hints in order to re-use the weights learnt on a task which had them. Algorithms specify one way of attacking a problem, that is explicitly detailed through the hints. In this sense, insertion sort (to be presented shortly) is one way of implementing 9Note that architectures supervised in this way usually don t model the hints perfectly, and will deviate from the target algorithm in subtle ways Veliˇckovi c et al. (2020) perform a qualitative study which shows GPU-specialised data structures could emerge as a result of such setups. The CLRS Algorithmic Reasoning Benchmark 5 2 4 3 1 5 2 4 3 1 5 2 4 3 1 5 2 4 3 1 5 2 4 3 1 Figure 2. A sequence of hints for insertion sorting a list [5, 2, 4, 3, 1]. Green pointers correspond to the predecessor pointers (specifying the list s state throughout the algorithm s execution. Note how the head of the list always points to itself, by convention. Further, note how, at every step, the list is rewired such that the node selected by the blue pointer (slot) will point to the current iterator (pointed in red). a sorting function: all sorting algorithms model sorting functions, and will hence have identical outputs for identical inputs. The aspects that set the different sorting algorithms apart are exposed through their hints. Being mindful of the fact that neural networks commonly run on parallelisable architectures, we have made efforts to compress the hints as much as possible. For example, if a single for loop is used to sweep the data and detect the node which optimises a certain quantity (without doing any order-sensitive computations), that for loop can typically be entirely skipped when recording hints: as parallel architectures may typically examine all the nodes at once. Further, we make every effort possible that the hint at step t + 1 will be predictable from the hints at step t by using only a single step of message passing. 3.3. Worked example: insertion sort To illustrate all of the concepts outlined above, we observe the trajectories extracted by our data collection procedure on an example: insertion sorting the array [5, 2, 4, 3, 1]. Insertion sort uses one pointer (j) to scan through the array, and then another pointer (i) to slot the j-th item into the correct place within [0..j]. This ascertains the invariant that, after k steps, the subarray of the first k elements is completely sorted. Hence the trajectory (with i and j marked) is: [5i,j, 2, 4, 3, 1] [2i, 5j, 4, 3, 1] [2, 4i, 5j, 3, 1] [2,3i, 4, 5j, 1] [1i, 2, 3, 4, 5j]. Here, at each step, j scans along the array, and i indicates the correct place for the element that was j-th at the start of each iteration. Converting this trajectory into a graph representation requires some considerations. Requiring the model to perform explicit swapping of node values would, ultimately, require numerical predictions. To avoid it, we ask the model to predict the predecessor pointer of each node (by convention, the head of the array points to itself). Hence the actual recorded trajectory can be realised as depicted in Figure 2. In this figure, green pointers correspond to the predecessor pointers, red ones point to j, and blue ones point to i. i and j are realised as type mask one, whereas predecessors are of type pointer and all three are stored in the nodes. The red and blue pointers represent the hints for this task. Finally, note that the original insertion sort pseudocode mandates that, at each iteration, i starts at position j and shifts backward until the right position is found. However, this procedure can be performed in one step by a GNN, as it can locate the correct position by examining all relevant positions, and we can omit all of those intermediate steps. In order to further illustrate how these hints are collected, we also provide an informal pseudocode for collecting hints for insertion sort in Algorithm 1: Algorithm 1 Hint updates for Insertion Sort Input :Input array val, Positions pos Hints :Predecessors pred, Iterator iter, swap slot slot ( 0 i = 0 i 1 i > 0 ; // Initialise list slot 0, iter 0 while iter < n do iter iter + 1 max node argmax j : pos[j] µA(B). If X = A. A A X, then model A wins on algorithm A. Otherwise, if X. X A A, then model A loses on algorithm A. Otherwise, model A is tied on algorithm A. The win/tie/loss counts are then aggregated across all algorithms A to obtain a metric for each model. As already mentioned, the details of this on a per-algorithm level are given in Table 3. D. Validation results individual plots Validation performance for all 30 algorithms in CLRS-30 may be found in Figure 4. For convenience, we also report the early-stopped validation performance in Table 4. The CLRS Algorithmic Reasoning Benchmark Table 2. Test performance of all models on all algorithms. Algorithm Deep Sets GAT Memnet MPNN PGN Activity Selector 66.09% 1.67 73.23% 1.37 24.10% 2.22 80.66% 3.16 66.80% 1.62 Articulation Points 39.06% 4.04 37.76% 1.62 1.50% 0.61 50.91% 2.18 49.53% 2.09 Bellman-Ford 51.33% 0.85 87.91% 1.19 40.04% 1.46 92.01% 0.28 92.99% 0.34 BFS 98.63% 0.38 99.04% 0.21 43.34% 0.04 99.89% 0.05 99.63% 0.29 Binary Search 47.97% 0.88 23.50% 3.12 14.37% 0.46 36.83% 0.26 76.95% 0.13 Bridges 32.43% 2.65 25.64% 6.60 30.26% 0.05 72.69% 4.78 51.42% 7.82 Bubble Sort 50.73% 3.24 9.91% 1.77 73.58% 0.78 5.27% 0.60 6.01% 1.95 DAG Shortest Paths 73.21% 2.42 81.14% 1.37 66.15% 1.92 96.24% 0.56 96.94% 0.16 DFS 7.44% 0.73 11.78% 2.04 13.36% 1.61 6.54% 0.51 8.71% 0.24 Dijkstra 36.12% 3.10 58.01% 0.79 22.48% 2.39 91.50% 0.50 83.45% 1.75 Find Max. Subarray 12.48% 0.39 24.43% 0.43 13.05% 0.08 20.30% 0.49 65.23% 2.56 Floyd-Warshall 7.22% 0.90 16.66% 3.14 14.17% 0.13 26.74% 1.77 28.76% 0.51 Graham Scan 64.71% 2.75 77.89% 2.70 40.62% 2.31 91.04% 0.31 56.87% 1.61 Heapsort 28.94% 12.57 10.35% 1.83 68.00% 1.57 10.94% 0.84 5.27% 0.18 Insertion Sort 40.98% 4.65 29.52% 1.87 71.42% 0.86 19.81% 2.08 44.37% 2.43 Jarvis March 50.25% 0.81 51.51% 10.25 22.99% 3.87 34.86% 12.39 49.19% 1.07 KMP Matcher 3.22% 0.54 3.03% 0.36 1.81% 0.00 2.49% 0.86 2.00% 0.12 LCS Length 50.10% 5.25 57.88% 1.02 49.84% 4.34 53.23% 0.36 56.82% 0.21 Matrix Chain Order 78.36% 3.58 78.19% 3.31 81.96% 1.03 79.84% 1.40 83.91% 0.49 Minimum 80.19% 2.08 84.20% 2.95 86.93% 0.11 85.34% 0.88 87.71% 0.52 MST-Kruskal 60.58% 4.71 65.72% 0.99 28.84% 0.61 70.97% 1.50 66.96% 1.36 MST-Prim 12.17% 5.47 38.20% 4.34 10.29% 3.77 69.08% 7.56 63.33% 0.98 Na ıve String Match 2.05% 0.29 3.01% 1.20 1.22% 0.48 3.92% 0.30 2.08% 0.20 Optimal BST 69.71% 1.36 65.49% 1.75 72.03% 1.21 62.23% 0.44 71.01% 1.82 Quickselect 3.21% 1.33 4.36% 0.95 1.74% 0.03 1.43% 0.69 3.66% 0.42 Quicksort 37.74% 2.16 7.60% 0.98 73.10% 0.67 11.30% 0.10 6.17% 0.15 Segments Intersect 77.29% 0.60 90.41% 0.04 71.81% 0.90 93.44% 0.10 77.51% 0.75 SCC 17.81% 2.61 12.70% 3.12 16.32% 4.78 24.37% 4.88 20.80% 0.64 Task Scheduling 84.84% 0.70 84.69% 2.09 82.74% 0.04 84.11% 0.32 84.89% 0.91 Topological Sort 15.84% 3.57 27.03% 6.92 2.73% 0.11 52.60% 6.24 60.45% 2.69 Overall average 43.36% 44.69% 38.03% 51.02% 52.31% The CLRS Algorithmic Reasoning Benchmark Figure 4. Validation results on all 30 algorithms in CLRS-30, averaged over three seeds. The CLRS Algorithmic Reasoning Benchmark Table 3. Win/Tie/Loss counts of all models on all algorithms. Legend: W: win, T: tie, L: loss. Algorithm Deep Sets GAT Memnet MPNN PGN Activity Selector L L L W L Articulation Points L L L T T Bellman-Ford L L L L W BFS L L L W L Binary Search L L L L W Bridges L L L W L Bubble Sort L L W L L DAG Shortest Paths L L L L W DFS L T T L L Dijkstra L L L W L Find Max. Subarray L L L L W Floyd-Warshall L L L L W Graham Scan L L L W L Heapsort L L W L L Insertion Sort L L W L L Jarvis March T T L L L KMP Matcher T T L L L LCS Length L W L L L Matrix Chain Order L L L L W Minimum L L L L W MST-Kruskal L L L W L MST-Prim L L L T T Na ıve String Match L L L W L Optimal BST L L T L T Quickselect L T L L T Quicksort L L W L L Segments Intersect L L L W L SCC L L L T T Task Scheduling T T L L T Topological Sort L L L L W Overall counts 0/3/27 1/5/24 4/2/24 8/3/19 8/6/16 The CLRS Algorithmic Reasoning Benchmark Table 4. Early-stopped validation results of all models on all algorithms. Algorithm Deep Sets GAT Memnet MPNN PGN Activity Selector 83.50% 0.17 92.40% 0.50 34.59% 2.15 93.89% 0.39 82.26% 0.19 Articulation Points 99.63% 0.31 100.00% 0.00 16.84% 1.03 100.00% 0.00 100.00% 0.00 Bellman-Ford 81.12% 0.14 99.28% 0.14 68.75% 0.42 99.48% 0.05 99.35% 0.05 BFS 100.00% 0.00 100.00% 0.00 70.70% 0.09 100.00% 0.00 100.00% 0.00 Binary Search 93.34% 0.41 95.72% 0.17 20.33% 0.28 94.19% 0.12 94.17% 0.08 Bridges 99.36% 0.05 100.00% 0.00 96.46% 1.13 100.00% 0.00 100.00% 0.00 Bubble Sort 81.51% 1.02 95.44% 1.01 92.64% 0.14 94.53% 1.84 87.17% 5.46 DAG Shortest Paths 92.25% 0.28 96.81% 0.05 81.90% 0.05 99.93% 0.05 99.80% 0.00 DFS 62.76% 1.26 99.22% 0.64 47.72% 0.45 100.00% 0.00 100.00% 0.00 Dijkstra 80.34% 0.42 99.22% 0.40 67.38% 0.70 99.67% 0.14 99.28% 0.05 Find Max. Subarray 91.41% 0.22 95.00% 0.32 27.91% 0.08 95.13% 0.37 95.30% 0.16 Floyd-Warshall 35.79% 0.04 87.28% 0.09 31.29% 0.04 89.14% 0.03 88.70% 0.15 Graham Scan 87.66% 0.24 97.85% 0.11 53.53% 1.58 98.45% 0.15 89.06% 0.27 Heapsort 81.84% 0.33 87.24% 2.23 54.04% 0.28 94.27% 0.11 90.36% 0.67 Insertion Sort 89.58% 0.28 95.18% 0.58 94.40% 0.14 96.74% 0.19 84.57% 0.82 Jarvis March 72.82% 0.42 98.38% 0.16 37.92% 6.61 97.94% 0.25 88.34% 0.36 KMP Matcher 98.03% 0.21 99.76% 0.08 9.67% 0.00 99.87% 0.05 94.14% 0.99 LCS Length 69.24% 0.36 77.00% 0.19 67.69% 0.24 77.88% 0.42 69.19% 0.04 Matrix Chain Order 94.46% 0.02 99.37% 0.03 93.91% 0.10 99.12% 0.04 99.21% 0.03 Minimum 97.59% 0.11 97.74% 0.21 95.56% 0.10 97.64% 0.05 97.07% 0.14 MST-Kruskal 83.79% 2.01 97.93% 0.25 64.65% 0.95 99.71% 0.17 99.12% 0.08 MST-Prim 74.61% 0.32 98.37% 0.14 74.09% 0.28 99.02% 0.09 97.79% 0.14 Na ıve String Match 49.80% 0.15 100.00% 0.00 9.91% 0.20 100.00% 0.00 50.33% 0.08 Optimal BST 92.02% 0.14 93.30% 0.49 90.86% 0.40 93.88% 0.11 93.20% 0.27 Quickselect 42.30% 0.92 83.82% 1.86 6.56% 0.25 88.74% 0.78 54.02% 0.17 Quicksort 79.69% 1.12 92.97% 0.40 93.16% 0.24 95.70% 0.40 54.30% 1.42 Segments Intersect 77.49% 0.12 90.82% 0.16 71.57% 1.08 93.84% 0.20 78.32% 0.18 SCC 89.52% 1.23 100.00% 0.00 70.57% 1.43 100.00% 0.00 99.93% 0.05 Task Scheduling 99.16% 0.04 99.80% 0.04 84.80% 0.09 100.00% 0.00 99.06% 0.08 Topological Sort 47.23% 0.81 100.00% 0.00 8.30% 0.50 100.00% 0.00 100.00% 0.00 Overall average 80.93% 95.66% 57.92% 96.63% 89.47%