# synthesizing_tasks_for_blockbased_programming__427f778d.pdf Synthesizing Tasks for Block-based Programming Umair Z. Ahmed1 Maria Christakis2 Aleksandr Efremov2 Nigel Fernandez2 Ahana Ghosh2 Abhik Roychoudhury1 Adish Singla2 1National University of Singapore, {umair, abhik}@comp.nus.edu.sg, 2MPI-SWS, {maria, aefremov, nfernand, gahana, adishs}@mpi-sws.org Block-based visual programming environments play a critical role in introducing computing concepts to K-12 students. One of the key pedagogical challenges in these environments is in designing new practice tasks for a student that match a desired level of difficulty and exercise specific programming concepts. In this paper, we formalize the problem of synthesizing visual programming tasks. In particular, given a reference visual task Tin and its solution code Cin, we propose a novel methodology to automatically generate a set {(Tout, Cout)} of new tasks along with solution codes such that tasks Tin and Tout are conceptually similar but visually dissimilar. Our methodology is based on the realization that the mapping from the space of visual tasks to their solution codes is highly discontinuous; hence, directly mutating reference task Tin to generate new tasks is futile. Our task synthesis algorithm operates by first mutating code Cin to obtain a set of codes {Cout}. Then, the algorithm performs symbolic execution over a code Cout to obtain a visual task Tout; this step uses the Monte Carlo Tree Search (MCTS) procedure to guide the search in the symbolic tree. We demonstrate the effectiveness of our algorithm through an extensive empirical evaluation and user study on reference tasks taken from the Hour of Code: Classic Maze challenge by Code.org and the Intro to Programming with Karel course by Code HS.com. 1 Introduction Block-based visual programming environments are increasingly used nowadays to introduce computing concepts to novice programmers including children and K-12 students. Led by the success of environments like Scratch [29], initiatives like Hour of Code by Code.org [24] (HOC) and online platforms like Code HS.com [21], block-based programming has become an integral part of introductory computer science education. Considering HOC alone, over one billion hours of block-based programming activity has been performed so far by over 50 million unique students worldwide [24, 35]. The societal need for enhancing K-12 computing education has led to a surge of interest in developing AI-driven systems for pedagogy of block-based programming [33, 26, 27, 34, 16]. Existing works have studied various aspects of intelligent support, including providing real-time next-step hints when a student is stuck solving a task [20, 36, 18, 17, 9], giving data-driven feedback about a student s misconceptions [31, 19, 28, 30, 35], and demonstrating a worked-out solution for a task when a student lacks the required programming concepts [37]. An underlying assumption when providing such intelligent support is that afterwards the student can practice new similar tasks to finally learn the missing concepts. However, this assumption is far from reality in existing systems the programming tasks are typically hand-curated by experts/tutors, and the available set of tasks is limited. Consider HOC s Classic Maze challenge [23], which provides a progression of 20 tasks: Millions of students have attempted these tasks, yet when students fail to solve a task and receive assistance, they cannot practice similar tasks, hindering their ability to master the desired concepts. We seek to tackle this pedagogical challenge by developing techniques for synthesizing new programming tasks. Authors listed alphabetically; Correspondence to: Ahana Ghosh . 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. (a) Visual puzzle for Tin Repeat Until(goal){ move If(path Left){ turn Left } } } (b) Solution code Cin (c) Visual puzzle for Tout def Run(){ move turn Left Repeat Until(goal){ move If(path Right){ turn Right } } } (d) Solution code Cout Figure 1: Illustration of our methodology for task Maze 16 from the Hour of Code: Classic Maze challenge by Code.org [23]; the complete list of tasks with their specifications is in Fig. 6. (a) Visual puzzle for Tin put Marker While(path Ahead){ move turn Left move turn Right put Marker } } (b) Solution code Cin (c) Visual puzzle for Tout put Marker While(path Ahead){ move move turn Right move turn Left put Marker } } (d) Solution code Cout Figure 2: Illustration of our methodology for task Diagonal from the Intro to Programming with Karel course by Code HS.com [22]; the complete list of tasks with their specifications is in Fig. 6. We formalize the problem of synthesizing visual programming tasks of the kind found in popular learning platforms like Code.org (see Fig. 1) and Code HS.com (see Fig. 2). As input, we are given a reference task Tin, specified as a visual puzzle, and its solution code Cin. Our goal is to synthesize a set {(Tout, Cout)} of new tasks along with their solution codes that are conceptually similar but visually dissimilar to the input. This is motivated by the need for practice tasks that on one hand exercise the same concepts, while looking fresh in order to maintain student engagement. When tackling the problem of synthesizing new tasks with the above desirable properties, three key challenges emerge. First, we are generating problems in a conceptual domain with no well-defined procedure that students follow to solve a task consequently, existing work on educational problem generation in procedural domains does not apply in our setting [3, 11]. Second, the mapping from the space of visual tasks to their solution codes is highly discontinuous; hence, template-based problem generation techniques [32, 25] that rely on directly mutating the input to generate new tasks is ineffective (see Section 5 where we use this approach as a baseline). Furthermore, such a direct task-mutation approach would require access to an automated solution synthesizer; however, state-of-the-art program synthesis techniques are not yet on par with experts and their minimal solutions [5, 8, 6]. Third, the space of possible tasks and their solutions is potentially unbounded, and thus, any problem generation technique that relies on exhaustive enumeration is intractable [32, 1, 2]. To overcome these challenges, we propose a novel methodology that operates by first mutating the solution code Cin to obtain a set of codes {Cout}, and then performing symbolic execution over a code Cout to obtain a visual puzzle Tout. Mutation is efficient by creating an abstract representation of Cin along with appropriate constraints and querying an SMT solver [4]; any solution to this query is a mutated code Cout. During symbolic execution, we use Monte Carlo Tree Search (MCTS) to guide the search over the (unbounded) symbolic execution tree. We demonstrate the effectiveness of our methodology by performing an extensive empirical evaluation and user study on a set of reference tasks from the Hour of code challenge by Code.org and the Intro to Programming with Karel course by Code HS.com. In summary, our main contributions are: We formalize the problem of synthesizing block-based visual programming tasks (Section 2). We present a novel approach for generating new visual tasks along with solution codes such that they are conceptually similar but visually dissimilar to a given reference task (Section 3). We demonstrate the effectiveness of our approach through an extensive empirical evaluation and user study on reference tasks from real-world programming platforms (Section 4 and Section 5). 2 Problem Formulation The space of tasks. We define a task as a tuple T := (Tvis, Tstore, Tsize), where Tvis denotes the visual puzzle, Tstore the available block types, and Tsize the maximum number of blocks allowed in the solution code. For instance, considering the task T := Tin in Fig. 1a, Tvis is illustrated in Fig. 1a, Tstore = {move, turn L, turn R, Repeat Until, If}, and Tsize = 4. The space of codes. The programming environment has a domain-specific language (DSL), which defines the set of valid codes C and is shown in Fig. 4a. A code C C is characterized by several properties, such as the set Cblocks of block types in C, the number of blocks Csize, the depth Cdepth of the corresponding Abstract Syntax Tree (AST), and the nesting structure Cstruct representing programming concepts exercised by C. For instance, considering the code C := Cin in Fig. 1b, Cblocks = {move, turn L, Repeat Until, If}, Csize = 4, Cdepth = 3, and Cstruct = {Run{Repeat Until{If}}}. Below, we introduce two useful definitions relating the task and code space. Definition 1 (Solution code). C is a solution code for T if the following holds: C successfully solves the visual puzzle Tvis, Cblocks Tstore, and Csize Tsize. CT denotes the set of all solution codes for T. Definition 2 (Minimality of a task). Given a solvable task T with |CT| 1 and a threshold δ N, the task is minimal if C CT such that Csize < Tsize δ. Next, we introduce two definitions formalizing the notion of conceptual similarity. Definition 3 formalizes conceptual similarity of a task T along with one solution code C. Since a task can have multiple solution codes, Definition 4 provides a stricter notion of conceptual similarity of a task T for all its solution codes. These definitions are used in our objective of task synthesis in conditions (I) and (V) below. Definition 3 (Conceptual similarity of (T, C)). Given a reference (Tin, Cin) and a threshold δ N, a task T along with a solution code C is conceptually similar to (Tin, Cin) if the following holds: Tstore = Tin store, |Tsize Tin size| δ, and Cstruct = Cin struct. Definition 4 (Conceptual similarity of (T, )). Given a reference (Tin, Cin) and a threshold δ N, a task T is conceptually similar to (Tin, Cin) if the following holds: Tstore = Tin store, |Tsize Tin size| δ, and C CT, Cstruct = Cin struct. Environment domain knowledge. We now formalize our domain knowledge about the block-based environment to measure visual dissimilarity of two tasks, and capture some notion of interestingness and quality of a task. Given tasks T and T , we measure their visual dissimilarity by an environmentspecific function Fdiss(Tvis, T vis) [0, 1]. Moreover, we measure generic quality of a task with function Fqual(Tvis, C) [0, 1]. We provide specific instantiations of Fdiss and Fqual in our evaluation. Objective of task synthesis. Given a reference task Tin and a solution code Cin CTin as input, we seek to generate a set {(Tout, Cout)} of new tasks along with solution codes that are conceptually similar but visually dissimilar to the input. Formally, given parameters (δsize, δdiss, δqual), our objective is to synthesize new tasks meeting the following conditions: (I) (Tout, Cout) is conceptually similar to (Tin, Cin) with threshold δsize in Definition 3. (II) Tout is visually dissimilar to Tin with margin δdiss, i.e., Fdiss(Tin vis, Tout vis) δdiss. (III) Tout has a quality score above threshold δqual, i.e., Fqual(Tout vis, Cout) δqual. In addition, depending on the use case, it is desirable that the new tasks satisfy the following criteria: (IV) Cout is different from the input solution code, i.e., Cout = Cin. (V) Tout is conceptually similar to (Tin, Cin) with threshold δsize in Definition 4. (VI) Tout is minimal as per Definition 2 for a desired value of δmini (e.g., δmini = 0 or δmini = 1). 3 Our Task Synthesis Algorithm task Tin code Cin sketch, constraints task Tout code Cout (a) (b) (c) Figure 3: Stages in our task synthesis algorithm. We now present the pipeline of our algorithm (see Fig. 3), which takes as input a reference task Tin and its solution code Cin, and generates a set {(Tout, Cout)} of new tasks with their solution codes. The goal is for this set to be conceptually similar to (Tin, Cin), but for new tasks {Tout} to be visually dissimilar to Tin. This is achieved by two main stages: (1) mutation of Cin to obtain a set {Cout}, and (2) symbolic execution of each Cout to create a task Tout. The first stage, presented in Section 3.1, converts Cin into an abstract representation restricted by a set of constraints (Fig. 3(a)), which must be satisfied by any generated Cout (Fig. 3(b)). The second stage, described in Section 3.2, applies symbolic execution on each code Cout to create a corresponding visual task Tout (Fig. 3(c)) while using Monte Carlo Tree Search (MCTS) to guide the search in the symbolic execution tree. code C := def Run () do y rule y := s | g | s; g rule s := a | s; s | If (b) do s | If (b) do s Else s | While (b) do s | Repeat (x) do s rule g := Repeat Until (goal) do s action a := move | turn L | turn R | put M | pick M bool b := path A | no Path A | path L | no Path L | path R | no Path R | marker | no Marker iter x := 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 (a) Code DSL sketch Q := def Run () do Y rule Y := S | G | S; G rule S := A | S; S | If (B) do S | If (B) do S Else S | While (B) do S | Repeat (X) do S rule G := Repeat Until (goal) do S Comments: A may be φ or take values of action a A denotes a sequence A1, . . . , An (b) Sketch DSL Input: code C, sketch Q Ω(C), map ω( | C), δsize, δiter ( 0) Size of generated code may be at most Csize + δsize ( 1) Edit action sequences ACTIONEDITS({A Q}, ω( | C)) ( 2) For each X Q : |X ω(X| C)| δiter ( 3) Constraints induced by structure {Abefore; Repeat {A}; Aafter} i. A is not a suffix of Abefore ii. A is not a prefix of Aafter ( 4) For each B Q : i. ω(B | C) {path A, no Path A } B {path A, no Path A } ii. ω(B | C) {path L, no Path L path R, no Path R } B {path L, no Path L, path R, no Path R } iii. ω(B | C) {marker, no Marker } B { marker,no Marker } ( 5) Constraints induced on A nested inside conditional B ( 6) For each A Q, constraints ensuring minimality of A (c) Types of Sketch Constraints Repeat Until(goal){ move If(path Left){ turn Left } } } (d) Code Cin A1 1, A2 1 (A1) Repeat Until(goal){ A1 2, A2 2, A3 2, A4 2, A5 2 (A2) If( B1){ A1 3, A2 3, A3 3, A4 3, A5 3 (A3) } } } (e) Sketch Qin Input: Cin, Qin, ω( | Cin), δsize = 2 ( 0) Up to 2 new actions may be added in total to A1, A2, A3 ( 1) Edit action sequences ACTIONEDITS({A1, A2, A3}, ω( | Cin)) ( 4) B1 = path L B1 = path R ( 5) i [5] s.t. Ai 3 = turn L ( j < i, Aj 3 / {move, turn R}) ( 5) i [5] s.t. Ai 3 = turn R ( j < i, Aj 3 / {move, turn L}) ( 6) A1, A2, A3 are minimal (f) Qin-Constraints Figure 4: Illustration of key steps in Code Mutation. Fig. 4d shows code Cin from Fig. 1b. The code mutation stage, when applied to Cin, generates many output codes, including Cout in Fig. 1d. 3.1 Code Mutation This stage in our pipeline mutates code Cin of task Tin such that its conceptual elements are preserved. Our mutation procedure consists of three main steps. First, we generate an abstract representation of Cin, called sketch. Second, we restrict the sketch with constraints that describe the space of its concrete instantiations. Although this formulation is inspired from work on generating algebra problems [32], we use it in the entirely different context of generating conceptually similar mutations of Cin. This is achieved in the last step, where we use the sketch and its constraints to query an SMT solver [4]; the query solutions are mutated codes {Cout} such that Cout struct = Cin struct (see Definition 3). Step 1: Sketch. The sketch of code C, denoted by Q, is an abstraction of C capturing its skeleton and generalizing C to the space of conceptually similar codes. Q, expressed in the language of Fig. 4b, is generated from C with mapping Ω. In particular, the map exploits the AST structure of the code: the AST is traversed in a depth-first manner, and all values are replaced with their corresponding sketch variables, i.e., action a, bool b, and iter x are replaced with A, B, and X, respectively. In the following, we also use mapping ω( | C), which takes a sketch variable in Q and returns its value in C. In addition to the above, we may extend a variable A to an action sequence A, since any A is allowed to be empty (φ). We may also add an action sequence of length δsize at the beginning and end of the obtained sketch. As an example, consider the code in Fig. 4d and the resulting sketch in Fig. 4e. Notice that, while we add an action sequence at the beginning of the sketch (A1), no action sequence is appended at the end because construct Repeat Until renders any succeeding code unreachable. Step 2: Sketch constraints. Sketch constraints restrict the possible concrete instantiations of a sketch by encoding the required semantics of the mutated codes. All constraint types are in Fig. 4c. In particular, 0 restricts the size of the mutated code within δsize. 1 specifies the allowed mutations to an action sequence based on its value in the code, given by ω(A | C). For instance, this constraint could result in converting all turn Left actions of a sequence to turn Right. 2 restricts the possible values of the Repeat counter within threshold δiter. 3 ensures that the Repeat counter is optimal, i.e., action subsequences before and after this construct are not nested in it. 4 specifies the possible values of the If condition based on its value in the code, given by ω(B | C). 5 refers to constraints imposed on action sequences nested within conditionals. As an example, consider (b) (c) (d) (e) (f) (g) (a) Figure 5: Illustration of symbolic execution on Cout from Fig. 1d. (b) shows the initial configuration of the agent s location and orientation as well as the status of the grid cells (unknown, free, blocked, goal). (c) (e) show the symbolic execution steps where conditions goal and path Right are False. (f) shows the step where goal is True. (g) shows the post-processing step where a puzzle Tout vis is obtained. 5 in Fig. 4f, which states that if B1 = path Left, then the nested action sequence must have at least one turn Left action, and the first occurrence of this action must not be preceded by a move or turn Right, thus preventing invalid actions within the conditional. 6 ensures minimality of an action sequence, i.e., optimality of the constituent actions to obtain the desired output. This constraint would, for instance, eliminate redundant sequences such as turn Left, turn Right, which does not affect the output, or turn Left, turn Left, turn Left, whose output could be achieved by a single turn Right. All employed elimination sequences can be found in the supplementary material. The entire list of constraints applied on the solution code in Fig. 4d is shown in Fig. 4f. Step 3: SMT query. For a sketch Q generated from code C and its constraints, we pose the following query to an SMT solver: (sketch Q, Q-constraints). As a result, the solver generates a set of instantiations, which are conceptually similar to C. In our implementation, we used the Z3 solver [7]. For the code in Fig. 4d, Z3 generated 66 mutated codes in 0.8s from an exhaustive space of 2, 997 possible codes with δsize = 2. One such mutation is shown in Fig. 1d. While this approach generates codes that are devoid of most semantic irregularities, it has its limitations. Certain irregularities continue to exist in some generated codes: An example of such a code included the action sequence move, turn Left, move, turn Left, move, turn Left, move, turn Left, which results in the agent circling back to its initial location in the task space. This kind of undesirable behaviour is eliminated in the symbolic execution stage of our pipeline. 3.2 Symbolic Execution Symbolic execution [13] is an automated test-generation technique that symbolically explores execution paths in a program. During exploration of a path, it gathers symbolic constraints over program inputs from statements along the path. These constraints are then mutated (according to a search strategy), and an SMT solver is queried to generate new inputs that explore another path. Obtaining visual tasks with symbolic execution. This stage in our pipeline applies symbolic execution on each generated code Cout to obtain a suitable visual task Tout. The program inputs of Cout are the agent s initial location/orientation and the status of the grid cells (unknown, free, blocked, marker, goal), which is initially unknown. Symbolic execution collects constraints over these from code statements. As in Fig. 5 for one path, symbolic execution generates a visual task for each path in Cout. However, not all of these tasks are suitable. For instance, if the goal is reached after the first move in Fig. 1d, all other statements in Cout are not covered, rendering the task less suitable for this code. Naïvely, symbolic execution could first enumerate all paths in Cout and their corresponding tasks, and then rank them in terms of suitability. However, solution codes may have an unbounded number of paths, which leads to path explosion, that is, the inability to cover all paths with tractable resources. Guiding symbolic execution using Monte Carlo Tree Search (MCTS). To address this issue, we use MCTS [14] as a search strategy in symbolic execution with the goal of generating more suitable tasks with fewer resources we define task suitability next. Symbolic execution has been previously combined with MCTS in order to direct the exploration towards costly paths [15]. In the supplementary material, we provide an example demonstrating how MCTS could guide the symbolic execution in generating more suitable tasks. As previously observed [12], a critical component of effectively applying MCTS is to define an evaluation function that describes the desired properties of the output, i.e., the visual tasks. Tailoring the evaluation function to our unique setting is exactly what differentiates our approach from existing work. In particular, our evaluation function, Fscore, distinguishes suitable tasks by assigning a score ( [0, 1]) to them, which guides the MCTS search. A higher Fscore indicates a more suitable task. Task T Tstore Tsize (= Csize) Cdepth Type: Source H1 move, turn L, turn R 5 1 HOC: Maze 4 [23] H2 move, turn L, turn R, Repeat 3 2 HOC: Maze 7 [23] H3 move, turn L, turn R, Repeat 5 2 HOC: Maze 8 [23] H4 move, turn L, turn R, Repeat Until 5 2 HOC: Maze 12 [23] H5 move, turn L, turn R, Repeat Until, If 4 3 HOC: Maze 16 [23] H6 move, turn L, turn R, Repeat Until, If Else 4 3 HOC: Maze 18 [23] K7 move, turn L, turn R, pick M, put M 5 1 Karel: Our first [22] K8 move, turn L, turn R, pick M, put M, Repeat 4 2 Karel: Square [22] K9 move, turn L, turn R, pick M, put M, Repeat, If Else 5 3 Karel: One ball in each spot [22] K10 move, turn L, turn R, pick M, put M, While 7 2 Karel: Diagonal [22] Figure 6: Datasets for HOC and Karel tasks. Its constituent components are: (i) Fcov(Tout vis, Cout) {0, 1}, which evaluates to 1 in the event of complete coverage of code Cout by task Tout vis and 0 otherwise; (ii) Fdiss(Tout vis, Tin vis) [0, 1], which evaluates the dissimilarity of Tout to Tin (see Section 2); (iii) Fqual(Tout vis, Cout) [0, 1], which evaluates the quality and validity of Tout; (iv) Fnocrash(Tout vis, Cout) {0, 1}, which evaluates to 0 in case the agent crashes into a wall and 1 otherwise; and (v) Fnocut(Tout vis, Cout) {0, 1}, which evaluates to 0 if there is a shortcut sequence of actions (a in Fig. 4a) smaller than Cout size that solves Tout and 1 otherwise. Fqual and Fnocut also resolve the limitations of our mutation stage by eliminating codes and tasks that lead to undesirable agent behavior. We instantiate Fscore in the next section. 4 Experimental Evaluation In this section, we evaluate our task synthesis algorithm on HOC and Karel tasks. Our implementation is publicly available.2 While we give an overview of key results here, a detailed description of our setup and additional experiments can be found in the supplementary material. 4.1 Reference Tasks and Specifications Reference tasks. We use a set of ten reference tasks from HOC and Karel, shown in Fig. 6. The HOC tasks were selected from the Hour of Code: Classic Maze challenge by Code.org [23] and the Karel tasks from the Intro to Programming with Karel course by Code HS.com [22]. The DSL of Fig. 4a is generic in that it includes both HOC and Karel codes, with the following differences: (i) construct While, marker-related actions put M, pick M, and conditions no Path A, no Path L, no Path R, marker, no Marker are specific to Karel only; (ii) construct Repeat Until and goal are specific to HOC only. Furthermore, the puzzles for HOC and Karel are of different styles (see Fig. 1 and Fig. 2). For all tasks, the grid size of the puzzles is fixed to 10 10 cells (grid-size parameter n = 10). Specification of scoring functions. Fqual(Tout vis, Cout) [0, 1] was approximated as the sum of the normalized counts of moves , turns , segments , and long-segments in the grid; segments and longsegments are sequences of 3 and 5 move actions respectively. More precisely, for HOC tasks, we used the following function where features are computed by executing Cout on Tout vis: FHOC qual (Tout vis, Cout) = 1 2n + #turns n + #segments n/2 + #long-segments Furthermore, in our implementation, Fqual( ) value was set to 0 when Fnocrash( ) = 0. For Karel tasks, Fqual additionally included the normalized counts of put M and pick M, and is provided in the supplementary material. Fdiss(Tout vis, Tin vis) [0, 1] was computed based on the dissimilarity of the agent s initial location/orientation w.r.t. Tin vis, and the grid-cell level dissimilarity based on the Hamming distance between Tout vis and Tin vis. More precisely, we used the following function: Fdiss(Tout vis, Tin vis) = 1 diss(loc | Tout vis, Tin vis) + diss(dir | Tout vis, Tin vis) + diss(grid-cells | Tout vis, Tin vis) where diss(loc | Tout vis, Tin vis) {0, 1}, diss(dir | Tout vis, Tin vis) {0, 1}, and diss(grid-cells | Tout vis, Tin vis) [0, 1] (after the Hamming distance is normalized with a factor of 2 n2 ). 2https://github.com/adishs/neurips2020_synthesizing-tasks_code Task Code Mutation Symbolic Execution Fraction of Tout with criteria Tin 2:#Cout =0 3:#Cout =0,1 4:#Cout =all 5:Time 6:#Cout 7:#Tout 8:Time 9:(V) 10:(VI)δmini=1 11:(VI)δmini=0 H1 3, 159 112 64 0.6s 28 272 68s 1.00 1.00 1.00 H2 8, 991 594 138 1.7s 48 428 61s 1.00 1.00 1.00 H3 798, 255 13, 122 720 13.3s 196 1, 126 60s 0.90 0.98 0.90 H4 5, 913 152 108 1.0s 44 404 167s 1.00 1.00 0.50 H5 2, 997 294 66 0.8s 46 444 348s 0.98 0.59 0.27 H6 1, 728 294 54 0.6s 48 480 347s 0.80 0.45 0.07 K7 96, 875 150 122 1.3s 122 1, 196 61s 1.00 1.00 1.00 K8 484, 875 4, 506 990 11.6s 469 4, 506 63s 1.00 1.00 1.00 K9 8.595 106 60, 768 888 11.3s 432 4, 258 185s 0.92 0.92 0.88 K10 132.625 106 19, 328 1, 404 17.1s 532 5, 032 158s 1.00 1.00 1.00 Figure 7: Results on HOC and Karel tasks; details are provided in Section 4. Next, we define the evaluation function Fscore(Tout, Cout, Tin, Cin) [0, 1] used by MCTS: Fscore(Tout, Cout, Tin, Cin) = 1 Fqual(Tout vis, Cout) δqual, Fnocrash(Tout vis, Cout) = 1, Fnocut(Tout vis, Cout) = 1 α1Fcov(Tout vis, Cout) + α2Fqual(Tout vis, Cout) + α3Fdiss(Tout vis, Tin vis) | {z } (ii) where 1 is an indicator function and each constant α = 1/3. Component (ii) in the above function supplies the gradients for guiding the search in MCTS; Component (i) is applied at the end of the MCTS run to pick the output. More precisely, the best task (i.e, the one with the highest Fscore value) is picked only from the pool of generated tasks which have Fscore( ) > 0 and satisfy Fcov( ) = 1. Specification of task synthesis and MCTS. As per Section 2, we set the following thresholds for our algorithm: (i) δsize = 2, (ii) δdiss = 0.33, and (iii) δqual = 0.2 for codes with While or Repeat Until, and 0.05 otherwise. We run MCTS 10 times per code, with each run generating one task. We set the maximum iterations of a run to 2 million (M) and the exploration constant to 2 [14]. Even when considering a tree depth of 2n (= 20), there are millions of leaves for difficult tasks H5 and H6, reflecting the complexity of task generation. For each code Cout, we generated 10 different visual tasks. To ensure sufficient diversity among the tasks generated for the same code, we introduced a measure Fdiversity. This measure, not only ensures visual task dissimilarity, but also ensures sufficient diversity in entire symbolic paths during generation (for details, see supplementary material). 4.2 Results Performance of task synthesis algorithm. Fig. 7 shows the results of our algorithm. The second column illustrates the enormity of the unconstrained space of mutated codes; we only impose size constraint 0 from Fig. 4c. We then additionally impose constraint 1 resulting in a partially constrained space of mutated codes (column 3), and finally apply all constraints from Fig. 4c to obtain the final set of generated codes (column 4). This reflects the systematic reduction in the space of mutated codes by our constraints. Column 5 shows the total running time for generating the final codes, which denotes the time taken by Z3 to compute solutions to our mutation query. As discussed in Section 3.1, few codes with semantic irregularities still remain after the mutation stage. The symbolic execution stage eliminates these to obtain the reduced set of valid codes (column 6). Column 7 shows the final number of generated tasks and column 8 is the average time per output task (i.e., one MCTS run). Analyzing output tasks. We further analyze the generated tasks based on the objectives of Section 2. All tasks satisfy properties (I) (III) by design. Objective (IV) is easily achieved by excluding generated tasks for which Cout = Cin. For a random sample of 100 of the generated tasks per reference task, we performed manual validation to determine whether objectives (V) and (VI) are met. The fraction of tasks that satisfy these objectives is listed in the last three columns of Fig. 7. We observe that the vast majority of tasks meet the objectives, even if not by design. For H6, the fraction of tasks satisfying (VI) is low because the corresponding codes are generic enough to solve several puzzles. Deep dive into an MCTS run. To offer more insight into the task generation process, we take a closer look at an MCTS run for task H5, shown in Fig. 8. Fig. 8a illustrates the improvement in various components of Fscore as the number of MCTS iterations increases. Best tasks at different iterations are shown in Fig. 8b, 8c, 8d. As expected, the more the iterations, the better the tasks are. Normalized features coverage no crash long segments moves turns (a) Trends in Fscore features (b) Best at 200 (c) Best at 20K (d) Best at 2M Figure 8: Illustration of a single MCTS run on Cout from Fig. 1d obtained from solution code of task H5 by mutation. (a) shows the temporal trends of different feature values in Fscore averaged over a time window of 100 steps. (b) (d) show the best, i.e., highest scoring, tasks generated up to times 2 102, 2 104, and 2 106 respectively. Tout vis shown in Fig. 1c is the puzzle produced in (d). Remarks. We also ran the mutation stage by enumerating the programs within size constraints and then post-checking other constraints without Z3. This implementation leads to a run-time increase by a factor of 10 to 100 for different tasks. So, Z3 seems to be very effective by jointly considering all the constraints. As a search method, although MCTS seems computationally expensive, the actual run-time and memory footprint of an MCTS run depend on the unique traces explored (i.e., unique symbolic executions done) this number is typically much lower than the number of iterations, also see discussion in the supplementary material. Considering the MCTS output in Figs. 8c, 8d, to obtain a comparable evaluation score through a random search, the corresponding number of unique symbolic executions required is at least 10 times more than executed by MCTS. We note that while we considered one I/O pair for Karel tasks, our methodology can be easily extended to multiple I/O pairs by adapting techniques designed for generating diverse tasks. 5 User Study and Comparison with Alternate Methods In this section, we evaluate our task synthesis algorithm with a user study focusing on tasks H2, H4, H5, and H6. We developed an online app3, which uses the publicly available toolkit of Blockly Games [10] and provides an interface for a participant to practice block-based programming tasks for HOC. Each practice session of the study involves three steps: (i) a reference task Tin {H2, H4, H5, H6} is shown to the participant along with its solution code Cin, (ii) a new task Tout is generated for which the participant has to provide a solution code, and (iii) a post-survey asks the participant to assess the visual dissimilarity of the two tasks on a 4-point Likert scale as used in [25]. Details on the app interface and questionnaire are provided in the supplementary material. Participants for the study were recruited through Amazon Mechanical Turk. We only selected four tasks due to the high cost involved in conducting the study (about 1.8 USD per participant). The number of participants and their performance are documented in Fig. 9. Baselines and methods evaluated. We evaluated four different methods, including three baselines (SAME, TUTOR, MUTTASK) and our algorithm (SYNTASK). SAME generates tasks such that Tin = Tout. TUTOR produces tasks that are similar to Tin and designed by an expert. We picked similar problems from the set of 20 Classic Maze challenge [23] tasks exercising the same programming concepts: Maze 6, 9 for H2, Maze 11, 13 for H4, Maze 15, 17 for H5, and Maze 19 for H6. MUTTASK generated tasks by directly mutating the grid-world of the original task, i.e., by moving the agent or goal by up to two cells and potentially changing the agent s orientation. A total of 18, 20, 15, and 17 tasks were generated for H2, H4, H5, and H6, respectively. Fig. 10 shows two output tasks for H4 and illustrates the challenge in directly mutating the input task, given the high discontinuity in mapping from the space of tasks to their codes. For H4, a total of 14 out of 20 new tasks were structurally very different from the input. SYNTASK uses our algorithm to generate tasks. We picked the generated tasks from three groups based on the size of the code mutations from which they were produced, differing from the reference solution code by +δsize for δsize {0, 1, 2}. For H2 and H4, we randomly selected 5 tasks from each group, for a total of 15 new tasks per reference task. For H5 and H6, we selected 10 tasks from the first group (δsize = 0) only, due to their complexity stemming from nested constructs in their codes. We observed that TUTOR tasks for H5, H6 were also of δsize = 0, i.e., Cout size = Cin size. All the generated tasks picked for SYNTASK adhere to properties (I) (VI) in Section 2. 3https://www.teaching-blocks.cc/ Method Total participants Fraction of tasks solved Time spent in secs Visual dissimilarity HH2 H4 H5 H6 HH2 H4 H5 H6 HH2 H4 H5 H6 HH2 H4 H5 H6 SAME 96 24 24 24 24 .94 .92 1.00 .96 .88 89 60 59 93 145 1.07 1.12 1.04 1.00 1.12 TUTOR 170 48 48 49 25 .90 .90 .92 .88 .92 121 107 113 118 169 2.90 2.81 2.79 2.96 3.16 MUTTASK 278 72 79 60 67 .68 .76 .71 .65 .60 219 135 299 219 215 2.17 2.36 2.33 1.95 1.99 SYNTASK 197 59 57 40 41 .89 .92 .89 .92 .83 144 85 183 130 189 2.63 2.41 2.42 2.68 3.20 Figure 9: User study results for HOC tasks (H-represents all tasks in the study); see Section 5. (a) Tin vis for H4 (b) 1st Tout vis (c) 2nd Tout vis Repeat Until(goal){ move turn Left move turn Right } } (d) Cin for H4 def Run(){ move move Repeat Until(goal){ turn Left move turn Right move } } (e) 1st Cout def Run(){ move move turn Left move turn Right move turn Left 15 more actions } (f) 2nd Cout Figure 10: MUTTASK applied to H4. Tin vis and Cin are shown in (a) and (d). (b) (c) illustrate two tasks Tout vis obtained via small mutations of Tin vis. (e) is the smallest solution code for (b) and is structurally similar to Cin. (f) is the smallest solution code for (c) and is drastically different from Cin. Results on task solving. In terms of successfully solving the generated tasks, SAME performed best (mean success = 0.94) in comparison to TUTOR (mean = 0.90), SYNTASK (mean = 0.89), and MUTTASK (mean = 0.68) this is expected given the tasks generated by SAME. In comparison to TUTOR, the performance of SYNTASK was not significantly different (χ2 = 0.04, p = 0.83); in comparison to MUTTASK, SYNTASK performed significantly better (χ2 = 28.74, p < e 8). The complexity of the generated tasks is also reflected in the average time that participants spent on solving them. As shown in Fig. 9, they spent more time solving the tasks generated by MUTTASK. Results on visual task dissimilarity. Visual dissimilarity was measured on a Likert scale ranging from 1 4, 1 being highly similar and 4 highly dissimilar. Comparing the dissimilarity of the generated tasks w.r.t. the reference task, we found that the performance of SAME was worst (mean dissimilarity = 1.07), while that of TUTOR was best (mean = 2.90). SYNTASK (mean = 2.63) performed significantly better than MUTTASK (mean = 2.17), yet slightly worse than TUTOR. This is because TUTOR generates tasks with additional distracting paths and noise, which can also be done by our algorithm (although not done for this study). Moreover, for H2, which had no conditionals, the resulting codes were somewhat similar, and so were the generated puzzles. When excluding H2 from the analysis, the difference between SYNTASK (mean = 2.72) and TUTOR (mean =2.93) was not statistically significant. A detailed distribution of the responses can be found in the supplementary material. Remarks. SAME s performance in terms of tasks solved is below 1.00, possibly because participants overlooked the solution of Step 1, unaware they will be receiving the same task in Step 2, and the app did not allow them to go back to Step 1. This user study provides a proof-of-concept; more elaborate studies are needed to fully reach the motivational goal of teaching K-12 students, and evaluate the long term impact on students concept learning. As additional studies, it would be important to understand the sensitivity of user study results w.r.t. the Likert scale definition; another possibility is to use pairwise comparisons in eliciting user evaluations. 6 Conclusions and Outlook We developed techniques for a critical aspect of pedagogy in block-based programming: Automatically generating new tasks that exercise specific programming concepts, while looking visually dissimilar to input. We demonstrated the effectiveness of our methodology through an extensive empirical evaluation and user study on reference tasks from popular programming platforms. We believe our techniques have the potential to drastically improve the success of pedagogy in block-based visual programming environments by providing tutors and students with a substantial pool of new tasks. Beyond the application domain of programming education, our methodology can be used for generating large-scale datasets consisting of tasks and solution codes with desirable characteristics this can be potentially useful for training neural program synthesis methods. There are several promising directions for future work, including but not limited to: Learning a policy to guide the MCTS procedure (instead of running vanilla MCTS); automatically learning the constraints and cost function from a human-generated pool of problems; and applying our methodology to other programming environments (e.g., Python problems). Broader Impact This paper develops new techniques for improving pedagogy in block-based visual programming environments. Such programming environments are increasingly used nowadays to introduce computing concepts to novice programmers, and our work is motivated by the clear societal need of enhancing K-12 computing education. In existing systems, the programming tasks are hand-curated by tutors, and the available set of tasks is typically very limited. This severely limits the utility of existing systems for long-term learning as students do not have access to practice tasks for mastering the programming concepts. We take a step towards tackling this challenge by developing a methodology to generate new practice tasks for a student that match a desired level of difficulty and exercise specific programming concepts. Our task synthesis algorithm is able to generate 1000 s of new similar tasks for reference tasks taken from the Hour of Code: Classic Maze challenge by Code.org and the Intro to Programming with Karel course by Code HS.com. Our extensive experiments and user study further validate the quality of the generated tasks. Our task synthesis algorithm could be useful in many different ways in practical systems. For instance, tutors can assign new practice tasks as homework or quizzes to students to check their knowledge, students can automatically obtain new similar tasks after they failed to solve a given task and received assistance, and intelligent tutoring systems could automatically generate a personalized curriculum of problems for a student for long-term learning. Acknowledgments and Disclosure of Funding We would like to thank the anonymous reviewers for their helpful comments. Ahana Ghosh was supported by Microsoft Research through its Ph D Scholarship Programme. Umair Z. Ahmed and Abhik Roychoudhury were supported by the National Research Foundation, Singapore and National University of Singapore through its National Satellite of Excellence in Trustworthy Software Systems (NSOE-TSS) project under the National Cybersecurity R&D (NCR) Grant award no. NRF2018NCRNSOE003-0001. [1] Umair Z. Ahmed, Sumit Gulwani, and Amey Karkare. Automatically generating problems and solutions for natural deduction. In IJCAI, pages 1968 1975, 2013. [2] Chris Alvin, Sumit Gulwani, Rupak Majumdar, and Supratik Mukhopadhyay. Synthesis of geometry proof problems. In AAAI, pages 245 252, 2014. [3] Erik Andersen, Sumit Gulwani, and Zoran Popovic. A trace-based framework for analyzing and synthesizing educational progressions. In CHI, pages 773 782, 2013. [4] Clark W. Barrett and Cesare Tinelli. Satisfiability modulo theories. In Handbook of Model Checking, pages 305 343. Springer, 2018. [5] Rudy Bunel, Matthew J. Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging grammar and reinforcement learning for neural program synthesis. In ICLR, 2018. [6] Xinyun Chen, Chang Liu, and Dawn Song. Execution-guided neural program synthesis. In ICLR, 2018. [7] Leonardo de Moura and Nikolaj Bjørner. Z3: An efficient SMT solver. In TACAS, pages 337 340, 2008. [8] Jacob Devlin, Rudy Bunel, Rishabh Singh, Matthew J. Hausknecht, and Pushmeet Kohli. Neural program meta-induction. In Advances in Neural Information Processing Systems, pages 2080 2088, 2017. [9] Aleksandr Efremov, Ahana Ghosh, and Adish Singla. Zero-shot learning of hint policy via reinforcement learning and program synthesis. In EDM, 2020. [10] Blockly Games. Games for tomorrow s programmers. https://blockly.games/. [11] Sumit Gulwani. Example-based learning in computer-aided STEM education. Communications of the ACM, 57(8):70 80, 2014. [12] Bilal Kartal, Nick Sohre, and Stephen J. Guy. Data driven Sokoban puzzle generation with Monte Carlo tree search. In AIIDE, 2016. [13] James C. King. Symbolic execution and program testing. Communications of the ACM, 19:385 394, 1976. [14] Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In ECML, pages 282 293, 2006. [15] Kasper Luckow, Corina S P as areanu, and Willem Visser. Monte Carlo tree search for finding costly paths in programs. In SEFM, pages 123 138, 2018. [16] John H. Maloney, Kylie Peppler, Yasmin Kafai, Mitchel Resnick, and Natalie Rusk. Programming by choice: Urban youth learning programming with Scratch. In SIGCSE, pages 367 371, 2008. [17] Samiha Marwan, Nicholas Lytle, Joseph Jay Williams, and Thomas W. Price. The impact of adding textual explanations to next-step hints in a novice programming environment. In ITi CSE, pages 520 526, 2019. [18] Benjamin Paaßen, Barbara Hammer, Thomas W. Price, Tiffany Barnes, Sebastian Gross, and Niels Pinkwart. The continuous hint factory Providing hints in vast and sparsely populated edit distance spaces. Journal of Educational Data Mining, 2018. [19] Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, and Leonidas J. Guibas. Learning program embeddings to propagate feedback on student code. In ICML, pages 1093 1102, 2015. [20] Chris Piech, Mehran Sahami, Jonathan Huang, and Leonidas J. Guibas. Autonomously generating hints by inferring problem solving policies. In L@S, pages 195 204, 2015. [21] Code HS platform. Code HS.com: Teaching Coding and Computer Science. https:// codehs.com/. [22] Code HS platform. Intro to Programming with Karel the Dog. https://codehs.com/ info/curriculum/introkarel. [23] Code.org platform. Hour of Code: Classic Maze Challenge. https://studio.code. org/s/hourofcode. [24] Code.org platform. Hour of Code Initiative. https://hourofcode.com/. [25] Oleksandr Polozov, Eleanor O Rourke, Adam M. Smith, Luke Zettlemoyer, Sumit Gulwani, and Zoran Popovic. Personalized mathematical word problem generation. In IJCAI, 2015. [26] Thomas W. Price and Tiffany Barnes. Position paper: Block-based programming should offer intelligent support for learners. In B&B, pages 65 68, 2017. [27] Thomas W. Price, Yihuan Dong, and Dragan Lipovac. i Snap: Towards intelligent tutoring in novice programming environments. In SIGCSE, pages 483 488, 2017. [28] Thomas W. Price, Rui Zhi, and Tiffany Barnes. Evaluation of a data-driven feedback algorithm for open-ended programming. EDM, 2017. [29] Mitchel Resnick, John Maloney, Andrés Monroy-Hernández, Natalie Rusk, Evelyn Eastmond, Karen Brennan, Amon Millner, Eric Rosenbaum, Jay Silver, Brian Silverman, et al. Scratch: Programming for all. Communications of the ACM, 52(11):60 67, 2009. [30] Reudismam Rolim, Gustavo Soares, Loris D Antoni, Oleksandr Polozov, Sumit Gulwani, Rohit Gheyi, Ryo Suzuki, and Björn Hartmann. Learning syntactic program transformations from examples. In ICSE, pages 404 415, 2017. [31] Rishabh Singh, Sumit Gulwani, and Armando Solar-Lezama. Automated feedback generation for introductory programming assignments. In PLDI, pages 15 26, 2013. [32] Rohit Singh, Sumit Gulwani, and Sriram K. Rajamani. Automatically generating algebra problems. In AAAI, 2012. [33] Lisa Wang, Angela Sy, Larry Liu, and Chris Piech. Learning to represent student knowledge on programming exercises using deep learning. EDM, 2017. [34] David Weintrop and Uri Wilensky. Comparing block-based and text-based programming in high school computer science classrooms. TOCE, 18(1):1 25, 2017. [35] Mike Wu, Milan Mosse, Noah Goodman, and Chris Piech. Zero shot learning for code education: Rubric sampling with deep learning inference. In AAAI, 2019. [36] Jooyong Yi, Umair Z. Ahmed, Amey Karkare, Shin Hwei Tan, and Abhik Roychoudhury. A feasibility study of using automated program repair for introductory programming assignments. In ESEC/FSE, 2017. [37] Rui Zhi, Thomas W. Price, Samiha Marwan, Alexandra Milliken, Tiffany Barnes, and Min Chi. Exploring the impact of worked examples in a novice programming environment. In SIGCSE, pages 98 104, 2019.