# neural_task_synthesis_for_visual_programming__5b16e302.pdf Published in Transactions on Machine Learning Research (01/2024) Neural Task Synthesis for Visual Programming Victor-Alexandru Pădurean vpadurea@mpi-sws.org Max Planck Institute for Software Systems Georgios Tzannetos gtzannet@mpi-sws.org Max Planck Institute for Software Systems Adish Singla adishs@mpi-sws.org Max Planck Institute for Software Systems Reviewed on Open Review: https: // openreview. net/ forum? id= a Yk Yajc JDN Generative neural models hold great promise in enhancing programming education by synthesizing new content. We seek to design neural models that can automatically generate programming tasks for a given specification in the context of visual programming domains. Despite the recent successes of large generative models like GPT-4, our initial results show that these models are ineffective in synthesizing visual programming tasks and struggle with logical and spatial reasoning. We propose a novel neuro-symbolic technique, Neur Task Syn, that can synthesize programming tasks for a specification given in the form of desired programming concepts exercised by its solution code and constraints on the visual task. Neur Task Syn has two components: the first component is trained via imitation learning procedure to generate possible solution codes, and the second component is trained via reinforcement learning procedure to guide an underlying symbolic execution engine that generates visual tasks for these codes. We demonstrate the effectiveness of Neur Task Syn through an extensive empirical evaluation and a qualitative study on reference tasks taken from the Hour of Code: Classic Maze challenge by Code.org and the Intro to Programming with Karel course by Code HS.com. 1 Introduction Recent advances in generative AI have demonstrated impressive performance in a variety of domains, including visual art and music creation (Dong et al., 2018; Briot et al., 2020; Suh et al., 2021; Ramesh et al., 2021; Rombach et al., 2022), medicinal chemistry synthesis (Schneider et al., 2020; Walters & Murcko, 2020; Tong et al., 2021; Gao & Coley, 2020), and AI-enhanced programming (Finnie-Ansley et al., 2022; Leinonen et al., 2023; Chen et al., 2021; Feng et al., 2020; Phung et al., 2023a). These successes are, in part, driven by advanced capabilities of deep generative models, such as Stable Diffusion (Rombach et al., 2022), Chat GPT (Open AI, 2023a), and GPT-4 (Open AI, 2023b). These advancements also hold great promise in enhancing education, for instance, by generating personalized content and new practice tasks for students allowing them to master required concepts (Sarsa et al., 2022; Tate et al., 2023; Baidoo-Anu & Owusu Ansah, 2023; Lim et al., 2023; Phung et al., 2023b). In this paper, we explore the role of generative AI in visual programming domains used for introductory programming. Popular domains, such as Scratch (Resnick et al., 2009), Hour of Code:Maze Challenge by Code.org (Ho CMaze) (Code.org, 2013b;a), and Karel (Pattis et al., 1995), have become an integral part of introductory computer science education and are used by millions of students (Code.org, 2013a; Wu et al., 2019; Price & Barnes, 2017). In existing visual programming platforms, programming tasks are hand-curated by tutors and the available set of tasks is typically very limited, posing a major hurdle for novices in mastering the missing concepts (Zhi et al., 2019; Ahmed et al., 2020). To this end, we seek to design generative models that can automatically synthesize visual programming tasks for a given specification (see Figure 1). Published in Transactions on Machine Learning Research (01/2024) Student is working on task Tref, struggling with obtaining a solution code Cref. AI system extracts the specifications ψin for a simpler task. AI system generates a task Tout based on ψin, and delivers it to student. (a) An interaction between student and AI system. Allowed blocks: {move, turn Left, turn Right, Repeat Until, If Else} Maximum size: 7 Repeat Until(goal){ If(path Ahead){ move } Else{ If(path Right){ turn Right } Else{ turn Left } } } } (b) Puzzle Tref IO (left), code constraints Tref code (middle), and solution Cref (right). Repeat Until(goal){ a blocks If(path Right){ turn Right } Else{ a blocks } a blocks } } a blocks is a body of basic action blocks from the set {move, turn Left, turn Right} Number of blocks 7 (c) Specifications ψin IO (left), ψin sketch (middle), and ψin (right). Allowed blocks: {move, turn Left, turn Right, Repeat Until, If Else} Maximum size: 7 Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move }move } } (d) Puzzle Tout IO (left), code constraints Tout code (middle), and Cout (right). Figure 1: (a) Illustration of an AI system helping a student, inspired by (Ghosh et al., 2022). The AI system supports the student when trying to solve Tref. The system first extracts a specification for a simpler task that can be provided to help the student. It then generates a new task based on that specification and delivers it to the student. (b) shows reference task Tref along with solution code Cref. (c) shows specifications ψin extracted by AI system. (d) shows task Tout and code Cout synthesized by AI system. 1.1 Motivation and Overview Our work is motivated by existing literature on AI-driven educational systems that seek to provide feedback to students who are struggling while solving a programming task. For instance, (Zhi et al., 2019) studies feedback provided in the form of worked examples as demonstration and (Ghosh et al., 2022) studies feedback provided in the form of new simpler tasks for adaptive scaffolding. We have illustrated the interaction between such a system and a student in Figure 1. Previous works have focused on studying how to adapt the feedback type according to the student s needs. We complement these works by focusing on generating on-the-fly the desired examples and simpler tasks that the system could provide to the student. In our work, we consider that a desired feedback format for a student is an input specification in the form of puzzle layout, code structure, code size, and selected programming concepts. Since the search space defined by these input specifications is potentially unbounded, designing AI systems that can synthesize correct tasks in real-time for any input specification is critical to cater to a diverse set of students behaviors. As a natural approach, one might be tempted to employ state-of-the-art large language models (LLMs) to generate a visual programming task by providing task synthesis specification as a prompt. In particular, models like GPT-4 are trained on multi-modal data, including text, code, and visual data, and hence it seems a suitable technique for reasoning about visual programming tasks (Open AI, 2023b; Bubeck et al., 2023). However, our results show that these models are ineffective in synthesizing visual programming tasks and struggle with logical and spatial reasoning, as also indicated in recent literature on state-of-the-art Published in Transactions on Machine Learning Research (01/2024) models (Bang et al., 2023; Bubeck et al., 2023; Valmeekam et al., 2022; Huang & Chang, 2022); see Section 5 for detailed discussion. In general, a major challenge in using purely neural generative models for synthesizing visual programming tasks is that the generative process is highly brittle even a small modification in the output task could make it invalid or semantically incorrect w.r.t. the input specification (Ahmed et al., 2020). As an alternative to neural generative models, we could rely on symbolic generative methods driven by search and planning algorithms to generate content that matches a specification. Several works have shown the efficacy of symbolic methods to generate new tasks in various educational domains, e.g., algebra exercises (Singh et al., 2012; Gulwani, 2014), geometric proof problems (Alvin et al., 2014), natural deduction (Ahmed et al., 2013), mathematical word problems (Polozov et al., 2015), Sokoban puzzles (Kartal et al., 2016), and visual programming tasks (Ahmed et al., 2020; Ghosh et al., 2022). In particular, our work is related to (Ahmed et al., 2020; Ghosh et al., 2022) that proposed symbolic methods guided by hand-crafted constraints and Monte Carlo Tree Search to generate high-quality visual programming tasks. However, their symbolic methods still suffer from intractably large spaces of feasible tasks and codes for a given specification, and could take several minutes to generate an output task for an input specification; see Section 4 for comparison to our technique. In general, a major shortcoming of using purely symbolic generative methods in the above-mentioned works is that the generative process is typically time-inefficient and not suitable for applications that require real-time synthesis. Against that backdrop, the main research question is: Can we develop neuro-symbolic techniques that can synthesize high-quality visual programming tasks while being robust and efficient? To this end, we develop Neur Task Syn, a novel neuro-symbolic technique that can synthesize programming tasks for input specifications in the form of desired programming concepts exercised by its solution code and constraints on the visual task. Given a task synthesis specification as input (Figure 1c), Neur Task Syn uses two components trained via reinforcement learning procedure: the first component generates possible solution codes, and the second component involves guiding an underlying symbolic execution engine that generates visual tasks for these codes. Our main results and contributions are summarized below: I. We formulate visual programming puzzle synthesis as a sequential decision-making process and propose Neur Task Syn, a novel neuro-symbolic technique for synthesizing visual programming tasks (Section 3). II. We demonstrate the performance of Neur Task Syn by comparing it to baselines from existing works (Section 4). III. We demonstrate the effectiveness of Neur Task Syn through an extensive evaluation on task specifications from real-world programming platforms (Section 5). IV. We publicly release the implementation and datasets to facilitate future research.1 1.2 Related work Educational task generation. Earlier works on task generation considered simpler domains such as algebra problems where solutions follow well-defined procedures. These works employed classical methods such as template-based (Polozov et al., 2015; Singh et al., 2012) or exhaustive enumeration-based task generation (Ahmed et al., 2020; Alvin et al., 2014). We aim to synthesize programming task given well-structured specifications, similar to previous works on visual task synthesis (Ahmed et al., 2020; Ghosh et al., 2022). While the existing works focused on offline generation, we seek to learn neural models that can generate tasks in real-time, as highlighted in Figure 1. Recent works have also explored the use of LLMs to generate new assignments for Python programming (Sarsa et al., 2022; Phung et al., 2023b) with mixed results. Spatial and logical reasoning for LLMs. Recent studies have focused on assessing the various capabilities of LLMs (Bubeck et al., 2023; Arora et al., 2023; Bang et al., 2023). These works show that such state-ofthe-art models have already achieved impressive generative capabilities for several programming education scenarios, including program repair and hint generation (Sarsa et al., 2022; Leinonen et al., 2023). However, these models still lack crucial capabilities like program execution, symbolic reasoning, and planning that are needed for structured task generation in visual programming domains (Valmeekam et al., 2022; Huang & Chang, 2022; Kaddour et al., 2023; Phung et al., 2023b). Based on (Huang et al., 2023), the LLMs inherent capabilities alone are still insufficient for many general reasoning tasks and require the guidance of an external tool or a human. 1Git Hub repository: https://github.com/machine-teaching-group/tmlr2024_neurtasksyn. Published in Transactions on Machine Learning Research (01/2024) 2 Problem Setup Visual programming tasks. We define a task as a tuple T := (TIO, Tcode), where TIO denotes the visual puzzle and Tcode denotes additional constraints on a solution code of this puzzle. This task space is general enough to encompass popular visual programming domains, including block-based programming domain of Hour of Code:Maze Challenge by Code.org (Ho CMaze) (Code.org, 2013b;a) and text-based programming domain of Karel (Pattis et al., 1995). For instance, the task in Figure 1b is based on Ho CMaze:Maze20 task Code.org (2013a) it is comprised of a visual puzzle where a solution code, when executed, will take the avatar to the goal without crashing into walls. Additional constraints on a solution code specify that it should use only blocks from the set {move, turn Left, turn Right, Repeat Until, If Else} and have a size 7. We give a similar illustrative example for Karel in Section 5. Solution codes of a task. We define the space of all possible codes C in a domain via a domainspecific language (DSL) (Gulwani et al., 2017). For instance, in our evaluation with Ho CMaze and Karel programming domains, we will use their corresponding DSLs as introduced in (Bunel et al., 2018; Ahmed et al., 2020). For a given task T, a code C C is a solution code if the following holds: C successfully solves TIO while respecting Tcode. For example, the codes in Figures 1b and 1d use only blocks from the set {move, turn Left, turn Right, Repeat Until, If Else}, have sizes equal to 7, and when executed, will take the avatar to the goal. Note that If Else is considered a single block. Task synthesis specification. We now introduce a notation to specify desired tasks for synthesis that exercise certain programming concepts in their solution codes and respect certain constraints on the visual puzzle. We define a task synthesis specification as a tuple ψ := (ψIO, ψsketch, ψ ), where ψIO is a partially initialized visual puzzle, ψsketch is a code sketch (i.e., a partial code) capturing the structure that should be followed by the synthesized task s solution codes, and ψ specifies additional constraints on solution codes. For instance, the input task synthesis specification ψin in Figure 1c is extracted from Tref and Cref in Figure 1b here, ψin IO is a 12x12 maze with certain cells initialized to free or wall cells; ψin sketch specifies the structure and some already initialized conditionals, loops and actions that should be used in solution codes; ψin specifies constraints related to what kind of blocks can be used and the size of a solution code. Synthesis objective. Given a task synthesis specification ψin := (ψin IO, ψin sketch, ψin ) as input, we seek to generate a task Tout := (Tout IO , Tout code) as output. Inspired by human-centered task quality criteria for the visual programming domains (Ahmed et al., 2020; Ghosh et al., 2022), we design objectives that capture the quality of desirable tasks. To formally set our synthesis objective and evaluation metrics, below we introduce different binary criteria (taking values 1 or 0) that we want Tout to satisfy w.r.t. ψin: O1:Validity. A task Tout is valid (i.e., value 1) iff: (a) Tout IO respects ψin IO, (b) Tout code respects (ψin sketch, ψin ), and (c) there exists at least one solution code C C for Tout. O2:Concepts. A task Tout conceptually captures specification ψin (i.e., value 1) iff: (a) there exists at least one solution code C C that respects ψin sketch and (b) all solution codes C C are at least of the complexity required by ψin sketch. We define complexity w.r.t. the total number of programming constructs used (e.g., Repeat Until, While, If, If Else) and the level of code nesting. We refer to the level of nesting as code depth in the remaining content of the paper. O3:Trace. A task Tout captures the following synthesis property: for any solution C for Tout that respects (ψin sketch, ψin ), the execution trace of C on Tout executes each loop or conditional at least n times. This property is inspired by real-world tasks that are easy to comprehend, as indicated by task quality criteria used in (Ahmed et al., 2020; Ghosh et al., 2022); we will use n = 2 in Section 5 evaluation. O4:Overall. This objective is 1 only if all the above objectives (O1, O2, O3) are satisfied for a task Tout. O5:Human. This objective captures the quality of a task Tout from an expert s point of view. The assessed criteria are the conceptual correctness w.r.t. the input specification and the visual quality of the task. Based on these criteria, the expert reports an overall binary score. Published in Transactions on Machine Learning Research (01/2024) 3 Our Synthesis Technique Neur Task Syn Code Generator Puzzle Generator C!"# T!"# 𝜓$%&#'( Figure 2: Components of task synthesis. In this section, we present Neur Task Syn, our neurosymbolic technique to synthesize visual programming tasks (Tout) for an input specification (ψin). 3.1 Overview As noted in Section 1, a key challenge in synthesizing tasks is that the mapping from the space of visual tasks to their solution codes is highly discontinuous a small modification in the output task could make it invalid or semantically incorrect w.r.t. the input specification (Ahmed et al., 2020). One way to tackle this challenge is to first reason about a possible solution code and then generate visual puzzles based on execution traces of this code (Gulwani, 2014; Kartal et al., 2016; Ahmed et al., 2020; Tercan et al., 2023). This motivates two components in our synthesis process shown in Figure 2: the first component generates possible solution codes Cout (akin to that of program synthesis (Gulwani et al., 2017)); the second component generates the visual puzzle for these codes via symbolic execution (akin to the idea of test-case generation (King, 1976)). After synthesizing both, we can get all the elements of Tout. Tout IO is based on the generated puzzle. Tout code is extracted from the size and types of blocks of Cout.2 Next, we discuss the components of Neur Task Syn. 3.2 Generating the Solution Code Cout The code generator component takes the elements of the specification, (ψin sketch, ψin ), that enforce constraints on solution codes of the desired task and accordingly generates a possible solution code Cout. Below, we briefly describe a base symbolic engine to generate syntactically valid codes from specifications via random search and then a neural model to guide this base engine; full details of this component are deferred to Appendix D.1. Base symbolic engine. The base engine operates on Abstract Syntax Tree representation of code sketches (i.e., partial codes). The engine generates codes by sampling tokens (i.e., action blocks, conditions, and iterators) from underlying DSL while respecting specification. Even though this ensures that a generated code is syntactically correct and valid w.r.t. specification, it could have semantic irregularities. Such irregularities can lead to confusing tasks (e.g., see Figure 8c, where the If and Else bodies are the same). Neural model. The neural model is trained to guide the sampling process of the base symbolic engine. This neural model is akin to a program synthesizer, but it has a different objective. It aims to synthesize codes that are semantically correct and can lead to high-quality tasks. One could use a variety of architectures, for instance, transformer-based (Le et al., 2022; Fried et al., 2022; Li et al., 2022; Wang et al., 2021) or custom-made encoder-decoder approaches (Balog et al., 2017; Bunel et al., 2018; Yin & Neubig, 2017). In our work, we use an LSTM-based decoder (Hochreiter & Schmidhuber, 1997) because of their extensive use in the existing literature on generating program solutions for an input visual programming task (Devlin et al., 2017; Bunel et al., 2018; Shin et al., 2019; Gupta et al., 2020). In our setting, the input corresponds to the code specification. Similar to (Bunel et al., 2018), we use an imitation (supervised) learning approach to train the LSTM-based neural model, learning to predict in a token-by-token manner. 3.3 Generating the Visual Puzzle Tout IO The puzzle generator component takes the element of the specification, ψin IO, that enforces constraints on the visual puzzle along with generated code Cout and accordingly generates a visual puzzle Tout IO . We first describe a base symbolic engine that performs symbolic execution of Cout to generate a semantically valid puzzle via random search and then describe a neural model to guide this base engine. We then offer details regarding how we formulate the neurally guided symbolic engine as a reinforcement learning (RL) agent. Base symbolic engine. The base engine performs symbolic execution of Cout on ψin IO which has uninitialized elements/unknowns. This symbolic execution emulates an execution trace of Cout and makes decisions about 2It is possible that code Cout generated during intermediate step is not a solution for Tout, e.g., when Cout is semantically incorrect and cannot generate a corresponding puzzle. In this case, we set the size constraints in Tout code as specified by ψin . Published in Transactions on Machine Learning Research (01/2024) Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move } move } } Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move } move } } Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move } move } } Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move } move } } Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move } move } } Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move } move } } Figure 3: Illustration of symbolic execution. In the first step, the execution of Cout starts on the partially initialized ψin IO. In the second step, the agent s location and orientation are set and execution continues to Repeat Until(goal), which is decided to false in this particular example. This decision does not immediately affect the visual grid. In the third step, the execution continues to If(path Right), which is decided to false and the avatar moves accordingly. The effect of these actions can be seen in the next state of the visual grid. In the fourth step, the execution continues to Repeat Until(goal), which is decided to true, giving the goal location at step five. This leads to the sixth step, which is the end of the execution. The remaining unknowns are replaced with walls. unknowns resulting in a concrete instantiation of ψin IO to Tout IO . The outcome of these decisions affects the quality of the generated Tout IO . Figure 3 shows an example of symbolic execution done by the symbolic engine. During this execution process, an emulator will replace unknowns with free cells, walls, the avatar position, and the goal location as indicated by the symbolic engine. The highlighted blocks of code in Figure 3 show the steps where the symbolic execution engine takes decisions, leading to one possible path in the execution of Cout and an instantiation of Tout IO . A code could have a potentially unbounded number of possible execution traces, and randomly taking decisions would typically lead to a lower quality task (King, 1976; Ahmed et al., 2020). Deciding the path of the execution maps to a sequential decision-making process. We aim to guide this decision-making process to achieve execution traces that lead to higher-quality tasks. Neural model. We train the neural model to guide the decision-making process of the base symbolic engine. Our neural architecture uses a CNN-based encoder for partial visual puzzles and combines it with code execution-related features (e.g., coverage, currently executed code block). This choice is based on previous works on program synthesis for visual programming tasks (Bunel et al., 2018; Gupta et al., 2020). Other works have investigated the use of Monte Carlo Tree Search (MCTS) (Kocsis & Szepesvári, 2006) strategy to guide the symbolic execution for generating better puzzles with fewer resources (Kartal et al., 2016; Ahmed et al., 2020). However, these works used MCTS at inference time without any learned policy and could take several minutes to generate an output task for an input specification. Instead, a neural model makes use of learned experience to lead to high-quality puzzles. Neurally guided symbolic engine as an RL agent. To speed up the generation process at inference, we train an RL agent (Sutton & Barto, 2018) whose reward is defined via a scoring function Fscore that captures the quality of the generated visual puzzle for an input specification; this scoring function is similar in spirit to that used for MCTS in (Kartal et al., 2016; Ahmed et al., 2020). The agent s goal is to make decisions about the unknowns encountered when symbolically executing a code with the objective of generating highquality puzzles. More concretely, we consider an episodic Markov Decision Process (MDP) defined as a tuple M = (S, A, P, R, S0), where an episode corresponds to a full symbolic execution of a code and the agent interacts with the environment over discrete time steps t. The elements of the MDP are defined as follows: A state s S is given by current status of visual puzzle and the block being executed (as in Figure 3); A set of state-depended actions As can either be the set of initialization actions (i.e., {pos1, pos2,..., posn}), or a set of binary decisions (i.e, {true, false}); Published in Transactions on Machine Learning Research (01/2024) Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move } move } } (a) State at time t Environment RL Agent (1) state: puzzle+code (2) action: fill in unknown (3) reward: 0 or Fscore (b) Interaction at time t Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move } move } } (c) State at time t + 1 Figure 4: Illustration showing the interaction between the RL agent and the environment. (a) shows state at time t, composed of code block being executed and status of the visual puzzle. (b) shows the interaction between the environment and the RL agent at time t. The environment comprises the code emulator and the scoring function. The RL agent is the neurally guided symbolic engine. (c) shows state at time t + 1 P : S A S R denotes deterministic transition dynamics, P(s |s, a) = 1 for s = s a, and 0 otherwise. R : S A R denotes a sparse reward. During the episode, the reward values are returned as 0; at the end of an episode, the reward values are computed using Fscore(Tout, Cout). S0 S is the set of initial states. An initial state is composed of ψin IO and Cout. In order to explain the interaction between the neurally guided symbolic engine and the emulator, we use a concrete example in Figure 4. Let us consider that the current state at time t is composed of the block being executed (i.e., If(path Right) interrogation) and the status of the visual puzzle (i.e., the avatar surrounded by unknowns). This is depicted in Figure 4a. At this point, the neural model can guide the symbolic engine, as an interaction between the environment and the agent is expected (Figure 4b). The agent receives the current state of the emulator (i.e., as in Figure 4a) and outputs an action, i.e., the decision on how to fill in the unknown. In our case, the decision is that there is no path to the right. Then the environment computes a reward in accordance with the action taken by using Fscore. After the decision is taken, the emulator continues executing the upcoming blocks (i.e., turn Left, move, move) until reaching the next step where a decision is needed (i.e., Repeat Until(goal)). The state at time t + 1 is represented by the block being executed (i.e., Repeat Until(goal)) and the new status of the visual puzzle, as depicted in Figure 4c. The implementation details of the RL agent are provided in Appendix D.2. 4 Training and Validation on Synthetic Specifications In this section, we train and validate Neur Task Syn on synthetic datasets of task specifications. We employ different variants of Neur Task Syn to quantify the utility of individual components and evaluate w.r.t. scoring and runtime metrics. Importantly, the models obtained here through training on synthetic dataset will be used for evaluation on real-world specifications in Section 5. 4.1 Evaluation Setup Visual programming domains. We consider two popular visual programming domains: Hour of Code:Maze Challenge by Code.org (Ho CMaze) (Code.org, 2013b;a) and Karel (Pattis et al., 1995), as introduced in Sections 1 and 2. Both these programming domains have been studied extensively in the literature on program/task synthesis (Bunel et al., 2018; Shin et al., 2019; Ahmed et al., 2020; Gupta et al., 2020) and computing education (Piech et al., 2015; Efremov et al., 2020; Ghosh et al., 2022). We define a few domain-specific elements for these visual programming domains. First, as introduced in Section 2, we use two domain-specific languages (DSLs) adapted from (Bunel et al., 2018; Ahmed et al., 2020). Second, as Published in Transactions on Machine Learning Research (01/2024) mentioned in Section 3, we will use domain-specific scoring functions FHo CMaze score and FKarel score to capture quality of a visual programming task. In our work, we adapt scoring functions used in (Ahmed et al., 2020). Full details are provided in the supplementary material; in a nutshell, these scoring functions are designed to intuitively capture the synthesis objectives set in Section 2, including properties like code coverage and trace quality. These scoring functions will be used in different ways: (a) during training of Neur Task Syn s puzzle generator as a reward for RL agent and during inference to select an output task from candidates; (b) when evaluating different techniques with a surrogate metric based on these scoring functions; (c) when computing the quality of expert-designed code-task pairs for the runtime experiments. For each domain, we will make use of an offline, time-intensive, method Task Oracle(C): it does one million symbolic executions of a given code C and returns highest-scoring task w.r.t. scoring function Fscore (cf. Footnote 3). Scoring-based metrics. We introduce a binary success metric M(ψin, Tout, Cout) that is used to compare the performance of different techniques and to pick hyperparameters. More concretely, M(ψin, Tout, Cout) is 1 if the following hold: (i) the task Tout is valid w.r.t. ψin; (ii) the generated code Cout is semantically correct in a sense that it can lead to a valid task via Task Oracle, i.e., Fscore(Task Oracle(Cout), Cout) > λ1; (iii) the generated task Tout is good quality in comparison to the oracle-generated task, i.e., Fscore(Tout, Cout) > λ2 Fscore(Task Oracle(Cout), Cout). We use λ1 = 0 and λ2 = 0.9 in our experiments. For each technique, performance is computed as % success rate across the evaluation dataset w.r.t. M; in total, we compute performance across three seeds and report averaged results as mean (stderr). Importantly, we note that this metric only serves as a surrogate metric for evaluation on synthetic dataset; the neural models trained here will be evaluated on real-world task specifications w.r.t. the synthesis objectives in the next section. Runtime metrics. We aim to measure the resources needed for a technique to reach certain quality thresholds. We do this by first computing the score given by Fscore to an expert-designed task-code pair. Then, we fix the code and measure the time and number of symbolic execution rollouts needed to reach a given % of the computed score. We evaluate each technique multiple times and report averaged results. 4.2 Evaluation w.r.t. Scoring-based Metrics Synthetic task specifications. For training and evaluation of techniques, we create a dataset of synthetic task specifications per domain, referred to as D := {ψin}. To create one specification ψin := (ψin sketch, ψin IO, ψin ), the most crucial part is getting a code sketch ψin sketch that respects the DSL and can lead to a valid code generation. We start by sampling a code Cin from the DSL for a given structure, depth, and constructs this sampling process is inspired by methods for synthetic dataset creation (Bunel et al., 2018; Shin et al., 2019; Ahmed et al., 2020). For each sampled code, we check its semantic validity, i.e., this code can lead to a high-quality task using Task Oracle. Afterwards, for a sampled code Cin we create its corresponding ψin sketch by keeping only the programming constructs (loops/conditionals) with a random subset of the booleans/iterators masked out. The rest of the ψin elements are instantiated as follows: ψin IO is 16 16 size without any initialization, ψin enforces the underlying DSL and a randomly initialized size between the number of blocks in Cin and 17. In our evaluation, we split D as follows: 80% for training the neural models (Dtrain), 10% for calibration (Dcal), and a fixed 10% for evaluation (Dtest). Techniques. First we describe Neur Task Syn, our main task synthesis technique from Section 3. For each domain (Ho CMaze and Karel), we train a separate instance of Neur Task Syn using the synthetic dataset introduced above. In Section 3, we described the generation process for a single rollout , i.e., one Cout and one puzzle Tout IO is generated. In practice, we use multiple rollouts to select a final output task Tout. More concretely, at inference time for a given ψin as input, Neur Task Syn generation process is captured by two parameters: number of code rollouts c by the code generator and number of puzzle rollouts p by the puzzle generator for each generated code. We denote these hyperparameters in subscript, e.g., Neur Task Sync:5,p:10 for 5 10 rollouts. Out of these c p candidates, the technique outputs one task Tout along with solution code Cout using its scoring function. Next, we describe different variants of Neur Task Sync,p and baselines: Neur Puzzle Genc:fix,p: This technique is a variant of Neur Task Sync,p to evaluate its puzzle generation component, assuming the code generator has access to code Cin associated with specification ψin in the dataset. At inference, Neur Puzzle Genc:fix,p generation process is captured by hyperparameter p, i.e., number of puzzle rollouts. Out of p candidates, technique outputs one task, analogous to Neur Task Syn. Published in Transactions on Machine Learning Research (01/2024) 10 20 30 40 50 Number of rollouts Performance NEURTSYN:ALL BASETSYN:ALL NEURTSYN:HARD BASETSYN:HARD (a) Base Task Syn, Neur Task Syn 0 10 20 30 40 50 60 70 80 90 100 Number of rollouts Performance NEURPGEN:ALL BASEPGEN:ALL NEURPGEN:HARD BASEPGEN:HARD (b) Base Puzzle Gen, Neur Puzzle Gen Metric Rollouts Runtime Technique:Data Quality Quality 90% 95% 99% 99% Neur PGen:All 4 9 30 1.97s Neur PGen:Hard 15 46 160 11.08s Base PGen:All 91 2932 69105 1977.44s Base PGen:Hard 437 17122 405231 11800.11s MCTS:All 92 1180 11740 14.92s MCTS:Hard 283 6841 10116 88.25s (c) Results for runtime evaluation. Figure 5: (a) Results for Base Task Sync,p and Neur Task Sync,p by increasing code rollouts c from 1 to 5 with fixed puzzle rollouts p = 10. (b) Results for Base Puzzle Genc:fix,p and Neur Puzzle Genc:fix,p by increasing puzzle rollouts p from 1 to 100. (c) Table showing the runtime/rollouts comparison between different techniques, on tasks of different difficulty, when varying the required quality threshold. The quality threshold is expressed as % of the quality score of an expert-designed task. See further details in Section 4. Base Task Sync,p and Base Puzzle Genc:fix,p: These techniques operate similar to Neur Task Sync,p and Neur Puzzle Genc:fix,p, but use only symbolic engine with random search.3 Results. Figures 5a and 5b report results as we vary the number of rollouts for different techniques, evaluated on the full dataset and on a hard segment of the dataset, where ψin sketch uses at least 2 constructs and has a depth of 3. In summary, these results demonstrate the utility of different components of Neur Task Syn and how the synthesis quality improves as we increase the number of rollouts. 4.3 Evaluation w.r.t. Runtime Metrics Code samples for puzzle generation. For evaluating the runtime of our technique, we pick a set of codes designed by experts as solutions for six representative tasks from both the Ho CMaze and Karel domains. We select those tasks that are common among our evaluation in Section 5 (see Figure 6) and the work of (Ahmed et al., 2020). These tasks are derived from Ho C:Maze8, Ho C:Maze9, Ho C:Maze13, Ho C:Maze18 (Code.org, 2013b), and Karel:Our First, Karel:Diagonal (Code HS, 2012b). Techniques. We focus on evaluating the runtime performance of three techniques. Neur Puzzle Genc:fix,p and Base Puzzle Genc:fix,p are equivalent to the techniques of Section 4.2. MCTSc:fix,p is the puzzle generation method based on Monte Carlo tree search (MCTS) (Ahmed et al., 2020), introduced in Section 3.3. More specifically, we do not set a fixed number of puzzle rollouts p, but we measure the values of p needed to reach certain quality thresholds. Results. We present our results in Figure 5c, averaged over all the selected tasks. Additionally, we define an expert code as hard when using at least 2 constructs and having a depth of 3. We separately report the results on Ho C:Maze18, which is considered hard. It can be noticed that the unguided symbolic execution of Base Puzzle Genc:fix leads to large execution times when aiming to synthesize high-quality tasks. MCTSc:fix has faster runtimes than Base Puzzle Genc:fix, but it becomes less scalable when required to come up with high-quality tasks for the hard code. Neur Puzzle Genc:fix maintains low execution times and number of rollouts for any of the required quality thresholds, even when faced with a hard code. 5 Experiments on Real-World Specifications In this section, we evaluate our task synthesis technique Neur Task Syn on real-world specifications. 3Task Oracle(C) introduced above is Base Puzzle Gen with p = 106 rollouts for a fixed code C. Published in Transactions on Machine Learning Research (01/2024) ψin ψin sketch structure (depth, constructs) ψin IO ψin Source ψ0 {Run {Repeat}} (2, 1) 16x16 empty Ho CMaze, blocks 10 Ho C:Maze9 Code.org (2013b) ψ1 {Run {Repeat Until}} (2, 1) 16x16 empty Ho CMaze, blocks 10 Ho C:Maze13 Code.org (2013b) ψ2 {Run {Repeat; Repeat}} (2, 2) 16x16 empty Ho CMaze, blocks 10 Ho C:Maze8 Code.org (2013b) ψ3 {Run {Repeat Until{If Else}}} (3, 2) 16x16 empty Ho CMaze, blocks 10 Ho C:Maze18 Code.org (2013b) ψ4 {Run {Repeat Until{If; If}}} (3, 3) 16x16 empty Ho CMaze, blocks 10 Ho C:Maze20 Code.org (2013b) ψ5 {Run} (1, 0) 16x16 empty Karel, blocks 10 Karel:Our First Code HS (2012b) ψ6 {Run {While}} (2, 1) 16x16 empty Karel, blocks 10 Karel:Diagonal Code HS (2012b) ψ7 {Run {While; While}} (2, 2) 16x16 empty Karel, blocks 10 Karel:Row Back Code HS (2012b) ψ8 {Run {While{If}}} (3, 2) 16x16 empty Karel, blocks 10 Karel:Stairway Code HS (2012b) ψ9 {Run {While{Repeat}}} (3, 2) 16x16 empty Karel, blocks 10 Karel:Clean All Figure 6: Real-world task specifications for Ho CMaze and Karel; ψin sketch is shortened for brevity; ψin indicated as booleans and actions allowed by each domain, with the addition of maximum size. Real-world task specifications. We use a set of 10 task specifications from Ho CMaze and Karel domains, shown in Figure 6. These task specifications are inspired by their source tasks (see Source column) in the following sense: we create a specification ψin for which the corresponding source task is a desired task as would be created by experts. Figure 8 shows illustration of task synthesis for a variant of ψ3 (source as Maze18 Ho CMaze task) where we used 12x12 grid size with certain cells pre-initialized; analogously, Figure 9 shows illustration of task synthesis for a variant of ψ8 (source as Stairway Karel task) where we used 12x12 grid size. Synthesis quality metrics. We evaluate techniques w.r.t. different metrics, each corresponding to a synthesis objective introduced in Section 2. Even though objectives O1 O4 are quantitative, it is challenging to fully automate their evaluation because it requires analyzing properties of different possible solution codes of a generated task. We manually did this evaluation when computing performance for each technique and metric. To capture objective O5, we assess the quality of synthesized tasks using human experts annotations. Each expert independently evaluated all specifications and all techniques. Additionally, we report a binary metric of whether Cout solves Tout to provide insight into the robustness of the synthesis process. Results are reported as a mean over 10 specifications ψin from Figure 6; we evaluate over three seeds and report averaged results as mean (stderr). In total, three experts (i.e., one per seed) with experience in visual programming provided annotations for O5. Techniques. We evaluate Neur Task Sync:10,p:100 with c = 10 and p = 100, i.e., total of c p = 1000 rollouts (see Section 4). Next, we describe additional techniques evaluated: Base Task Sync:10,p:100 operates similarly to Neur Task Sync:10,p:100, but uses only base symbolic engine with random search without any neural guidance (see Section 4). GPT4Task Syn-converse is based on Open AI s GPT-4 (Open AI, 2023b). It uses conversation-style prompts that involved human guidance to correct mistakes. More concretely, it is based on a two-stage process as shown in Figure 2 we first ask GPT-4 to generate a code Cout for ψin and then ask it to generate a puzzle Tout IO that could be solved by Cout. The first stage comprised 5 separate queries to generate a Cout: each query started with an initial prompt and then follow-up prompts to fix any mistakes. The second stage comprised of another 5 separate queries to generate a puzzle Tout IO : each query started with an initial prompt and then follow-up prompts to fix any mistakes. Once we get Cout and Tout IO , we set Tout code as for Neur Task Syn. We defer the prompts to the supplementary material. This variant of GPT4Task Syn was used to synthesize the tasks in Figures 8 and 9. GPT4Task Syn-fewshot uses the same two-stage synthesis process, but employs few-shot examples without any follow-up conversations. Expert refers to expert-designed tasks. In our setup, Expert simply outputs a task Tout based on the source task associated with input specification ψin; moreover it appropriately adjusts Tout IO to match ψin IO layout and sets Tout code as allowed by (ψin sketch, ψin ) and the size of the minimal solution code. Published in Transactions on Machine Learning Research (01/2024) Technique O1:Validity O2:Concepts O3:Trace O4:Overall O5:Human Cout solves Tout Neur Task Sync:10,p:100 1.00 (0.00) 0.83 (0.04) 0.80 (0.00) 0.80 (0.00) 0.77 (0.11) 1.00 (0.00) Base Task Sync:10,p:100 0.97 (0.04) 0.37 (0.08) 0.33 (0.04) 0.33 (0.04) 0.20 (0.07) 0.50 (0.12) GPT4Task Syn-converse 0.97 (0.04) 0.57 (0.11) 0.60 (0.07) 0.43 (0.11) 0.27 (0.08) 0.33 (0.08) GPT4Task Syn-fewshot 0.80 (0.07) 0.37 (0.11) 0.57 (0.11) 0.33 (0.08) 0.13 (0.11) 0.27 (0.04) Expert 1.00 1.00 1.00 1.00 1.00 1.00 Figure 7: Results on real-world task specifications for Ho CMaze and Karel in Figure 6; see Section 5. In this figure, Expert refers to expert-designed tasks. a blocks Repeat Until(goal){ a blocks If(b){ a blocks } Else{ a blocks } a blocks } } a blocks is a body of basic action blocks from the set {move, turn Left, turn Right} b is a boolean condition from {path Ahead, path Left, path Right} Number of blocks should be 7 (a) Task synthesis specification ψin, with ψin IO (left), ψin sketch (middle) and ψin (right). Repeat Until(goal){ If(path Left){ turn Left move } Else{ (b) Tout IO , Cout by GPT4Task Syn turn Left turn Left Repeat Until(goal){ If(path Right){ move } Else{ (c) Tout IO , Cout by Base Task Syn Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move } move } } (d) Tout IO , Cout by Neur Task Syn Repeat Until(goal){ If(path Ahead){ move } Else{ turn Left } } } (e) Tout IO , Cout by Expert Figure 8: Illustrative example showcasing task synthesis inspired by the Maze18 Ho CMaze task Code.org (2013b;a). The given specification seeks to synthesize tasks where a solution code has the {Repeat Until{If Else}} structure. (a) Task synthesis specification ψin := (ψin IO, ψin sketch, ψin ) is provided as input: ψin IO is a 12x12 maze with certain cells initialized to free (white) or wall (gray) cells, the rest being marked as unknowns (brown); ψin sketch along with ψin specify constraints on code solutions of a synthesized task. (b d) show tasks Tout and codes Cout used as intermediate step to generate output tasks by three techniques. (e) shows task Tout and code Cout based on Maze18. See Section 5. Results. Figure 7 reports evaluation results for different techniques w.r.t. our task synthesis objectives. Next, we summarize some of our key findings. First, Neur Task Syn has high performance of at least 0.77 across all metrics. The illustrative examples in Figures 8 and 9 showcase the high-quality of tasks synthesized by Neur Task Syn, matching interesting characteristics of real-world tasks from Expert. Second, Base Task Syn, GPT4Task Syn-converse, and GPT4Task Syn-fewshot struggle on objectives O2 and O3. Their low performance can be explained, in part, by failure to generate a valid task/code pair. The illustrative examples in Figures 8 and 9 further highlight the issues of tasks generated by these techniques. For instance, Base Task Syn s Tout in Figure 8c can be solved by a simpler code with lower depth than specified in the input specification; GPT4Task Syn-converse s Tout in Figures 8b and 9b are not solvable by codes that would match the input specification. GPT4Task Syn-fewshot shows that changing the prompting strategy does not help with increasing the performance. In summary, these results demonstrate the effectiveness of Neur Task Syn in synthesizing high-quality visual programming tasks for real-world specifications and some Published in Transactions on Machine Learning Research (01/2024) a blocks While(b){ a blocks If(b){ a blocks } a blocks } a blocks } a blocks is a body of basic action blocks from the set {move, turn Left, turn Right, pick Marker, put Marker} b is a boolean condition from {path Ahead, path Left, path Right, no-path Ahead, marker Present, no-marker Present} Number of blocks should be 10 (a) Task synthesis specification ψin, with ψin IO (left), ψin sketch (middle) and ψin (right). While(path Ahead){ If(marker Present){ pick Marker } move } } (b) Tout IO , Cout by GPT4Task Syn While(no-path Ahead){ If(path Right){ pick Marker } turn Right turn Right move turn Left } pick Marker move } (c) Tout IO , Cout by Base Task Syn While(path Ahead){ pick Marker If(path Left){ put Marker } move } turn Right } (d) Tout IO , Cout by Neur Task Syn While(no-path Ahead){ If(marker Present){ pick Marker } turn Left move turn Right move } } (e) Tout IO , Cout by Expert Figure 9: Illustrative example showcasing task synthesis inspired by the Stairway Karel task Pattis et al. (1995); Code HS (2012b;a). This specification seeks to synthesize tasks where a solution code has the {While{If}} structure. (a) Task synthesis specification ψin := (ψin IO, ψin sketch, ψin ) is provided as input: ψin IO is a pregrid-postgrid pair with unitialized grid of size 12x12 ; ψin sketch along with ψin specify constraints on code solutions of a synthesized task. (b d) show tasks Tout and codes Cout used as intermediate step to generate output tasks by three techniques. (e) shows task Tout and code Cout based on Stairway. See Section 5. of the challenges in synthesizing visual programming tasks by state-of-the-art neural generative models as the synthesis process requires logical, spatial, and programming skills. 6 Concluding Discussions We developed a novel neuro-symbolic technique, Neur Task Syn, that can synthesize visual programming tasks for a given specification. We demonstrated both the efficiency and effectiveness of Neur Task Syn through an extensive evaluation on synthetically generated and reference tasks picked from popular visual programming environments. We believe our proposed technique has the potential to enhance introductory programming education by synthesizing personalized content. Moreover, we showcase the challenges that purely neural generative models and purely symbolic models face. Base Task Syn, a purely symbolic model, can lead to semantic irregularities in codes and low-quality tasks. On the other hand, purely generative techniques based on GPT-4 struggle to generate high-quality puzzles. This result aligns with findings in contemporary studies that GPT-4 may face challenges in code execution, symbolic operations, test-case generation, and task synthesis (Bubeck et al., 2023; Phung et al., 2023b). Published in Transactions on Machine Learning Research (01/2024) There are many interesting directions for future work. First, we have built our LSTM/CNN-based architecture specialized for task synthesis objectives; it would be interesting to fine-tune models like GPT-4 to improve its capabilities for synthesizing visual programming tasks. Moreover, we can analyze the logical and spatial abilities of these fine-tuned models by curating benchmarks in visual programming domains. Second, our methodology focused on visual programming; it would be interesting to develop generative models for synthesizing tasks in other programming domains, such as synthesizing Python problems that match a certain input-output configuration and other specifications regarding the possible solution codes. Third, our evaluation study considered various objectives capturing human-centered aspects along with expert annotations; in the future, it would also be useful to conduct studies with human learners to evaluate the quality in terms of perceived difficulty or interpretability of synthesized tasks. Acknowledgements Funded/Co-funded by the European Union (ERC, TOPS, 101039090). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. Umair Z. Ahmed, Sumit Gulwani, and Amey Karkare. Automatically Generating Problems and Solutions for Natural deduction. In IJCAI, 2013. Umair Z. Ahmed, Maria Christakis, Aleksandr Efremov, Nigel Fernandez, Ahana Ghosh, Abhik Roychoudhury, and Adish Singla. Synthesizing Tasks for Block-based Programming. In Neur IPS, 2020. Chris Alvin, Sumit Gulwani, Rupak Majumdar, and Supratik Mukhopadhyay. Synthesis of Geometry Proof Problems. In AAAI, 2014. Daman Arora, Himanshu Gaurav Singh, and Mausam. Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models. Co RR, abs/2305.15074, 2023. David Baidoo-Anu and Leticia Owusu Ansah. Education in the Era of Generative Artificial Intelligence (AI): Understanding the Potential Benefits of Chat GPT in Promoting Teaching and Learning. Available at SSRN 4337484, 2023. Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deep Coder: Learning to Write Programs. In ICLR, 2017. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. A Multitask, Multilingual, Multimodal Evaluation of Chatgpt on Reasoning, Hallucination, and Interactivity. Co RR, abs/2302-04023, 2023. Jean-Pierre Briot, Gaëtan Hadjeres, and François-David Pachet. Deep Learning Techniques for Music Generation. Springer, 2020. Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, and Yi Zhang. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. Co RR, abs/2303.12712, 2023. Rudy Bunel, Matthew J. Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis. In ICLR, 2018. Mark Chen et al. Evaluating Large Language Models Trained on Code. Co RR, abs/2107-03374, 2021. Code HS. Code HS.com: Teaching Coding and Computer Science. https://codehs.com/, 2012a. Published in Transactions on Machine Learning Research (01/2024) Code HS. Intro to Programming with Karel the Dog. https://codehs.com/info/curriculum/introkarel, 2012b. Code.org. Code.org: Learn Computer Science. https://code.org/, 2013a. Code.org. Hour of Code: Classic Maze Challenge. https://studio.code.org/s/hourofcode, 2013b. Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. Robust Fill: Neural Program Learning under Noisy I/O. In ICML, 2017. Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. Musegan: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment. In AAAI, 2018. Aleksandr Efremov, Ahana Ghosh, and Adish Singla. Zero-shot Learning of Hint Policy via Reinforcement Learning and Program Synthesis. In EDM, 2020. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A Pre-Trained Model for Programming and Natural Languages. In EMNLP, 2020. James Finnie-Ansley, Paul Denny, Brett A. Becker, Andrew Luxton-Reilly, and James Prather. The Robots Are Coming: Exploring the Implications of Open AI Codex on Introductory Programming. In ACE, 2022. Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A Generative model for Code Infilling and Synthesis. Co RR, abs/2204.05999, 2022. Wenhao Gao and Connor W Coley. The Synthesizability of Molecules Proposed by Generative Models. Journal of Chemical Information and Modeling, 60(12):5714 5723, 2020. Ahana Ghosh, Sebastian Tschiatschek, Sam Devlin, and Adish Singla. Adaptive Scaffolding in Block-Based Programming via Synthesizing New Tasks as Pop Quizzes. In AIED, 2022. Sumit Gulwani. Example-based Learning in Computer-aided STEM Education. Communications of the ACM, 57(8):70 80, 2014. Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. Program Synthesis. Foundations and Trends in Programming Languages, 2017. Kavi Gupta, Peter Ebert Christensen, Xinyun Chen, and Dawn Song. Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis. In Neur IPS, 2020. Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9, 1997. Jie Huang and Kevin Chen-Chuan Chang. Towards Reasoning in Large Language Models: A Survey. Co RR, abs/2212-10403, 2022. Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large Language Models Cannot Self-Correct Reasoning Yet. Co RR, abs/2310.01798, 2023. Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert Mc Hardy. Challenges and Applications of Large Language Models. Co RR, abs/2307.10169, 2023. Bilal Kartal, Nick Sohre, and Stephen J. Guy. Data Driven Sokoban Puzzle Generation with Monte Carlo Tree Search. In AIIDE, 2016. James C. King. Symbolic Execution and Program Testing. Communications of ACM, 19(7):385 394, 1976. Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo Planning. In ECML, 2006. Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu-Hong Hoi. Coderl: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. In Neur IPS, 2022. Published in Transactions on Machine Learning Research (01/2024) Juho Leinonen, Arto Hellas, Sami Sarsa, Brent N. Reeves, Paul Denny, James Prather, and Brett A. Becker. Using Large Language Models to Enhance Programming Error Messages. In SIGCSE, 2023. Yujia Li et al. Competition-Level Code Generation with Alphacode. Co RR, abs/2203.07814, 2022. Weng Marc Lim, Asanka Gunasekara, Jessica Leigh Pallant, Jason Ian Pallant, and Ekaterina Pechenkina. Generative AI and the Future of Education: Ragnarök or Reformation? A Paradoxical Perspective from Management Educators. The International Journal of Management Education, 21(2):100790, 2023. Open AI. Chat GPT. https://openai.com/blog/chatgpt, 2023a. Open AI. GPT-4 Technical Report. Co RR, abs/2303.08774, 2023b. Richard E Pattis, Jim Roberts, and Mark Stehlik. Karel the Robot: A Gentle Introduction to the Art of Programming. John Wiley & Sons, Inc., 1995. Tung Phung, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. Generating high-precision feedback for programming syntax errors using large language models. In EDM, 2023a. Tung Phung, Victor-Alexandru Padurean, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. Generative AI for Programming Education: Benchmarking Chatgpt, Gpt-4, and Human Tutors. In ICER V.2, 2023b. Chris Piech, Mehran Sahami, Jonathan Huang, and Leonidas J. Guibas. Autonomously Generating Hints by Inferring Problem Solving Policies. In L@S, 2015. Oleksandr Polozov, Eleanor O Rourke, Adam M. Smith, Luke Zettlemoyer, Sumit Gulwani, and Zoran Popovic. Personalized Mathematical Word Problem Generation. In IJCAI, 2015. Thomas W. Price and Tiffany Barnes. Position Paper: Block-Based Programming Should Offer Intelligent Support for Learners. IEEE Blocks and Beyond Workshop, 2017. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. In ICML, 2021. Mitchel Resnick, John H. Maloney, Andrés Monroy-Hernández, Natalie Rusk, Evelyn Eastmond, Karen Brennan, Amon Millner, Eric Rosenbaum, Jay S. Silver, Brian Silverman, and Yasmin B. Kafai. Scratch: Programming for All. Communications of ACM, 52(11):60 67, 2009. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022. Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In ICER, 2022. Petra Schneider, W Patrick Walters, Alleyn T Plowright, Norman Sieroka, Jennifer Listgarten, Robert A Goodnow Jr, Jasmin Fisher, Johanna M Jansen, José S Duca, Thomas S Rush, et al. Rethinking Drug Design in the Artificial Intelligence Era. Nature Reviews Drug Discovery, 19(5):353 364, 2020. Richard Shin, Neel Kant, Kavi Gupta, Chris Bender, Brandon Trabucco, Rishabh Singh, and Dawn Song. Synthetic Datasets for Neural Program Synthesis. In ICLR, 2019. Rohit Singh, Sumit Gulwani, and Sriram K. Rajamani. Automatically Generating Algebra Problems. In AAAI, 2012. Minhyang Suh, Emily Youngblom, Michael Terry, and Carrie J. Cai. AI as Social Glue: Uncovering the Roles of Deep Generative AI during Social Music Composition. In CHI, 2021. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018. Published in Transactions on Machine Learning Research (01/2024) Tamara Tate, Shayan Doroudi, Daniel Ritchie, and Ying Xu. Educational Research and AI-Generated Writing: Confronting the Coming Tsunami. 2023. Alperen Tercan, Ahana Ghosh, Hasan Ferit Eniser, Maria Christakis, and Adish Singla. Synthesizing a Progression of Subtasks for Block-Based Visual Programming Tasks. Co RR, abs/2305.17518, 2023. Xiaochu Tong, Xiaohong Liu, Xiaoqin Tan, Xutong Li, Jiaxin Jiang, Zhaoping Xiong, Tingyang Xu, Hualiang Jiang, Nan Qiao, and Mingyue Zheng. Generative Models for De Novo Drug Design. Journal of Medicinal Chemistry, 64(19):14011 14027, 2021. Karthik Valmeekam, Alberto Olmo Hernandez, Sarath Sreedharan, and Subbarao Kambhampati. Large Language Models Still Can t Plan (A Benchmark for LLMs on Planning and Reasoning about Change). Co RR, abs/2206-10498, 2022. W. Walters and Mark Murcko. Assessing the Impact of Generative AI on Medicinal Chemistry. Nature Biotechnology, 38(2):143 145, 2020. Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. Code T5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In EMNLP, 2021. Mike Wu, Milan Mosse, Noah D. Goodman, and Chris Piech. Zero Shot Learning for Code Education: Rubric Sampling with Deep Learning Inference. In AAAI, 2019. Pengcheng Yin and Graham Neubig. A Syntactic Neural Model for General-Purpose Code Generation. In ACL, 2017. Rui Zhi, Thomas W. Price, Samiha Marwan, Alexandra Milliken, Tiffany Barnes, and Min Chi. Exploring the Impact of Worked Examples in a Novice Programming Environment. In SIGCSE, 2019. Published in Transactions on Machine Learning Research (01/2024) A Table of Contents In this section, we provide a brief description of the content provided in the appendices of the paper. Appendix B provides a discussion of the broader impact of our work and compute resources used. Appendix C presents the details about the generation of the illustrative examples from Figures 8 and 9, and shows their relationship with metrics O1-O4 described in Section 2. Appendix D gives more insights into the architecture described in Section 3. Appendix E provides additional details about results, the scoring functions, the synthetic dataset creation process, and the training process in Section 4. Appendix F provides the source task/code pairs used for creating the real-world task specifications in Section 5. It also provides more insights into the interaction with GPT-4. B Discussion Broader impact. This paper develops new techniques which have the potential of being used for improving pedagogy in visual programming environments. On the existing platforms, content is hand-curated by tutors, offering limited resources for students to practice on. We aim to tackle this challenge by synthesizing novel practice tasks that match a desired level of difficulty with regard to exercised content for a student. We believe our proposed technique has the potential to drastically enhance introductory programming education by synthesizing personalized content for students. Compute resources. All the experiments were conducted on a cluster of machines equipped with Intel Xeon Gold 6142 CPUs running at a frequency of 2.60GHz. Published in Transactions on Machine Learning Research (01/2024) C Illustrative Examples: Details In this section, we discuss the details regarding the generation and scoring for each of the techniques output in Figure 8 for Ho CMaze and Figure 9 for Karel. C.1 Example for Ho CMaze in Figure 8 We present Tout, along with Cout for each of the techniques with ψin as input in Figure 11; this figure expands on Figure 8 with additional details. We give additional explanations regarding how each of the techniques output respects or not metrics O1-O3 (see Sections 2 and 5) in Figure 10. Generation/adjustment for GPT4Task Syn in Figure 8. When querying GPT-4 for this example, we set ψin IO as an empty 8x8 grid. We then expand the generated grid to a 12x12 grid and manually integrate the pattern seen in Figure 11a to match the specification. We discuss the results obtained by GPT4Task Syn-converse. Generation/adjustment for Base Task Syn and Neur Task Syn in Figure 8. The neural model for puzzle generation is trained on 16x16 grids, yet the symbolic engine can support the existence of preinitialized grids. Thus, we mask the upper-left part of the grid (4 rows and 4 columns), obtaining the 12x12 workspace for the technique. On top of that, on the remaining 12x12 grid, we pre-initialize the pattern seen in Figure 11a (i.e., lower-left and upper-right bounded squares). Generation/adjustment for Expert in Figure 8. The output of Expert represents a manual adaptation of the Ho C:Maze18 task to expand it to a 12x12 grid and to integrate the pattern seen in in Figure 8a. Technique O1:Validity O2:Concepts O3:Trace O4:Overall Cout solves Tout GPT4Task Syn 1 0 0 0 0 Base Task Syn 1 0 0 0 1 Neur Task Syn 1 1 1 1 1 Expert 1 1 1 1 1 Figure 10: Scores showing whether the output Tout for ψin of each technique respects the four metrics O1-O4, with the additional Cout solves Tout metric, for this Ho CMaze example. We provide explanations for each 0 entry for the objectives O1-O3 in Figure 10: GPT4Task Syn for Objective O2: The only possible solution code C has depth 4 and uses 3 constructs (nested If Else is needed), i.e., ψin sketch is not respected, hence O2 is 0. GPT4Task Syn for Objective O3: There is no solution C that respects ψin sketch, hence O3 is 0 by definition. GPT4Task Syn for Cout solves Tout: Cout, when executed on Tout IO , makes the avatar crash into a wall, hence Cout is not a solution for Tout Base Task Syn for Objective O2: The If Else block employed by Cout is not required. This implies that there is a solution code C which has a lower complexity, i.e., a smaller depth and less constructs than required by ψin sketch, hence O2 is 0. Base Task Syn for Objective O3: As the employed If Else block is not required, it is possible to design a solution code that uses If Else with a different conditional (e.g., If(path Left)Else) for which the body would never be executed, hence O3 is 0. Published in Transactions on Machine Learning Research (01/2024) a blocks Repeat Until(goal){ a blocks If(b){ a blocks } Else{ a blocks } a blocks } } a blocks is a body of basic action blocks from the set {move, turn Left, turn Right} b is a boolean condition from {path Ahead, path Left, path Right} Number of blocks should be 7 (a) Task synthesis specification ψin Allowed blocks: {move, turn Left, turn Right, Repeat Until, If Else} Maximum size: 7 (b) Tout by GPT4Task Syn Repeat Until(goal){ If(path Left){ turn Left move } Else{ move } } } (c) Cout by GPT4Task Syn Allowed blocks: {move, turn Left, turn Right, Repeat Until, If Else} Maximum size: 7 (d) Tout by Base Task Syn turn Left turn Left Repeat Until(goal){ If(path Right){ move } Else{ (e) Cout by Base Task Syn Allowed blocks: {move, turn Left, turn Right, Repeat Until, If Else} Maximum size: 7 (f) Tout by Neur Task Syn Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move } move } } (g) Cout by Neur Task Syn Allowed blocks: {move, turn Left, turn Right, Repeat Until, If Else} Maximum size: 5 (h) Tout by Expert Repeat Until(goal){ If(path Ahead){ move } Else{ turn Left } } } (i) Cout by Expert Figure 11: Illustration containing the tuple Tout for each technique, along with Cout, for this Ho CMaze example. Published in Transactions on Machine Learning Research (01/2024) C.2 Example for Karel in Figure 9 We present Tout, along with Cout for each of the techniques with ψin as input in Figure 13; this figure expands on Figure 9 with additional details. We give additional explanations regarding how each of the techniques output respects or not metrics O1-O3 (see Sections 2 and 5) in Figure 12. Generation/adjustment for GPT4Task Syn in Figure 9. For this example, we set ψin IO as an empty 12x12 grid and query GPT-4. We discuss the results obtained by GPT4Task Syn-converse. Generation/adjustment for Base Task Syn and Neur Task Syn in Figure 9. The neural model for puzzle generation is trained on 16x16 grids, yet the symbolic engine can support the existence of pre-initialized grids. Thus, we mask the upper-left part of the grid (4 rows and 4 columns), obtaining the 12x12 workspace for the technique. Generation/adjustment for Expert in Figure 9. The output of Expert for this example is based on the Karel:Stairway task. Technique O1:Validity O2:Concepts O3:Trace O4:Overall Cout solves Tout GPT4Task Syn 0 0 0 0 0 Base Task Syn 1 0 1 0 1 Neur Task Syn 1 1 1 1 1 Expert 1 1 1 1 1 Figure 12: Scores showing whether the output Tout for ψin of each technique respects the six metrics O1-O4, with the additional Cout solves Tout metric, for this Karel example. We provide explanations for each 0 entry for the objectives O1-O3 in Figure 12: GPT4Task Syn for Objective O1: Tout IO cannot be solved with any code respecting Tout code, hence O1 is 0. GPT4Task Syn for Objective O2: Tout IO cannot be solved with any code respecting Tout code, hence O2 is 0. GPT4Task Syn for Objective O3: Tout IO cannot be solved with any code respecting Tout code, hence O3 is 0. GPT4Task Syn for Cout solves Tout: The generated code Cout does not solve Tout, i.e., the pregrid is not transformed into the postgrid after code execution. Base Task Syn for Objective O2: The employed If block is not required. This implies that there is a solution code C which has a lower complexity, i.e., a smaller depth and less constructs than required by ψin sketch, hence O2 is 0. Published in Transactions on Machine Learning Research (01/2024) a blocks While(b){ a blocks If(b){ a blocks } a blocks } a blocks } a blocks is a body of basic action blocks from the set {move, turn Left, turn Right, pick Marker, put Marker} b is a boolean condition from {path Ahead, path Left, path Right, no-path Ahead, marker Present, no-marker Present} Number of blocks should be 10 (a) Task synthesis specification ψin Allowed blocks: {move, turn Left, turn Right, pick Marker, put Marker, While, If} Maximum size: 10 (b) Tout by GPT4Task Syn While(path Ahead){ If(marker Present){ pick Marker } move } } (c) Cout by GPT4Task Syn Allowed blocks: {move, turn Left, turn Right, pick Marker, put Marker, While, If} Maximum size: 10 (d) Tout by Base Task Syn While(no-path Ahead){ If(path Right){ pick Marker } turn Right turn Right move turn Left } pick Marker move } (e) Cout by Base Task Syn Allowed blocks: {move, turn Left, turn Right, pick Marker, put Marker, While, If} Maximum size: 7 (f) Tout by Neur Task Syn While(path Ahead){ pick Marker If(path Left){ put Marker } move } turn Right } (g) Cout by Neur Task Syn Allowed blocks: {move, turn Left, turn Right, pick Marker, put Marker, While, If} Maximum size: 8 (h) Tout by Expert While(no-path Ahead){ If(marker Present){ pick Marker } turn Left move turn Right move } } (i) Cout by Expert Figure 13: Illustration containing the tuple Tout for each technique, along with Cout, for this Karel example. Published in Transactions on Machine Learning Research (01/2024) D Our Synthesis Technique Neur Task Syn: Details Next, we give additional details regarding each module of our architecture. We present the interaction between the neural models and the underlying symbolic engines, the neural architecture, and the training procedures. D.1 Generating the Solution Code Cout Code generator visualization. We describe the interaction between the neural model and the underlying symbolic engine for the code generator. For better understanding, we use one concrete example, illustrated in Figure 14. We consider the AST at time t as presented in Figure 14a, where the previously taken decision was the addition of the turn Left token. The symbolic engine continues its depth-first traversal of the AST and now needs to take the next decision for the subsequent a blocks . This is achieved by interrogating the neural component; interaction demonstrated in Figure 14b. We introduce the notion of a budget, which represents the number of available blocks that can be added to the AST so that the size specified by ψin is respected; in our example, the remaining budget is 2. It is passed as input for the neural model at time t. The neural model, based on the budget and its internal state, which keeps track of the previously taken decisions, outputs a logit for each decision, i.e., a set of logits Ldict. The symbolic engine accepts Ldict and masks them according to the rules in the DSL, thus obtaining Lmasked dict . In our example, the only values in Lmasked dict that are not masked are those of basic action blocks (i.e., move, turn Left, turn Right) and the token that represents the end of the Else body. After mapping the logits to a probability distribution, the symbolic engine proceeds to sample a decision from it. In our example, move is sampled. The decision is passed to the neural model to update its internal state. The symbolic engine then updates the AST with the taken decision (i.e., move), thus obtaining the updated version of the AST for the next step at time t + 1, illustrated in Figure 14c. We generalize this process to every decision that needs to be taken while traversing the AST. Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left a blocks } a blocks } } (a) Code status at time t Symbolic Engine Neural Model (2) logits Ldict (3) sample(Lmasked dict ) (b) Interaction at time t Repeat Until(goal){ If(path Right){ turn Right } Else{ turn Left move a blocks } a blocks } } (c) Code status at time t + 1 Figure 14: Visualization of the interaction process between the neural model and the symbolic engine in the code generator component of Neur Task Syn. (a) shows the AST at time t, where the first a blocks needs to be decided. (b) shows the interaction between the symbolic engine and the neural model at time t, where the symbolic engine first passes the budget (available blocks) to the neural model, the neural model computes the logits for all the tokens in the dictionary Ldict and passes them back to the symbolic engine, which finally masks them obtaining Lmasked dict , applies softmax to obtain a probability and samples the next action, sending it to the neural model. (c) shows the AST at time t + 1, where the sampled move was integrated. Imitation learning procedure. We will now give details about the learning procedure we used for the neural model. Given the fact that dataset D := {ψin} is accompanied by example codes, i.e., Cin for each ψin, we employ an imitation (supervised) learning approach, similar to (Devlin et al., 2017; Bunel et al., 2018). Thus, for each decision, we compute the cross-entropy with respect to the target decision. We force the agent to take the target decision afterward so the generated code does not digress from our example code. Published in Transactions on Machine Learning Research (01/2024) Neural architecture. Here, we present in detail the architecture of the neural model we employ for code generation. Similar to (Bunel et al., 2018), we employ an LSTM-based (Hochreiter & Schmidhuber, 1997) recurrent neural network. We first convert code tokens to indexes based on a dictionary, then we pass them through an embedding layer. We do the same with the numeric representation of the budget (introduced previously). We concatenate both embeddings and pass them through a two-layer LSTM. Last, we convert the output of the LSTM to logits for each entry in the dictionary using a linear layer. The architecture can be observed in Figure 15. Input Code token categorical (0-58) Budget ordinal (0-16) Embedding Size = 256 Size = 16 LSTM 1 Hidden dim = 256 LSTM 2 Hidden dim = 256 Linear Hidden dim Dict size = 256 59 Figure 15: Architecture of the neural model used by the code generator. Published in Transactions on Machine Learning Research (01/2024) D.2 Generating the Visual Puzzle Tout IO Reinforcement learning procedure. We describe further details necessary for training our reinforcement learning (RL) agent. To learn a policy, we use policy gradient methods. These methods generally learn by using gradient ascent, thus updating the parameters θ of the parameterized policy πθ(a|s) to increase the expected reward of the policy in the MDP. Naturally, a neural network can be used to learn the policy, where θ represents the network s weights. The network would take action a and state s as input, outputting a logit Hθ(a|s). Given the logits, we map them to a probabilistic distribution using softmax: πθ(a|s) = exp (Hθ(a|s)) P a As exp (Hθ(a |s)). We use an actor-critic policy gradient method for agent training (Sutton & Barto, 2018). We denote with ˆv(s, w) the value for state s predicted by the critic with parameters w. As we operate on batches, the parameters of both the actor and the critic remain unchanged until a buffer is filled with a fixed number of episodes. Thus, for an initial state s0 (i.e., an empty or pre-initialized task and a code, with the emulator reset), we execute the existing policy πθ until the buffer is filled, generating several sequences of experience as tuples (st, at, rt)t=0..T , where T represents a variable episode length. Thus, the losses for an episode are computed as a sum over the timesteps t [0, T] as follows, for the actor (Equation 1) and for the critic (Equation 2, employing the smooth L1 loss, denoted as L1smooth): τ=t rτ ˆv(st, w) θ log(πθ(at|st)) (1) t=0 L1smooth τ=t rτ, ˆv(st, w) Finally, θ and w are updated by using the computed losses over the entire batch, multiplied with a learning rate. Neural architecture. We describe the architecture of the CNN-based neural model used by the puzzle generator. We employ a similar architecture for both the Ho CMaze and Karel domains, as presented in Figure 16. Only the input size for the grid (i.e., D 16 16, where D = 12 for Ho CMaze and D = 14 for Karel) and code features (i.e., F, where F = 9 for Ho CMaze and F = 12 for Karel) differ. We process the grid by 3 CNN blocks (i.e., one block composed of Conv2D, Re LU, and Max Pooling2D layers), after which we apply 5 fully connected (linear) layers, thus obtaining the grid embedding. To the grid embedding, we concatenate the code features, which are represented in a binary manner (e.g., increase in coverage, current decision type). We then pass the concatenated tensor through an additional fully-connected layer, and its output is then passed to both the action head and the value head (i.e., necessary for the Actor-Critic algorithm). Published in Transactions on Machine Learning Research (01/2024) Input D 16 16 Grid Code features, Size = F CNN Block 1 Conv2D, kernel size = 3, padding 1, 64 64 Re LU Max Pool2D, kernel size = 2, padding 0 CNN Block 2 Conv2D, kernel size = 3, padding 1, 64 64 Re LU Max Pool2D, kernel size = 2, padding 0 CNN Block 3 Conv2D, kernel size = 3, padding 1, 64 64 Re LU Max Pool2D, kernel size = 2, padding 0 Linear 1 CNN output size (256) 1024 Linear 2 1024 512 Linear 3 512 256 Linear 4 256 128 Linear 5 128 32 Linear 6 Concatenated features size (32 + F) 8 Linear 7 (Action and Value) 8 action space 8 1 Figure 16: Architecture of the neural model used by the puzzle generator. D denotes the depth of the input grid and F denotes the size of the code features tensor, both different for each of the Ho CMaze and Karel domains. Published in Transactions on Machine Learning Research (01/2024) E Experimental Evaluation with Synthetic Task Specifications: Details In this section, we detail the instantiations of the scoring functions, give more insight into the synthetic dataset creation process, and show the details of the training processes for both the code generator and the task generator. E.1 Scoring Function Fscore Next, we describe the two instantiations for Fscore as used in the two domains Ho CMaze and Karel. We adopt a scoring function similar to that of (Ahmed et al., 2020), where Fscore is used for guiding a Monte Carlo Tree Search, as an evaluation function that describes the desired properties of their system s output. We note that our method can work with any other instantiation of the scoring function Fscore. The instantiations we use for Fscore for each of the Ho CMaze and Karel domains are defined in Equations 3 and 4 and are comprised of different components: (i) Fcov(Tout IO , Cout) [0, 1] computes the coverage ratio, i.e., ratio of executed blocks to total number of blocks; (ii) Fsol(Tout IO , Cout) {0, 1} evaluates to 1 if Cout correctly solves Tout IO , i.e., no crashing, reaching the goal/converting the pre-grid to the post-grid; (iii) Fnocross(Tout IO , Cout) [0, 1] computes the ratio of cells visited exactly once with regard to the total number of visited cells; (iv) Fnocut(Tout IO , Cout) {0, 1} evaluates to 0 if there is a shortcut sequence comprised of basic actions; (v) Fnotred(Tout IO , Cout) {0, 1} evaluates to 0 if there are redundant action sequences in Cout, e.g., sequences like turn Left, turn Right, or if the codes obtained by eliminating one action, loop or conditional from Cout solves Tout IO ; (vi) Fqual(Tout IO , Cout) [0, 1] evaluates the visual quality of Tout IO as per Equation 5; (vii) Fcutqual(Tout IO , Cout) [0, 1] evaluates visual quality of the shortest path made only of basic actions, similar to Fqual. We set α1 = α2 = 1 2 and α3 = α4 = α5 = 1 FHo CMaze score (Tout, Cout) =1 h Fcov(Tout IO , Cout) = 1, Fsol(Tout IO , Cout) = 1, Fnocross(Tout IO , Cout) = 1, Fnocut(Tout IO , Cout) = 1, Fnotred(Tout IO , Cout) = 1 i h α1Fcov(Tout IO , Cout) + α2Fqual(Tout IO , Cout) i (3) FKarel score (Tout, Cout) =1 h Fcov(Tout IO , Cout) = 1, Fsol(Tout IO , Cout) = 1, Fnocross(Tout IO , Cout) = 1, Fnocut(Tout IO , Cout) = 1, Fnotred(Tout IO , Cout) = 1 i h α3Fcov(Tout IO , Cout) + α4Fqual(Tout IO , Cout) + α5Fcutqual(Tout IO , Cout) i (4) We use the same measure of visual quality for both domains, keeping into account the number of moves, turns, segments, long-segments, and turn-segments, as explained next. More specifically, segments and long-segments correspond to consecutive sequences of moves containing more than 3 and 5 actions, respectively; turn-segments correspond to consecutive sequences of turn Left or turn Right containing more than 3 actions. The formula for Fqual is given in Equation 5 below; we also clip the values for each counter # w.r.t. its corresponding normalization factor (not depicted here for brevity). Fqual(Tout, Cout) =3 2n + #turns n + #segments n/2 + #long-segments 1 4 1 #turn-segments E.2 Synthetic Task Specifications Domain-specific elements. As introduced in Sections 2 and 4, we use two DSLs shown in Figures 17a and 17b, adapted from the DSLs in (Bunel et al., 2018; Ahmed et al., 2020). Published in Transactions on Machine Learning Research (01/2024) code C := def Run () do y rule y := s | g | s; g rule s := a | s; s | If (b) do s | If (b) do s Else s | Repeat (x) do s rule g := Repeat Until (goal) do s action a := move | turn Left | turn Right bool b := path Ahead | path Left | path Right iter x := 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 (a) DSL for Ho CMaze domain code C := def Run () do s rule s := a | s; s | If (b) do s | If (b) do s Else s | While (b) do s | Repeat (x) do s action a := move | turn Left | turn Right | put Marker | pick Marker bool b := path Ahead | path Left | path Right | no-path Ahead | marker Present | no-marker Present iter x := 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 (b) DSL for Karel domain Domain All Easy Hard (depth, constructs) (depth, constructs) (1, 0) (2, 1) (2, 2) (3, 2) (3, 3) Ho CMaze 1, 016 183 69 47 136 581 Karel 1, 027 300 155 277 295 0 (c) Dataset of synthetic task specifications Figure 17: (a) Synthetic datasets used for training/evaluation in Section 4). (b) DSLs for two domains. Algorithm 1: Specification Dataset Collection Procedure Input: list S of tuples (code structure s, required size l); maximum candidate set size m; D ; /* Dataset initialized to empty set */ foreach (s, l) S do C ; /* Candidate set initialized to empty set */ while size(C) < m do code Generate Code(s); task Task Oracle(code); score Fscore(task, code); if score > 0 then add (code, task, score) to C; sort C according to the score, in decreasing order; counter 0; while counter < l and C = do (code, task, score) Pop(C); accept Inspect(task, code) ; /* Inspection step */ if accept then ψ Extract Specs(code); add ψ to D; counter counter+1; Output: Dataset D; Dataset. We follow Algorithm 1 to create dataset D. For each code structure, we generate a set of candidate codes and obtain an oracle task for these codes. We filter them out if a low-quality task is obtained, supplementing this filtering with an additional inspection step. This inspection step is necessary because semantic irregularities (e.g., If Else with the same If and Else bodies) can get past the previous filtering step. As the compute and implementation efforts are larger for an automatic system that would detect such irregularities, which are easy to spot, we opt for a direct inspection step. Figure 17c provides a summary of datasets D for each domain. Published in Transactions on Machine Learning Research (01/2024) E.3 Training Process Training the code generator. We employ a standard approach, using an imitation (supervised) form of learning, with a cross-entropy loss for an LSTM-based architecture (see Appendix D) (Devlin et al., 2017; Bunel et al., 2018). We augment Dtrain by adding all the possible combinations of construct instantiations for a given code. The training plots and the hyperparameters used can be seen in Figure 18. We report the validation performance smoothed via an exponential decay function, and the batch loss averaged over one epoch. 0 10 20 30 40 50 60 Epoch Ho CMaze:Validation Performance Ho CMaze:Training Loss (a) Ho CMaze 0 20 40 60 80 100 Epoch Karel:Validation Performance Karel:Training Loss Hyper Parameter Ho CMaze Karel Max. Epochs 60 100 Batch Size 32 32 Learning Rate 5 10 4 5 10 4 Dict. Size 59 59 Max. Blocks 17 17 (c) Hyperparameters Figure 18: Illustration of training details for the code generator. (a) and (b) show the training curves with mean epoch loss and validation performance, based on metric M, for both the Ho CMaze and Karel domains. (c) shows the hyperparameters employed for the code generator training. Training the puzzle generator. We use an RL procedure, using the instantiations of Fscore as rewards. We augment the RL training set with additional codes produced by the previously trained code generator. To encourage higher quality tasks, we use a form of curriculum as follows: after a certain epoch, we give a reward larger than 0 only if the ratio between the scores of the output task and the Task Oracle s task is larger than a factor bλ2; we gradually increase bλ2 from 0.8 to 0.9. For Karel, we also employ a temperature parameter during training, encouraging exploration during inference. The training plots and the hyperparameters used can be seen in Figures 19. We report the validation performance (i.e., same metric employed for Neur Puzzle Gen) smoothed via an exponential decay function, and the batch reward averaged over one epoch. 0 50 100 150 200 Epoch Ho CMaze:Validation Performance Ho CMaze:Training Reward (a) Ho CMaze 0 50 100 150 Epoch Karel:Validation Performance Karel:Training Reward Hyper Parameter Ho CMaze Karel Max. Epochs 500 500 Batch Size 32 32 Learning Rate 10 4 10 4 Temperature 1 0.5 Curriculum True True (c) Hyperparameters Figure 19: Illustration of training details for the puzzle generator. (a) and (b) show the training curves with mean epoch reward and validation performance, based on metric M for both the Ho CMaze and Karel domains. A form of curriculum learning was employed, which explains the lack of general monotonicity for the reward. (c) shows the hyperparameters employed for the puzzle generator training. Further implementation details. We limit the number of possible initial locations for a grid to one representative per quadrant. In total, we consider 5 quadrants (i.e., top-left, bottom-left, center, top-right, bottom-right). We do this to limit the action space to a more tractable amount for a variable grid size. With 5 quadrants and 4 possible orientations, this leads to 5 4 = 20 possible initial location/orientation pairs, offering already enough variability. Published in Transactions on Machine Learning Research (01/2024) Detailed results on synthetic task specifications. Figure 20 reports evaluation results for different techniques for a fixed number of code/puzzle rollouts across two domains and segments, complementary to Figure 5. Technique Ho CMaze Karel All Easy Hard All Easy Hard Base Task Sync:5,p:10 13.6 (1.3) 46.2 (4.1) 2.0 (1.3) 35.7 (1.0) 49.3 (0.9) 10.9 (2.7) Neur Task Sync:5,p:10 81.4 (3.7) 100.0 (0.0) 74.3 (5.1) 92.6 (1.4) 100.0 (0.0) 79.3 (3.9) Base Puzzle Genc:fix,p:10 55.6 (1.8) 91.7 (2.4) 41.9 (2.3) 71.8 (3.8) 86.6 (3.8) 45.0 (3.9) Neur Puzzle Genc:fix,p:10 78.4 (2.5) 100.0 (0.0) 70.3 (3.4) 79.8 (0.6) 92.0 (1.3) 57.7 (1.8) Figure 20: Results on synthetic task specifications for Ho CMaze and Karel; see Figure 17c and Section 4. Published in Transactions on Machine Learning Research (01/2024) F Experimental Evaluation with Real-World Task Specifications: Details F.1 Real-World Task Specifications In Figures 21 and 22 below, we list source tasks T and codes C for 10 task specifications mentioned in Figure 6. Allowed blocks: {move, turn Left, turn Right, Repeat} Maximum size: 5 (a) Source T for ψ0 move move turn Right } } (b) Source C for ψ0 Allowed blocks: {move, turn Left, turn Right, Repeat Until} Maximum size: 6 (c) Source T for ψ1 Repeat Until(goal){ turn Right move turn Left move } } (d) Source C for ψ1 Allowed blocks: {move, turn Left, turn Right, Repeat} Maximum size: 6 (e) Source T for ψ2 move } turn Left Repeat(5){ (f) Source C for ψ2 Allowed blocks: {move, turn Left, turn Right, Repeat Until, If Else} Maximum size: 5 (g) Source T for ψ3 Repeat Until(goal){ If(path Ahead){ move } Else{ turn Left } } } (h) Source C for ψ3 Allowed blocks: {move, turn Left, turn Right, Repeat Until, If} Maximum size: 7 (i) Source T for ψ4 Repeat Until(goal){ move If(path Left){ turn Left } If(path Right){ turn Right } } } (j) Source C for ψ4 Figure 21: Overview of Ho CMaze sources. Published in Transactions on Machine Learning Research (01/2024) Allowed blocks: {move, turn Left, turn Right, pick Marker, put Marker} Maximum size: 6 (a) Source T for ψ5 move move put Marker move move } (b) Source C for ψ5 Allowed blocks: {move, turn Left, turn Right, pick Marker, put Marker, While} Maximum size: 8 (c) Source T for ψ6 put Marker While(path Ahead){ move turn Left move turn Right put Marker } } (d) Source C for ψ6 Allowed blocks: {move, turn Left, turn Right, pick Marker, put Marker, While} Maximum size: 10 (e) Source T for ψ7 While(path Ahead){ move pick Marker } turn Left turn Left While(path Ahead){ move } turn Left turn Left } (f) Source C for ψ7 Allowed blocks: {move, turn Left, turn Right, pick Marker, put Marker, While, If} Maximum size: 8 (g) Source T for ψ8 While(no-path Ahead){ If(marker Present){ pick Marker } turn Left move turn Right move } } (h) Source C by ψ8 Allowed blocks: {move, turn Left, turn Right, pick Marker, put Marker, While, Repeat} Maximum size: 5 (i) Source T for ψ9 While(path Ahead){ move Repeat(4){ pick Marker } } } (j) Source C by ψ9 Figure 22: Overview of Karel sources. Published in Transactions on Machine Learning Research (01/2024) F.2 GPT4Task Syn We describe next the details of our interaction with GPT-4 for generating the visual puzzles Tout IO and the intermediate code Cout via the GPT4Task Syn techniques. Our interaction is conducted through the platform (Open AI, 2023a). We try several strategies and prompts to make GPT-4 work for synthesizing visual programming tasks, as it tends to struggle with logical and spatial reasoning. Thus, we opt for a two-stage task synthesis process which works the best. We first ask GPT-4 to generate a code Cout for ψin, by using 5 separate queries. For the GPT4Task Syn-converse technique, we start with an initial prompt and then use follow-up prompts to fix any mistakes, as GPT-4 occasionally ignores part of the specifications. The initial and follow-up prompts used for generating Cout are presented in Figures 23a and 24a. We select the best code generated during the 5 separate queries based on our expertise. The second stage comprises of additional 5 separate queries for generating Tout IO for the selected code Cout. Again, we start with an initial prompt and then use follow-up prompts to fix any issues. The follow-up prompts are necessary because GPT-4 tends to struggle with spatial orientation and with the relationship between Cout and Tout IO . The initial and follow-up prompts used for generating Tout IO are presented in Figures 23b and 24b. Similar to the code selection process, we select the best visual puzzle generated during the 5 separate queries based on our expertise. Once we get Cout and Tout IO , we set Tout code as for Base Task Syn and Neur Task Syn. For the GPT4Task Syn-fewshot, technique, we employ a few-shot approach for both stages, by first giving 3 synthesis examples (i.e., for code and task synthesis, respectively). We do not use follow-up prompts here. The few-shot prompts are presented in Figures 25 and 26. Published in Transactions on Machine Learning Research (01/2024) Code: Initial prompt I am working in the block-based visual programming domain of Hour of Code: Maze Challenge from code.org. In this domain, the following types of coding blocks are available: - Basic action blocks: move forward, turn left, turn right. - Boolean conditions: path ahead, path left, path right. - Loops: repeat Until(goal), repeat(int). - Conditionals: if(boolean), if(boolean)else. In this domain, a task is represented as an 8x8 visual grid that contains WALL cells, FREE cells, AVATAR (with specific location and direction), and GOAL. We represent a task s 8x8 visual grid with the following symbols. # represents a WALL cell. + represents a FREE cell. * represents GOAL. E represents AVATAR s location facing East direction. W represents AVATAR s location facing West direction. N represents AVATAR s location facing North direction. S represents AVATAR s location facing South direction. Below, I am giving you a program structure. Can you generate a code that respects this program structure? You should not change the structure. This means that you shouldn t add or remove any loops (e.g., repeat Until(goal), repeat(int)) and conditionals (e.g., if(boolean), if(boolean)else). The program needs to be valid, meaning that bodies of constructs cannot remain empty. To complete this given structure, you can use basic action blocks, boolean conditions, and iteration numbers that are available in the Hour of Code: Maze Challenge programming. Code: Follow-up prompt in case of constructs changed Your code does not follow the program structure I have given. You shouldn t add or remove any loops (e.g., repeat Until(goal), repeat(int)) and conditionals (e.g., if(boolean), if(boolean)else). Can you try to generate a new code for the same structure? Code: Follow-up prompt for any other issues Your code could be improved! You can think of producing a better code by reasoning about the AVATAR s actions when the code is executed. Can you try to generate a new code respecting the program structure I have given? (a) Prompts used for obtaining Cout Published in Transactions on Machine Learning Research (01/2024) Task: Initial prompt I am working in the block-based visual programming domain of Hour of Code: Maze Challenge from code.org. In this domain, the following types of coding blocks are available: - Basic action blocks: move forward, turn left, turn right. - Boolean conditions: path ahead, path left, path right. - Loops: repeat Until(goal), repeat(int). - Conditionals: if(boolean), if(boolean)else. In this domain, a task is represented as an 8x8 visual grid that contains WALL cells, FREE cells, AVATAR (with specific location and direction), and GOAL. We represent a task s 8x8 visual grid with the following symbols. # represents a WALL cell. + represents a FREE cell. * represents GOAL. E represents AVATAR s location facing East direction. W represents AVATAR s location facing West direction. N represents AVATAR s location facing North direction. S represents AVATAR s location facing South direction. Below I am giving you a solution code. Can you generate a task with 8x8 visual grid that would be solved by this code? The visual grid must contain AVATAR (with specific location and direction) along with GOAL, and can have WALL cells and FREE cells. Number your grid with row numbers (1 to 8) and column numbers (1 to 8). Also, you should tell me the position of AVATAR and GOAL in your generated task so we are sure about the numbering. You can verify the correctness of your generated task by executing the solution code on your task. A solution code for a task takes AVATAR to GOAL when executed. Note that AVATAR can only move on FREE cells and will crash if it tries to go to a WALL cell. If your generated task is not correct, you should try again to generate a correct task. Task: Follow-up prompt for any issues Your code does not solve the generated grid. Be careful with the AVATAR as it should reach the goal after the code execution. Keep the code fixed. Can you try to generate a new visual grid and explain your reasoning? Recall that your code, when executed, should take the AVATAR from its initial location to the GOAL. (b) Prompts used for obtaining the main part of Tout IO Figure 23: Prompts used in the implementation of GPT4Task Syn-converse technique for Ho CMaze domain. Published in Transactions on Machine Learning Research (01/2024) Code: Initial prompt I am working in the block-based visual programming domain of Karel programming. In this domain, the following types of coding blocks are available: - Basic action blocks: move forward, turn left, turn right, pick marker, put marker. - Boolean conditions: path ahead, path left, path right, marker present, no path ahead, no marker present. - Loops: while(boolean), repeat(int). - Conditionals: if(boolean), if(boolean)else. In this domain, a task is represented as a pair of 10x10 visual pregrid and 10x10 visual postgrid. This pregrid and postgrid contain WALL cells, FREE cells, AVATAR (with specific location and direction), and markers. We represent a task s 10x10 visual pregrid and postgrid with the following symbols. # represents a WALL cell. + represents a FREE cell. m represents a cell with marker. E represents AVATAR s location on a cell without marker, facing East direction. W represents AVATAR s location on a cell without marker, facing West direction. N represents AVATAR s location on a cell without marker, facing North direction. S represents AVATAR s location on a cell without marker, facing South direction. Em represents AVATAR s location on a cell with marker, facing East direction. Wm represents AVATAR s location on a cell with marker, facing West direction. Nm represents AVATAR s location on a cell with marker, facing North direction. Sm represents AVATAR s location on a cell with marker, facing South direction. Below, I am giving you a program structure. Can you generate a code that respects this program structure? - Structure - You should not change the structure. This means that you shouldn t add or remove any loops (e.g., while(boolean), repeat(int)) and conditionals (e.g., if(boolean), if(boolean)else). The program needs to be valid, meaning that bodies of constructs cannot remain empty. To complete this given structure, you can use basic action blocks, boolean conditions, and iteration numbers that are available in Karel programming. Code: Follow-up prompt in case of constructs changed Your code does not follow the programming structure I have given. You shouldn t add or remove any loops (e.g., while(boolean), repeat(int)) and conditionals (e.g., if(boolean), if(boolean)else). Can you try to generate a new code for the same structure? Code: Follow-up prompt for any other issues Your code could be improved! You can think of producing a better code by reasoning about the Karel AVATAR when the code is executed. Can you try to generate a new code? (a) Prompts used for obtaining Cout Published in Transactions on Machine Learning Research (01/2024) Task: Initial prompt I am working in the block-based visual programming domain of Karel programming. In this domain, the following types of coding blocks are available: - Basic action blocks: move forward, turn left, turn right, pick marker, put marker. - Boolean conditions: path ahead, path left, path right, marker present, no path ahead, no marker present. - Loops: while(boolean), repeat(int). - Conditionals: if(boolean), if(boolean)else. In this domain, a task is represented as a pair of 10x10 visual pregrid and 10x10 visual postgrid. This pregrid and postgrid contain WALL cells, FREE cells, AVATAR (with specific location and direction), and markers. We represent a task s 10x10 visual pregrid and postgrid with the following symbols. # represents a WALL cell. + represents a FREE cell. m represents a cell with marker. E represents AVATAR s location on a cell without marker, facing East direction. W represents AVATAR s location on a cell without marker, facing West direction. N represents AVATAR s location on a cell without marker, facing North direction. S represents AVATAR s location on a cell without marker, facing South direction. Em represents AVATAR s location on a cell with marker, facing East direction. Wm represents AVATAR s location on a cell with marker, facing West direction. Nm represents AVATAR s location on a cell with marker, facing North direction. Sm represents AVATAR s location on a cell with marker, facing South direction. Below I am giving you a solution code. Can you generate a task with a pair of 10x10 visual pregrid and 10x10 visual postgrid that would be solved by this code? Both the visual pregrid and visual postgrid must contain AVATAR (with specific location and direction), and can have WALL cells, FREE cells, and markers. Number your grids with row numbers (1 to 10) and column numbers (1 to 10). Also, you should tell me the position of AVATAR in your generated pregrid and postgrid so we are sure about the numbering. You can verify the correctness of your generated task by executing the solution code on your task. A solution code for a task transforms the pregrid into the postgrid when executed. Note that AVATAR can only move on FREE cells and will crash if it tries to go to a WALL cell. If your generated task is not correct, you should try again to generate a correct task. Task: Follow-up prompt for any issues Your code does not solve the generated pregrid and postgrid. Be careful with the AVATAR in the postgrid as it should show the effect of the code execution. Keep the code fixed. Can you try to generate a new visual pregrid and postgrid and explain your reasoning? Recall that your code, when executed, should transform the pregrid into the postgrid. Be careful with the AVATAR in the postgrid as it should show the effect of the code execution. (b) Prompts used for obtaining the main part of Tout IO Figure 24: Prompts used in the implementation of GPT4Task Syn-converse technique for Karel domain. Published in Transactions on Machine Learning Research (01/2024) Code: Few-shot prompt I am working in the block-based visual programming domain of Hour of Code: Maze Challenge from code.org. In this domain, the following types of coding blocks are available: - Basic action blocks: move forward, turn left, turn right. - Boolean conditions: path ahead, path left, path right. - Loops: repeat Until(goal), repeat(int). - Conditionals: if(boolean), if(boolean)else. In this domain, a task is represented as an 8x8 visual grid that contains WALL cells, FREE cells, AVATAR (with specific location and direction), and GOAL. We represent a task s 8x8 visual grid with the following symbols. # represents a WALL cell. + represents a FREE cell. * represents GOAL. E represents AVATAR s location facing East direction. W represents AVATAR s location facing West direction. N represents AVATAR s location facing North direction. S represents AVATAR s location facing South direction. Below, I will give you a program structure. Can you generate a code that respects this program structure? You should not change the structure. This means that you shouldn t add or remove any loops (e.g., repeat Until(goal), repeat(int)) and conditionals (e.g., if(boolean), if(boolean)else). The program needs to be valid, meaning that bodies of constructs cannot remain empty. To complete this given structure, you can use basic action blocks, boolean conditions, and iteration numbers that are available in the Hour of Code: Maze Challenge programming. I am giving you some examples comprising a program structure and a code that respects the structure. Provide the code for the last structure. Example i: Structure Example i: Code Example n: Structure Example n: Code (a) Prompts used for obtaining Cout Published in Transactions on Machine Learning Research (01/2024) Task: Few-shot prompt I am working in the block-based visual programming domain of Hour of Code: Maze Challenge from code.org. In this domain, the following types of coding blocks are available. - Basic action blocks: move forward, turn left, turn right. - Boolean conditions: path ahead, path left, path right. - Loops: repeat Until(goal), repeat(int). - Conditionals: if(boolean), if(boolean)else. In this domain, a task is represented as an 8x8 visual grid that contains WALL cells, FREE cells, AVATAR (with specific location and direction), and GOAL. We represent a task s 8x8 visual grid with the following symbols. # represents a WALL cell. + represents a FREE cell. * represents GOAL. E represents AVATAR s location facing East direction. W represents AVATAR s location facing West direction. N represents AVATAR s location facing North direction. S represents AVATAR s location facing South direction. Below I will give you a solution code. Can you generate a task with 8x8 visual grid that would be solved by this code? The visual grid must contain AVATAR (with specific location and direction) along with GOAL, and can have WALL cells and FREE cells. Number your grid with row numbers (1 to 8) and column numbers (1 to 8). Also, you should tell me the position of AVATAR and GOAL in your generated task so we are sure about the numbering. You can verify the correctness of your generated task by executing the solution code on your task. A solution code for a task takes AVATAR to GOAL when executed. Note that AVATAR can only move on FREE cells and will crash if it tries to go to a WALL cell. If your generated task is not correct, you should try again to generate a correct task. I am giving you some examples comprising a solution code and task that is solved by this code. Provide the task for the last solution code. Example i: Solution Example i: Task Example n: Solution Example n: Task (b) Prompts used for obtaining the main part of Tout IO Figure 25: Prompts used in the implementation of GPT4Task Syn-fewshot technique for Ho CMaze domain. Published in Transactions on Machine Learning Research (01/2024) Code: Few-shot prompt I am working in the block-based visual programming domain of Karel programming. In this domain, the following types of coding blocks are available: - Basic action blocks: move forward, turn left, turn right, pick marker, put marker. - Boolean conditions: path ahead, path left, path right, marker present, no path ahead, no marker present. - Loops: while(boolean), repeat(int). - Conditionals: if(boolean), if(boolean)else. In this domain, a task is represented as a pair of 10x10 visual pregrid and 10x10 visual postgrid. This pregrid and postgrid contain WALL cells, FREE cells, AVATAR (with specific location and direction), and markers. We represent a task s 10x10 visual pregrid and postgrid with the following symbols. # represents a WALL cell. + represents a FREE cell. m represents a cell with marker. E represents AVATAR s location on a cell without marker, facing East direction. W represents AVATAR s location on a cell without marker, facing West direction. N represents AVATAR s location on a cell without marker, facing North direction. S represents AVATAR s location on a cell without marker, facing South direction. Em represents AVATAR s location on a cell with marker, facing East direction. Wm represents AVATAR s location on a cell with marker, facing West direction. Nm represents AVATAR s location on a cell with marker, facing North direction. Sm represents AVATAR s location on a cell with marker, facing South direction. Below, I will give you a program structure. Can you generate a code that respects this program structure? You should not change the structure. This means that you shouldn t add or remove any loops (e.g., while(boolean), repeat(int)) and conditionals (e.g., if(boolean), if(boolean)else). The program needs to be valid, meaning that bodies of constructs cannot remain empty. To complete this given structure, you can use basic action blocks, boolean conditions, and iteration numbers that are available in Karel programming. I am giving you some examples comprising a program structure and a code that respects the structure. Provide the code for the last structure. Example i: Structure Example i: Code Example n: Structure Example n: Code (a) Prompts used for obtaining Cout Published in Transactions on Machine Learning Research (01/2024) Task: Few-shot prompt I am working in the block-based visual programming domain of Karel programming. In this domain, the following types of coding blocks are available. - Basic action blocks: move forward, turn left, turn right, pick marker, put marker. - Boolean conditions: path ahead, path left, path right, marker present, no path ahead, no marker present. - Loops: while(boolean), repeat(int). - Conditionals: if(boolean), if(boolean)else. In this domain, a task is represented as a pair of 10x10 visual pregrid and 10x10 visual postgrid. This pregrid and postgrid contain WALL cells, FREE cells, AVATAR (with specific location and direction), and markers. We represent a task s 10x10 visual pregrid and postgrid with the following symbols. # represents a WALL cell. + represents a FREE cell. m represents a cell with marker. E represents AVATAR s location on a cell without marker, facing East direction. W represents AVATAR s location on a cell without marker, facing West direction. N represents AVATAR s location on a cell without marker, facing North direction. S represents AVATAR s location on a cell without marker, facing South direction. Em represents AVATAR s location on a cell with marker, facing East direction. Wm represents AVATAR s location on a cell with marker, facing West direction. Nm represents AVATAR s location on a cell with marker, facing North direction. Sm represents AVATAR s location on a cell with marker, facing South direction. Below I will give you a solution code. Can you generate a task with a pair of 10x10 visual pregrid and 10x10 visual postgrid that would be solved by this code? Both the visual pregrid and visual postgrid must contain AVATAR (with specific location and direction), and can have WALL cells, FREE cells, and markers. Number your grids with row numbers (1 to 10) and column numbers (1 to 10). Also, you should tell me the position of AVATAR in your generated pregrid and postgrid so we are sure about the numbering. You can verify the correctness of your generated task by executing the solution code on your task. A solution code for a task transforms the pregrid into the postgrid when executed. Note that AVATAR can only move on FREE cells and will crash if it tries to go to a WALL cell. If your generated task is not correct, you should try again to generate a correct task. I am giving you some examples comprising a solution code and task that is solved by this code. Provide the task for the last solution code. Example i: Solution Example i: Task Pregrid [TASK_PREGRID_i] Example i: Task Postgrid [TASK_POSTGRID_i] Example n: Solution Example n: Task Pregrid (b) Prompts used for obtaining the main part of Tout IO Figure 26: Prompts used in the implementation of GPT4Task Syn-fewshot technique for Karel domain.