# sketchagent_generating_structured_diagrams_from_handdrawn_sketches__a62ef15c.pdf Sketch Agent: Generating Structured Diagrams from Hand-Drawn Sketches Cheng Tan1,2 , Qi Chen3,4 , Jingxuan Wei3,4 , Gaowei Wu3,4 , Zhangyang Gao1,2, Siyuan Li1,2, Bihui Yu3,4, Ruifeng Guo3,4, Stan Z. Li1 1Westlake University 2Zhejiang University 3University of Chinese Academy of Sciences 4Shenyang Institute of Computing Technology, Chinese Academy of Sciences tancheng@westlake.edu.cn, weijingxuan20@mails.ucas.edu.cn Hand-drawn sketches are a natural and efficient medium for capturing and conveying ideas. Despite significant advancements in controllable natural image generation, translating freehand sketches into structured, machine-readable diagrams remains a labor-intensive and predominantly manual task. The primary challenge stems from the inherent ambiguity of sketches, which lack the structural constraints and semantic precision required for automated diagram generation. To address this challenge, we introduce Sketch Agent, a multiagent system designed to automate the transformation of hand-drawn sketches into structured diagrams. Sketch Agent integrates sketch recognition, symbolic reasoning, and iterative validation to produce semantically coherent and structurally accurate diagrams, significantly reducing the need for manual effort. To evaluate the effectiveness of our approach, we propose the Sketch2Diagram Benchmark, a comprehensive dataset and evaluation framework encompassing eight diverse diagram categories, such as flowcharts, directed graphs, and model architectures. The dataset comprises over 6,000 high-quality examples with token-level annotations, standardized preprocessing, and rigorous quality control. By streamlining the diagram generation process, Sketch Agent holds great promise for applications in design, education, and engineering, while offering a significant step toward bridging the gap between intuitive sketching and machinereadable diagram generation. 1 Introduction Hand-drawn sketches are a natural and powerful medium for rapidly conveying ideas, serving as a universal language in creative, technical, and educational workflows [Zhao and Lai, 2022; Zhao et al., 2024; Tan et al., 2024]. From rough brainstorming sessions to preliminary engineering designs, sketches offer an intuitive way to externalize concepts. However, translating these informal and ambiguous drawings into Corresponding author. structured, machine-readable diagrams remains an open challenge. Unlike natural image generation tasks [Cao et al., 2024; Huang et al., 2024; Li et al., 2019], which have seen remarkable progress in recent years through techniques such as controllable natural image generation, the sketch-to-diagram task demands more than just visual fidelity it requires understanding and formalizing the underlying structural and semantic relationships inherent to diagrams. We introduce a new task, sketch-to-diagram generation, which involves converting a hand-drawn sketch into a structured, machine-readable diagram. As shown in Figure 1, this task differs fundamentally from controllable natural image generation, as it focuses not on generating aesthetically pleasing visuals but on synthesizing a precise, semantically meaningful diagram that adheres to specific structural rules. This transformation requires solving several core challenges: (1) handling the inherent ambiguity and variability in freehand sketches, (2) preserving the spatial and structural relationships between diagram components, and (3) producing an output that is both syntactically valid and semantically aligned with the user s intent. These challenges make sketch-to-diagram generation a highly specialized and underexplored problem, distinct from existing works. To address the lack of standardized resources for sketch-todiagram research, we introduce the Sketch2Diagram Benchmark, a comprehensive dataset and evaluation framework designed to support the development and assessment of models for this task. The dataset spans eight diverse diagram categories, including flowcharts, directed graphs, and model architectures, and consists of over 6,000 high-quality examples. Each example includes a hand-drawn sketch paired with its corresponding structured diagram representation. The dataset is meticulously curated, featuring token-level annotations, standardized preprocessing, and rigorous quality control, ensuring its reliability for both training and evaluation purposes. Building on this benchmark dataset, we propose Sketch Agent, an end-to-end system for automating the transformation of hand-drawn sketches into structured diagrams. The system begins by converting an input sketch into a symbolic code representation, which abstracts its structural and spatial properties into a machine-readable format, bridging the gap between informal freehand drawings and precise computational diagrams. From there, Sketch Agent performs iterative refinement to improve the accuracy, coherence, and validity of the Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Converting a rough sketch into a polished, well-organized diagram with rich colors and details? Existing models Existing models primarily generate realworld objects, with no models focused on creating structured diagrams. Please convert this sketch into a clear Please modify the diagram to include the following With just a few adjustments, I can achieve the style I desire! Sketch Agent Figure 1: Sketch Agent automates the transformation of hand-drawn sketches into structured diagrams. code representation, ensuring the final diagram adheres to the user s intent while satisfying all structural constraints. Our main contributions are as follows: We formally define the task of converting hand-drawn sketches into structured diagrams, distinguishing it from related tasks such as controllable image generation. We introduce a benchmark dataset of hand-drawn sketches and their corresponding structured diagram representations, offering a standardized resource for training and evaluation. We propose Sketch Agent, a modular system that automates the sketch-to-diagram transformation process, integrating sketch recognition, symbolic reasoning, iterative refinement, and verification into a unified pipeline. 2 Related Work 2.1 Controllable Image Generation Controllable image generation aims to synthesize images that adhere to specific constraints [Cao et al., 2024; Huang et al., 2024]. Existing methods can be categorized into three main approaches: GAN-based, diffusion-based, and multimodal fusion techniques. Early GAN-based methods, such as Control GAN [Li et al., 2019], introduced fine-grained text-conditioned image synthesis but suffered from instability and mode collapse. Diffusion models have since become the dominant paradigm, offering more stable and highquality generation. Diffusion Self-Guidance [Epstein et al., 2023] and Multi Diffusion [Bar-Tal et al., 2023] enable explicit control over object positioning and spatial structure, while Control-GPT [Zhang et al., 2023b] leverages GPT-4generated sketches for improved spatial consistency. Other approaches integrate LLMs or additional modalities for enhanced control, such as Mo MA [Song et al., 2025], which fuses textual and visual embeddings, and MM-Diff [Wei et al., 2024b], which refines personalization through CLIPbased representations. Furthermore, PALP [Arar et al., 2024] enhances alignment with complex textual prompts by optimizing cross-modal score matching. Despite these advancements, existing approaches primarily focus on photorealistic image synthesis, making them insufficient for structured and logic-constrained generation tasks like diagrams [Cao et al., 2024; Huang et al., 2024]. While diffusion-based models offer control over spatial attributes [Epstein et al., 2023; Bar-Tal et al., 2023], they lack explicit structural reasoning capabilities required for diagram generation. 2.2 Controllable Code Generation Controllable code generation aims to produce structured and executable code while adhering to specific constraints [Shin and Nam, 2021; Wei et al., 2025]. Language model-based approaches leverage pre-trained models to improve code synthesis. Magicoder [Wei et al., 2024a] enhances multilanguage code generation through OSS-INSTRUCT, while Veri Gen [Thakur et al., 2024] tailors language models for Verilog synthesis by curating specialized training datasets. Structure-aware methods refine code generation by integrating abstract syntax trees and data flow graphs. Struct Coder [Tipirneni et al., 2024] introduces a structure-aware self-attention mechanism, and Co Tex T [Phan et al., 2021] applies multi-task learning to enhance text-to-code understanding. Planning-based techniques decompose complex tasks into stepwise solutions, as seen in Self-Planning Code Generation [Jiang et al., 2024], while reinforcement learningbased approaches such as Code RL [Le et al., 2022] optimize model adaptation through reward-based fine-tuning. Execution-enhanced methods ensure generated code correctness by leveraging runtime validation. MBR-EXEC [Shi et al., 2022] employs minimum Bayesian risk decoding based on execution, whereas CODET [Chen et al., 2022] generates test cases to filter invalid code. Besides, real-world integration studies, such as In-IDE Code Generation [Xu et al., 2022], evaluate the practical utility. While these methods advance code generation in terms of syntax, semantics, and execution fidelity, they remain constrained to text-based inputs, lacking the capability to synthesize code from sketch-based conceptualizations. Furthermore, existing controllable code generation approaches do not inherently support structured diagram generation, limiting their applicability in domains requiring logical and hierarchical visual representations. [Ghosh et al., 2018; Almazroi et al., 2021] have focused on extracting structured representations from textual descriptions, but these methods do not generalize to sketch-driven workflows. 3 Method The system consists of three modules: the Sketch-to-Code Agent, the Editing Code Agent, and the Check Agent, each responsible for specific tasks. Given a sketch S and a userspecified instruction set Q, Sketch Agent generates an initial code representation, refines it based on additional instructions, and verifies the final output before rendering the structured diagram. The overall workflow is illustrated in Figure 2. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Check Agent Please convert this sketch into a clear framework diagram, with specific details including: - A straight line connecting Node 1 and - An arrow with the number 5 indicating - The label A for Node 5 Sketch-to-Code Agent User Query Please modify the diagram as follows: 1. Change the color of nodes \\( h1 \\), \\( h2 \\), \\( h3 \\), 2. Replace the dashed lines between nodes \\( g1 \\) and \\( g4 \\), \\( g2 \\) and \\( g3 \\), \\( g2 \\) and \\( g4 \\)... 3. Add a label \"Input Layer\" above nodes \\( h1 \\) to \\( h6 \\). 4. Change the color of the fill in the shaded region 5. Remove the red \\( \\times \\) symbols from nodes \\( g1 \\), \\( g2 \\), and \\( g5 \\) Editing Query \\documentclass[crop, tikz]{standalone} \\usepackage{tikz} \\definecolor{mygreen}{rgb}{0,0.6,0} \\definecolor{mymauve}{rgb}{0.58,0,0.82} \\definecolor{camdrk}{RGB}{0,62,114} \\pagenumbering{gobble} Editing Code Agent Generated Output Editing Output \\documentclass[crop, tikz]{standalone} \\usepackage{tikz} \\node[circle, draw, thick] (h AA) {}; \\node[circle, draw, thick, right=of h1]{}; \\path [draw=black, smooth, fill opacity=0.3] MLP Predict MLP Predict Figure 2: The Sketch Agent pipeline, consisting of three main modules: Sketch-to-Code Agent, Editing Code Agent, and Check Agent. 3.1 Sketch-to-Code Agent The Sketch-to-Code Agent maps a hand-drawn sketch S and an instruction set Q to an initial code representation Ck, capturing the structural semantics of the sketch. This process is formulated as: Ck = Fk(S, Q), (1) where Fk represents the transformation function. The output Ck is modeled as a sequence of tokens, where each token corresponds to a diagram component or an attribute. To ensure the generated code aligns with the expected structure, we define the objective as minimizing the negative log-likelihood of the sequence: Lk = ECk P (C|S,Q) t=1 log P(C(t) k | C(