# computeraided_design_as_language__87a13b86.pdf Computer-Aided Design as Language Yaroslav Ganin1 Sergey Bartunov1 Yujia Li1 Ethan Keller2 Stefano Saliceti1 1Deep Mind 2Onshape Computer-Aided Design (CAD) applications are used in manufacturing to model everything from coffee mugs to sports cars. These programs are complex and require years of training and experience to master. A component of all CAD models particularly difficult to make are the highly structured 2D sketches that lie at the heart of every 3D construction. In this work, we propose a machine learning model capable of automatically generating such sketches. Through this, we pave the way for developing intelligent tools that would help engineers create better designs with less effort. The core of our method is a combination of a generalpurpose language modeling technique alongside an off-the-shelf data serialization protocol. Additionally, we explore several extensions allowing us to gain finer control over the generation process. We show that our approach has enough flexibility to accommodate the complexity of the domain and performs well for both unconditional synthesis and image-to-sketch translation. 1 Introduction Figure 1: The anatomy of a CAD sketch. Sketches are the main building block of every 3D construction. A sketch consists of entities (e.g., lines and arcs) and constraints (e.g., tangent and mirror). The dotted curve shows what happens if we drop some of the constraints the design idea is lost. Computer-Aided Design (CAD) is used in the production of most manufactured objects: from cars to robots to stents to power plants. CAD has replaced pencil drawings with precise computer sketches, enabling unparalleled precision, flexibility, and speed. Despite these improvements the CAD engineer must still develop, relate and annotate all the minutiae of their designs with the same attention to detail as their draftingtable forebears. CAD productivity might be improved by the careful application of machine learning to automate predictable design tasks and free the engineer to focus on the bigger picture. The flexibility and power of deep learning is uniquely suited to the complexity of design. Sketches are at the heart of mechanical CAD. They are the skeleton from which three dimensional forms are made. A sketch consists of various geometric entities (e.g., lines, arcs, splines and circles) related by specific constraints such as tangency, perpendicularity and symmetry. Figure 1 illustrates how entities and constraints work in tandem to create well-defined shapes. Geometric entities lie on a single plane and together form enclosed regions used by subsequent construction operations such as lofts and extrusions to generate complex Correspondence to: ganin@deepmind.com. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). line.end.x line.end.y (0, 0.6, False) line.start.y: line.end.x: 0.6 . . . . . . . . . . . . . . . Transformer [33] + Pointer Net [35] Interpreter states Figure 2: Interpreter-guided generation of a sketch. At each point in time, a Transformer [33] outputs a raw value which is fed into an interpreter that decides which field of a Protocol Buffers message this value corresponds to. Once the field is populated the interpreter communicates ( ) its decision back to the Transformer and transitions ( ) to the next state. 3D geometry. Well-chosen sketch constraints are essential to properly convey design intent [2] and facilitate the sketch s resilience to successive parameters modifications which is often understood as a measure of the quality of a design document [8]. The dotted curve in Figure 1 shows what happens when some of the constraints are dropped the design idea is lost. The complexities of sketch construction are analogous to those of natural language modeling. Selecting the next constraint or entity in a sketch is like the generation of the next word in a sentence. In both contexts, the selection must function grammatically (form a consistent constraint system in the case of the sketch) and work towards some cohesive meaning (preserve design intent). Luckily, machine learning has proved highly successful in generating natural language especially the Transformer [33] trained on vast amounts of real-world data [26, 6]. It is therefore a promising choice for adapting to the task of sketch generation. This work is our take at this adaptation. We make the following contributions: (1) We devise a method for describing structured objects using Protocol Buffers [32] and demonstrate its flexibility on the domain of natural CAD sketches. (2) We propose several techniques for capturing distributions of objects represented as serialized Protocol Buffers. Our approach draws inspiration from recent advances in language modeling while focusing on eliminating data redundancy. (3) We collect a dataset of over 4.7M of carefully preprocessed parametric CAD sketches and use this dataset to validate the proposed generative models. To our knowledge, the experiments presented in this work significantly surpass the scale of those reported in the literature both in terms of the amount of training data and the model capacity. 2 Related work Datasets and generative models for CAD. Until recently there were very few parametric CAD datasets large and varied enough to serve as training data for machine learning. This situation had started to change with the release of the ABC dataset [16] containing a collection of 3D shapes from the Onshape public repository [22]. Unfortunately, the main focus of [16] revolves around meshes and, as a result, the dataset is difficult to use for sketch modeling. Several works concurrent with ours deal with the symbolic representation of CAD constructions. Seff et al. [31] center their attention on contributing a better dataset of 2D sketches but also provide a proof-of-concept model predicting a selected subset of object attributes. Willis et al. [37] use the dataset from [31] to train a modification of [21] which demonstrates a boost in generation quality but is designed to only work for sketch entities. The latter is addressed in our present work and in a subsequent paper by Para et al. [24]. While [24] employ very similar ideas to ours they do not support certain features of the CAD data and their proposed model is a more direct adaptation of [21] and therefore cannot handle arbitrary orderings of entities and constraints. In Section 5, we show that object ordering has a substantial impact on the performance. Fusion 360 Gallery [36] attacks CAD data from a different angle. Here, the task is to recover a sequence of extrusion operations that gives rise to a particular target 3D shape. Despite dealing with 3D, this setting is deliberately limited: sketches are assumed to be given and the proposed model only decides on which sketch to extrude and to what extent. [38] considers a more general scenario where extrusion profiles are not provided and need to be synthesized from scratch. Although both of these works make initial steps towards full parametric CAD generation, they rely on significant simplifying assumptions and therefore it is unclear how well they will scale to more real-world scenarios. Our approach, on the other hand, is designed to be flexible and domain-agnostic and is only limited by the data availability. Vector image generation and inference. Synthesizing CAD sketches bears a lot of similarities with predicting vector graphics. In this field, several recent works Carlier et al. [7] and Reddy et al. [30] using different vector object representations to define generative models of vector images. Egiazarian et al. [13] take a more traditional computer vision approach and propose a multi-stage pipeline for vectorizing technical drawings. All of these methods use highly domain-dependent architectures and, therefore, it would be a non-trivial task to adapt them for generation of complex sketch objects. CAD community has also been concerned with a similar task of image to CAD conversion [20, 34, 10, 11], largely focusing on heuristic object recognition while our work relies more on learning the recognition from data. Transformers for sequence modeling. In our work, we employ Transfomers [33] as a computational backbone for the proposed approach. Due to its scalability and excellent performance [29, 9, 6, 27], this architecture has become the dominating approach in many sequence modeling applications. Our method can be seen as generalization of Poly Gen [21], a Transformer-based generative model for 3D meshes. Similarly to [21], we use Pointer Networks [35] to relate items in the synthesized sequence. Unlike Poly Gen, however, our framework can handle non-homogeneous structures of arbitrary complexity. Moreover, we simplify the architecture to use a single neural network to generate the entire object of interest. All these improvements make our approach a good fit for modeling CAD sketches and potentially other components of CAD constructions. Formally, a CAD sketch is defined by two collections of objects: entities and constraints. Each object is generally represented as a set of attribute-value pairs where a value can be either primitive (e.g., integer or floating-point) or complex (e.g., an array or another object). Sketches that we use in this work originate from the Onshape platform [22] which provides them in JSON format [25]. As the first step in our processing pipeline we convert JSON messages into Protocol Buffers (PB) [32]. In order to keep the pipeline as domain-agnostic and as widely applicable as possible, we aim to avoid any significant changes to the data and largely retain the original structures of objects. The benefit of converting into PB is twofold: the resulting data occupies less space because unnecessary information is removed, but also, unlike JSON, PB provide a convenient way to define precise specifications for structures of arbitrary complexity. Listing 1a shows how we represent the line entity and the mirror constraint (see Appendix A for an extensive list of supported objects). The line specification is straightforward: we first need to decide whether our entity should be treated as a construction geometry2 and then provide pairs of coordinates for the beginning and end of the segment. The Mirror Constraint is used to force an arbitrary number of pairs of geometries (i.e., mirrored_pairs) to be symmetrical with respect to some axis (i.e., mirror). Constraints rely on the Pointer data type to specify entities they act upon. In practice, a pointer is simply an index in the table of all the eligible pointees (i.e., entities and their parts). 2Construction geometries are rendered in dashed style in the figures. message Line Entity { bool is_construction = 1; message Vector { // 2D coordinate. double x = 1; double y = 2; } Vector start = 2; // Start point. Vector end = 3; // End point. } message Mirror Constraint { Pointer mirror = 1; // Axis of symmetry. message Pair { // Mirrored objects. Pointer first = 1; Pointer second = 2; } repeated Pair mirrored_pairs = 2; } (a) Entities and constraints have similar structures. Pointers refer to entities that constraints are applied to. message Entity { oneof kind { Line Entity line = 1; // And other entity types. } } message Constraint { // Defined similarly to Entity. } message Object { oneof kind { Entity entity = 1; Constraint constraint = 2; } } message Sketch { repeated Object objects = 1; } (b) A full sketch is defined as a sequence of objects each of which can be either an entity or a constraint. Listing 1: Examples of object specifications. We represent objects using Protocol Buffers. Protocol Buffers allow us to easily write specifications for structured objects of varying complexity. Our ultimate goal is to build a machine learning model of sketch objects. To that end, we process the data even further and represent these objects as sequences of tokens. This allows us to pose sketch generation as language modeling (LM) and take advantage of the recent progress in this area [26, 6]. To achieve this, we pack first collections of entities and constraints into one Protocol Buffer message (see Listing 1b) assuming some ordering of objects. We discuss different orderings in Section 5. There are a few ways to obtaining a sequence of tokens from a sketch message. Arguably the most intuitive one is to format messages as text. For a line entity connecting (0.0, 0.1) and ( 0.5, 0.2) this will result in: { is_construction: true, start { x: 0.0, y: 0.1 }, end { x: -0.5, y: 0.2 } } Since this format contains both the structure and the content of the data, the resulting sequences end up being prohibitively long. Additionally, the model would have to generate valid syntax, which would take up some portion of the model s capacity. To overcome these challenges, we work with two flavours of serialized PB messages. The first one is a sequences of bytes obtained by calling the Serialize To String() method of a message. Such sequences are much shorter since the structure is handled by an external parser automatically generated from the data specification. The parser s task is to interpret the incoming stream of unstructured bytes and populate the fields of PB messages. However, like the text format, not every sequence of bytes results in a valid PB message. Going one step further, we can utilize the structure of the sketch format more directly, and build a custom interpreter, that takes as input a sequence of tokens each representing a valid choice at various decision steps [4] in the sketch creation process. We designed this interpreter in such a way that all sequences of tokens in this format lead to valid PB messages. More specifically, we represent a message as a sequence of triplets (di, ci, fi) where i is an index of the token. The majority of tokens describe basic fields of the sketch objects with each token representing exactly one field. The first two positions in each triplet are allocated for a discrete value and a continuous value respectively. Since each field in a message is either discrete or continuous only one of two positions is active at a time (the other one is set to a default zero value). The third component is a boolean flag signifying the end of a repeated field3 which contains a list of elements of the same type. An example sequence for a sketch containing a line and a point placed at one of its ends is shown in Table 1 (Triplet column). 3For additional details on how we handle more complex constructions of the PB language see Appendix A. Triplet Field 1. (0, 0.0, False) objects.kind 2. (0, 0.0, False) entity.kind 3. (1, 0.0, False) line.is_constr 4. (0, 0.0, False) line.start.x 5. (0, 0.1, False) line.start.y 6. (0, 0.5, False) line.end.x 7. (0, 0.2, False) line.end.y Triplet Field 8. (0, 0.0, False) objects.kind 9. (1, 0.0, False) entity.kind 10. (0, 0.0, False) point.is_const 11. (0, 0.0, False) point.x 12. (0, 0.1, False) point.y 13. (0, 0.0, True) objects.kind Table 1: A triplet representation of a simple sketch. The sketch contains and a line and a point. Within each triplet in the left column, the active component (the value that is actually used) is highlighted in bold. The right column shows which field of the object the triplet is associated with. Given a sequence of such triplets, it is possible to infer which exact field each token corresponds to. Indeed, the very first token (d1, c1, f1) is always associated with objects.kind since it is the first choice that needs to be made to create a Sketch message (see Listing 1b). The second field depends on the concrete value of d1. If d1 = 0 then the first object is an entity which means that the second token corresponds to entity.kind . The rest of the sequence is associated in a similar fashion. Field identifiers along with their locations within an object form the context of the tokens. We use this contextual information as an additional input for our machine learning models since it makes it easier to interpret the meaning of the triplet values and to be aware of the overall structure of the data. In order to estimate the distribution pdata of 2D sketches in a dataset D, we decompose the joint distribution over the sequence of tokens [19] t = (t1, . . . , t N) in an autoregressive fashion, representing each conditional with a neural network parameterized by θ and pose the estimation of pdata as maximization of the log-likelihood of D, i.e., i=1 p(ti | t