# cose_compositional_stroke_embeddings__d599efdc.pdf Co SE: Compositional Stroke Embeddings Emre Aksan ETH Zurich eaksan@inf.ethz.ch Thomas Deselaers Apple Switzerland deselaers@gmail.com Andrea Tagliasacchi Google Research atagliasacchi@google.com Otmar Hilliges ETH Zurich otmar.hilliges@inf.ethz.ch We present a generative model for complex free-form structures such as strokebased drawing tasks. While previous approaches rely on sequence-based models for drawings of basic objects or handwritten text, we propose a model that treats drawings as a collection of strokes that can be composed into complex structures such as diagrams (e.g., flow-charts). At the core of the approach lies a novel autoencoder that projects variable-length strokes into a latent space of fixed dimension. This representation space allows a relational model, operating in latent space, to better capture the relationship between strokes and to predict subsequent strokes. We demonstrate qualitatively and quantitatively that our proposed approach is able to model the appearance of individual strokes, as well as the compositional structure of larger diagram drawings. Our approach is suitable for interactive use cases such as auto-completing diagrams. We make code and models publicly available at https://eth-ait.github.io/cose. 1 Introduction Figure 1: Teaser We model complex drawings as a collections of strokes. Given only sparse strokes as input (black) the model predicts the most likely next strokes and their starting positions (heatmap), each color corresponds to one prediction step. This functional decomposition allows for generative modelling of varied and complex structures such as flow-charts (top), or freehand sketches (bottom). The drawings are model outputs. Sketches and drawings have been at the heart of human civilization for millennia. While free-form sketching is a powerful and flexible tool for humans, it is a surprisingly hard task for machines, especially if interpreted in the generative sense. Consider Figure 1: when given only a sparse set of strokes (in black), what is the most likely continuation of a sketch or diagram (colored strokes are predicted)? The answer to this question is highly context sensitive and requires reasoning at the local (i.e., stroke) and global (i.e., the diagram or sketch) level. Existing work has been focused on the recognition [1 3] and generation of handwritten text [4, 5] or the modelling of entire drawings [6 9] from the Quick, Draw! dataset [10]. However, the more recent Di Di dataset introduced by Gervais et al. [11], consisting of much more realistic and challenging complex structures such as diagrams and flow-charts, has been shown to be challenging for existing methods [12], due to the combinatorially many ways individual strokes can be combined into a complex drawing (see Fig. 7). Work done while at Google. Unrelated to affiliation with Apple 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. In this paper we propose a novel compositional generative model, called Co SE, for complex stroke based data such as drawings, diagrams and sketches. While existing work considers the entire drawing as a single temporal sequence [3, 6, 7], our key insight is to factor local appearance of a stroke from the global structure of the drawing. To this end we treat each stroke as an ordered sequence of 2D positions s={(xt, yt)}T t=0, where (x, y) represents the 2D location on screen. Importantly we treat the entire drawing x as an unordered collection of strokes x={sk}K k=1. Since the stroke ordering does not impact the semantic meaning of the diagram, this modelling decision has profound implications. In our approach the model does not need to understand the difference between the (K 1)! potential orderings of the previous strokes to predict the k-th stroke, leading to a much more efficient utilization of modelling capacity. To achieve this we propose a generative model that first projects variable-length strokes into a latent space of fixed dimension via an encoder. A relational model then predicts the embeddings of future strokes, which are then rendered by the decoder. The whole network is trained end-to-end and we experimentally show that the architecture can model complex diagrams and flow-charts from the Di Di dataset, free-form sketches from Quick Draw and handwritten text from the IAM-On DB datasets [13]. We demonstrate the predictive capabilities via a proof-of-concept interactive demo (video in supplementary) in which the model suggests diagram completions based on initial user input. We show that our model outperforms existing models quantitatively and qualitatively and we analyze the learned latent space to provide insights into how predictions are formed. 2 Related Work The interpretation of stroke data has been pursued before deep learning, often on small datasets of a few hundred samples targeting a particular application: Costagliola et al. [14] presented a parsing-based approach using a grammar of shapes and symbols where shapes and symbols are independently recognized and the results are combined using a non-deterministic grammar parser. Bresler et al. [15, 16] investigated flowchart and diagram recognition using a multi-stage approach including multiple independent segmentation and recognition steps. For handwriting recognition, neural networks have been successfully used since Yaeger at al. [1] and LSTMs have been shown to be quite successful [4]. Recently, [17] have applied graph attention networks to 1,300 diagrams from [14 16] for text/non-text classification using a hand-engineered stroke feature vector. Yang et al. [8] apply graph convolutional networks for semantic segmentation at the stroke level to extensions of the Quick Draw data [18, 19]. For an in-depth treatment of drawing recognition, we refer the reader to the recent survey by Xu et. al. [9]. Particularly relevant to our work are approaches that apply generative models to stroke data. Ha et. al. [6] and Ribeiro et. al. [7] build LSTM/VAE-based and Transformer-based models respectively to generate samples from the Quick Draw dataset [20]. These approaches model the entire drawing as a single sequence of points. The different categories of drawings are modelled holistically without taking their internal structure into account. Graves proposed an auto-regressive handwriting generation model with LSTMs [21]; it explicitly models the sequence structure, hence making a full-sequence representation of the ink data a reasonable choice. In [5], an auto-regressive latent variable model is used to control the content and style aspects of handwriting, allowing for style transfer and synthesis applications. Existing work either models the whole drawing (as an image) or as a complete sequence of points. In contrast, we model stroke-based drawings as order invariant 2D compositional structures and in consequence our model scales to more complex settings. Albeit in diverse and different domains, the following works are also relevant to ours in terms of explicitly considering the compositional nature of the problems. Ellis et al. [22] use a program synthesis approach to analyze drawing images by recognizing primitives and combining them through program synthesis. This approach models the structure between components, but unlike our approach the model is applied to dense image-data rather than sparse strokes directly. One approach that applies neural networks to understand complex structures based on embeddings of basic building blocks is Lee et al. [23]. They learn an embedding space for mathematical equations and use a higher-level model to predict valid transformations of equations. Wang et al. [24] follow an iterative approach to synthesize indoor scenes where the model picks an object from a database and decides where to place it. Layout GAN [25] learns to generate realistic layouts from 2D wireframes and semantic labels for documents and abstract scenes. Figure 2: Architecture overview (left) the input drawing as a collection of strokes {sk}; (middle) our embedding architecture, consisting of a shared encoder Eθ, a shared decoder Dθ, and a relational model Rθ; (right) the input drawing with the next stroke s4 and its starting position s4 predicted by Rθ and decoded by Dθ. Note that the relational model Rθ is permutation-invariant. 3 Method We are interested in modelling a drawing x as a collection of strokes {sk}K k=1, in the following abbreviated as {sk}, which requires capturing the semantics of a sketch and learning the relationships between its strokes. We propose a generative model, dubbed Co SE, that first projects variable-length strokes into a fixed-dimensional latent space, and then models their relationships in this latent space to predict future strokes. This approach is illustrated in Figure 2. More formally, given an initial set of strokes (e.g. {s1, s2, s3}), we wish to predict the next stroke (e.g. s4). We decompose the joint distribution of the sequence of strokes x as a product of conditional distributions over the set of existing strokes: k=1 p(sk, sk|s