# parametric_visual_program_induction_with_function_modularization__98a374d5.pdf Parametric Visual Program Induction with Function Modularization Xuguang Duan 1 Xin Wang 1 Ziwei Zhang 1 Wenwu Zhu 1 Generating programs to describe visual observations has gained much research attention recently. However, most of the existing approaches are based on non-parametric primitive functions, making them unable to handle complex visual scenes involving many attributes and details. In this paper, we propose the concept of parametric visual program induction. Learning to generate parametric programs for visual scenes is challenging due to the huge number of function variants and the complex function correlations. To solve these challenges, we propose the method of function modularization, capable of dealing with numerous function variants and complex correlations. Specifically, we model each parametric function as a multi-head self-contained neural module to cover different function variants. Moreover, to eliminate the complex correlations between functions, we propose the hierarchical heterogeneous Monto-Carlo tree search (H2MCTS) algorithm which can provide high-quality uncorrelated supervision during training, and serve as an efficient searching technique during testing. We demonstrate the superiority of the proposed method on three visual program induction datasets involving parametric primitive functions. Experimental results show that our proposed model is able to significantly outperform the state-of-the-art baseline methods in terms of generating accurate programs. 1. Introduction Studying how to generate computer-executable programs is one of the core interests of the AI community (Waldinger & Lee, 1969; Manna & Waldinger, 1975), and has drawn 1Department of Computer Science and Technology, Tsinghua University, Beijing, China. Correspondence to: Xin Wang , Wenwu Zhu . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). line(lx=1,ty=8,rx=9,by=8,arrow= LEFT ) line(lx=9,ty=1,rx=9,by=8,arrow= LEFT ) for (i = [1, 2, 3]){ lx=2i+1,ty=6-i,rx=2i+2,by=8 ) } while(no Markers Present){ put Marker() move() turn Left() } (a) non-parametric function, no parameter, no variant. (b) parametric function, many parameters, more than 104 variants. Figure 1. (a): An example of the visual program induction task that only generates non-parametric programs, within which, each function has only one variant, and could be modeled as a symbolic token. (b): An example of the parametric visual program induction task studied in this paper, where parametric primitive functions with many more variants are needed to describe the complex visual scene. However, it is hard to tackle such many function variants. lots of recent interests in the visual domain thanks to deep learning (Ellis et al., 2020). By leveraging powerful deep models, these works can successfully describe the logic behind visual games (Sun et al., 2018), learn spatial patterns hidden in images (Young et al., 2019), or conduct neuralsymbolic reasoning (Yi et al., 2018). Despite their enormous success, most of the existing approaches are based on non-parametric primitive functions, failing to meet the requirement of the increasing complexity of visual observations, as well as the increasing elaboration of programs. In this paper, we are the first to propose the concept of Parametric Visual Program Induction, i.e., generating programs with parametric primitive functions for complex visual observations, to the best of our knowledge. By leveraging parametric primitive functions, we can generate much more detailed programs to describe both the hidden logic and visual details. However, the challenges for solving parametric program induction are two folds. First, the action space for a single function can be huge. Compared with basic nonparametric primitive functions, the parametric primitive functions always have several heterogeneous parameters, resulting in a huge number of function variants. For example, in Figure 1(a), a basic visual program induction task may contain simple primitive functions such as move(), Parametric Visual Program Induction with Function Modularization turn Left(); while in Figure 1(b), a parametric function studied in this work tend to have more than 104 variants due to different parameter combinations. Second, the function space for the whole program is also very huge. Given that parametric functions may contain multiple parameters, and these parameters and functions are correlated together, it becomes very challenging to model the long-range function transitions within a program. This problem is also known as program aliasing (Bunel et al., 2018) in the non-parametric scenario, and becomes more severe for parametric functions. These two challenges make non-parametric visual program induction methods hard to extend to the parametric domain. To address these challenges, we propose the concept and method of Function Modularization, which can model numerous and complex parametric functions. In particular, we treat each function along with its parameters as a self-contained module and learn the module to predict the correct parameters given visual contexts, which is able to solve the challenge of the huge action space. Furthermore, based on the modularized functions, we propose a Hierarchical Heterogeneous Monto-Carlo Tree-Search (H2MCTS) algorithm that can traversal all the program aliases, thus providing uncorrelated training data during training and serving as a powerful search method during inference. To verify the superiority of the concept of function modularization and the efficiency of the H2MCTS algorithm, we conduct extensive experiments on a small hand-craft dataset and two well-known datasets (Ellis et al., 2018; Dong et al., 2019). Experimental results show that a modularized function is easier to learn and has higher accuracy compared with vanilla baselines. Also, the proposed H2MCTS algorithm is able to efficiently search over different function combinations and reduce the inference time significantly. In summary, we make the following contributions: To the best of our knowledge, we are the first to investigate the problem of parametric visual program induction by proposing the concept and method of Function Modularization, which decouples the learning of function parameters and function transitions, resulting in accurate and efficient learning of the parametric programs. We propose the H2MCTS algorithm to assist the learning of modularized functions. Our proposed algorithm can provide uncorrelated data to train modularized functions and serve as an efficient search method during inference. We conduct extensive experiments to demonstrate that our proposed model can significantly outperform stateof-the-art baselines on all three datasets. 2. Related Work Learning to generate programs has a long history in AI (Waldinger & Lee, 1969; Manna & Waldinger, 1975; 1980). Traditionally, the process of generating programs is based on search-based induction, and one of the most famous works is the Excel Flash Fill system (Gulwani, 2011). These methods rely on syntax-based pruning (Feser et al., 2015), or use satisfiability modulo theories-based solvers (Lezama, 2008; Feser et al., 2015). With the development of deep learning, this area has gained new attention as learning to generate a program from data directly (Parisotto et al., 2017; Devlin et al., 2017; Ling et al., 2017; Chollet, 2019), including previously unsolvable visual domain tasks (Bunel et al., 2018; Sun et al., 2018; Shin et al., 2019). Besides, the combination of search and learning is also appealing by leveraging advantages from both sides by combining learning and searching (Balog et al., 2016; Irving et al., 2016; Ellis et al., 2020). Balog et al. (2016) and Irving et al. (2016) propose to use neural networks to predict the probability of the next word, and lead into s guided-search schema; Ellis et al. (2020) propose the EC2 algorithm to iteratively learn and search over a Domain Specific Language. Despite the success of these methods, most existing approaches work with non-parametric or fewparameter primitive functions and solve the task by treating programs as a sequence of tokens to learn the token transition dynamics, which cannot effectively handle parametric programs. Besides, Nye et al. (2019) had also tried to solve the problem of generating complex programs, which focus on generating longer programs with complex control flows by proposing a series of control-flow sketches and learning to fill the sketch-hole . Besides, as visual scenes become prevalent, researchers start to work with much more complex visual scenes like LATEX drawings and computer-aided design objects (Eslami et al., 2016; Ellis et al., 2018; Young et al., 2019; Tian et al., 2019; Zhou et al., 2021). Most of these tasks are based on parametric functions and thus make the traditional view of treating the program as a sequence of tokens collapse due to a large number of variants of parametric functions. Ellis et al. (2018) uses STN (Jaderberg et al., 2015) to model multiple parameters, Tian et al. (2019) aligns all the function parameters such that they could be modeled with the same neural network, while Zhou et al. (2021) uses a grammarencoded LSTM model. Though obtained remarkable results, these methods are not easy to generalize. Compared with existing methods, we follow the combination of learning and searching, while, at the same time, tackling those parametric primitive functions. We propose to model each parametric function along with its parameters as a module and propose the H2MCTS algorithm that could benefit both training and inference. Parametric Visual Program Induction with Function Modularization 3. Problem Formulation 3.1. Notations and Problem Formulation Following Piantadosi (2011), we define a program as a logical collection of primitive functions. Specifically, given a set of primitive functions F, a program P = (f Θ1 1 , f Θ2 2 , , f ΘT T ), where f Θi i F is a primitive function f with parameters Θi = (Θi,0, Θi,1, , Θi,nf ), nf is the number of parameters for f, and T is a programdependent parameter that indicates the length of the program P. Besides, in the main text of this paper, we focus on the parametric functions and simplify our program syntax as context-free grammar (CFG) (Zhou et al., 2021), i.e., programs without loops and other control commands, and we show in the experiments (Section 6.3) and Appendix B that our method could be easily extended to context-based scenarios. The task of parametric visual program induction is defined as: given an input-output observation pair (OI, OO), find a parametric program P to transform the input to the output: P(OI) OO. (1) Moreover, based on CFG, Eq. (1) can be rewritten as f ΘT T f ΘT 1 T 1 f Θ1 1 (OI) OO, (2) where f Θi i f Θj j is the composition of two functions, i.e., f Θi i f Θj j (Oin) .= f Θi i f Θj j (Oin) . 3.2. The Existing Methods To generate the desired program P in Eq. (1), most of the existing works adopt the method of tokenization, i.e., transforming (f Θ1 1 , f Θ2 2 , , f ΘT T ) into (t1, t2, ...t N), where ti is a token and N is the number of tokens. The probability of program P is calculated by assuming the Markov property: Pr [P|OI, OO] = i=1 P(ti|t 30,000 66.94% 6.30% 0% 0% 0% SEQ2SEQ(F) + RL LOSS 44.20% 2.10% 0% 0% 0% SEQ2SEQ(F) + CE-RL LOSS 76.44% 15.41% 0% 0% 0% STN + SMC(ELLIS ET AL., 2018) P f nf + |F| + 1 =16 63% 70% 70% STN + SMC(OUR IMPLEMENT) 90.74% 71.21% 64% 72% 74% FM + CANONICAL P f O(nf) + |F| + 2 =20 91.81% 74.45% 67% 72% 72% FM + H2MCTS 93.91% 83.13% 76% 84% 86% word as token: Target Program Input Output BLOCK[0,0,1,1,7]; BLOCK[0,3,1,4,3]; BLOCK[3,3,4,4,6]; BORDER[1,1,3,3,1] block l( block r) block lx=3 (syntax error) function as token: block[0, 0, 1, 1, 7] block[0, 3, 1, 4, 3] block[3, 3, 4, 4, 6] border[1, 1, 3, 3, 1] function modularizaton: lx=0,ty=3,rx=1,by=4,col=3 lx=0,ty=0,rx=1,by=1,col=7 lx=3,ty=3,rx=4,by=4,col=6 lx=1,ty=1,rx=3,by=3,col=1 Figure 4. An illustration of the 5 5 Pixel Grid dataset. We show the input-output observation and the target program in the left. In the right, we show the generated programs of three methods under the supervised setting. problem. Function-as-Token could correctly predict the program, with the cost of having to select from the 2, 349 candidates tokens (Table 2). In contrast, our proposed method can easily generate the program by function modularization. 6.2. 2D LATEX Drawing: Control-free Program Learning Dataset. In the second experiment, we adopt the LATEX 2D drawing dataset (Ellis et al., 2018). The goal is to learn LATEX executable programs with visual observations. Following (Ellis et al., 2018), the training set includes 95,000 synthesized data, and the testing set includes 5,000 synthesized data as well as 100 real hand-drawn images. Specifically, OI is an empty canvas, and OO is an image with size 256 256. This dataset contains 3 primitive functions: Circle, Line, and Rectangle, each of which draws on a discrete 16 16 grid coordinates. The synthesized data contains randomly generated programs, while the real hand-drawn images aim to show certain structures. Figure 5 shows some examples of the dataset. Baselines. We compare the following methods. (1) A LSTM-based Seq2Seq language model, which has achieved successes in language translation and image captioning tasks (Vinyals et al., 2015). Based on the results from 5 5 circle(x=8,y=8,radius=1), circle(x=9,y=14, radius=1), line(x1=9,y1=8,x2=9,y2=1, arrow=true,solid=true), line(x1=8,y1=9,x2=10,y2=10, arrow=false,solid=true), rectangle(x1=10,y1=9,x2=15,y2=15) (a) synthesized observation and its ground truth program. (b) hand drawn images for testing. Figure 5. Top: an example of the synthesized image and the program that generated it. Bottom: examples of the true hand-drawn images which correspond to model diagram, flowing chart, and tree structures. The learned programs should create legible figures by rendering in LATEX. Figure 6. A comparison between different search methods. Left: the relative time consumption, from which we could find that our H2MCTS algorithm consumes much less time than other methods; Right: the testing accuracy, which demonstrate that our H2MCTS obtain the best performance. Pixel Grid in Section 6.1, we only consider the Functionas-Token tokenizing method and denote it as Seq2seq(F). We compare three versions where the first two use Cross Entropy (CE) loss, Reinforcement Learning (RL) loss during training respectively, and the third one is pretrained Parametric Visual Program Induction with Function Modularization return program rectangle[𝜃] circle[𝜃] ℰ ℰ rectangle[𝜃] rectangle[𝜃] rectangle[𝜃] circle[𝜃] ℰ rectangle[𝜃] circle[𝜃] ℰ circle(x=9,y=10,radius=1), rectangle(lx=1,ty=5,rx=11,by=15), line(lx=4,ty=7,rx=6,by=9,arrow=true,solid=true), line(lx=4,ty=13,rx=6,by=11,arrow=true, solid=true), rectangle(lx=12,ty=9,rx=13,by=11), line(lx=9,ty=13,rx=4,by=13,arrow=true,solid=false), rectangle(lx=2,ty=12,rx=4,by=11),line(lx=7,ty=10, rx=8,by=10,arrow=true,solid=false), rectangle(lx=2,ty=6,rx=4,by=8), (14 complex parametric primitive functions with more than 0 parameters) Figure 7. An example of the 2D LATEXDrawing dataset. In the illustration, the function transition model proposes green function modules as executable ones, and the H2MCTS algorithm selects and executes one function. This process is repeated until we reach the OOF terminal state, marking the failure of this search process, or the EOP terminal state, which marks the success of this search. Then, the programs along with their parameters are returned as the final program. The color of the rendered images is inverted for better visualization. Figure 8. More showcases for the LATEX 2D drawing datasets. Top: the hand drawings; Bottom: the LATEXrendered output with our generated program. with the CE loss and further fine-tuned with the RL loss. (2) Spatial Transformer Network model with Sequential Monto-Carlo search (STN+SMC)(Ellis et al., 2018), which achieves the state-of-the-art result. We present two versions: the original results reported in the paper and our own implementation. (3) Our proposed function modularization and H2MCTS model (FM+H2MCTS). We also include an ablation study, which uses the standard canonical function order to replace the H2MCTS training, denoted as FM+Canonical. For all methods except the original result from (Ellis et al., 2018), we use Res Net-18 as the encoder E to ensure a fair comparison. Results. The results are shown in Table 3. We make the following observations. Overall, our proposed FM+H2MCTS model reports the best results, consistently and greatly outperforming the most competitive baseline by more than 10% in both the synthesized test set and real hand-drawn images. The results demonstrate the effectiveness of our pro- posed method in handling the 2D LATEXdataset. We attribute the reasons into two folds. First, our proposed multi-head self-contained neural module is more flexible to handle different parameters adaptively. For example, we could model coordinates with a regression head and model arrow state with a binary classification head. On the other hand, compared to using the canonical approach, our H2MCTS algorithm also contributes to the performance by getting rid of the predefined function execution order during both training and inference which is extremely inflexible. The Seq2Seq model shows poor results even in the training dataset, not to mention handling real hand-drawn images. A plausible reason is that this baseline is difficult to converge due to the huge number of parameters in MLP (more than 30,000 dimension outputs). The results are consistent with Ellis et al. (2018) that pure DNN-based approaches could not tackle the complicated visual program induction tasks. We also compare H2MCTS and SMC with other search Parametric Visual Program Induction with Function Modularization Table 4. The results on the 3D Shape dataset. Following (Tian et al., 2019), we compute the intersection over union (Io U) between the target observation and the rendered output of the generated program as the evaluation metric. MODEL TRAINING SET (IOU ) TESTING SET (IOU ) TABLE CHAIR BED SOFA CABINET BENCH TIAN ET AL. (2019) 0.492 0.469 0.283 0.365 0.345 0.248 FM + H2MCTS (OURS) 0.641 0.642 0.501 0.670 0.661 0.602 strategies including greedy search, beam search, and depthfirst search (DFS) regarding their search accuracy and efficiency during inference time based on our trained model. The results are shown in Figure 6. We observe that H2MCTS outperforms other search methods with respect to both search time and testing accuracy. Among baselines, DFS is most competitive based on our well-trained function transition model and parameter prediction model.In comparison, beam search and SMC perform slightly worse. The differences mainly come from the searching strategy: Beam Search and SMC are optimized Breadth-First-Search that keep a number of partials (bandwidth) at every iteration which consumes much more computation; while H2MCTS is an optimized Depth-First-Search method that stores the search statics at each iteration for next round. With a welltrained model, a small bandwidth could be sufficient to tackle most of the problem and thus leads to the high efficiency of H2MCTS and DFS. The results demonstrate that H2MCTS serves as an efficient inference technique, as discussed in Section 5.3. Finally, to provide a more intuitive understanding, we present a showcase of our method in Figure 7 and Figure 8, including the raw observation, the learned programs, and the LATEX rendered images. The figure clearly shows the workflow and the effectiveness of our proposed method in learning parametric programs from real visual scenes. We include more examples in Figure 8, where the first five drawings could be solved perfectly. 6.3. 3D Shape: Control-based Program Learning Datasets. In the last experiment, we adopt the 3D-Shape dataset (Tian et al., 2019) containing 18 primitive functions and for loops, i.e., control-based programs. A showcase of the dataset is provided in Figure 9. Settings and Baselines. To enable our proposed method to handle the controls, we add one extra type of Node to determine the control flow, and extend the bi-level modeling in Eq. (5)) into a tri-level modeling as follows: (Control Transition) C : (Oi, OO) Ci. (Function Transition) P : (Oi, OO) fi, (Parameter Prediction) Qf : (Oi, OO, Ci) Θ. (13) draw('Top', 'Rec', P=(-1,-1,0), G=(2,7,8)) draw('Sup', 'Cylinder', P=(-11,0,0), G=(11,1)) for(i<4, 'Rot', theta=90 , axis=(-12,0,0) draw('Base', 'Line', P1=(-12,0,0), P2=(-12,-7,-7), theta*i, axis) draw('Back', 'Cub', P=(1,5,-7), G=(11,2,14), theta=10 ) for(i<2, 'Trans', u=(0,0,13)) draw('Chair_Beam', 'Cub', P=(1,-3,-7)+i*u, G=(3,1,1)) for(i<2, 'Trans', u=(0,0,14)) draw('Hori_Bar', 'Cub', P=(4,-3,-7)+i*u, G=(2,8,1)) rendered 3D object learned program observations Figure 9. An illustration of the 3D Shape dataset. With this tri-level modeling, functions within different control blocks are still context-free and therefore our proposed H2MCST algorithm still applies. See Appendix B for more details. We mainly compare our proposed method with Tian et al. (2019), a state-of-the-art baseline for this dataset. Notice that guided adaptation used in (Tian et al., 2019) is not available in our considered setting. Results As shown in Table 4, the program generated by our method achieves better performance for all categories of objects, demonstrating the general applicability of our method on control-based programs. In Figure 9, we provide a showcase of our method, which successfully generates a program to describe the visual observation. 7. Conclusion In this paper, we investigate the parametric visual program induction task by decoupling the learning of parametric function as learning function transition and function parameter prediction. We propose the concept of function modularization and the H2MCTS algorithm. Our method outperforms state-of-the-art baselines with higher accuracy and efficiency. Future works include exploring more visual program induction scenarios using our proposed method. Acknowledgement This work is supported by the National Key Research and Development Program of China No. 2020AAA0106300 and National Natural Science Foundation of China No. 62102222. Parametric Visual Program Induction with Function Modularization Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. In International Conference on Learning Representations, 2016. Bunel, R., Hausknecht, M., Devlin, J., Singh, R., and Kohli, P. Leveraging grammar and reinforcement learning for neural program synthesis. Proceedings of the 4th International Conference on Learning Representations(ICLR), 2018. Chollet, F. On the measure of intelligence. ar Xiv preprint ar Xiv:1911.01547, 2019. Devlin, J., Bunel, R., Singh, R., Hausknecht, M., and Kohli, P. Neural program meta-induction. Neural Information Processing Systems (NIPS), 2017. Dong, H., Mao, J., Lin, T., Wang, C., Li, L., and Zhou, D. Neural logic machines. Proceedings of the 4th International Conference on Learning Representations(ICLR), 2019. Ellis, K., Ritchie, D., Solar-Lezama, A., and Tenenbaum, J. B. Learning to infer graphics programs from handdrawn images. Neural Information Processing Systems (NIPS), 2018. Ellis, K., Wong, C., Nye, M., Sable-Meyer, M., Cary, L., Morales, L., Hewitt, L., Solar-Lezama, A., and Tenenbaum, J. B. Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning. ar Xiv preprint ar Xiv:2006.08381, 2020. Eslami, S., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G. E., et al. Attend, infer, repeat: Fast scene understanding with generative models. Advances in Neural Information Processing Systems (NIPS), 29:3225 3233, 2016. Feser, J. K., Chaudhuri, S., and Dillig, I. Synthesizing data structure transformations from input-output examples. ACM SIGPLAN Notices, 50(6):229 239, 2015. Gulwani, S. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices, 46 (1):317 330, 2011. Irving, G., Szegedy, C., Alemi, A. A., E en, N., Chollet, F., and Urban, J. Deepmath-deep sequence models for premise selection. Advances in Neural Information Processing Systems, 29:2235 2243, 2016. Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatial transformer networks. Advances in neural information processing systems (NIPS), 28:2017 2025, 2015. Lezama, A. S. Program synthesis by sketching. Ph D thesis, Ph D thesis, EECS Department, University of California, Berkeley, 2008. Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 158 167, 2017. Manna, Z. and Waldinger, R. Knowledge and reasoning in program synthesis. Artificial intelligence, 6(2):175 208, 1975. Manna, Z. and Waldinger, R. A deductive approach to program synthesis. ACM Transactions on Programming Languages and Systems (TOPLAS), 2(1):90 121, 1980. Nye, M., Hewitt, L., Tenenbaum, J., and Solar-Lezama, A. Learning to infer program sketches. In International Conference on Machine Learning, pp. 4861 4870. PMLR, 2019. Parisotto, E., Mohamed, A.-r., Singh, R., Li, L., Zhou, D., and Kohli, P. Neuro-symbolic program synthesis. Proceedings of the 4th International Conference on Learning Representations(ICLR), 2017. Pattis, R., Roberts, J., and Stehlik, M. Karel the robot. A gentele introduction to the Art of Programming, 1981. Piantadosi, S. T. Learning and the language of thought. Ph D thesis, Massachusetts Institute of Technology, 2011. Shin, R., Kant, N., Gupta, K., Bender, C., Trabucco, B., Singh, R., and Song, D. Synthetic datasets for neural program synthesis. Proceedings of the 4th International Conference on Learning Representations(ICLR), 2019. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. Sun, S.-H., Noh, H., Somasundaram, S., and Lim, J. Neural program synthesis from diverse demonstration videos. In International Conference on Machine Learning, pp. 4790 4799. PMLR, 2018. Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (NIPS), pp. 1057 1063, 2000. Tian, Y., Luo, A., Sun, X., Ellis, K., Freeman, W. T., Tenenbaum, J. B., and Wu, J. Learning to infer and execute 3d shape programs. In International Conference on Learning Representations, 2019. Parametric Visual Program Induction with Function Modularization Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156 3164, 2015. Waldinger, R. J. and Lee, R. C. Prow: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence, pp. 241 252, 1969. Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., and Tenenbaum, J. B. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Neural Information Processing Systems (NIPS), 2018. Young, H., Bastani, O., and Naik, M. Learning neurosymbolic generative models via program synthesis. In International Conference on Machine Learning, pp. 7144 7153. PMLR, 2019. Zhou, C., Li, C.-L., and P oczos, B. Unsupervised program synthesis for images by sampling without replacement. In Uncertainty in Artificial Intelligence, pp. 408 418. PMLR, 2021. Parametric Visual Program Induction with Function Modularization A. Implementation and Training In this section, we briefly introduce more details about the modularized functions, as well as our training framework. A.1. Function Modular Parametric primitive functions are easy to implement, but how to determine their parameters is hard. This is one of the motivation of this paper. To achieve this, we organize each primitive function and its parameter into a module with a multi-head MLP (each head is a two-layer MLP as Param Net), leaving each head corresponding to one parameter. An examples of the DOT primitive function and its modularized form Dot are shown as follows: 1 import numpy as np 2 import torch 4 def DOT(In, X, Y, COL): 5 Out = copy.deepcopy(In) 6 Out[X, Y] = COL 7 return Out 9 class Dot(torch.nn.Module): 10 def __init__(self, H): 11 super().__init__() 12 self.fn = DOT 13 # Param Net is another mlp to predict parameters. 14 self.params[ X ] = Param Net("X", "REG", range=MAXSIZE[0]) 15 self.params[ Y ] = Param Net("Y", "REG", range=MAXSIZE[0]) 16 self.params[ C ] = Param Net("C", "CLS", range=NUM_OF_COLOR) 18 def inference(In, Target): 19 s = encoder(Target) - encoder(In) 20 param_x = self.params[ X ](s) 21 param_y = self.params[ Y ](s) 22 param_c = self.params[ C ](s) 23 return self.fn(In, param_x, param_y, param_c) Listing 1. An example of Function Modular. A.2. Overall Network Architecture With the idea of function modularization, the whole deep model is clear to go as a shared convolutional encoder, followed with function transition head to predict the transition dynamics for next function, while several function modular to predict the parameter for each function. As for the encoder, we adopt a 4-layer convolutional networks for 5 5 Pixel Grid, use Res Net18 for 2D LATEX Drawing, use a shapenet as Tian et al. (2019) for 3D Shape. Moreover, to accelerate the training speed, we follow the setting of Mu Zero (Silver et al., 2016) to implement a distributed system based on Ray1. The system consists of 60 explorers to continuously explore the function space via the H2MCTS algorithm (Alg.1), and explored traces are used to train the model via an online trainer. The whole framework is launched on a GPU server with two Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz CPU processors and two Nvidia Ge Force RTX 3090 GPU processors. A.3. The H2MCTS algorithm Our H2MCTS algorithm follows and generalizes the basic MCTS algorithm and (Silver et al., 2016). A basic MCTS algorithm includes four parts as: Selection, Expansion, Simulation, Back Up. Both Silver et al. (2016) and our work use the neural network to conduct the Simulation step. Moreover, we consider two different types of nodes in the simulation process, and thus our algorithm is called H2MCTS. The detailed algorithm is shown in Algorithm 1. 1https://www.ray.io/ Parametric Visual Program Induction with Function Modularization Algorithm 1 H2MCTS Input: The Function Transition model P( | ); A set of Parameter Prediction models {Qf( | )}; {Eq. (5)} Input: A raw observation input-output pair OI, OO. init: j 0. init: ROOT (OI, OI, 0) {ROOT node with OI, OO, and visit count 0} repeat PNode ROOT {We starts from root at every round} repeat f = arg maxf 1/(1 + β α(PNode)) P(f|PNode.context) {Select Function node according to Eq. (11)} if f is OOF then node PNode {Back Up node visit count} repeat α(node) α(node) + 1 node node.parent until node is None break inner loop end if if f is EOP then break outer loop end if if f / PNode.children then FNode (f, 0) {create new function node with f, and visit count 0} PNode.children.add( FNode(f, 0)) {Add f to PNode s children} end if Θ = arg maxΘ 1/(1 + β α(PNode)) Q(Θ|PNode.context) {Select Parameter Node for f according to Eq. (11)} Onew f Θ(PNode.I) {Obtain new observations} if Θ / FNode.children then Onew f Θ(PNode.Input) {render the new observation and Expand the search tree} PNode (Onew, PNode.Output, 0) {A new PNode} FNode.children.add(PNode) end if until K exceed maximum search depth until j exceed maximum number of simulation P [] repeat P.add(PNode.parent.f, PNode.Θ) PNode PNode.parent.parent until PNode is None Return P B. Extension to Context-based Scenarios To extend our formulation to context-based scenarios, we firstly rewrite a program as P = (C1, C2, .., CNP ), where Ci = (CΘC, f Θi,1 i,1 , f Θi,2 i,2 , ) is the i-th logic collection block and CΘC is the control unit (e.g., loop, if-else), and f Θi,j i,j is primitive function f with parameters Θi,j = (Θi,j,0, Θi,j,, , Θi,j,nf ), nf is the number of parameters for f (e.g., line(lx,ty,rx,by)). Then we define a tri-level modeling as: Pr [P|OI, OO] = YNp i=1 C(ci| ˆOi 1) P(fi| ˆOi 1) Qfi(Θi| ˆOi 1, ci). (14) This means we firstly determine the control unit, then determine the function, and finally the parameter. Moreover, we add an extra control unit as NUL which directly executes its enclosed commands, and the H2MCTS algorithm is correspondingly extended with the extra Control node. Parametric Visual Program Induction with Function Modularization C. More Experiments Results C.1. Pixel Grid C.1.1. DATASET To generate the dataset, we randomly sample a sequence of functions and parameters from the primitive functions P = h f (Θ1) 1 , f (Θ2) 2 , f (ΘT ) T i with T < Tmax, and generate a random input OI, and apply the program to OI to obtain OO. Then (OI, OO) is used as an input-output pair. Moreover, as some functions could overwrite previous functions (e.g., fj=DOT(X,Y,C2) will overwrite fi =DOT(X,Y,C1) if j > i), we carefully compare the intermediate results [O0, O1, , OT ] to remove functions which have been overwritten. C.1.2. COMPARISON WITH RULE-BASED METHODS. In our experiments, the training set contains of 10,000 input-output pairs, which consists of 20% programs with length 1, 20% programs with length 2, and 60% programs with length 3. The testing set contains 1,000 input-output pairs with the same length distribution. C.2. LATEX 2D drawings We show the training curve of our model in Figure 10. From the figure, we can see that most of the functions and their parameters could reach an accuracy of more than 95% after 50k iterations. C.3. 3D shape-synthesis We provide more showcases for this dataset in Figure 11. Parametric Visual Program Induction with Function Modularization 0 50 100 150 200 0.0 Program Transition Accuracy 0 50 100 150 200 0.0 line:lx accuracy 0 50 100 150 200 0.0 line:ty accuracy 0 50 100 150 200 0.0 line:rx accuracy 0 50 100 150 200 0.0 line:by accuracy 0 50 100 150 200 0.0 line:solid accuracy 0 50 100 150 200 0.0 line:arrow accuracy 0 50 100 150 200 0.0 rectangle:lx accuracy 0 50 100 150 200 0.0 rectangle:ty accuracy 0 50 100 150 200 0.0 rectangle:rx accuracy 0 50 100 150 200 0.0 rectangle:by accuracy 0 50 100 150 200 0.0 circle:cx accuracy 0 50 100 150 200 0.0 circle:cy accuracy 0 50 100 150 200 0.0 circle:radius accuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 Figure 10. Training Curves for the LATEX 2D drawing datasets. We can observe that most of the modules could converge within 50k iterations. Parametric Visual Program Induction with Function Modularization draw('Top', 'Rec', P=(-1,0,0), G=(3,9,9)) for(i<2, 'Trans', u1=(0,0,17)) for(i<2, 'Trans', u2=(0,13,0)) draw('Leg', 'Cub', P=(-12,-7,-9)+i*u1+j*u2, G=(14,2,1)) for(i<2, 'Trans', u=(0,0,17)) draw('Hori_Bar', 'Cub', P=(-12,-7,-9)+i*u, G=(2,15,1)) draw('Back', 'Cub', P=(2,5,-9), G=(10,3,18), theta=5 ) for(i<2, 'Trans', u=(0,0,18)) draw('Chair_Beam', 'Cub', P=(2,-7,-10)+i*u, G=(3,1,1)) for(i<2, 'Trans', u=(0,0,18)) draw('Hori_Bar', 'Cub', P=(5,-7,-10)+i*u, G=(3,14,1)) draw('Top', 'Square', P=(10,0,0), G=(3,12)) for(i<2, 'Trans', u1=(0,0,16)) for(i<2, 'Trans', u2=(0,17,0)) draw('Leg', 'Cub', P=(-12,-10,-9)+i*u1+j*u2, G=(24,3,2)) draw('Layer', 'Rec', P=(-2,0,0), G=(2,9,9)) draw('Top', 'Square', P=(-5,0,0), G=(5,10)) draw('Vert_Board', 'Cub', P=(-10,-8,-10), G=(11,1,19)) for(i<5, 'Rot', theta=72 , axis=(-11,1,0) draw('Base', 'Line', P1=(-11,1,0), P2=(-12,-8,-6), theta*i, axis) draw('Back', 'Cub', P=(0,10,-10), G=(11,2,19), theta=0 ) Figure 11. More showcases for the 3D shape dataset. Left: the raw observations. Middle: the generated program. Right: the rendered results.