# parametric_visual_program_induction_with_function_modularization__98a374d5.pdf Parametric Visual Program Induction with Function Modularization Xuguang Duan 1 Xin Wang 1 Ziwei Zhang 1 Wenwu Zhu 1 Generating programs to describe visual observations has gained much research attention recently. However, most of the existing approaches are based on non-parametric primitive functions, making them unable to handle complex visual scenes involving many attributes and details. In this paper, we propose the concept of parametric visual program induction. Learning to generate parametric programs for visual scenes is challenging due to the huge number of function variants and the complex function correlations. To solve these challenges, we propose the method of function modularization, capable of dealing with numerous function variants and complex correlations. Specifically, we model each parametric function as a multi-head self-contained neural module to cover different function variants. Moreover, to eliminate the complex correlations between functions, we propose the hierarchical heterogeneous Monto-Carlo tree search (H2MCTS) algorithm which can provide high-quality uncorrelated supervision during training, and serve as an efficient searching technique during testing. We demonstrate the superiority of the proposed method on three visual program induction datasets involving parametric primitive functions. Experimental results show that our proposed model is able to significantly outperform the state-of-the-art baseline methods in terms of generating accurate programs. 1. Introduction Studying how to generate computer-executable programs is one of the core interests of the AI community (Waldinger & Lee, 1969; Manna & Waldinger, 1975), and has drawn 1Department of Computer Science and Technology, Tsinghua University, Beijing, China. Correspondence to: Xin Wang , Wenwu Zhu . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). line(lx=1,ty=8,rx=9,by=8,arrow= LEFT ) line(lx=9,ty=1,rx=9,by=8,arrow= LEFT ) for (i = [1, 2, 3]){ lx=2i+1,ty=6-i,rx=2i+2,by=8 ) } while(no Markers Present){ put Marker() move() turn Left() } (a) non-parametric function, no parameter, no variant. (b) parametric function, many parameters, more than 104 variants. Figure 1. (a): An example of the visual program induction task that only generates non-parametric programs, within which, each function has only one variant, and could be modeled as a symbolic token. (b): An example of the parametric visual program induction task studied in this paper, where parametric primitive functions with many more variants are needed to describe the complex visual scene. However, it is hard to tackle such many function variants. lots of recent interests in the visual domain thanks to deep learning (Ellis et al., 2020). By leveraging powerful deep models, these works can successfully describe the logic behind visual games (Sun et al., 2018), learn spatial patterns hidden in images (Young et al., 2019), or conduct neuralsymbolic reasoning (Yi et al., 2018). Despite their enormous success, most of the existing approaches are based on non-parametric primitive functions, failing to meet the requirement of the increasing complexity of visual observations, as well as the increasing elaboration of programs. In this paper, we are the first to propose the concept of Parametric Visual Program Induction, i.e., generating programs with parametric primitive functions for complex visual observations, to the best of our knowledge. By leveraging parametric primitive functions, we can generate much more detailed programs to describe both the hidden logic and visual details. However, the challenges for solving parametric program induction are two folds. First, the action space for a single function can be huge. Compared with basic nonparametric primitive functions, the parametric primitive functions always have several heterogeneous parameters, resulting in a huge number of function variants. For example, in Figure 1(a), a basic visual program induction task may contain simple primitive functions such as move(), Parametric Visual Program Induction with Function Modularization turn Left(); while in Figure 1(b), a parametric function studied in this work tend to have more than 104 variants due to different parameter combinations. Second, the function space for the whole program is also very huge. Given that parametric functions may contain multiple parameters, and these parameters and functions are correlated together, it becomes very challenging to model the long-range function transitions within a program. This problem is also known as program aliasing (Bunel et al., 2018) in the non-parametric scenario, and becomes more severe for parametric functions. These two challenges make non-parametric visual program induction methods hard to extend to the parametric domain. To address these challenges, we propose the concept and method of Function Modularization, which can model numerous and complex parametric functions. In particular, we treat each function along with its parameters as a self-contained module and learn the module to predict the correct parameters given visual contexts, which is able to solve the challenge of the huge action space. Furthermore, based on the modularized functions, we propose a Hierarchical Heterogeneous Monto-Carlo Tree-Search (H2MCTS) algorithm that can traversal all the program aliases, thus providing uncorrelated training data during training and serving as a powerful search method during inference. To verify the superiority of the concept of function modularization and the efficiency of the H2MCTS algorithm, we conduct extensive experiments on a small hand-craft dataset and two well-known datasets (Ellis et al., 2018; Dong et al., 2019). Experimental results show that a modularized function is easier to learn and has higher accuracy compared with vanilla baselines. Also, the proposed H2MCTS algorithm is able to efficiently search over different function combinations and reduce the inference time significantly. In summary, we make the following contributions: To the best of our knowledge, we are the first to investigate the problem of parametric visual program induction by proposing the concept and method of Function Modularization, which decouples the learning of function parameters and function transitions, resulting in accurate and efficient learning of the parametric programs. We propose the H2MCTS algorithm to assist the learning of modularized functions. Our proposed algorithm can provide uncorrelated data to train modularized functions and serve as an efficient search method during inference. We conduct extensive experiments to demonstrate that our proposed model can significantly outperform stateof-the-art baselines on all three datasets. 2. Related Work Learning to generate programs has a long history in AI (Waldinger & Lee, 1969; Manna & Waldinger, 1975; 1980). Traditionally, the process of generating programs is based on search-based induction, and one of the most famous works is the Excel Flash Fill system (Gulwani, 2011). These methods rely on syntax-based pruning (Feser et al., 2015), or use satisfiability modulo theories-based solvers (Lezama, 2008; Feser et al., 2015). With the development of deep learning, this area has gained new attention as learning to generate a program from data directly (Parisotto et al., 2017; Devlin et al., 2017; Ling et al., 2017; Chollet, 2019), including previously unsolvable visual domain tasks (Bunel et al., 2018; Sun et al., 2018; Shin et al., 2019). Besides, the combination of search and learning is also appealing by leveraging advantages from both sides by combining learning and searching (Balog et al., 2016; Irving et al., 2016; Ellis et al., 2020). Balog et al. (2016) and Irving et al. (2016) propose to use neural networks to predict the probability of the next word, and lead into s guided-search schema; Ellis et al. (2020) propose the EC2 algorithm to iteratively learn and search over a Domain Specific Language. Despite the success of these methods, most existing approaches work with non-parametric or fewparameter primitive functions and solve the task by treating programs as a sequence of tokens to learn the token transition dynamics, which cannot effectively handle parametric programs. Besides, Nye et al. (2019) had also tried to solve the problem of generating complex programs, which focus on generating longer programs with complex control flows by proposing a series of control-flow sketches and learning to fill the sketch-hole . Besides, as visual scenes become prevalent, researchers start to work with much more complex visual scenes like LATEX drawings and computer-aided design objects (Eslami et al., 2016; Ellis et al., 2018; Young et al., 2019; Tian et al., 2019; Zhou et al., 2021). Most of these tasks are based on parametric functions and thus make the traditional view of treating the program as a sequence of tokens collapse due to a large number of variants of parametric functions. Ellis et al. (2018) uses STN (Jaderberg et al., 2015) to model multiple parameters, Tian et al. (2019) aligns all the function parameters such that they could be modeled with the same neural network, while Zhou et al. (2021) uses a grammarencoded LSTM model. Though obtained remarkable results, these methods are not easy to generalize. Compared with existing methods, we follow the combination of learning and searching, while, at the same time, tackling those parametric primitive functions. We propose to model each parametric function along with its parameters as a module and propose the H2MCTS algorithm that could benefit both training and inference. Parametric Visual Program Induction with Function Modularization 3. Problem Formulation 3.1. Notations and Problem Formulation Following Piantadosi (2011), we define a program as a logical collection of primitive functions. Specifically, given a set of primitive functions F, a program P = (f Θ1 1 , f Θ2 2 , , f ΘT T ), where f Θi i F is a primitive function f with parameters Θi = (Θi,0, Θi,1, , Θi,nf ), nf is the number of parameters for f, and T is a programdependent parameter that indicates the length of the program P. Besides, in the main text of this paper, we focus on the parametric functions and simplify our program syntax as context-free grammar (CFG) (Zhou et al., 2021), i.e., programs without loops and other control commands, and we show in the experiments (Section 6.3) and Appendix B that our method could be easily extended to context-based scenarios. The task of parametric visual program induction is defined as: given an input-output observation pair (OI, OO), find a parametric program P to transform the input to the output: P(OI) OO. (1) Moreover, based on CFG, Eq. (1) can be rewritten as f ΘT T f ΘT 1 T 1 f Θ1 1 (OI) OO, (2) where f Θi i f Θj j is the composition of two functions, i.e., f Θi i f Θj j (Oin) .= f Θi i f Θj j (Oin) . 3.2. The Existing Methods To generate the desired program P in Eq. (1), most of the existing works adopt the method of tokenization, i.e., transforming (f Θ1 1 , f Θ2 2 , , f ΘT T ) into (t1, t2, ...t N), where ti is a token and N is the number of tokens. The probability of program P is calculated by assuming the Markov property: Pr [P|OI, OO] = i=1 P(ti|t