# bmnas_bilevel_multimodal_neural_architecture_search__f7874f91.pdf BM-NAS: Bilevel Multimodal Neural Architecture Search Yihang Yin1, Siyu Huang2, Xiang Zhang3 1Nanyang Technological University 2Harvard University 3The Pennsylvania State University yyin009@e.ntu.edu.sg, huang@seas.harvard.edu, xzz89@psu.edu Deep neural networks (DNNs) have shown superior performances on various multimodal learning problems. However, it often requires huge efforts to adapt DNNs to individual multimodal tasks by manually engineering unimodal features and designing multimodal feature fusion strategies. This paper proposes Bilevel Multimodal Neural Architecture Search (BM-NAS) framework, which makes the architecture of multimodal fusion models fully searchable via a bilevel searching scheme. At the upper level, BM-NAS selects the inter/intra-modal feature pairs from the pretrained unimodal backbones. At the lower level, BM-NAS learns the fusion strategy for each feature pair, which is a combination of predefined primitive operations. The primitive operations are elaborately designed and they can be flexibly combined to accommodate various effective feature fusion modules such as multi-head attention (Transformer) and Attention on Attention (Ao A). Experimental results on three multimodal tasks demonstrate the effectiveness and efficiency of the proposed BM-NAS framework. BM-NAS achieves competitive performances with much less search time and fewer model parameters in comparison with the existing generalized multimodal NAS methods. Our code is available at https://github.com/Somedaywilldo/BM-NAS. Introduction Deep neural networks (DNNs) have achieved a great success on various unimodal tasks (e.g., image categorization (Krizhevsky, Sutskever, and Hinton 2012; He et al. 2016), language modeling (Vaswani et al. 2017; Devlin et al. 2018), and speech recognition (Amodei et al. 2016)) as well as the multimodal tasks (e.g., action recognition (Simonyan and Zisserman 2014; Vielzeuf et al. 2018), image/video captioning (You et al. 2016; Jin et al. 2019, 2020; Yan et al. 2021), visual question answering (Lu et al. 2016; Anderson et al. 2018), and cross-modal generation (Reed et al. 2016; Zhou et al. 2019)). Despite the superior performances achieved by DNNs on these tasks, it usually requires huge efforts to adapt DNNs to the specific tasks. Especially with the increase of modalities, it is exhausting to manually design the backbone architectures and the feature fusion strategies. It raises urgent concerns about the automatic design of multimodal DNNs with minimal human interventions. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Inner Structure of a Cell Input Y Input X Predecessors B Previous Cells A selected edge fixed edge unselected edge B(1) B(2) B(3) Figure 1: An overview of our BM-NAS framework for multimodal learning. a Cell is a searched feature fusion unit that accepts two inputs from modality features or other Cells. In a bilevel fashion, we search the connections between Cells and the inner structures of Cells, simultaneously. Neural architecture search (NAS) (Zoph and Le 2017; Liu et al. 2018a) is a promising data-driven solution to this concern by searching for the optimal neural network architecture from a predefined space. By applying NAS to multimodal learning, MMnas (Yu et al. 2020) searches the architecture of Transformer model for visual-text alignment and MMIF (Peng et al. 2020) searches the optimal CNNs structure to extract multi-modality image features for tomography. These methods lack generalization ability since they are designed for models on specific modalities. MFAS (P erez-R ua et al. 2019) is a more generalized framework which searches the feature fusion strategy based on the unimodal features. However, MFAS only allows fusion of intermodal features, and the fusion operations are not searchable. It results in a limited space of feature fusion strategies when dealing with various modalities in different multimodal tasks. In this paper, we propose a generalized framework, named Bilevel Multimodal Neural Architecture Search (BM-NAS), to adaptively learn the architectures of DNNs for a variety of multimodal tasks. BM-NAS adopts a bilevel searching scheme that it learns the unimodal feature selection strategy at the upper level and the multimodal feature fusion strategy at the lower level, respectively. As shown in the The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) (c) Attention on Attention (Ao ANet) Input X Input Y Scale Dot Attn Input X Input Y Scale Dot Attn (d) Multi-Head Attention (Transformer) Scale Dot Attn Input X Input Y Linear GLU Scale Dot Attn Input X Input Y Scale Dot Attn (e) A Searched Cell (Ours) (b) Concat & FC (MFAS) Input X Input Y Zero Identity Figure 2: The search space of a Cell in BM-NAS accommodates many existing multimodal fusion strategies. (d) is a two-head version of multi-head attention (Vaswani et al. 2017), and more heads can be flexibly added by changing the number of inner steps. (e) is the Cell founded by BM-NAS on NTU RGB-D dataset (Shahroudy et al. 2016), and it outperforms these existing fusion strategies (see Table 7). left part of Fig. 1, the upper level of BM-NAS consists of a series of feature fusion units, i.e., Cells. The Cells are organized to combine and transform the unimodal features to the task output through a searchable directed acyclic graph (DAG). The right part of Fig. 1 illustrates the lower level of BM-NAS which learns the inner structures of Cells. A Cell is comprised of several predefined primitive operations. We carefully select the primitive operations such that different combinations of them can form a large variety of fusion modules, as shown in Fig. 2, our search space incorporates benchmark attention mechanisms like multi-head attention (Transformer) (Vaswani et al. 2017) and Attention on Attention (Ao A) (Huang et al. 2019). The bilevel scheme of BM-NAS is end-to-end learned using the differentiable NAS framework (Liu, Simonyan, and Yang 2019). We conduct extensive experiments on three multimodal tasks to evaluate the proposed BM-NAS framework. BM-NAS shows superior performances in comparison with the state-of-the-art multimodal methods. Compared with the existing generalized multimodal NAS frameworks, BM-NAS achieves competitive performances with much less search time and fewer model parameters. To the best of our knowledge, BM-NAS is the first multimodal NAS framework that supports the search of both the unimodal feature selection strategies and the multimodal fusion strategies. The main contributions of this paper are three-fold. 1. Towards a more generalized and flexible design of DNNs for multimodal learning, we propose a new paradigm that employs NAS to search both the unimodal feature selection strategy and the multimodal fusion strategy. 2. We present a novel BM-NAS framework to address the proposed paradigm. BM-NAS makes the architecture of multimodal fusion models fully searchable via a bilevel searching scheme. 3. We conduct extensive experiments on three multimodal learning tasks to evaluate the proposed BM-NAS framework. Empirical evidences indicate that both the unimodal feature selection strategy and the multimodal fusion method are significant to the performance of multimodal DNNs. Related Work Neural Architecture Search Neural architecture search (NAS) aims at automatically finding the optimal neural network architectures for specific learning tasks. NAS can be viewed as a bilevel optimization problem by optimizing the weights and the architecture of DNNs at the same time. Since the network architecture is discrete, traditional NAS methods usually rely on the black-box optimization algorithms, resulting in a extremely large computing cost. For example, searching architectures using reinforcement learning (Zoph and Le 2016) or evolution (Real et al. 2019) would require thousands of GPU-days to find a state-of-the-art architecture on Image Net dataset (Deng et al. 2009) due to low sampling efficiency. As a result, many methods were proposed for speeding up NAS. From the perspective of engineering, ENAS (Pham et al. 2018) improve the sampling efficiency by weightsharing. From the perspective of optimization algorithm, PNAS (Liu et al. 2018b) employs sequential model-based optimization (SMBO) (Hutter, Hoos, and Leyton-Brown 2011), using surrogate model to predict the performance of an architecture. Monte Carlo tree search (MTCS) (Negrinho and Gordon 2017) and Bayesian optimization (BO) (Kandasamy et al. 2018) are also explored to enhance the sampling efficiency. Recently, a remarkable efficiency improvement of NAS is achieved by differentiable architecture search (DARTS) (Liu, Simonyan, and Yang 2019). DARTS introduces a continuous relaxation of the network architecture, making it possible to search an architecture via gradient-based optimization. However, DARTS only supports the search of unary operations. For specific multimodal tasks, we expect the NAS framework to support the search of multi-input operations, in order to obtain the optimal fusion strategy. In this work, we devise a novel NAS framework named BM-NAS for multimodal learning. BM-NAS follows the optimization scheme of DARTS, however, it novelly introduces a bilevel searching scheme to search the unimodal feature selection strategy and the multimodal fusion strategy simultaneously, enabling an effective search scheme for multimodal fusion. Multimodal Fusion The multimodal fusion techniques for DNNs can be generally classified into two categories: early fusion and late fusion. Early fusion combines low-level features, while late fusion combines prediction-level features. To combine these features, a series of reduction operations such as weighted average (Natarajan et al. 2012) and bilinear product (Teney et al. 2018) are proposed in previous works. As each unimodal DNNs backbone could have tens of layers or maybe more, manually sorting out the best intermediate features for multimodal fusion could be exhausting. Therefore, some works propose to enable fusion at multiple intermediate layers. For instance, Central Net (Vielzeuf et al. 2018) and MMTM (Joze et al. 2020) join the latent representations at each layer and pass them as auxiliary information for deeper layers. Such methods achieve superior performances on several multimodal tasks including multimodal action recognition (Shahroudy et al. 2016) and gesture recognition (Zhang et al. 2018). However, it would largely increase the parameters of multimodal fusion models. In recent years, there is an increased interest of introducing the attention mechanisms such as Transformer (Vaswani et al. 2017) to multimodal learning. The multimodal-BERT family (Chen et al. 2019; Li et al. 2019; Lu et al. 2019; Tan and Bansal 2019) is a typical approach for inter-modal fusion. Moreover, DFAF (Gao et al. 2019) shows that intramodal fusion could also be helpful. DFAF proposes a dynamic attention flow module to mix inter-modal and intramodal features together through the multi-head attention (Vaswani et al. 2017). Additional efforts are made to enhance multimodal fusion efficacy of attention mechanisms. For instance, Ao ANet (Huang et al. 2019) proposes the attention on attention (Ao A) module, showing that adding an attention operation on top of another one could achieve better performance on image captioning task. Recently, the NAS approaches are making an exciting progress for DNNs, and it shows a huge potential to introduce NAS to multimodal learning. One representative work is MFAS (P erez-R ua et al. 2019), which employs SMBO algorithm (Hutter, Hoos, and Leyton-Brown 2011) to search multimodal fusion strategies given the unimodal backbones. But as SMBO is a black-box optimization algorithm, every update step requires a bunch of DNNs to be trained, leading to the inefficiency of MFAS. Besides, MFAS only use concatenation and fully connected (FC) layers for unimodal feature fusion, and the stack of FC layers would be a heavy burden for computing. Further work like MMIF (Peng et al. 2020) and 3D-CDC (Yu et al. 2021) adopt the efficient DARTS algorithm (Liu, Simonyan, and Yang 2019) for architecture search but only support the search of unary operations on graph edges and use summation on every intermediate node for reduction. MMnas (Yu et al. 2020) allows searching the attention operations but the topological structure of the network is fixed during architecture search. Different from these related works, our proposed BMNAS supports to search both the unimodal feature selection strategy and the fusion strategy of multimodal DNNs. BM-NAS introduces a bilevel searching scheme. The upper level of BM-NAS supports both intra-modal and inter-modal feature selection. The lower level of BM-NAS searches the fusion operations within every intermediate step. Each step can flexibly form the summation, concatenation, multihead attention (Vaswani et al. 2017), attention on attention (Huang et al. 2019), or any other unexplored fusion mechanisms. BM-NAS is a generalized and efficient NAS framework for multimodal learning. In experiments we show that BM-NAS can be applied to various multimodal tasks regardless of the modalities or backbone models. Methodology In this work, we propose a generalized NAS framework, named Bilevel Multimodal NAS (BM-NAS), to search the architectures of multimodal fusion DNNs. More specifically, BM-NAS searches a Cell-by-Cell architecture in a bilevel fashion. The upper level architecture is a directed acyclic graph (DAG) of the input features and Cells. The lower level architecture is a DAG of inner step nodes within a Cell. Each inner step node is a bivariate operation drawn from a predefined pool. The bilevel searching scheme ensures that BMNAS can be easily adapted to various multimodal learning tasks regardless of the types of modalities. In the following, we discuss the unimodal feature extraction, the upper and lower levels of BM-NAS, along with the architecture search algorithm and evaluation. Unimodal Feature Extraction By following previous multimodal fusion works, such as Central Net (Vielzeuf et al. 2018), MFAS (P erez-R ua et al. 2019) and MMTM (Joze et al. 2020), we also employ the pretrained unimodal backbone models as the feature extractors. We use the outputs of their intermediate layers as raw features (or intermediate blocks if the model has a block-byblock structure like Res Ne Xt (Xie et al. 2017)). Since the raw features vary in shapes, we reshape them by applying pooling, interpolation, and fully connected layers on spatial, temporal, and channel dimensions, successively. By doing so, we reshape all the raw features to the shape of (N, C, L), such that we can easily perform fusion operations between features of different modalities. Here N is the batch size, C is the embedding dimension or the number of channels, L is the sequence length. Upper Level: Cells for Feature Selection The upper level of BM-NAS searches the unimodal feature selection strategy and it consists of a group of Cells. Formally, suppose we have two modalities A and B, and two pretrained unimodal models for each modality. Let {A(i)} and {B(i)} indicate the modality features extracted by the backbone models. We formulate the upper level nodes in an ordered sequence S, as S = [A(1), ..., A(NA), B(1), ..., B(NB), Cell(1), ..., Cell(N)]. Under the setting of S, both inter-modal fusion and intramodal fusion are considered in BM-NAS. Feature Selection. By adopting the continuous relaxation in differentiable architecture search scheme (Liu, Simonyan, and Yang 2019), all predecessors of Cell(i) will be connected Input Y Input X Step 2 Step 3 Figure 3: An example of a multimodal fusion network found by BM-NAS, which consists of a bilevel searching scheme, we denote searched edges in blue, and fixed edges in black. Left: The upper level BM-NAS. The input features are extracted by pretrained unimodal models. Each Cell accepts two inputs from its predecessors, i.e., any unimodal feature or previous Cell. Right: The lower level BM-NAS. Within a Cell, each Step denotes a primitive operation selected from a predefined operation pool. The topologies of Cells and Steps are both searchable. The numbers of Cells and Steps are hyper-parameters such that BM-NAS can be adapted to a variety of multimodal tasks with different scales. to Cell(i) through weighted edges at the searching stage. This directed complete graph between Cells is called the hypernet. For two upper level nodes s(i), s(j) S, let α(i,j) denote the edge weight between s(i) and s(j). Each edge is a unary operation g selected from a function set G including (1) Identity(x) = x, i.e., selecting an edge. (2) Zero(x) = 0, i.e., discarding an edge. Then, the mixed edge operation g(i,j) on edge (i, j) is g(i,j)(s) = X exp(α(i,j) g ) P g G exp(α(i,j) g ) g(s). (1) A Cell s(j) receives inputs from all its predecessors, as g(i,j)(s(i)). (2) In evaluation stage, the network architecture is discretized that an input pair (s(i), s(j))1 will be selected for s(k) if (i, j) = arg max i