# recurrently_controlled_recurrent_networks__76faf6f0.pdf Recurrently Controlled Recurrent Networks Yi Tay1, Luu Anh Tuan2, and Siu Cheung Hui3 1,3Nanyang Technological University 2Institute for Infocomm Research ytay017@e.ntu.edu.sg1 at.luu@i2r.a-star.edu.sg2 asschui@ntu.edu.sg3 Recurrent neural networks (RNNs) such as long short-term memory and gated recurrent units are pivotal building blocks across a broad spectrum of sequence modeling problems. This paper proposes a recurrently controlled recurrent network (RCRN) for expressive and powerful sequence encoding. More concretely, the key idea behind our approach is to learn the recurrent gating functions using recurrent networks. Our architecture is split into two components - a controller cell and a listener cell whereby the recurrent controller actively influences the compositionality of the listener cell. We conduct extensive experiments on a myriad of tasks in the NLP domain such as sentiment analysis (SST, IMDb, Amazon reviews, etc.), question classification (TREC), entailment classification (SNLI, Sci Tail), answer selection (Wiki QA, Trec QA) and reading comprehension (Narrative QA). Across all 26 datasets, our results demonstrate that RCRN not only consistently outperforms Bi LSTMs but also stacked Bi LSTMs, suggesting that our controller architecture might be a suitable replacement for the widely adopted stacked architecture. 1 Introduction Recurrent neural networks (RNNs) live at the heart of many sequence modeling problems. In particular, the incorporation of gated additive recurrent connections is extremely powerful, leading to the pervasive adoption of models such as Gated Recurrent Units (GRU) [Cho et al., 2014] or Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] across many NLP applications [Bahdanau et al., 2014; Xiong et al., 2016; Rocktäschel et al., 2015; Mc Cann et al., 2017]. In these models, the key idea is that the gating functions control information flow and compositionality over time, deciding how much information to read/write across time steps. This not only serves as a protection against vanishing/exploding gradients but also enables greater relative ease in modeling long-range dependencies. There are two common ways to increase the representation capability of RNNs. Firstly, the number of hidden dimensions could be increased. Secondly, recurrent layers could be stacked on top of each other in a hierarchical fashion [El Hihi and Bengio, 1996], with each layer s input being the output of the previous, enabling hierarchical features to be captured. Notably, the wide adoption of stacked architectures across many applications [Graves et al., 2013; Sutskever et al., 2014; Wang et al., 2017; Nie and Bansal, 2017] signify the need for designing complex and expressive encoders. Unfortunately, these strategies may face limitations. For example, the former might run a risk of overfitting and/or hitting a wall in performance. On the other hand, the latter might be faced with the inherent difficulties of going deep such as vanishing gradients or difficulty in feature propagation across deep RNN layers [Zhang et al., 2016b]. This paper proposes Recurrently Controlled Recurrent Networks (RCRN), a new recurrent architecture and a general purpose neural building block for sequence modeling. RCRNs are characterized by 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. its usage of two key components - a recurrent controller cell and a listener cell. The controller cell controls the information flow and compositionality of the listener RNN. The key motivation behind RCRN is to provide expressive and powerful sequence encoding. However, unlike stacked architectures, all RNN layers operate jointly on the same hierarchical level, effectively avoiding the need to go deeper. Therefore, RCRNs provide a new alternate way of utilizing multiple RNN layers in conjunction by allowing one RNN to control another RNN. As such, our key aim in this work is to show that our proposed controller-listener architecture is a viable replacement for the widely adopted stacked recurrent architecture. To demonstrate the effectiveness of our proposed RCRN model, we conduct extensive experiments on a plethora of diverse NLP tasks where sequence encoders such as LSTMs/GRUs are highly essential. These tasks include sentiment analysis (SST, IMDb, Amazon Reviews), question classification (TREC), entailment classification (SNLI, Sci Tail), answer selection (Wiki QA, Trec QA) and reading comprehension (Narrative QA). Experimental results show that RCRN outperforms Bi LSTMs and multi-layered/stacked Bi LSTMs on all 26 datasets, suggesting that RCRNs are viable replacements for the widely adopted stacked recurrent architectures. Additionally, RCRN achieves close to state-of-the-art performance on several datasets. 2 Related Work RNN variants such as LSTMs and GRUs are ubiquitous and indispensible building blocks in many NLP applications such as question answering [Seo et al., 2016; Wang et al., 2017], machine translation [Bahdanau et al., 2014], entailment classification [Chen et al., 2017] and sentiment analysis [Longpre et al., 2016; Huang et al., 2017]. In recent years, many RNN variants have been proposed, ranging from multi-scale models [Koutnik et al., 2014; Chung et al., 2016; Chang et al., 2017] to treestructured encoders [Tai et al., 2015; Choi et al., 2017]. Models that are targetted at improving the internals of the RNN cell have also been proposed [Xingjian et al., 2015; Danihelka et al., 2016]. Given the importance of sequence encoding in NLP, the design of effective RNN units for this purpose remains an active area of research. Stacking RNN layers is the most common way to improve representation power. This has been used in many highly performant models ranging from speech recognition [Graves et al., 2013] to machine reading [Wang et al., 2017]. The BCN model [Mc Cann et al., 2017] similarly uses multiple Bi LSTM layers within their architecture. Models that use shortcut/residual connections in conjunctin with stacked RNN layers are also notable [Zhang et al., 2016b; Longpre et al., 2016; Nie and Bansal, 2017; Ding et al., 2018]. Notably, a recent emerging trend is to model sequences without recurrence. This is primarily motivated by the fact that recurrence is an inherent prohibitor of parallelism. To this end, many works have explored the possibility of using attention as a replacement for recurrence. In particular, self-attention [Vaswani et al., 2017] has been a popular choice. This has sparked many innovations, including general purpose encoders such as Di SAN [Shen et al., 2017] and Block Bi-Di SAN [Shen et al., 2018]. The key idea in these works is to use multi-headed self-attention and positional encodings to model temporal information. While attention-only models may come close in performance, some domains may still require the complex and expressive recurrent encoders. Moreover, we note that in [Shen et al., 2017, 2018], the scores on multiple benchmarks (e.g., SST, TREC, SNLI, Multi NLI) do not outperform (or even approach) the state-of-the-art, most of which are models that still heavily rely on bidirectional LSTMs [Zhou et al., 2016; Choi et al., 2017; Mc Cann et al., 2017; Nie and Bansal, 2017]. While self-attentive RNN-less encoders have recently been popular, our work moves in an orthogonal and possibly complementary direction, advocating a stronger RNN unit for sequence encoding instead. Nevertheless, it is also good to note that our RCRN model outperforms Di SAN in all our experiments. Another line of work is also concerned with eliminating recurrence. SRUs (Simple Recurrent Units) [Lei and Zhang, 2017] are recently proposed networks that remove the sequential dependencies in RNNs. SRUs can be considered a special case of Quasi-RNNs [Bradbury et al., 2016], which performs incremental pooling using pre-learned convolutional gates. A recent work, Multi-range Reasoning Units (MRU) [Tay et al., 2018b] follows the same paradigm, trading convolutional gates with features learned via expressive multi-granular reasoning. Zhang et al. [2018] proposed sentence-state LSTMs (S-LSTM) that exchanges incremental reading for a single global state. Our work proposes a new way of enhancing the representation capability of RNNs without going deep. For the first time, we propose a controller-listener architecture that uses one recurrent unit to control another recurrent unit. Our proposed RCRN consistently outperforms stacked Bi LSTMs and achieves state-of-the-art results on several datasets. We outperform above-mentioned competitors such as Di SAN, SRUs, stacked Bi LSTMs and sentence-state LSTMs. 3 Recurrently Controlled Recurrent Networks (RCRN) This section formally introduces the RCRN architecture. Our model is split into two main components - a controller cell and a listener cell. Figure 1 illustrates the model architecture. Controller Cell Listener Cell Figure 1: High level overview of our proposed RCRN architecture. 3.1 Controller Cell The goal of the controller cell is to learn gating functions in order to influence the target cell. In order to control the target cell, the controller cell constructs a forget gate and an output gate which are then used to influence the information flow of the listener cell. For each gate (output and forget), we use a separate RNN cell. As such, the controller cell comprises two cell states and an additional set of parameters. The equations of the controller cell are defined as follows: i1 t = σs(W 1 i xt + U 1 i h1 t 1 + b1 i ) and i2 t = σs(W 2 i xt + U 2 i h2 t 1 + b2 i ) (1) f 1 t = σs(W 1 f xt + U 1 f h1 t 1 + b1 f) and f 2 t = σs(W 2 f xt + U 2 f h2 t 1 + b2 f) (2) o1 t = σs(W 1 o xt + U 1 o h1 t 1 + b1 o) and o2 t = σs(W 2 o xt + U 2 o h2 t 1 + b2 o) (3) c1 t = f 1 t c1 t 1 + i1 t σ(W 1 c xt + U 1 c h1 t 1 + b1 c) (4) c2 t = f 2 t c2 t 1 + i2 t σ(W 2 c xt + U 2 c h2 t 1 + b2 c) (5) h1 t = o1 t σ(c1 t) and h2 t = o2 t σ(c2 t) (6) where xt is the input to the model at time step t. W k , U k , bk are the parameters of the model where k = {1, 2} and = {i, f, o}. σs is the sigmoid function and σ is the tanh nonlinearity. is the Hadamard product. The controller RNN has two cell states denoted as c1 and c2 respectively. h1 t, h2 t are the outputs of the unidirectional controller cell at time step t. Next, we consider a bidirectional adaptation of the controller cell. Let Equations (1-6) be represented by the function CT(), the bidirectional adaptation is represented as: h1 t, h2 t = CT(h1 t 1, h2 t 1, xt) t = 1, ℓ (7) h1 t, h2 t = CT(h1 t+1, h2 t+1, xt) t = M, 1 (8) h1 t = [ h1 t; h1 t] and h2 t = [ h2 t; h2 t] (9) The outputs of the bidirectional controller cell are h1 t, h2 t for time step t. These hidden outputs act as gates for the listener cell. 3.2 Listener Cell The listener cell is another recurrent cell. The final output of the RCRN is generated by the listener cell which is being influenced by the controller cell. First, the listener cell uses a base recurrent model to process the sequence input. The equations of this base recurrent model are defined as follows: i3 t = σs(W 3 i xt + U 3 i h3 t 1 + b3 i ) (10) f 3 t = σs(W 3 f xt + U 3 f h3 t 1 + b3 f) (11) o3 t = σs(W 3 o xt + U 3 o h3 t 1 + b3 o) (12) c3 t = f 3 t c3 t 1 + i3 t σ(W 3 c xt + U 3 c h3 t 1 + b3 c) (13) h3 t = o3 t σ(c3 t) (14) Similarly, a bidirectional adaptation is used, obtaining h3 t = [ h3 t, h3 t]. Next, using h1 t, h2 t (outputs of the controller cell), we define another recurrent operation as follows: c4 t = σs(h1 t) c4 t 1 + (1 σs(h1 t)) h3 t (15) h4 t = h2 t c3 t (16) where cj t, hj t and j = {3, 4} are the cell and hidden states at time step t. W 3 , U 3 are the parameters of the listener cell where = {i, f, o}. Note that h1 t and h2 t are the outputs of the controller cell. In this formulation, σs(h1 t) acts as the forget gate for the listener cell. Likewise σs(h2 t) acts as the output gate for the listener. 3.3 Overall RCRN Architecture, Variants and Implementation Intuitively, the overall architecture of the RCRN model can be explained as follows: Firstly, the controller cell can be thought of as two Bi RNN models which hidden states are used as the forget and output gates for another recurrent model, i.e., the listener. The listener uses a single Bi RNN model for sequence encoding and then allows this representation to be altered by listening to the controller. An alternative interpretation to our model architecture is that it is essentially a recurrent-over-recurrent model. Clearly, the formulation we have used above uses Bi LSTMs as the atomic building block for RCRN. Hence, we note that it is also possible to have a simplified variant1 of RCRN that uses GRUs as the atomic block which we found to have performed slightly better on certain datasets. Cuda-level Optimization For efficiency purposes, we use the CUDNN optimized version of the base recurrent unit (LSTMs/GRUs). Additionally, note that the final recurrent cell (Equation (15)) can be subject to CUDA-level optimization2 following simple recurrent units (SRU) [Lei and Zhang, 2017]. The key idea is that this operation can be performed along the dimension axis, enabling greater parallelization on the GPU. For the sake of brevity, we refer interested readers to [Lei and Zhang, 2017]. Note that this form of cuda-level optimization was also performed in the Quasi-RNN model [Bradbury et al., 2016], which effectively subsumes the SRU model. On Parameter Cost and Memory Efficency Note that a single RCRN model is equivalent to a stacked Bi LSTM of 3 layers. This is clear when we consider how two controller Bi RNNs are used to control a single listener Bi RNN. As such, for our experiments, when considering only the encoder and keeping all other components constant, 3L-Bi LSTM has equal parameters to RCRN while RCRN and 3L-Bi LSTM are approximately three times larger than Bi LSTM. 4 Experiments This section discusses the overall empirical evaluation of our proposed RCRN model. 1We omit technical descriptions due to the lack of space. 2We adapt the CUDA kernel as a custom Tensorflow op in our experiments. While the authors of SRU release their cuda-op at https://github.com/taolei87/sru, we use a third-party open-source Tensorflow version which can be found at https://github.com/Jonathan Raiman/tensorflow_qrnn.git. 4.1 Tasks and Datasets In order to verify the effectiveness of our proposed RCRN architecture, we conduct extensive experiments across several tasks3 in the NLP domain. Sentiment Analysis Sentiment analysis is a text classification problem in which the goal is to determine the polarity of a given sentence/document. We conduct experiments on both sentence and document level. More concretely, we use 16 Amazon review datasets from [Liu et al., 2017], the well-established Stanford Sentiment Tree Bank (SST-5/SST-2) [Socher et al., 2013] and the IMDb Sentiment dataset [Maas et al., 2011]. All tasks are binary classification tasks with the exception of SST-5. The metric is the accuracy score. Question Classification The goal of this task is to classify questions into fine-grained categories such as number or location. We use the TREC question classification dataset [Voorhees et al., 1999]. The metric is the accuracy score. Entailment Classification This is a well-established and popular task in the field of natural language understanding and inference. Given two sentences s1 and s2, the goal is to determine if s2 entails or contradicts s1. We use two popular benchmark datasets, i.e., the Stanford Natural Language Inference (SNLI) corpus [Bowman et al., 2015], and Sci Tail (Science Entailment) [Khot et al., 2018] datasets. This is a pairwise classsification problem in which the metric is also the accuracy score. Answer Selection This is a standard problem in information retrieval and learning-to-rank. Given a question, the task at hand is to rank candidate answers. We use the popular Wiki QA [Yang et al., 2015] and Trec QA [Wang et al., 2007] datasets. For Trec QA, we use the cleaned setting as denoted by Rao et al. [2016]. The evaluation metrics are the MAP (Mean Average Precision) and Mean Reciprocal Rank (MRR) ranking metrics. Reading Comprehension This task involves reading documents and answering questions about these documents. We use the recent Narrative QA [Koˇcisk y et al., 2017] dataset which involves reasoning and answering questions over story summaries. We follow the original paper and report scores on BLEU-1, BLEU-4, Meteor and Rouge-L. 4.2 Task-Specific Model Architectures and Implementation Details In this section, we describe the task-specific model architectures for each task. Classification Model This architecture is used for all text classification tasks (sentiment analysis and question classification datasets). We use 300D Glo Ve [Pennington et al., 2014] vectors with 600D Co Ve [Mc Cann et al., 2017] vectors as pretrained embedding vectors. An optional character-level word representation is also added (constructed with a standard Bi GRU model). The output of the embedding layer is passed into the RCRN model directly without using any projection layer. Word embeddings are not updated during training. Given the hidden output states of the 200d dimensional RCRN cell, we take the concatenation of the max, mean and min pooling of all hidden states to form the final feature vector. This feature vector is passed into a single dense layer with Re LU activations of 200d dimensions. The output of this layer is then passed into a softmax layer for classification. This model optimizes the cross entropy loss. We train this model using Adam [Kingma and Ba, 2014] and learning rate is tuned amongst {0.001, 0.0003, 0.0004}. Entailment Model This architecture is used for entailment tasks. This is a pairwise classification models with two input sequences. Similar to the singleton classsification model, we utilize the identical input encoder (Glo Ve, Co VE and character RNN) but include an additional part-of-speech (POS tag) embedding. We pass the input representation into a two layer highway network [Srivastava et al., 2015] of 300 hidden dimensions before passing into the RCRN encoder. The feature representation of s1 and s2 is the concatentation of the max and mean pooling of the RCRN hidden outputs. To compare s1 and s2, we pass [s1, s2, s1 s2, s1 s2] into a two layer highway network. This output 3While we agree that other tasks such as language modeling or NMT would be interesting to investigate, we could not muster enough GPU resources to conduct any extra experiments. We leave this for future work. is then passed into a softmax layer for classification. We train this model using Adam and learning rate is tuned amongst {0.001, 0.0003, 0.0004}. We mainly focus on the encoder-only setting which does not allow cross sentence attention. This is a commonly tested setting on the SNLI dataset. Ranking Model This architecture is used for the ranking tasks (i.e., answer selection). We use the model architecture from Attentive Pooling Bi LSTMs (AP-Bi LSTM) [dos Santos et al., 2016] as our base and swap the RNN encoder with our RCRN encoder. The dimensionality is set to 200. The similarity scoring function is the cosine similarity and the objective function is the pairwise hinge loss with a margin of 0.1. We use negative sampling of n = 6 to train our model. We train our model using Adadelta [Zeiler, 2012] with a learning rate of 0.2. Reading Comprehension Model We use R-NET [Wang et al., 2017] as the base model. Since R-NET uses three Bidirectional GRU layers as the encoder, we replaced this stacked Bi GRU layer with RCRN. For fairness, we use the GRU variant of RCRN instead. The dimensionality of the encoder is set to 75. We train both models using Adam with a learning rate of 0.001. For all datasets, we include an additional ablative baselines, swapping the RCRN with (1) a standard Bi LSTM model and (2) a stacked Bi LSTM of 3 layers (3L-Bi LSTM). This is to fairly observe the impact of different encoder models based on the same overall model framework. 4.3 Overall Results This section discusses the overall results of our experiments. Dataset/Model Bi LSTM 2L-Bi LSTM SLSTM Bi LSTM 3L-Bi LSTM RCRN Camera 87.1 88.1 90.0 87.3 89.7 90.5 Video 84.7 85.2 86.8 87.5 87.8 88.5 Health 85.5 85.9 86.5 85.5 89.0 90.5 Music 78.7 80.5 82.0 83.5 85.7 86.0 Kitchen 82.2 83.8 84.5 81.7 84.5 86.0 DVD 83.7 84.8 85.5 84.0 86.0 86.8 Toys 85.7 85.8 85.3 87.5 90.5 90.8 Baby 84.5 85.5 86.3 85.0 88.5 89.0 Books 82.1 82.8 83.4 86.0 87.2 88.0 IMDB 86.0 86.6 87.2 86.5 88.0 89.8 MR 75.7 76.0 76.2 77.7 77.7 79.0 Apparel 86.1 86.4 85.8 88.0 89.2 90.5 Magazines 92.6 92.9 93.8 93.7 92.5 94.8 Electronics 82.5 82.3 83.3 83.5 87.0 89.0 Sports 84.0 84.8 85.8 85.5 86.5 88.0 Software 86.7 87.0 87.8 88.5 90.3 90.8 Macro Avg 84.3 84.9 85.6 85.7 87.5 88.6 Table 1: Results on the Amazon Reviews dataset. are models implemented by us. Model/Reference Acc MVN [Guo et al., 2017] 51.5 Di SAN [Shen et al., 2017] 51.7 DMN [Kumar et al., 2016] 52.1 LSTM-CNN [Zhou et al., 2016] 52.4 NTI [Yu and Munkhdalai, 2017] 53.1 BCN [Mc Cann et al., 2017] 53.7 BCN + ELMo [Peters et al., 2018] 54.7 Bi LSTM 51.3 3L-Bi LSTM 52.6 RCRN 54.3 Table 2: Results on Sentiment Analysis on SST-5. Model/Reference Acc P-LSTM [Wieting et al., 2015] 89.2 CT-LSTM [Looks et al., 2017] 89.4 TE-LSTM [Huang et al., 2017] 89.6 NSE [Munkhdalai and Yu, 2016] 89.7 BCN [Mc Cann et al., 2017] 90.3 BMLSTM [Radford et al., 2017] 91.8 Bi LSTM 89.7 3L-Bi LSTM 90.0 RCRN 90.6 Table 3: Results on Sentiment Analysis on SST-2. Sentiment Analysis On the 16 review datasets (Table 1) from [Liu et al., 2017; Zhang et al., 2018], our proposed RCRN architecture achieves the highest score on all 16 datasets, outperforming the existing state-of-the-art model - sentence state LSTMs (SLSTM) [Zhang et al., 2018]. The Model/Reference Acc Res. Bi LSTM [Longpre et al., 2016] 90.1 4L-QRNN [Bradbury et al., 2016] 91.4 BCN [Mc Cann et al., 2017] 91.8 oh-LSTM [Johnson and Zhang, 2016] 91.9 TRNN [Dieng et al., 2016] 93.8 Virtual Miyato et al. [2016] 94.1 Bi LSTM 90.9 3L-Bi LSTM 91.8 RCRN 92.8 Table 4: Results on IMDb binary sentiment clasification. Model/Reference Acc CNN-MC [Kim, 2014] 92.2 SRU [Lei and Zhang, 2017] 93.9 DSCNN [Zhang et al., 2016a] 95.4 DC-Bi LSTM [Ding et al., 2018] 95.6 BCN [Mc Cann et al., 2017] 95.8 LSTM-CNN [Zhou et al., 2016] 96.1 Bi LSTM 95.8 3L Bi LSTM 95.4 RCRN 96.2 Table 5: Results on TREC question classification. Model/Reference Acc Multi-head [Vaswani et al., 2017] 84.2 Att. Bi-SRU [Lei and Zhang, 2017] 84.8 Di SAN [Shen et al., 2017] 85.6 Shortcut [Nie and Bansal, 2017] 85.7 Gumbel LSTM [Choi et al., 2017] 86.0 Dynamic Meta Emb [Kiela et al., 2018] 86.7 Bi LSTM 85.5 3L-Bi LSTM 85.1 RCRN 85.8 Table 6: Results on SNLI dataset. Model/Reference Acc ESIM [Chen et al., 2017] 70.6 Decomp Att [Parikh et al., 2016] 72.3 DGEM [Khot et al., 2018] 77.3 CAFE [Tay et al., 2017] 83.3 CSRAN [Tay et al., 2018a] 86.7 Open AI GPT [Radford et al., 2018] 88.3 Bi LSTM 80.1 3L-Bi LSTM 79.6 RCRN 81.1 Table 7: Results on Sci Tail dataset. Wiki QA Trec QA Model MAP MRR MAP MRR Bi LSTM 68.5 69.8 72.4 82.5 3L-Bi LSTM 69.3 71.3 73.0 83.6 RCRN 71.1 72.3 75.4 85.5 AP-Bi LSTM 63.9 69.9 75.1 80.0 AP-3L-Bi LSTM 69.8 71.3 73.3 83.4 AP-RCRN 72.4 73.7 77.9 88.2 Table 8: Results on Answer Retrieval (Wiki QA and Trec QA). Model Bleu-1 Bleu-4 Meteor Rouge Seq2Seq 16.1 1.40 4.2 13.3 ASR 23.5 5.90 8.0 23.3 Bi DAF 33.7 15.5 15.4 36.3 R-NET 34.9 20.3 18.0 36.7 RCRN 38.1 21.8 18.1 38.3 Table 9: Results on Reading Comprehension (Narrative QA). macro average performance gain over Bi LSTMs (+4%) and Stacked (2 X Bi LSTM) (+3.4%) is also notable. On the same architecture, our RCRN outperforms ablative baselines Bi LSTM by +2.9% and 3L-Bi LSTM by +1.1% on average across 16 datasets. Results on SST-5 (Table 2) and SST-2 (Table 3) are also promising. More concretely, our RCRN architecture achieves state-of-the-art results on SST-5 and SST-2. RCRN also outperforms many strong baselines such as Di SAN [Shen et al., 2017], a self-attentive model and Bi-Attentive classification network (BCN) [Mc Cann et al., 2017] that also use Co Ve vectors. On SST-2, strong baselines such as Neural Semantic Encoders [Munkhdalai and Yu, 2016] and similarly the BCN model are also outperformed by our RCRN model. Finally, on the IMDb sentiment classification dataset (Table 4), RCRN achieved 92.8% accuracy. Our proposed RCRN outperforms Residual Bi LSTMs [Longpre et al., 2016], 4-layered Quasi Recurrent Neural Networks (QRNN) [Bradbury et al., 2016] and the BCN model which can be considered to be very competitive baselines. RCRN also outperforms ablative baselines Bi LSTM (+1.9%) and 3L-Bi LSTM (+1%). Question Classification Our results on the TREC question classification dataset (Table 5) is also promising. RCRN achieved a state-of-the-art score of 96.2% on this dataset. A notable baseline is the Densely Connected Bi LSTM [Ding et al., 2018], a deep residual stacked Bi LSTM model which RCRN outperforms (+0.6%). Our model also outperforms BCN (+0.4%) and SRU (+2.3%). Our ablative Bi LSTM baselines achieve reasonably high score, posssibly due to Co Ve Embeddings. However, our RCRN can further increase the performance score. Entailment Classification Results on entailment classification are also optimistic. On SNLI (Table 6), RCRN achieves 85.8% accuracy, which is competitive to Gumbel LSTM. However, RCRN outperforms a wide range of baselines, including self-attention based models as multi-head [Vaswani et al., 2017] and Di SAN [Shen et al., 2017]. There is also performance gain of +1% over Bi-SRU even though our model does not use attention at all. RCRN also outperforms shortcut stacked encoders, which use a series of Bi LSTM connected by shortcut layers. Post review, as per reviewer request, we experimented with adding cross sentence attention, in particular adding the attention of Parikh et al. [2016] on 3L-Bi LSTM and RCRN. We found that they performed comparably (both at 87.0). We did not have resources to experiment further even though intuitively incorporating different/newer variants of attention [Kim et al., 2018; Tay et al., 2018a; Chen et al., 2017] and/or ELMo [Peters et al., 2018] can definitely raise the score further. However, we hypothesize that cross sentence attention forces less reliance on the encoder. Therefore stacked Bi LSTMs and RCRNs perform similarly. The results on Sci Tail similarly show that RCRN is more effective than Bi LSTM (+1%). Moreover, RCRN outperforms several baselines in [Khot et al., 2018] including models that use cross sentence attention such as Decomp Att [Parikh et al., 2016] and ESIM [Chen et al., 2017]. However, it still falls short to recent state-of-the-art models such as Open AI s Generative Pretrained Transformer [Radford et al., 2018]. Answer Selection Results on the answer selection (Table 8) task show that RCRN leads to considerable improvements on both Wiki QA and Trec QA datasets. We investigate two settings. The first, we reimplement AP-Bi LSTM and swap the Bi LSTM for RCRN encoders. Secondly, we completely remove all attention layers from both models to test the ability of the standalone encoder. Without attention, RCRN gives an improvement of + 2% on both datasets. With attentive pooling, RCRN maintains a + 2% improvement in terms of MAP score. However, the gains on MRR are greater (+4 7%). Notably, AP-RCRN model outperforms the official results reported in [dos Santos et al., 2016]. Overall, we observe that RCRN is much stronger than Bi LSTMs and 3L-Bi LSTMs on this task. Reading Comprehension Results (Table 9) show that enhancing R-NET with RCRN can lead to considerable improvements. This leads to an improvement of 1% 2% on all four metrics. Note that our model only uses a single layered RCRN while R-NET uses 3 layered Bi GRUs. This empirical evidence might suggest that RCRN is a better way to utilize multiple recurrent layers. Overall Results Across all 26 datasets, RCRN outperforms not only standard Bi LSTMs but also 3L-Bi LSTMs which have approximately equal parameterization. 3L-Bi LSTMs were overall better than Bi LSTMs but lose out on a minority of datasets. RCRN outperforms a wide range of competitive baselines such as Di SAN, Bi-SRUs, BCN and LSTM-CNN, etc. We achieve (close to) state-of-the-art performance on SST, TREC question classification and 16 Amazon review datasets. 4.4 Runtime Analysis This section aims to get a benchmark on model performance with respect to model efficiency. In order to do that, we benchmark RCRN along with Bi LSTMs and 3 layered Bi LSTMs (with and without CUDNN optimization) on different sequence lengths (i.e., 16, 32, 64, 128, 256). We use the IMDb sentiment task. We use the same standard hardware (a single Nvidia GTX1070 card) and an identical overarching model architecture. The dimensionality of the model is set to 200 with a fixed batch size of 32. Finally, we also benchmark a CUDA optimized adaptation of RCRN which has been described earlier (Section 3.3). Table 10 reports training/inference times of all benchmarked models. The fastest model is naturally the 1 layer Bi LSTM (CUDNN). Intuitively, the speed of RCRN should be roughly equivalent to using 3 Bi LSTMs. Surprisingly, we found that the CUDA optimized RCRN performs consistently slightly faster than the 3 layer Bi LSTM (CUDNN). At the very least, RCRN provides comparable efficiency to using stacked Bi LSTM and empirically we show that there is nothing to lose in this aspect. However, we note that CUDA-level optimizations have to be performed. Finally, the non-CUDNN optimized Bi LSTM and stacked Bi LSTMs are also provided for reference. Training Time (seconds/epoch) Inference (seconds/epoch) 16 32 64 128 256 16 32 64 128 256 3 layer Bi LSTM 29 50 113 244 503 12 20 38 72 150 Bi LSTM 18 30 63 131 272 9 15 28 52 104 1 layer Bi LSTM (CUDNN) 5 6 9 14 26 2 3 4 6 10 3 layer Bi LSTM (CUDNN) 10 14 23 42 80 4 5 9 16 32 RCRN (CUDNN) 19 29 53 101 219 8 12 23 41 78 RCRN (CUDNN +CUDA optimized) 10 13 21 40 78 4 5 8 15 29 Table 10: Training and Inference times on IMDb binary sentiment classification task with varying sequence lengths. 5 Conclusion and Future Directions We proposed Recurrently Controlled Recurrent Networks (RCRN), a new recurrent architecture and encoder for a myriad of NLP tasks. RCRN operates in a novel controller-listener architecture which uses RNNs to learn the gating functions of another RNN. We apply RCRN to a potpourri of NLP tasks and achieve promising/highly competitive results on all tasks and 26 benchmark datasets. Overall findings suggest that our controller-listener architecture is more effective than stacking RNN layers. Moreover, RCRN remains equally (or slightly more) efficient compared to stacked RNNs of approximately equal parameterization. There are several potential interesting directions for further investigating RCRNs. Firstly, investigating RCRNs controlling other RCRNs and secondly, investigating RCRNs in other domains where recurrent models are also prevalent for sequence modeling. The source code of our model can be found at https://github.com/ vanzytay/NIPS2018_RCRN. 6 Acknowledgements We thank the anonymous reviewers and area chair from Neur IPS 2018 for their constructive and high quality feedback. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 632 642, 2015. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural networks. Co RR, abs/1611.01576, 2016. Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark A Hasegawa-Johnson, and Thomas S Huang. Dilated recurrent neural networks. In Advances in Neural Information Processing Systems, pages 76 86, 2017. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1657 1668, 2017. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014. Jihun Choi, Kang Min Yoo, and Sang-goo Lee. Unsupervised learning of task-specific tree structures with tree-lstms. ar Xiv preprint ar Xiv:1707.02786, 2017. Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. ar Xiv preprint ar Xiv:1609.01704, 2016. Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative long short-term memory. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1986 1994, 2016. Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. Topicrnn: A recurrent neural network with long-range semantic dependency. ar Xiv preprint ar Xiv:1611.01702, 2016. Zixiang Ding, Rui Xia, Jianfei Yu, Xiang Li, and Jian Yang. Densely connected bidirectional lstm with applications to sentence classification. ar Xiv preprint ar Xiv:1802.00889, 2018. Cícero Nogueira dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. Attentive pooling networks. Co RR, abs/1602.03609, 2016. Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. In Advances in neural information processing systems, pages 493 499, 1996. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645 6649. IEEE, 2013. Hongyu Guo, Colin Cherry, and Jiang Su. End-to-end multi-view networks for text classification. ar Xiv preprint ar Xiv:1704.05907, 2017. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997. Minlie Huang, Qiao Qian, and Xiaoyan Zhu. Encoding syntactic knowledge in neural networks for sentiment classification. ACM Transactions on Information Systems (TOIS), 35(3):26, 2017. Rie Johnson and Tong Zhang. Supervised and semi-supervised text categorization using lstm for region embeddings. ar Xiv preprint ar Xiv:1602.02373, 2016. Tushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science question answering. In AAAI, 2018. Douwe Kiela, Changhan Wang, and Kyunghyun Cho. Context-attentive embeddings for improved sentence representations. ar Xiv preprint ar Xiv:1804.07983, 2018. Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and Nojun Kwak. Semantic sentence matching with densely-connected recurrent and co-attentive information. ar Xiv preprint ar Xiv:1805.11360, 2018. Yoon Kim. Convolutional neural networks for sentence classification. ar Xiv preprint ar Xiv:1408.5882, 2014. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2014. Tomáš Koˇcisk y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. ar Xiv preprint ar Xiv:1712.07040, 2017. Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. ar Xiv preprint ar Xiv:1402.3511, 2014. Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning, pages 1378 1387, 2016. Tao Lei and Yu Zhang. Training rnns as fast as cnns. ar Xiv preprint ar Xiv:1709.02755, 2017. Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Adversarial multi-task learning for text classification. ar Xiv preprint ar Xiv:1704.05742, 2017. Shayne Longpre, Sabeek Pradhan, Caiming Xiong, and Richard Socher. A way out of the odyssey: Analyzing and combining recent insights for lstms. ar Xiv preprint ar Xiv:1611.05104, 2016. Moshe Looks, Marcello Herreshoff, De Lesley Hutchins, and Peter Norvig. Deep learning with dynamic computation graphs. ar Xiv preprint ar Xiv:1702.02181, 2017. Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pages 142 150. Association for Computational Linguistics, 2011. Bryan Mc Cann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6297 6308, 2017. Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. ar Xiv preprint ar Xiv:1605.07725, 2016. Tsendsuren Munkhdalai and Hong Yu. Neural semantic encoders. corr abs/1607.04315, 2016. Yixin Nie and Mohit Bansal. Shortcut-stacked sentence encoders for multi-domain inference. In Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, Rep Eval@EMNLP 2017, Copenhagen, Denmark, September 8, 2017, pages 41 45, 2017. Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2249 2255, 2016. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532 1543, 2014. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. ar Xiv preprint ar Xiv:1802.05365, 2018. Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment. ar Xiv preprint ar Xiv:1704.01444, 2017. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. Jinfeng Rao, Hua He, and Jimmy J. Lin. Noise-contrastive estimation for answer selection with deep neural networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016, pages 1913 1916, 2016. Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Koˇcisk y, and Phil Blunsom. Reasoning about entailment with neural attention. ar Xiv preprint ar Xiv:1509.06664, 2015. Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. ar Xiv preprint ar Xiv:1611.01603, 2016. Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. Disan: Directional self-attention network for rnn/cnn-free language understanding. ar Xiv preprint ar Xiv:1709.04696, 2017. Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Bi-directional block selfattention for fast and memory-efficient sequence modeling. ar Xiv preprint ar Xiv:1804.00857, 2018. Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. Citeseer, 2013. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. Co RR, abs/1505.00387, 2015. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104 3112, 2014. Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from tree-structured long short-term memory networks. ar Xiv preprint ar Xiv:1503.00075, 2015. Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. A compare-propagate architecture with alignment factorization for natural language inference. ar Xiv preprint ar Xiv:1801.00102, 2017. Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. Co-stack residual affinity networks with multi-level attention refinement for matching text sequences. ar Xiv preprint ar Xiv:1810.02938, 2018a. Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. Multi-range reasoning for machine comprehension. ar Xiv preprint ar Xiv:1803.09074, 2018b. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000 6010, 2017. Ellen M Voorhees et al. The trec-8 question answering track report. In Trec, volume 99, pages 77 82, 1999. Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. What is the jeopardy model? A quasisynchronous grammar for QA. In EMNLP-Co NLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, pages 22 32, 2007. Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189 198, 2017. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Towards universal paraphrastic sentence embeddings. ar Xiv preprint ar Xiv:1511.08198, 2015. SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802 810, 2015. Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question answering. Co RR, abs/1611.01604, 2016. Yi Yang, Wen-tau Yih, and Christopher Meek. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 2013 2018, 2015. Hong Yu and Tsendsuren Munkhdalai. Neural tree indexers for text understanding. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pages 11 21, 2017. Matthew D Zeiler. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012. Rui Zhang, Honglak Lee, and Dragomir Radev. Dependency sensitive convolutional neural networks for modeling sentences and documents. ar Xiv preprint ar Xiv:1611.02361, 2016a. Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. Highway long short-term memory rnns for distant speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 5755 5759. IEEE, 2016b. Yue Zhang, Qi Liu, and Linfeng Song. Sentence-state lstm for text representation. ar Xiv preprint ar Xiv:1805.02474, 2018. Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. Text classification improved by integrating bidirectional lstm with two-dimensional max pooling. ar Xiv preprint ar Xiv:1611.06639, 2016.