# semanticsaware_bert_for_language_understanding__2444ba97.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Semantics-Aware BERT for Language Understanding Zhuosheng Zhang,1,2,3, Yuwei Wu,1,2,3,4,* Hai Zhao,1,2,3, Zuchao Li,1,2,3 Shuailiang Zhang,1,2,3 Xi Zhou,5 Xiang Zhou5 1Department of Computer Science and Engineering, Shanghai Jiao Tong University 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China 3Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China 4College of Zhiyuan, Shanghai Jiao Tong University, China 5Cloud Walk Technology, Shanghai, China {zhangzs, will8821}@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn The latest work on language representations carefully integrates contextualized features into language model training, which enables a series of success especially in various machine reading comprehension and natural language inference tasks. However, the existing language representation models including ELMo, GPT and BERT only exploit plain context-sensitive features such as character or word embeddings. They rarely consider incorporating structured semantic information which can provide rich semantics for language representation. To promote natural language understanding, we propose to incorporate explicit contextual semantics from pre-trained semantic role labeling, and introduce an improved language representation model, Semanticsaware BERT (Sem BERT), which is capable of explicitly absorbing contextual semantics over a BERT backbone. Sem BERT keeps the convenient usability of its BERT precursor in a light fine-tuning way without substantial task-specific modifications. Compared with BERT, semantics-aware BERT is as simple in concept but more powerful. It obtains new state-ofthe-art or substantially improves results on ten reading comprehension and language inference tasks. 1 Introduction Recently, deep contextual language model (LM) has been shown effective for learning universal language representations, achieving state-of-the-art results in a series of flagship natural language understanding (NLU) tasks. Some prominent examples are Embedding from Language models (ELMo) (Peters et al. 2018), Generative Pre-trained Transformer (Open AI GPT) (Radford et al. 2018), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al. 2018) and Generalized Autoregressive Pretraining (XLNet) (Yang et al. 2019). Providing fine-grained contextual embedding, these pre-trained models could be either These authors contribute equally. Corresponding author. This paper was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100) and Key Projects of National Natural Science Foundation of China (U1836222 and 61733011). Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. easily applied to downstream models as the encoder or used for fine-tuning. Despite the success of those well pre-trained language models, we argue that current techniques which only focus on language modeling restrict the power of the pretrained representations. The major limitation of existing language models lies in only taking plain contextual features for both representation and training objective, rarely considering explicit contextual semantic clues. Even though well pre-trained language models can implicitly represent contextual semantics more or less (Clark et al. 2019), they can be further enhanced by incorporating external knowledge. To this end, there is a recent trend of incorporating extra knowledge to pre-trained language models (Zhang et al. 2020b). A number of studies have found deep learning models might not really understand the natural language texts (Mudrakarta et al. 2018) and vulnerably suffer from adversarial attacks (Jia and Liang 2017). Through their observation, deep learning models pay great attention to non-significant words and ignore important ones. For attractive question answering challenge (Rajpurkar et al. 2016), we observe a number of answers produced by previous models are semantically incomplete (As shown in Section 6.3), which suggests that the current NLU models suffer from insufficient contextual semantic representation and learning. Actually, NLU tasks share the similar task purpose as sentence contextual semantic analysis. Briefly, semantic role labeling (SRL) over a sentence is to discover who did what to whom, when and why with respect to the central meaning of the sentence, which naturally matches the task target of NLU. For example, in question answering tasks, questions are usually formed with who, what, how, when and why, which can be conveniently formulized into the predicateargument relationship in terms of contextual semantics. In human language, a sentence usually involves various predicate-argument structures, while neural models encode sentence into embedding representation, with little consideration of the modeling of multiple semantic structures. Thus we are motivated to enrich the sentence contextual semantics in multiple predicate-specific argument sequences by presenting Sem BERT: Semantics-aware BERT, which is a fine-tuned BERT with explicit contextual semantic clues. The proposed Sem BERT learns the representation in a finegrained manner and takes both strengths of BERT on plain context representation and explicit semantics for deeper meaning representation. Our model consists of three components: 1) an out-ofshelf semantic role labeler to annotate the input sentences with a variety of semantic role labels; 2) an sequence encoder where a pre-trained language model is used to build representation for input raw texts and the semantic role labels are mapped to embedding in parallel; 3) a semantic integration component to integrate the text representation with the contextual explicit semantic embedding to obtain the joint representation for downstream tasks. The proposed Sem BERT will be directly applied to typical NLU tasks. Our model is evaluated on 11 benchmark datasets involving natural language inference, question answering, semantic similarity and text classification. Sem BERT obtains new state-of-the-art on SNLI and also obtains significant gains on the GLUE benchmark and SQu AD 2.0. Ablation studies and analysis verify that our introduced explicit semantics is essential to the further performance improvement and Sem BERT essentially and effectively works as a unified semantics-enriched language representation model1. 2 Background and Related Work 2.1 Language Modeling for NLU Natural language understanding tasks require a comprehensive understanding of natural languages and the ability to do further inference and reasoning. A common trend among NLU studies is that models are becoming more and more sophisticated with stacked attention mechanisms or large amount of corpus (Zhang et al. 2018; 2020a; Zhou, Zhang, and Zhao 2019), resulting in explosive growth of computational cost. Notably, well pre-trained contextual language models such as ELMo (Peters et al. 2018), GPT (Radford et al. 2018) and BERT (Devlin et al. 2018) have been shown powerful to boost NLU tasks to reach new high performance. Distributed representations have been widely used as a standard part of NLP models due to the ability to capture the local co-occurence of words from large scale unlabeled text (Mikolov et al. 2013). However, these approaches for learning word vectors only involve a single, context independent representation for each word with litter consideration of contextual encoding in sentence level. Thus recently introduced contextual language models including ELMo, GPT, BERT and XLNet fill the gap by strengthening the contextual sentence modeling for better representation, among which BERT uses a different pre-training objective, masked language model, which allows capturing both sides of context, left and right. Besides, BERT also introduces a next sentence prediction task that jointly pre-trains text-pair representations. The latest evaluation shows that BERT is powerful and convenient for downstream NLU tasks. 1The code is publicly available at https://github.com/cooelf/ Sem BERT. The major technical improvement over traditional embeddings of these newly proposed language models is that they focus on extracting context-sensitive features from language models. When integrating these contextual word embeddings with existing task-specific architectures, ELMo helps boost several major NLP benchmarks (Peters et al. 2018) including question answering on SQu AD, sentiment analysis (Socher et al. 2013), and named entity recognition (Sang and De Meulder 2003), while BERT especially shows effective on language understanding tasks on GLUE, Multi NLI and SQu AD (Devlin et al. 2018). In this work, we follow this line of extracting context-sensitive features and take pre-trained BERT as our backbone encoder for jointly learning explicit context semantics. 2.2 Explicit Contextual Semantics Although distributed representations including the latest advanced pre-trained contextual language models have already been strengthened by semantics to some extent from linguistic sense (Clark et al. 2019), we argue such implicit semantics may not be enough to support a powerful contextual representation for NLU, according to our observation on the semantically incomplete answer span generated by BERT on SQu AD, which motivates us to directly introduce explicit semantics. There are a few formal semantic frames, including Frame Net (Baker, Fillmore, and Lowe 1998) and Prop Bank (Palmer, Gildea, and Kingsbury 2005), in which the latter is more popularly implemented in computational linguistics. Formal semantics generally presents the semantic relationship as predicate-argument structure. For example, given the following sentence with target verb (predicate) sold, all the arguments are labeled as follows, [ARG0 Charlie] [V sold] [ARG1 a book] [ARG2 to Sherry] [AM T MP last week]. where ARG0 represents the seller (agent), ARG1 represents the thing sold (theme), ARG2 represents the buyer (recipient), AM TMP is an adjunct indicating the timing of the action and V represents the predicate. To parse the predicate-argument structure, we have an NLP task, semantic role labeling (SRL) (Zhao, Chen, and Kit 2009; Zhao, Zhang, and Kit 2013). Recently, end-toend SRL system neural models have been introduced (He et al. 2017; Li et al. 2019). These studies tackle argument identification and argument classification in one shot. He et al. (2017) presented a deep highway Bi LSTM architecture with constrained decoding, which is simple and effective, enabling us to select it as our basic semantic role labeler. Inspired by recent advances, we can easily integrate SRL into NLU. 3 Semantics-aware BERT Figure 1 overviews our semantics-aware BERT framework. We omit rather extensive formulations of BERT and recommend readers to get the details from (Devlin et al. 2018). Sem BERT is designed to be capable of handling multiple sequence inputs. In Sem BERT, words in the input sequence are passed to semantic role labeler to fetch multiple predicate- reconstructing dormitories will not be approved by cavanaugh rec##ons ##tructing dorm by ca ARG1 (x2) MODAL NEG Verb ARG0 (x2) reconstructing dormitories by cavanaugh BERT tokenization *Semantic Role Labeling ##itor ##ies [PAD] [PAD] ##vana ##ugh ... reconstructing dormitories will not be approved by cavanaugh Lookup table ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... reconstructing dormitories will not be approved by cavanaugh Semantic role labels (various aspects) reconstructing dormitories will not be approved by cavanaugh Perspective integration conv. conv. conv. conv. pooling pooling pooling pooling semantics integration reconstructing dormitories cavanaugh by Trm Trm Trm Trm Trm Trm T1 T2 TN ... E1 E2 EN ... For the text, {reconstructing dormitories will not be approved by cavanaugh}, it will be tokenized to a subword-level sequence, {rec, ##ons, ##tructing, dorm, ##itor, ##ies, will, not, be, approved, by, ca, ##vana, ##ugh}. Meanwhile, there are two kinds of word-level semantic structures, [ARG1: reconstructing dormitories] [ARGM-MOD: will] [ARGM-NEG: not] be [V: approved] [ARG0: by cavanaugh] [V: reconstructing] [ARG1: dormitories] will not be approved by cavanaugh Figure 1: Semantics-aware BERT. * denotes the pre-trained labeler which will not be fine-tuned in our framework. derived structures of explicit semantics and the corresponding embeddings are aggregated after a linear layer to form the final semantic embedding. In parallel, the input sequence is segmented to subwords (if any) by BERT word-piece tokenizer, then the subword representation is transformed back to word level via a convolutional layer to obtain the contextual word representations. At last, the word representations and semantic embedding are concatenated to form the joint representation for downstream tasks. 3.1 Semantic Role Labeling During the data pre-processing, each sentence is annotated into several semantic sequences using our pre-trained semantic labeler. We take Prop Bank style (Palmer, Gildea, and Kingsbury 2005) of semantic roles to annotate every token of input sequence with semantic labels. Given a specific sentence, there would be various predicate-argument structures. As shown in Figure 1, for the text, [reconstructing dormitories will not be approved by cavanaugh], there are two semantic structures in the view of the predicates in the sentence. To disclose the multidimensional semantics, we group the semantic labels and integrate them with text embeddings in the next encoding component. The input data flow is de- picted in Figure 2. 3.2 Encoding The raw text sequences and semantic role label sequences are firstly represented as embedding vectors to feed a pretrained BERT. The input sentence X = {x1, . . . , xn} is a sequence of words of length n, which is first tokenized to word pieces (subword tokens). Then the transformer encoder captures the contextual information for each token via self-attention and produces a sequence of contextual embeddings. For m label sequences related to each predicate, we have T = {t1, . . . , tm} where ti contains n labels denoted as {labeli 1, labeli 2, ..., labeli n}. Since our labels are in word-level, the length is equal to the original sentence length n of X. We regard the semantic signals as embeddings and use a lookup table to map these labels to vectors {vi 1, vi 2, ..., vi n} and feed a Bi GRU layer to obtain the label representations for m label sequences in latent space, e(ti) = Bi GRU(vi 1, vi 2, . . . , vi n) where 0 < i m. For m label sequences, let Li denote the label sequences for token xi, we have e(Li) = {e(t1), . . . , e(tm)}. We concatenate the m sequences of label representation and feed them to a fully connected layer to obtain the refined joint representa- Word-level Embeddings BERT Subword Explicit Semantic Embeddings rec ##tructing dorm ##itor ##ies will not be approved by ##ons reconstructing reconstructing dormitories will not be approved by cavanaugh ARG1 Verb O (x6) ARG1 (x2) MODAL NEG dormitories cavanaugh ca ##vana ##ugh will not be approved by O VERB ARG0 (x2) Figure 2: The input representation flow. tion et in dimension d: e (Li) = W2 [e(t1), e(t2), . . . , e(tm)] + b2, et = {e (L1), ..., e (Ln)}, (1) where W2 and b2 are trainable parameters. 3.3 Integration This integration module fuses the lexical text embedding and label representations. As the original pre-trained BERT is based on a sequence of subwords, while our introduced semantic labels are on words, we need to align these different sized sequences. Thus we group the subwords for each word and use convolutional neural network (CNN) with a max pooling to obtain the representation in word-level. We select CNN because of fast speed and our preliminary experiments show that it also gives better results than RNNs in our concerned tasks where we think the local feature captured by CNN would be beneficial for subword-derived LM modeling. We take one word for example. Supposing that word xi is made up of a sequence of subwords [s1, s2, ..., sl], where l is the number of subwords for word xi. Denoting the representation of subword sj from BERT as e(sj), we first utilize a Conv1D layer, e i = W1 [e(si), e(si+1), . . . , e(si+k 1)] + b1, where W1 and b1 are trainable parameters and k is the kernel size. We then apply Re LU and max pooling to the output embedding sequence for xi: e i = Re LU(e i), e(xi) = Max Pooling(e 1, ..., e l k+1), (2) Therefore, the whole representation for word sequence X is represented as ew = {e(x1), . . . e(xn)} Rn dw where dw denotes the dimension of word embedding. The aligned context and distilled semantic embeddings are then merged by a fusion function h = ew et, where represents concatenation operation2. 4 Model Implementation Now, we introduce the specific implementation parts of our Sem BERT. Sem BERT could be a forepart encoder for a wide range of tasks and could also become an end-to-end model 2We also tried summation, multiplication and attention mechanisms, but our experiments show that concatenation is the best. with only a linear layer for prediction. For simplicity, we only show the straightforward Sem BERT that directly gives the predictions after fine-tuning3. 4.1 Semantic Role Labeler To obtain the semantic labels, we use a pre-trained SRL module to predict all predicates and corresponding arguments in one shot. We implement the semantic role labeler from Peters et al. (2018), achieving an F1 of 84.6%4 on English Onto Notes v5.0 benchmark dataset (Pradhan et al. 2013) for the Co NLL-2012 shared task. At test time, we perform Viterbi decoding to enforce valid spans using BIO constraints. In our implementation, there are 104 labels in total. We use O for non-argument words and Verb label for predicates. 4.2 Task-specific Fine-tuning In Section 3, we have described how to obtain the semanticsaware BERT representations. Here, we show how to adapt Sem BERT to classification, regression and span-based MRC tasks. We transform the fused contextual semantic and LM representations h to a lower dimension and obtain the prediction distributions. Note that this part is basically the same as the implementation in BERT without any modification, to avoid extra influence and focus on the intrinsic performance of Sem BERT. We outline here to keep the completeness of the implementation. For classification and regression tasks, h is directly passed to a fully connection layer to get the class logits or score, respectively. The training objectives are Cross Entropy for classification tasks and Mean Square Error loss for regression tasks. For span-based reading comprehension, h is passed to a fully connection layer to get the start logits s and end logits e of all tokens. The score of a candidate span from position i to position j is defined as si + ej, and the maximum scoring span where j i is used as a prediction5. For prediction, we compare the score of the pooled first token span: snull = s0 + e0 to the score of the best non-null span 3We only use single model for each task without jointly training and parameter sharing. 4This result nearly reaches the SOTA in (He et al. 2018). 5All the candidate scores are normanized by softmax. Method Classification Natural Language Inference Semantic Similarity Score Co LA SST-2 MNLI QNLI RTE MRPC QQP STS-B - (mc) (acc) m/mm(acc) (acc) (acc) (F1) (F1) (pc) - Leaderboard (September, 2019) ALBERT 69.1 97.1 91.3/91.0 99.2 89.2 93.4 74.2 92.5 89.4 Ro BERTa 67.8 96.7 90.8/90.2 98.9 88.2 92.1 90.2 92.2 88.5 XLNET 67.8 96.8 90.2/89.8 98.6 86.3 93.0 90.3 91.6 88.4 In literature (April, 2019) Bi LSTM+ELMo+Attn 36.0 90.4 76.4/76.1 79.9 56.8 84.9 64.8 75.1 70.5 GPT 45.4 91.3 82.1/81.4 88.1 56.0 82.3 70.3 82.0 72.8 GPT on STILTs 47.2 93.1 80.8/80.6 87.2 69.1 87.7 70.1 85.3 76.9 MT-DNN 61.5 95.6 86.7/86.0 - 75.5 90.0 72.4 88.3 82.2 BERTBASE 52.1 93.5 84.6/83.4 - 66.4 88.9 71.2 87.1 78.3 BERTLARGE 60.5 94.9 86.7/85.9 92.7 70.1 89.3 72.1 87.6 80.5 Our implementation Sem BERTBASE 57.8 93.5 84.4/84.0 90.9 69.3 88.2 71.8 87.3 80.9 Sem BERTLARGE 62.3 94.6 87.6/86.3 94.6 84.5 91.2 72.8 87.8 82.9 Table 1: Results on GLUE benchmark. The block In literatures shows the comparable results from (Liu et al. 2019; Radford et al. 2018) at the time of submitting Sem BERT to GLUE (April, 2019). ˆ si,j = maxj i(si + ej). We predict a non-null answer when ˆ si,j > snull+τ, where the threshold τ is selected on the dev set to maximize F1. 5 Experiments Our implementation is based on the Py Torch implementation of BERT6. We use the pre-trained weights of BERT and follow the same fine-tuning procedure as BERT without any modification, and all the layers are tuned with moderate model size increasing, as the extra SRL embedding volume is less than 15% of the original encoder size. We set the initial learning rate in {8e-6, 1e-5, 2e-5, 3e-5} with warm-up rate of 0.1 and L2 weight decay of 0.01. The batch size is selected in {16, 24, 32}. The maximum number of epochs is set in [2, 5] depending on tasks. Texts are tokenized using wordpieces, with maximum length of 384 for SQu AD and 200 for other tasks. The dimension of SRL embedding is set to 10. The default maximum number of predicate-argument structures m is set to 3. Hyper-parameters were selected using the dev set. 5.2 Tasks and Datasets Our evaluation is performed on ten NLU benchmark datasets involving natural language inference, machine reading comprehension, semantic similarity and text classification. Some of these tasks are available from the recently released GLUE benchmark (Wang et al. 2018), which is a collection of nine NLU tasks. We also extend our experiments to two widelyused tasks, SNLI (Bowman et al. 2015) and SQu AD 2.0 (Rajpurkar, Jia, and Liang 2018) to show the superiority. 6https://github.com/huggingface/pytorch-pretrained-BERT Reading Comprehension As a widely used MRC benchmark dataset, SQu AD 2.0 (Rajpurkar, Jia, and Liang 2018) combines the 100,000 questions in SQu AD 1.1 (Rajpurkar et al. 2016) with over 50,000 new, unanswerable questions that are written adversarially by crowdworkers to look similar to answerable ones. For SQu AD 2.0, systems must not only answer questions when possible, but also abstain from answering when no answer is supported by the paragraph. Natural Language Inference Natural Language Inference involves reading a pair of sentences and judging the relationship between their meanings, such as entailment, neutral and contradiction. We evaluate on 4 diverse datasets, including Stanford Natural Language Inference (SNLI) (Bowman et al. 2015), Multi-Genre Natural Language Inference (MNLI) (Nangia et al. 2017), Question Natural Language Inference (QNLI) (Rajpurkar et al. 2016) and Recognizing Textual Entailment (RTE) (Bentivogli et al. 2009). Semantic Similarity Semantic similarity tasks aim to predict whether two sentences are semantically equivalent or not. The challenge lies in recognizing rephrasing of concepts, understanding negation, and handling syntactic ambiguity. Three datasets are used, including Microsoft Paraphrase corpus (MRPC) (Dolan and Brockett 2005), Quora Question Pairs (QQP) dataset (Chen et al. 2018) and Semantic Textual Similarity benchmark (STS-B) (Cer et al. 2017). Classification The Corpus of Linguistic Acceptability (Co LA) (Warstadt, Singh, and Bowman 2018) is used to predict whether an English sentence is linguistically acceptable or not. The Stanford Sentiment Treebank (SST-2) (Socher et al. 2013) provides a dataset for sentiment classification 7https://nlp.stanford.edu/seminar/details/jdevlin.pdf Model EM F1 #1 BERT + DAE + Ao A 85.9 88.6 #2 SG-Net 85.2 87.9 #3 BERT + NGM + SST 85.2 87.7 U-Net (Sun et al. 2018) 69.2 72.6 RMR + ELMo + Verifier (Hu et al. 2018) 71.7 74.2 Our implementation BERTLARGE 80.5 83.6 Sem BERTLARGE 82.4 85.2 Sem BERT LARGE 84.8 87.9 Table 2: Exact Match (EM) and F1 scores on SQu AD 2.0 test set for single models. denotes the top 3 single submissions from the leaderboard at the time of submitting Sem BERT (11 April, 2019). Most of the top results from the SQu AD leaderboard do not have public model descriptions available, and it is allowed to use any public data for system training. We therefore further adopt synthetic self training7 for data augmentation, denoted as Sem BERT LARGE. that needs to determine whether the sentiment of a sentence extracted from movie reviews is positive or negative. 5.3 Results Table 1 shows results on the GLUE benchmark datasets, showing Sem BERT gives substantial gains over BERT and outperforms all the previous state-of-the-art models in literature7. Since Sem BERT takes BERT as the backbone with the same evaluation procedure, the gain is entirely owing to newly introduced explicit contextual semantics. Though recent dominant models take advance of multi-tasking, knowledge distillation, transfer learning or ensemble, our single model is lightweight and competitive, even yields better results with simple design and less parameters. Model parameter comparison is shown in Table 4. We observe that without multi-task learning like MT-DNN8, our model still achieves remarkable results. Particularly, we observe substantial improvements on small datasets such as RTE, MRPC, Co LA, which demonstrates involving explicit semantics helps the model work better with small training data, which is important for most NLP tasks as large-scale annotated data is unavailable. Table 2 shows the results for reading comprehension on SQu AD 2.0 test set9. Sem BERT boosts the strong BERT baseline essentially on both EM and F1. It also outperforms all the published works and achieves comparable performance with a few unpublished models from the leaderboard. Table 3 shows Sem BERT also achieves a new state-ofthe-art on SNLI benchmark and even outperforms all the en- 7We find that MNLI model can be effectively transferred for RTE and MRPC datasets, thus the models for RTE and MRPC are fine-tuned base on our MNLI model. 8Since MT-DNN is a multi-task learning framework with shared parameters on 9 task-specific layers, we count the 340M shared parameters for nine times for fair comparison. 9There is a restriction of submission frequency for online SQu AD 2.0 evaluation, we do not submit our base models. Model Dev Test In literature DRCN (Kim et al. 2018) - 90.1 SJRC (Zhang et al. 2019) - 91.3 MT-DNN (Liu et al. 2019) 92.2 91.6 Our implementation BERTBASE 90.8 90.7 BERTLARGE 91.3 91.1 Sem BERTBASE 91.2 91.0 Sem BERTLARGE 92.3 91.6 Table 3: Accuracy on SNLI dataset. Previous state-of-theart result is marked by . Both our Sem BERT and BERT are single models, fine-tuned based on the pre-trained models. Model Params Shared Rate (M) (M) MT-DNN 3,060 340 9.1 BERT on STILTs 335 - 1.0 BERT 335 - 1.0 Sem BERT 340 - 1.0 Table 4: Parameter Comparison on LARGE models. The numbers are from GLUE leaderboard (https:// gluebenchmark.com/leaderboard). semble models10 by a large margin. 6 Analysis 6.1 Ablation Study To evaluate the contributions of key factors in our method, we perform an ablation study on the SNLI and SQu AD 2.0 dev sets as shown in Table 6. Since Sem BERT absorbs contextual semantics in a deep processing way, we wonder if a simple and straightforward way integrating such semantic information may still work, thus we concatenate the SRL embedding with BERT subword embeddings for a direct comparison, where the semantic role labels are copied to the number of subwords for each original word, without CNN and pooling for word-level alignment. From the results, we observe that the concatenation would yield an improvement, verifying that integrating contextual semantics would be quite useful for language understanding. However, Sem BERT still outperforms the simple BERT+SRL model just like the latter outperforms the original BERT by a large performance margin, which shows that Sem BERT works more effectively for integrating both plain contextual representation and contextual semantics at the same time. 6.2 The influence of the number m We investigate the influence of the max number of predicateargument structures m by setting it from 1 to 5. Table 7 shows the result. We observe that the modest number of m would be better. 10https://nlp.stanford.edu/projects/snli/. As ensemble models are commonly composed of multiple heterogeneous models and resources, we exclude them in our table to save space. Question Baseline Sem BERT What is a very seldom used unit of mass in the metric system? The ki metric slug What is the lone MLS team that belongs to southern California? Galaxy LA Galaxy How many people does the Greater Los Angeles Area have? 17.5 million over 17.5 million Table 5: The comparison of answers from baseline and our model. In these examples, answers from Sem BERT are the same as the ground truth. Model SNLI SQu AD 2.0 Dev EM F1 BERTLARGE 91.3 79.6 82.4 BERTLARGE+SRL 91.5 80.3 83.1 Sem BERTLARGE 92.3 80.9 83.6 Table 6: Analysis on SNLI and SQu AD 2.0 datasets. Number 1 2 3 4 5 Accuracy 91.49 91.36 91.57 91.29 91.42 Table 7: The influence of the max number of predicateargument structures m. 6.3 Model Prediction To have an intuitive observation of the predictions of Sem BERT, we show a list of prediction examples on SQu AD 2.0 from baseline BERT and Sem BERT11 in Table 5. The comparison indicates that our model could extract more semantically accurate answer, yielding more exact match answers while those from the baseline BERT model are often semantically incomplete. This shows that utilizing explicit semantics is potential to guide the model to produce meaningful predictions. Intuitively, the advance would attribute to better awareness of semantic role spans, which guides the model to learn the patterns like who did what to whom explicitly. Through the comparison, we observe Sem BERT might benefit from better span segmentation through span-based SRL labeling. We conduct a case study on our best model of SQu AD 2.0, by transforming SRL into segmentation tags to indicate which token is inside or outside the segmented span. The result is 83.69(EM)/87.02(F1), which shows that the segmentation indeed works but marginally beneficial compared with our complete architecture. It is worth noting that we are motivated to use the SRL signals to help the model to capture the span relationships inside sentence, which results in both sides of semantic label hints and segmentation benefits across semantic role spans to some extent. The segmentation could also be regarded as the awareness of semantics even with better semantic span segmentations. Intuitively, this indicates that our model evolves from BERT subword-level representation to intermediate word-level and final semantic representations. 11Henceforth, we use the Sem BERT* model from Table 2 as the strong and challenging baseline for ablation. 6.4 Infulence of Accuracy of SRL Our model relies on a semantic role labeler that would influence the overall model performance. To investigate influence of the accuracy of the labeler, we degrade our labeler by randomly turning specific proportion [0, 20%, 40%] of labels into random error ones as cascading errors. The F1 scores of SQu AD are respectively [87.93, 87.31, 87.24]. This advantage can be attributed to the concatenation operation of BERT hidden states and SRL representation, in which the lower dimensional SRL representation (even noisy) would not affect the former one intensely. This result indicates that the LM can not only benefit from high-accuracy labeler but also keep robust against noisy labels. Besides the wide range of tasks verified in this work, Sem BERT could also be easily adapted to other languages. As SRL is a fundamental NLP task, it is convenient to train a labeler for main languages as Co NLL 2009 provides 7 SRL treebanks. For those without available treebanks, unsupervised SRL methods can be effectively applied. For out-ofdomain issue, the datasets (GLUE and SQu AD) that we are working on cover quite diverse domains, and experiments show that our method still works. 7 Conclusion This paper proposes a novel semantics-aware BERT network architecture for fine-grained language representation. Experiments on a wide range of NLU tasks including natural language inference, question answering, machine reading comprehension, semantic similarity and text classification show the superiority over the strong baseline BERT. Our model has surpassed all the published works in all of the concerned NLU tasks. This work discloses the effectiveness of semantics-aware BERT in natural language understanding, which demonstrates that explicit contextual semantics can be effectively integrated with state-of-the-art pre-trained language representation for even better performance improvement. Recently, most works focus on heuristically stacking complex mechanisms for performance improvement, instead, we hope to shed some lights on fusing accurate semantic signals for deeper comprehension and inference through a simple but effective method. Baker, C. F.; Fillmore, C. J.; and Lowe, J. B. 1998. The berkeley framenet project. In COLING. Bentivogli, L.; Clark, P.; Dagan, I.; and Giampiccolo, D. 2009. The fifth pascal recognizing textual entailment challenge. In ACLPASCAL. Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. In EMNLP. Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; and Specia, L. 2017. Semeval-2017 task 1: Semantic textual similaritymultilingual and cross-lingual focused evaluation. ar Xiv preprint ar Xiv:1708.00055. Chen, Z.; Zhang, H.; Zhang, X.; and Zhao, L. 2018. Quora question pairs. Clark, K.; Khandelwal, U.; Levy, O.; and Manning, C. D. 2019. What does BERT look at? an analysis of bert s attention. ar Xiv preprint ar Xiv:1906.04341. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805. Dolan, W. B., and Brockett, C. 2005. Automatically constructing a corpus of sentential paraphrases. In IWP2005. He, L.; Lee, K.; Lewis, M.; Zettlemoyer, L.; He, L.; Lee, K.; Lewis, M.; Zettlemoyer, L.; He, L.; and Lee, K. 2017. Deep semantic role labeling: What works and what s next. In ACL. He, L.; Lee, K.; Levy, O.; and Zettlemoyer, L. 2018. Jointly predicting predicates and arguments in neural semantic role labeling. In ACL. Hu, M.; Peng, Y.; Huang, Z.; Yang, N.; Zhou, M.; et al. 2018. Read+ verify: Machine reading comprehension with unanswerable questions. ar Xiv preprint ar Xiv:1808.05759. Jia, R., and Liang, P. 2017. Adversarial examples for evaluating reading comprehension systems. In EMNLP. Kim, S.; Hong, J.-H.; Kang, I.; and Kwak, N. 2018. Semantic sentence matching with densely-connected recurrent and co-attentive information. ar Xiv preprint ar Xiv:1805.11360. Li, Z.; He, S.; Zhao, H.; Zhang, Y.; Zhang, Z.; Zhou, X.; and Zhou, X. 2019. Dependency or span, end-to-end uniform semantic role labeling. In AAAI. ar Xiv preprint ar Xiv:1901.05280. Liu, X.; He, P.; Chen, W.; and Gao, J. 2019. Multi-task deep neural networks for natural language understanding. ar Xiv preprint ar Xiv:1901.11504. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. Mudrakarta, P. K.; Taly, A.; Sundararajan, M.; and Dhamdhere, K. 2018. Did the model understand the question? In ACL. Nangia, N.; Williams, A.; Lazaridou, A.; and Bowman, S. R. 2017. The repeval 2017 shared task: Multi-genre natural language inference with sentence representations. In Rep Eval. Palmer, M.; Gildea, D.; and Kingsbury, P. 2005. The proposition bank: An annotated corpus of semantic roles. Computational linguistics 31(1):71 106. Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In NAACL-HLT. Pradhan, S.; Moschitti, A.; Xue, N.; Ng, H. T.; Bj orkelund, A.; Uryupina, O.; Zhang, Y.; and Zhong, Z. 2013. Towards robust linguistic analysis using Onto Notes. In Co NLL. Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pretraining. Technical report. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQu AD: 100,000+ questions for machine comprehension of text. In EMNLP. Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know what you don t know: Unanswerable questions for SQu AD. In ACL. Sang, E. F., and De Meulder, F. 2003. Introduction to the conll2003 shared task: Language-independent named entity recognition. ar Xiv preprint cs/0306050. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP. Sun, F.; Li, L.; Qiu, X.; and Liu, Y. 2018. U-net: Machine reading comprehension with unanswerable questions. ar Xiv preprint ar Xiv:1810.06638. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman, S. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In 2018 EMNLP Workshop Blackbox NLP. Warstadt, A.; Singh, A.; and Bowman, S. R. 2018. Neural network acceptability judgments. ar Xiv preprint ar Xiv:1805.12471. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; and Le, Q. V. 2019. XLNet: Generalized autoregressive pretraining for language understanding. ar Xiv preprint ar Xiv:1906.08237. Zhang, Z.; Li, J.; Zhu, P.; and Zhao, H. 2018. Modeling multi-turn conversation with deep utterance aggregation. In COLING. ar Xiv preprint ar Xiv:1806.09102. Zhang, Z.; Wu, Y.; Li, Z.; and Zhao, H. 2019. Explicit contextual semantics for text comprehension. In PACLIC 33. ar Xiv preprint ar Xiv:1809.02794. Zhang, S.; Zhao, H.; Wu, Y.; Zhang, Z.; Zhou, X.; and Zhou, X. 2020a. Dual co-matching network for multi-choice reading comprehension. In AAAI. ar Xiv preprint ar Xiv:1901.09381. Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; and Zhao, H. 2020b. SGNet: Syntax-guided machine reading comprehension. In AAAI. ar Xiv preprint ar Xiv:1908.05147. Zhao, H.; Chen, W.; and Kit, C. 2009. Semantic dependency parsing of nombank and propbank: An efficient integrated approach via a large-scale feature selection. In EMNLP. Zhao, H.; Zhang, X.; and Kit, C. 2013. Integrative semantic dependency parsing via efficient large-scale feature selection. Journal of Artificial Intelligence Research 46:203 233. Zhou, J.; Zhang, Z.; and Zhao, H. 2019. LIMIT-BERT: Linguistic informed multi-task bert. ar Xiv preprint ar Xiv:1910.14296.