# text_matching_as_image_recognition__84d142e6.pdf Text Matching as Image Recognition Liang Pang , Yanyan Lan , Jiafeng Guo , Jun Xu , Shengxian Wan , and Xueqi Cheng CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China {pangliang,wanshengxian}@software.ict.ac.cn, {lanyanyan,guojiafeng,junxu,cxq}@ict.ac.cn Matching two texts is a fundamental problem in many natural language processing tasks. An effective way is to extract meaningful matching patterns from words, phrases, and sentences to produce the matching score. Inspired by the success of convolutional neural network in image recognition, where neurons can capture many complicated patterns based on the extracted elementary visual patterns such as oriented edges and corners, we propose to model text matching as the problem of image recognition. Firstly, a matching matrix whose entries represent the similarities between words is constructed and viewed as an image. Then a convolutional neural network is utilized to capture rich matching patterns in a layer-by-layer way. We show that by resembling the compositional hierarchies of patterns in image recognition, our model can successfully identify salient signals such as n-gram and n-term matchings. Experimental results demonstrate its superiority against the baselines. Introduction Matching two texts is central to many natural language applications, such as machine translation (Brown et al. 1993), question and answering (Xue, Jeon, and Croft 2008), paraphrase identification (Socher et al. 2011) and document retrieval (Li and Xu 2014). Given two texts T1 = (w1, w2, . . . , wm) and T2 = (v1, v2, . . . , vn), the degree of matching is typically measured as a score produced by a scoring function on the representation of each text: match(T1, T2) = F Φ(T1), Φ(T2) , (1) where wi and vj denotes the i-th and j-th word in T1 and T2, respectively. Φ is a function to map each text to a vector, and F is the scoring function for modeling the interactions between them. A successful matching algorithm needs to capture the rich interaction structures in the matching process. Taking the task of paraphrase identification for example, given the following two texts: T1 : Down the ages noodles and dumplings were famous Chinese food. Copyright c 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. T2 : Down the ages dumplings and noodles were popular in China. We can see that the interaction structures are of different levels, from words, phrases to sentences. Firstly, there are many word level matching signals, including identical word matching between down in T1 and down in T2, and similar word matching between famous in T1 and popular in T2. These signals compose phrase level matching signals, including n-gram matching between down the ages in T1 and down the ages in T2, unordered n-term matching between noodles and dumplings in T1 and dumplings and noodles in T2, and semantic n-term matching between were famous Chinese food in T1 and were popular in China in T2. They further form sentence level matching signals, which are critical for determining the matching degree of T1 and T2. How to automatically find and utilize these hierarchical interaction patterns remains a challenging problem. In image recognition, it has been widely observed that the convolutional neural network (CNN) (Le Cun et al. 1998; Simard, Steinkraus, and Platt 2003) can successfully abstract visual patterns from raw pixels with layer-by-layer composition (Girshick et al. 2014). Inspired by this observation, we propose to view text matching as image recognition and use CNN to solve the above problem. Specifically, we first construct a word level similarity matrix, namely matching matrix, to capture the basic word level matching signals. The matching matrix can be viewed as: 1) a binary image if we define the similarity to be 0-1, indicating whether the two corresponding words are identical; 2) a gray image if we define the similarity to be real valued, which can be achieved by calculating the cosine or inner product based on the word embeddings. Then we apply a convolutional neural network on this matrix. Meaningful matching patterns such as n-gram and n-term can therefore be fully captured within this architecture. We can see that our model takes text matching as a multi-level abstraction of interaction patterns between words, phrases and sentences, with a layer-by-layer architecture, so we name it Match Pyramid. The experiments on the task of paraphrase identification show that Match Pyramid (with 0-1 matching matrix) outperforms the baselines, by solely leveraging interactions between texts. While for other tasks such as paper citation matching, where semantic is somehow important, Match- Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Pyramid (with real-valued matching matrix) performs the best by considering both interactions and semantic representations. Contributions of this paper include: 1) a novel view of text matching as image recognition; 2) the proposal of a new deep architecture based on the matching matrix, which can capture the rich matching patterns at different levels, from words, phrases, to the whole sentences; 3) experimental analysis on different tasks to demonstrate the superior power of the proposed architecture against competitor matching algorithms. Motivation It has been widely recognized that making a good matching decision requires to take into account the rich interaction structures in the text matching process, starting from the interactions between words, to various matching patterns in the phrases and the whole sentences. Taking the aforementioned two sentences as an example, the interaction structures are of different levels, as illustrated in Figure 1. Figure 1: An example of interaction structures in paraphrase identification. Word Level Matching Signals refer to matchings between words in the two texts, including not only identical word matchings, such as down down , the the , ages ages , noodles noodles , and and , dumplings dumplings and were were , but also similar word matchings, such as famous popular and chinese china . Phrase Level Matching Signals refer to matchings between phrases, including n-gram and n-term. N-gram matching occurs with n exactly matched successive words, e.g. (down the ages) (down the ages) . While n-term matching allows for order or semantic alternatives, e.g. (noodles and dumplings) (dumplings and noodles) , and (were famous chinese food) (were popular in china) . Sentence Level Matching Signals refer to matchings between sentences, which are composed of multiple lower level matching signals, e.g. the three successive phrase level matchings mentioned above. When we consider matchings between paragraphs that contain multiple sentences, the whole paragraph will be viewed as a long sentence and the same composition strategy would generate paragraph level matching signals. To sum up, the interaction structures are compositional hierarchies, in which higher level signals are obtained by composing lower level ones. This is similar to image recognition. In an image, raw pixels provide basic units of the image, and each patch may contain some elementary visual features such as oriented edges and corners. Local combinations of edges form motifs, motifs assemble into parts, and parts form objects. We give an example to show the relationships between text matching and image recognition (Jia et al. 2014), as illustrated in Figure 2. In the area of image recognition, CNN has been recognized as one the most successful way to capture different levels of patterns in image (Zeiler and Fergus 2014). Therefore, it inspires us to transform text matching to image recognition and employ CNN to solve it. However, the representations of text and image are so different that it remains a challenging problem to perform such transformation. Match Pyramid In this section we introduce a new deep architecture for text matching, namely Match Pyramid. The main idea comes from modeling text matching as image recognition, by taking the matching matrix as an image, as illustrated in Figure 3. Matching Matrix: Bridging the Gap between Text Matching and Image Recognition As discussed before, one challenging problem by modeling text matching as image recognition lies in the different representations of text and image: the former are two 1D (onedimensional) word sequences while the latter is typically a 2D pixel grid. To address this issue, we represent the input of text matching as a matching matrix M, with each element Mij standing for the basic interaction, i.e. similarity between word wi and vj (see Eq. 2). Here for convenience, wi and vj denotes the i-th and j-th word in two texts respectively, and stands for a general operator to obtain the similarity. Mij = wi vj. (2) In this way, we can view the matching matrix M as an image, where each entry (i.e. the similarity between two words) stands for the corresponding pixel value. We can adopt different kinds of to model the interactions between two words, leading to different kinds of raw images. In this paper, we give three examples as follows. Indicator Function produces either 1 or 0 to indicate whether two words are identical. Mij = I{wi=vj} = 1, if wi = vj 0, otherwise. (3) One limitation of the indicator function is that it cannot capture the semantic matching between similar words. To tackle this problem, we define based on word embeddings, which will make the matrix more flexible to capture semantic interactions. Given the embedding of each word αi = Φ(wi) and βj = Φ(vj), which can be obtained by recent Word2Vec (Mikolov et al. 2013) technique, we introduce the other two operators: cosine and dot product. Cosine views angles between word vectors as the similarity, and it acts as a soft indicator function. Mij = αi βj αi βj , (4) where stands for the norm of a vector, and ℓ2 norm is used in this paper. Dot Product further considers the norm of word vectors, as compared to cosine. Mij = αi βj. (5) Figure 2: Relationships between text matching and image recognition. Figure 3: An overview of Match Pyramid on Text Matching. Based on these three different operators, the matching matrices of the given example are shown in Fig 4. Obviously we can see that Fig 4(a) corresponds to a binary image, and Fig 4(b) correspond to gray images. Hierarchical Convolution: A Way to Capture Rich Matching Patterns The body of Match Pyramid is a typical convolutional neural network, which can extract different levels of matching patterns. For the first layer of CNN, the k-th kernel w(1,k) scans over the whole matching matrix z(0) =M to generate a feature map z(1,k): z(1,k) i,j = σ rk 1 t=0 w(1,k) s,t z(0) i+s,j+t + b(1,k) , (6) (a) Indicator (b) Dot Product Figure 4: Three different matching matrices, where solid circles elements are all valued 0. where rk denotes the size of the k-th kernel. In this paper we use square kernel, and Re LU (Dahl, Sainath, and Hinton 2013) is adopted as the active function σ. Dynamic pooling strategy (Socher et al. 2011) is then used to deal with the text length variability. By applying dynamic pooling, we will get fixed-size feature maps: z(2,k) i,j = max 0 s