# procedural_text_understanding_via_scenewise_evolution__41921a50.pdf

Procedural Text Understanding via Scene-Wise Evolution

Jialong Tang1,3, Hongyu Lin1, Meng Liao4,*, Yaojie Lu1,3, Xianpei Han1,2, Le Sun1,2,*, Weijian Xie4, Jin Xu4

1 Chinese Information Processing Laboratory, Beijing, China 2 State Key Laboratory of Computer Science Institute of Software, Chinese Academy of Sciences, Beijing, China 3 University of Chinese Academy of Sciences, Beijing, China 4 Data Quality Team, We Chat, Tencent Inc., China {jialong2019,hongyu,yaojie2017,xianpei,sunle}@iscas.ac.cn {maricoliao, vikoxie, jinxxu}@tencent.com

Procedural text understanding requires machines to reason about entity states within the dynamical narratives. Current procedural text understanding approaches are commonly entity-wise, which separately track each entity and independently predict different states of each entity. Such an entity-wise paradigm does not consider the interaction between entities and their states. In this paper, we propose a new scene-wise paradigm for procedural text understanding, which jointly tracks states of all entities in a scene-by-scene manner. Based on this paradigm, we propose Scene Graph Reasoner (SGR), which introduces a series of dynamically evolving scene graphs to jointly formulate the evolution of entities, states and their associations throughout the narrative. In this way, the deep interactions between all entities and states can be jointly captured and simultaneously derived from scene graphs. Experiments show that SGR not only achieves the new state-of-the-art performance but also signiﬁcantly accelerates the speed of reasoning.

Introduction Understanding how events will affect the world is the essence of intelligence (Henaff et al. 2017). Procedural text understanding, aiming to track the state changes (e.g., create, move, destroy) and locations (a span in the text) of entities throughout the whole procedure, is a representative task to estimate the machine intelligence on such ability (Mishra et al. 2018). For example, in Figure 1 (a), given a narrative describing the procedure of photosynthesis, as well as a pre-speciﬁed entity water , a procedural text understanding model is asked to predict the corresponding {State, location} sequences: {Move, root}, {Move, leaf}. Compared with conventional factoid-style reading comprehension tasks (Seo et al. 2017; Clark and Gardner 2018), procedural text understanding is more challenging because it requires to model and reason with the dynamical world (Mishra et al. 2018; Bosselut et al. 2018). Most approaches resolve procedural text understanding task in an entity-wise paradigm, where each entity is tracked separately, and state changes and locations of each entity are independently predicted. Along this line, as Figure 1 (a)

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. Corresponding authors.

State Tracker The roots absorb water and minerals from the soil.

This combination of water and minerals flows from the stem into the leaf.

State Tracker

Entity-wise Tracking

(a) Entity-wise Paradigm

plant water

Scene-wise Tracking

(b) Scene-wise Paradigm

Entity Location Enhanced Knowledge

The roots absorb water and minerals from the soil.

This combination of water and minerals flows from the stem into the leaf.

Jointly Reasoning

Figure 1: Comparison between the traditional entity-wise paradigm and the proposed scene-wise paradigm for procedural text understanding. We can see that: (a) the entitywise paradigm tracks each entity separately, and predict state changes and locations of each entity independently; (b) the scene-wise paradigm jointly tracks the state changes and locations of all entities scene-by-scene.

shows, current procedural text understanding models mainly resort to hierarchical neural network architectures, which ﬁrst encode the document-entity pair using a token-level encoder, then track the state changes and locations by two separate sentence-level trackers (Mishra et al. 2018; Du et al. 2019a,b; Tang, Feng, and Zhao 2020; Gupta and Durrett 2019b). More recently, the main research hot spot in this direction is how to obtain more effective document-entity representations by introducing graph-based architectures (Das et al. 2019; Zhong et al. 2020; Huang et al. 2021), pretrained language models (Gupta and Durrett 2019a; Amini et al. 2020; Zhang et al. 2020) or external knowledge bases (Ribeiro et al. 2019; Tandon et al. 2018). Unfortunately, the traditional entity-wise paradigm ignores the interactions between different entities in the same narrative, as well as the associations between the state

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

changes and locations of one entity. Speciﬁcally, the multiple entities mentioned in the same narrative are highly correlated with each other. For example, if we know water and minerals will be combined into a mixture , we can conﬁrm that they must in the same location leaf . Furthermore, the states and locations of an entity are highly associated. For example, if we know the location of water changes from root to leaf , we can easily predict the state of water is Move . Besides, the states/locations at current step depend on the states/location at previous steps. For example, if we know root and leaf are parts of plant , we will tend to predict the location leaf after the location root . However, current entity-wise paradigm is unable to exploit the above-mentioned interactions and associations. In addition, reasoning procedures entity-by-entity is time-intensive and inefﬁcient. Therefore, it is still far from achieving decent procedure text understanding models in both accuracy and efﬁciency. To this end, this paper proposes scene-wise procedural text understanding, a new paradigm that jointly tracks the state changes and locations of all entities scene-by-scene. Instead of the entity-wise paradigm, we formulate the world described in the procedural text at different timesteps using a sequence of dynamically evolving scenes1. Figure 1 (b) illustrates the whole process of the scene-wise procedural text understanding. Speciﬁcally, each scene contains concepts (e.g., entities, locations or elements from external knowledge) and their relations at current timestep. As the narrative develops, the concepts and relations in the scene are dynamically evolved scene-by-scene. In this way, the state changes and locations of all entities are jointly exploited and then can be simultaneously derived from the scenes. Based on this paradigm, we propose Scene Graph Reasoner (SGR), a speciﬁc implementation for scene-wise procedural text understanding. SGR uses a graph structure to model scene. Each node in the graph represents a concept, and each edge in the graph represents a relation between two concepts. Then the scene evolution is modeled by the graph evolution throughout the whole procedure. Specifically, SGR consists of four basic components: 1) a graph structure encoder, which summarizes critical information from the current scene graph; 2) a context encoder, which captures the new events occurring from the sentence describing next narrative timestep; 3) a graph structure predictor, which predicts the evolution of the scene graph after the new events occurring; 4) a state reasoner, which distills the state changes and locations via comparing the adjacent scene graphs. By jointly exploiting all concepts and their relations in the scene graphs, SGR is able to better capture their interactions and associations throughout the whole procedure, and therfore enables to track the state changes and locations of all entities simultaneously in a graph evolution process. Generally, the main contributions of this paper are:

We propose a new scene-wise paradigm for procedural text understanding, which jointly tracks the state changes and locations of all entities scene-by-scene.

1In procedural text understanding task, the division of timesteps is consistent with the division of sentences.

We design a speciﬁc implementation SGR for scene-wise procedural text understanding, which can fully consider the interactions of multiple entities, as well as the associations of state changes and locations. We conduct experiments on Pro Para (Mishra et al. 2018) and Recipes (Bosselut et al. 2018), two of the representative procedural text understanding benchmarks. Experiments show that SGR in the scene-wise paradigm achieves the new state-of-the-art procedural text understanding performance, and the reasoning speed is significantly accelerated.

Backgrounds Task Deﬁnition In this paper, we focus on Pro Para (Mishra et al. 2018), which includes a variety of natural procedures, and the task is to answer the questions about the state changes and locations of the entities. Speciﬁcally, given: A paragraph P consists of T sentences {S1, S2, ..., ST }; A set of pre-speciﬁed entities E = {e1, e2, ..., e N} need to be tracked; the procedural text understanding model is required to reason with the described world, and output: State change sequences Y s = {Y s e1, Y s e2, ..., Y s e N } for all pre-speciﬁed entities E, where Y s ei = {ys ei,1, ys ei,2, ..., ys ei,T }, ys ei,t {Other (O), Exist (E), Move (M), Create (C), Destroy (D)}2. Location sequences Y l = {Y l e1, Y l e2, ..., Y l e N } for all prespeciﬁed entities E, where Y l ei = {yl ei,1, yl ei,2, ..., yl ei,T }, yl ei,t is a text span in the paragraph. A special ? token indicates the location is unknown.

Entity Recognition and Location Candidates Generation In procedural text understanding, identifying entities is necessary because they are participants in the narrative. Thus, we ﬁrst use Spa Cy to tokenize the paragraph and all entities. All text are cleaned and lower-cased. And then the simple string matching algorithm is used to recognize entities. Unlike entities, location information in this task is not given initially. Due to the difﬁculty to consider arbitrary text spans as possible locations, we follow the previous works (Gupta and Durrett 2019b; Zhang et al. 2020) to generate candidates, and transform the original text span extraction into the candidate classiﬁcation for tracking locations. Speciﬁcally, we ﬁrst extract the POS tags by ﬂair (Akbik et al. 2019), and then generate location candidates by POSbased rules3. For the train and dev sets, if the gold location is not included in the candidates, we manually add them to the candidate set. This is mainly for expanding the size of trainable instances in location prediction. For the test set, we do

2Other (O) is further devided into OA, OB, which mean none state before and after existence separately. 3See details at https://github.com/ytyz1307zzh/NCET-Pro Para.

𝑆3: this combination of water and minerals flows from the stem into the leaf.

0 1 0 1 1 0

Graph Structure Encoder

Enhanced Knowledge

[CLS] water [SEP] mineral [SEP] this combination of water and minerals flows from the stem into the leaf . [SEP]

Context Encoder

Graph Structure

Scene Graph Reasoner

Edge Indication

State Reasoner

water E, soil

Figure 2: An overview of the proposed SGR in the scene-wise paradigm, which is composed of four parts: (a) graph structure encoder; (b) context encoder; (c) graph structure predictor and (d) state reasoner.

not use such method because we obviously cannot know the gold location while testing.

Scene Graph Reasoner

In this section, we describe how to train an effective procedural text understanding model in the scene-wise paradigm, and track state changes and locations of all entities simultaneously. As illustrated in Figure 2, we propose Scene Graph Reasoner (SGR), a speciﬁc implementation for scene-wise procedural text understanding. SGR constructs scene graphs for each training instance. To evolve the scene graphs, SGR ﬁrst summarizes critical information from the current scene graph by a graph structure encoder, captures the new events occurring from the sentence describing next narrative timestep by a context encoder, and then predicts the evolution of the scene graph after the new events by a graph structure predictor. During testing, SGR utlizes a state reasoner to simultaneously distills the state changes and locations of all entities via comparing the adjacent scene graphs.

Scene Graph Construction for Training

For each training instance, we transform the original gold state change and location annotations {Y s, Y l} into a sequence of scene graphs Y g to adapt for the proposed scenewise paradigm, where Y g = {yg 1, yg 2, ..., yg t , ..., yg T }. Each node in the scene graph represents a concept (entities, locations or elements from external knowledge), and each edge represents a predeﬁned relation between two concepts. As the narrative develops, the nodes and edges in the scene graphs are dynamically created or deleted. Enlightened by (Skardinga, Gabrys, and Musial 2021), we utilize

a complete graph with two matrices to record the dynamics of nodes and edges. And each scene graph can be represented as yg t = { ˆG, Maskt, Relt}: ˆG for the complete graph, Maskt for node masking and Relt for edge indicating at time t. Consequently, the model training objectives is reformulated to predict these scene graphs. Speciﬁcally, SGR ﬁrst uses the recognized entities and the generated location candidates as nodes4, and use three SRLbased relations (entity-entity, location-location and entitylocation) as edges to construct the complete graph. Then SGR enhances the complete graph with the external commonsense knowledge from Concept Net (Speer, Chin, and Havasi 2017) because it can provide abundant concept relation, and help the model to understand composition of the world5. Finally SGR generates the node mask Maskt and the edge indication Relt for each scene graph yg t , where Maskt RM masks the entities that are not created or are already destroied, Relt RM M R indicates different relations whose arguments are not masked, M is the number of concepts, and R is the number of relations. In this way, the state changes and locations can be jointly modeled in the scene graphs, e.g., the state changes like Exist, Create and Destroy are record by Maskt; the location of each entity are record by Relt.

4It is worth to notice that we do not treat events/actions as one kind of nodes as Huang et al. (2021) dose, because events are more suitable for evolving scene graphs. 5We retrieve and add the corresponding entities and relations (e.g, Has A, Part Of, and so on) into complete graphs using the same heuristic rules as Zhang et al. (2020).

Graph Structure Encoder for Summarizing Scenes At timestep t, SGR adopts a graph attention network (GAT) (Velickovic et al. 2018) to summerize critical information from the current scene graph yg t since its strong representation capacity. In this way, the entities and their states/locations are jointly modeled by rich types of relations among different concepts. Speciﬁcally, the input to the graph attention network is a set of node features H = {ˆhs,t 1 , ˆhs,t 2 , ..., ˆhs,t M } at timestep t, where M is the number of concepts6. SGR then performs self-attention on the nodes a shared masked attentional mechanism computes attention coefﬁcients:

et ij = a(W1ˆhs,t i , W1ˆhs,t j , W2Relt ij) (1)

that indicate the importance of node j to node i, where Relt ij is the relation embedding, W1, W2 are learnable parameters. To make coefﬁcients easily comparable across different nodes, we normalize them across all choices of j using the softmax function:

αt ij = softmaxj(et ij) = exp(et ij) P k N t i exp(et ik) (2)

where N t i is the neighborhood of node i in the scene graph at timestep t, which is determined by the complete graph ˆG, the node mask Maskt and the edge indication Relt. Once obtained, the normalized attention coefﬁcients are used to compute a linear combination of the features corresponding to them, to serve as the ﬁnal features for every node at timestep t: hs,t i = σ( X

j N t i αt ijˆhs,t j ) (3)

Finally, we obtain the hidden state corresponding to the special [Global] node as the graph structure representation:

hs,t [Gloabl] = GAT(yg t ) = GAT({ ˆG, Maskt, Relt}) (4)

where hs,t [Gloabl] summerizes critical information from the current scene graph before new events occur. The scene graph structure and the enhanced external knowledge can be fully learned by the graph structure encoder.

Context Encoder for Capturing New Events Existing procedural text understanding models do have the context encoder to obtain document-entity representations (Mishra et al. 2018; Du et al. 2019a,b; Tang, Feng, and Zhao 2020; Gupta and Durrett 2019b). However, we leverage the power of context encoder differently. We utilize it to capture the new events occurring in the sentence at the next timestep t + 1. In this paper, we use BERT (Devlin et al. 2019) to handle the nuances of procedural texts. As suggested by Gupta and Durrett (2019a), we restructure the input to guide the transformer model (Vaswani et al. 2017) to focus on particular entities mentioned in the sentence. Speciﬁcally, take S3 in Figure 2 as an example, we ﬁrst restructure the input as: {[CLS] water [SEP] minerals [SEP]

6At timestep 0, node features are initialized by context encoder.

This combination of water and minerals ﬂows from the stem into the leaf . [SEP]}, where [CLS] and [SEP] are special tokens. In this way, the transformer can always observe the entities it should be primarily attending to from the standpoint of building representations. For each token in the input, its representation is constructed by concatenating the corresponding token and position embeddings. Then, the context representation will be inputted into BERT architecture (Devlin et al. 2019), and updated by multilayer Transformer blocks (Vaswani et al. 2017). Finally, we obtain the hidden state corresponding to the special [CLS] token in the last layer as the context representation:

hc,t+1 [CLS] = BERT(restructure(St+1)) (5)

where hc,t+1 [CLS] captures the new events occured in the sentence St+1 at the current timestep t + 1. The self-attention mechanism of BERT supports the interactions of the multiple relationships between entities, locations and events mentioned in the sentence St+1. And it can take advantage of the knowledge learned via pretraining.

Graph Structure Predictor for Evolving Scenes

Based on the the structure representation hs,t [Gloabl] and the

context representation hc,t+1 [CLS], we can predict the new scene graph structure at timestep t + 1. Speciﬁcally, we generate the new scene graph yg t+1 via predicting the node mask Maskt+1 and the edge indication Relt+1 with the guidance of the aggregate representation:

ˆ Mask t+1 = f1(hs,t [Gloabl], hc,t+1 [CLS])

ˆ Rel t+1 = f2(hs,t [Gloabl], hc,t+1 [CLS]) (6)

where ˆ Mask t+1 RM, ˆ Rel t+1 RM M R, f1, f2 are two nonlinear output layers. And a sequence of scene graphs can be generated through an autoregressive behavior:

yg t+1 = SGR(yg t , St+1) (7)

Model Training

Given a training corpus with constructed scene graphs (see Section Scene Graph Construction for Training.), SGR can be supervisedly learned by maximum log-likelihood estimation (MLE):

i=1 Maskt i log ˆ Mask t i

k=1 Relt ijk log ˆ Rel t ijk

where T is the number of timesteps, M is the number of concepts, including entities, location candidates and elements from external knowledge and R is the number of relations.

Algorithm 1: : State Reasoner.

Input: SGR: the trained procedural text understanding model; Concept Net: the external commonsense knowledge base; P = {S1, S2, ..., ST }: the procedural text; E = {e1, e2, ..., e N}: pre-speciﬁed entities; Constraints: used for postprocessing; 1: The complete graph G Construct(P, E) 2: The enhanced complete graph ˆG Enhance(G, Concept Net) 3: Y g 4: yg 0 ( ˆG, Mask0, Rel0) ( ˆG, , ) 5: Y g.append(yg 0) 6: for St in P do 7: hs,t [Gloabl] Graph Structure Encoder(yg t )

8: hc,t+1 [CLS] Context Encoder(St+1)

9: yg t+1 ( ˆG, ˆ Mask t, ˆ Rel t) Graph Structure Predictor(hs,t [Gloabl], hc,t+1 [CLS]) 10: Y g.append(yg t+1) 11: (Y s, Y l) Transform( ˆG, Y g, Constraints) Return: Y s, Y l;

State Reasoner for Tracking All Entities

During testing, on the basis of the trained procedural text understanding model SGR, we can construct scene graphs for new narratives, and then simultaneously track the state changes and locations of all entities scene-by-scene. Specifically, we can infer the state change and location sequences via comparing the adjacent scene graphs, e.g., if we ﬁnd out that water is not masked at scene yg t 1 but masked at scene yg t , we can deterministically infer the state of water is Destroy (D) at timestep t; if we ﬁnd that water has a Locate In realtion with root , the current location of water must be root . It is worth to notice that these transformations of all entities can be processed in parallel. To facilitate the description of state reasoner, we summarize this process in Algorithm 1. Speciﬁcally, we ﬁrst preprocess the raw input {P, E} to construct the complete graph, and then utilize the external commonsense knowledge Concept Net (Speer, Chin, and Havasi 2017) to get the enriched complete graph (Line 1-2). Second, we initialize the scene graphs Y g as (Line 3). However, during testing, we cannot access the gold state change and location annotations. Thus, we initialize Mask0 and Rel0 as zeros, which means that we know nothing about the current world at timestep 0 (Line 4). After these preparations, we utilize the trained model SGR to evolve the scene graphs from yg 0 to yg T , which contains graph structure encoding, context encoding and graph structure predicting (Line 6-10). Finally, we transform the scene graphs Y g into the state change sequence Y s and the location sequence Y l for all pre-speciﬁed entities (Line 611). The constraints used in previous works can be easily inject into the ﬁnal transformation process, e.g., correct invalid actions according to the whole action sequence (Tang, Feng, and Zhao 2020). In this way, the state change and location sequences of all entities can be tracked simultaneously, and the efﬁciency of reasoning can be signiﬁcantly improved.

Statistics Pro Para Recipes train dev test train dev test #Instance/Para 391 43 54 693 86 87 #Sentence 2,620 288 372 6,098 765 783 #Entity 1,504 175 236 5,932 756 737 Avg.#sent/para 6.7 6.7 6.9 8.8 8.9 9.0 Avg.#enti/para 3.8 4.1 4.4 8.6 8.8 8.5

Table 1: Statistics of Pro Para and Recipes datasets. We regard one paragraph with all pre-speciﬁed entities as an instance. Thus, the number of instances is equivalent to the number of paragraphs.

Experiments Experimental Settings Dataset. We conduct main experiments on Pro Para (Mishra et al. 2018) and auxiliary experiments on Recipes (Bosselut et al. 2018). For Pro Para, we follow the ofﬁcial split (Mishra et al. 2018) for train/dev/test set. For Recipes, following the previous works (Zhang et al. 2020; Huang et al. 2021), we only use the human-labeled data in our experiments, and re-split it into 80%/10%/10% for train/dev/test sets. More statistics about these two datasets are shown in Table 17. Implementation Details. For graph structure encoder, we apply a one-layer graph attention network (GAT) (Velickovic et al. 2018). For context encoder, we use the BERT base implemented by Hugging Face s transformers library (Wolf et al. 2020). Hyper-parameters are manually tuned according to the accuracy on the dev set: batch size is set to 16, hidden size is set to 128 and learning rate is set to 5e-5. The ﬁnal model is trained on an Nvidia TITAN RTX GPU with Adam optimizer (Kingma and Ba 2015), and is selected with the highest prediction accuracy on dev set. Evaluation Metrics For Pro Para, following the previous works, we perform document level (Tandon et al. 2018) and sentence level (Mishra et al. 2018) tasks in our main experiments. Speciﬁcally, the document level task requires models to answer the four document-level questions:

Q1: What are the inputs to the procedure? Q2: What are the outputs of the procedure? Q3: What conversions occur, when and where? Q4: What movements occur, when and where?

The evaluator compute precision, recall and F1 score for each question, and the overall F1 score is the macro-average of the above four questions8. The sentence-level task requires models to answer ten ﬁne grained sentence-level questions, which can be summarized into three categories:

Cat-1: Is entity created (destroyed, moved)? Cat-2: When is entity created (destroyed, moved)? Cat-3: Where is entity created (destroyed, moved from/to)?

7The number of instances is different from the previous entitywise works because they regard one entity-paragraph pair as an instance, and result in 1.9k instances 8https://github.com/allenai/aristo-leaderboard/tree/master/ Pro Para

Models Document-level task Sentence-level task Precision Recall F1 Cat-1 Cat-2 Cat-3 Macro-Avg Micro-Avg Entity-wise Models

Models with context encoder

Ent Net (Henaff et al. 2017) 54.7 30.7 39.4 51.6 18.8 7.8 26.1 26.0 QRN (Seo et al. 2017) 60.9 31.1 41.4 52.4 15.5 10.9 26.3 26.5 Pro Local (Mishra et al. 2018) 81.7 36.8 50.7 62.7 30.5 10.4 34.5 34.0 Pro Global (Mishra et al. 2018) 48.8 61.7 51.9 63.0 36.4 35.9 45.1 45.4 AQA (Ribeiro et al. 2019) 62.0 45.1 52.3 61.6 40.1 18.6 39.4 40.1 Pro Struct (Tandon et al. 2018) 74.3 43.0 54.5 - - - - - XPAD (Du et al. 2019a) 70.5 45.3 55.2 - - - - - LACE (Du et al. 2019b) 75.3 45.4 56.6 - - - - - NCET (Gupta and Durrett 2019b) 67.1 58.5 62.5 73.7 47.1 41.0 53.9 54.0 ET BERT (Gupta and Durrett 2019a) - - - 73.6 52.6 - - - IEN (Tang, Feng, and Zhao 2020) 69.8 56.3 62.3 71.8 47.6 40.5 53.3 53.0 DYNAPRO (Amini et al. 2020) 75.2 58.0 65.5 72.4 49.3 44.5 55.4 55.5 KOALA (Zhang et al. 2020) 77.7 64.4 70.4 78.5 53.3 41.3 57.7 57.5 Models with structure encoder

KG-MRC (Das et al. 2019) 69.3 49.3 57.6 62.9 40.0 38.2 47.0 46.6 Pro Graph (Zhong et al. 2020) 67.3 55.8 61.0 67.8 44.6 41.8 51.4 51.5 TSLM (Faghihi and Kordjamshidi 2021) 68.4 68.9 68.6 78.8 56.8 40.9 58.8 58.3 REAL (Huang et al. 2021) 81.9 61.9 70.5 78.4 53.7 42.4 58.2 57.9 Scene-wise Models SGR (our method) 84.9 62.9 72.2 79.9 55.1 43.5 59.5 59.2 w/o Graph Structure Encoder 72.4 51.1 59.9 69.9 42.7 39.9 50.8 51.0 w/o Context Encoder 76.1 55.4 64.1 74.9 47.9 40.0 54.3 54.2 w/o Concept Net 82.7 63.2 71.6 78.3 56.0 42.5 58.9 58.6 w/o Pre-trained Bert 81.8 59.2 68.7 76.2 53.3 41.4 57.0 56.7

Table 2: Experimental results on Pro Para document-and sentence-level tasks. , and indicate the models consider the interactions between multiple entities, use the external knowledge base and are equipped with the pre-trained model separately.

Models Precision Recall F1 NCET (re-implementation) 56.5 46.4 50.9 IEN (re-implementation) 58.5 47.0 52.2 KOALA (Zhang et al. 2020) 60.1 52.6 56.1 REAL (Huang et al. 2021) 55.2 52.9 54.1 SGR (our method) 69.3 50.5 58.4

Table 3: Experimental results on re-split Recipes.

Evaluation metrics are macro-average and micro-average accuracy of three sets of questions. More details can be found in the ofﬁcial script9. The answers of both documentand sentence-level questions can be deterministically computed from the state change and location sequences. For Recipes, we follow Zhang et al. (2020); Huang et al. (2021) to predict the location changes of the ingredients during the procedure. For each movement, the model should predict the new location of the entity, plus the timestep when the movement occurs. We take precision, recall, and F1 scores to evaluate models.

For Pro Para, we compare SGR with the following baselines, most of them are on the ofﬁcial leaderboard10:

9https://github.com/allenai/Pro Para/tree/master/Pro Para/ evaluation 10https://leaderboard.allenai.org/Pro Para/submissions/public

Models with context encoder rely on Bi-LSTM/Bert to obtain the document-entity representations and track the states/locations separately. Models with structure encoder leverage the power of the static graph to obtaion more effective document representations. Different from our work, they lack the dynamical representations of the procedures.

For Recipes, we compare SGR with the state-of-the-art models: NCET (Tang, Feng, and Zhao 2020), KOALA (Zhang et al. 2020) and REAL (Huang et al. 2021).

Overall Results Table 2, 3 and 4 show the overall results. We can see that: 1. The proposed SGR in the scene-wise paradigm achieves the state-of-the-art performance. SGR can signiﬁcantly outperform the state-of-the-art model REAL and achieves 72.2 F1 on Pro Para document-level task and 59.5/59.2 Macro-Avg/Micro-Avg scores on Pro Para sentence-level task. On Recipes, SGR also outperforms the corresponding baselines and achieves 58.4 F1. We believe this is because that the entities and their states/locations are jointly modeled in the scenes. Therefore, the association of two track targets and the interaction of the multiple entities are fully explored. 2. Reasoning states and locations of all entities sceneby-scene signiﬁcantly improves the inference efﬁciency. We compare the inferencing time of SGR with NCET and IEN on an Nvidia TITAN RTX GPU in Table 4. NCET

Models Pro Para Total Avg./para Avg./enti NCET (re-implementation) 51.31 0.95 0.22 IEN (re-implementation) 42.50 0.79 0.18 SGR (our method) 17.99 0.33 0.08

Table 4: Inference time (seconds) on Pro Para.

Concept Net Document-level task Train Test Precision Recall F1 SGR 84.9 62.9 72.2 SGR(test) 83.7 62.9 71.8 SGR(train) 81.5 60.4 69.4 SGR(none) 82.7 63.2 71.6

Table 5: Effect of the usage of Concept Net on Pro Para. All improvements of SGR are statistical signiﬁcance at p<0.01.

is the traditional model in the entity-wise paradigm, while IEN considers the interactions between multiple entities via the entity-location attention mechanism11. However, both NCET and IEN rely on separate trackers with CRF to predict state changes and locations. Thus they are time-intensive and take almost 3-4 times as long as SGR to track each entity. These results veriﬁes that reasoning questions scene-byscene is efﬁcient and promotes the application of the procedural text understanding model in real-world scenes. 3. The graph structure encoder and the context encoder are indispensable, and are complementary with each other. When compared with the full model SGR, its two variants SGR w/o Graph Structure Encoder and SGR w/o Context Encoder show declined performance in different degrees, which indicates that the current scene modeled by the graph structure encoder and the new events captured by the context encoder are necessary. Surprisingly, we ﬁnd that SGR w/o Context Encoder still perform quite well. The insight in those observations may be that it can be regard as a graph-structured language model predicts snapshots through an autoregressive behavior, and builds label consistency in the same topic (Du et al. 2019b).

Detailed Analysis The external knowledge can be easily injected into SGR. To investigate the effectiveness of difference kinds of external knowledge, we design the following experiments. Effects of the External Knowledge from the Pretrained language model. From Table 2, we can see that models indicates with outperform other baselines. When compared with SGR, the performance of SGR w/o Pretrained BERT clampes between SGR and SGR w/o Context Encoder. It means that not only the context encoder but also the external knowledge learned via pre-training is helpful for procedural text understanding. Effects of the External Knowledge from the Knowledge Base. First of all, from Table 2, we can see that models indicates with outperform other baselines. And the de-

11Other state-of-the-art models spend more reasoning times than NCET and IEN due to exquisitely designed archectures.

cay of SGR w/o Concept Net is also appreciable when compared with SGR. These results verify the effectiveness of the external knowledge from knowledge base. Furthermore, we investigate the usage of Concept Net in Table 5. We can see that: 1) Compared with SGR(none), SGR and SGR(test) lead to improvements. The reason behinds it is that Concept Net can constrain the prediction space of graph structure predictor and help the model to understand composition of the world. 2) SGR(train) even perform worse than SGR(none). It is because that the inconsistency between train and test introduces too many noisies rather than knowledge. In other words, the exposure bias of the autoregressive behavior hurts the performance of the model (Zhang et al. 2019).

Related Work

Procedural Text Understanding is important and challenging. Many datasets have been proposed such as b Ab I (Henaff et al. 2017), RECIPES (Kiddon et al. 2015) and Pro Para (Mishra et al. 2018). Blessed with valuable benchmarks, there emerge abundant procedural text understanding models which are in the question-answering framework (Henaff et al. 2017; Seo et al. 2017; Das et al. 2019) or hierarchical neural network framework (Mishra et al. 2018; Tandon et al. 2018; Du et al. 2019b; Gupta and Durrett 2019b,a; Zhang et al. 2020; Zhong et al. 2020). Some of them utilize graph encoder to obtaion more effective document-entity representations (Huang et al. 2021) and almost of them in the entity-wise paradigm. Different from them, this paper propose a new scene-wise paradigm to jointly tracks the state changes and locations of all entities scene-by-scene. Dynamic Graph Neural Networks (DGNNs) are used in a wide range of ﬁelds, including social network analysis, recommender systems and epidemiology (Yin et al. 2019; Skardinga, Gabrys, and Musial 2021). DGNNs add a new dimension to network modeling and prediction time. This new dimension radically inﬂuences network properties which enable a more powerful representation of network, and increases predictive capabilities of methods (Aggarwal and Subbian 2014; Li et al. 2018). In this paper, we utilize a graph structure to model scene, and an evaluation algorithm is proposed to adapt for the proposed scene-wise paradigm.

Conclusions

In this paper, we propose a new scene-wise paradigm for procedural text understanding and Scene Graph Reasoner (SGR) is designed to jointly model the associations of state changes and locations, as well as the interactions of multiple entities. In this way, the state changes and locations of all entities are jointly exploited and then can be simultaneously derived from the scene graphs. Experiments show that SGR achieves the new state-of-the-art procedural text understanding performance, and the reasoning speed is significantly accelerated. For future work, we want to pretrain a graph-structured language model to build label consistency in the same topic (Du et al. 2019b) and design new training and reasoning methods to overcome the exposure bias of the autoregressive behavior (Zhang et al. 2019).

Acknowledgments

We sincerely thank the reviewers for their insightful comments and valuable suggestions. This research work is supported by the National Key R&D Program of China (2020AAA0105200), and the National Natural Science Foundation of China under Grants no. U1936207, 62122077, 62106251 and 61772505, the Beijing Academy of Artiﬁcial Intelligence (BAAI) and in part by the Youth Innovation Promotion Association CAS(2018141).

Aggarwal, C.; and Subbian, K. 2014. Evolutionary Network Snalysis: A Survey. ACM Computing Surveys (CSUR). Akbik, A.; Bergmann, T.; Blythe, D.; Rasul, K.; Schweter, S.; and Vollgraf, R. 2019. FLAIR: An Easy-to-use Framework for State-of-the-art NLP. In Proceedings of NAACL 2019. Amini, A.; Bosselut, A.; Mishra, B. D.; Choi, Y.; and Hajishirzi, H. 2020. Procedural Reading Comprehension with Attribute-Aware Context Flow. In Proceedings of AKBC 2020. Bosselut, A.; Levy, O.; Holtzman, A.; Ennis, C.; Fox, D.; and Choi, Y. 2018. Simulating Action Dynamics with Neural Process Networks. In Proceedings of ICLR 2018. Clark, C.; and Gardner, M. 2018. Simple and Effective Multi-Paragraph Reading Comprehension. In Proceedings of ACL 2018. Das, R.; Munkhdalai, T.; Yuan, X.; Trischler, A.; and Mc Callum, A. 2019. Building Dynamic Knowledge Graphs from Text using Machine Reading Comprehension. In Proceedings of ICLR 2019. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL 2019. Du, X.; Mishra, B. D.; Tandon, N.; Bosselut, A.; tau Yih, W.; and Clark, P. 2019a. Everything Happens for a Reason: Discovering the Purpose of Actions in Procedural Text. In Proceedings of EMNLP 2019. Du, X.; Mishra, B. D.; Tandon, N.; Bosselut, A.; tau Yih, W.; Clark, P.; and Cardie, C. 2019b. Be Consistent! Improving Procedural Text Comprehension using Label Consistency. In Proceedings of NAACL 2019. Faghihi, H. R.; and Kordjamshidi, P. 2021. Time-Stamped Language Model: Teaching Language Models to Understand The Flow of Events. In Proceedings of NAACL 2021. Gupta, A.; and Durrett, G. 2019a. Effective Use of Transformer Networks for Entity Tracking. In Proceedings of EMNLP 2019. Gupta, A.; and Durrett, G. 2019b. Tracking Discrete and Continuous Entity State for Process Understanding. In Proceedings of NAACL 2019 Workshop. Henaff, M.; Weston, J.; Szlam, A.; Bordes, A.; and Le Cun, Y. 2017. Tracking the World State with Recurrent Entity Networks. In Proceedings of ICLR 2017.

Huang, H.; Geng, X.; Pei, J.; Long, G.; and Jiang, D. 2021. Reasoning over Entity-Action-Location Graph for Procedural Text Understanding. In Proceedings of ACL 2021. Kiddon, C.; Ponnuraj, G. T.; Zettlemoyer, L.; and Choi, Y. 2015. Mise En Place: Unsupervised Interpretation of Instructional Recipes. In Proceedings of EMNLP 2015. Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of ICLR 2015. Li, T.; Wang, B.; Jiang, Y.; Zhang, Y.; and Yan, Y. 2018. Restricted Boltzmann Machine-based Approaches for Link Prediction in Dynamic Networks. IEEE Access. Mishra, B. D.; Huang, L.; Tandon, N.; tau Yih, W.; and Clark, P. 2018. Tracking State Changes in Procedural Text: A Challenge Dataset and Models for Process Paragraph Comprehension. In Proceedings of NAACL 2018. Ribeiro, D.; Hinrichs, T.; Crouse, M.; Forbus, K.; Chang, M.; and Witbrock, M. 2019. Predicting State Changes in Procedural Text using Analogical Question Answering. In Proceedings of ACACS 2019. Seo, M. J.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2017. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of ICLR 2017. Skardinga, J.; Gabrys, B.; and Musial, K. 2021. Foundations and Modelling of Dynamic Networks using Dynamic Graph Neural Networks: A survey. IEEE Access. Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of AAAI 2017. Tandon, N.; Mishra, B. D.; Grus, J.; tau Yih, W.; Bosselut, A.; and Clark, P. 2018. Reasoning about Actions and State Changes by Injecting Commonsense Knowledge. In Proceedings of EMNLP 2018. Tang, J.; Feng, Y.; and Zhao, D. 2020. Understanding Procedural Text using Interactive Entity Networks. In Proceedings of EMNLP 2020. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Kaiser, A. N. G. Ł.; and Polosukhin, I. 2017. Attention is all you need. In Proceedings of NIPS 2017. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2018. Graph Attention Networks. In Proceedings of ICLR 2018. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davison, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Scao, T. L.; Gugger, S.; Drame, M.; Lhoest, Q.; and Rush, A. M. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of EMNLP 2020: System Demonstrations. Yin, Y.; Song, L.; Su, J.; Zeng, J.; Zhou, C.; and Luo, J. 2019. Graph-based Neural Sentence Ordering. In Proceedings of IJCAI 2019. Zhang, W.; Feng, Y.; Meng, F.; You, D.; and Liu, Q. 2019. Bridging the gap between training and inference for neural machine translation. In Proceedings of ACL 2019.

Zhang, Z.; Geng, X.; Qin, T.; Wu, Y.; and Jiang, D. 2020. Knowledge-aware Procedural Text Understanding with Multi-stage Training. In Proceedings of WWW 2020. Zhong, W.; Tang, D.; Duan, N.; Zhou, M.; Wang, J.; and Yin, J. 2020. A Heterogeneous Graph with Factual, Temporal and Logical Knowledge for Question Answering Over Dynamic Contexts. Ar Xiv preprint ar Xiv:2004.12057.