# temporal_and_object_quantification_networks__4b2c7fc9.pdf Temporal and Object Quantification Networks Jiayuan Mao1 , Zhezheng Luo1 , Chuang Gan2 , Joshua B. Tenenbaum1 , Jiajun Wu3 , Leslie Pack Kaelbling1 and Tomer D. Ullman4 1Massachusetts Institute of Technology 2MIT-IBM Watson AI Lab 3Stanford University 4Harvard University We present Temporal and Object Quantification Networks (TOQ-Nets), a new class of neuro-symbolic networks with a structural bias that enables them to learn to recognize complex relational-temporal events. This is done by including reasoning layers that implement finite-domain quantification over objects and time. The structure allows them to generalize directly to input instances with varying numbers of objects in temporal sequences of varying lengths. We evaluate TOQ-Nets on input domains that require recognizing event-types in terms of complex temporal relational patterns. We demonstrate that TOQ-Nets can generalize from small amounts of data to scenarios containing more objects than were present during training and to temporal warpings of input sequences. 1 Introduction Every day, people interpret events and actions in terms of concepts, defined over temporally evolving relations among agents and objects [Zacks et al., 2007; Str anger and Hommel, 1996]. For example, in a soccer game, people can easily recognize when one player has control of the ball, when a player passes the ball to another player, or when a player is offsides. Although it requires reasoning about intricate relationships among sets of objects over time, this cognitive act is effortless, intuitive, and fast. It also generalizes directly over different numbers and arrangements of players, and detailed timings and trajectories. In contrast, most computational representations of sequential concepts are based on fixed windows of space and time, and lack the ability to perform relational generalization. In this paper, we develop generalizable representations for learning complex activities in time sequences from realistic data. As illustrated in Fig. 1, we can describe complex events with a first-order linear temporal logic [FO-LTL; Pnueli 1977] formula, which allows us to flexibly decompose an input sequence into stages that satisfy different criteria over time. Object quantifiers ( and ) are used to specify conditions *: equal contribution; order determined by a coin toss. Email: {jiayuanm,ezzluo}@mit.edu. on sets of objects that define each stage. Such representations immediately support generalization to situations with a varying number of objects, and sequences with different time warpings. More concretely, the variety of complicated spatio-temporal trajectories that high pass can refer to in a soccer game can be described in these terms: in a high pass from player A to teammate B, A is close to the ball (distance(A, ball) < θ1) and moving (velocity(A) > θ2) until the ball moves over the ground (zposition(ball) > θ3), which is in turn until teammate B gets control of the ball (teammate(A, B) distance(B, ball) < θ1). Beyond modeling human actions in physical environments, these structures can be applied to events in any time sequence of relational states, e.g., characterizing particular offensive or defensive maneuvers in board games such as chess or in actual conflicts, or detecting a process of money-laundering amidst financial transaction records. In this paper, we propose a neuro-symbolic approach to learning to recognize temporal relational patterns, called Temporal and Object Quantification Networks (TOQ-Nets), in which we design structured neural networks with an explicit bias that represents finite-domain quantification over both entities and time. A TOQ-Net is a multi-layer neural network whose inputs are the properties of agents and objects and their relationships, which may change over time. Each layer in the TOQ-Net performs either object or temporal quantification. The key idea of TOQ-Nets is to use tensors to represent relations between objects, and to use tensor pooling operations over different dimensions to realize temporal and object quantifiers (G and ). For example, the colorful matrix in Fig. 1(a) can be understood as representing a unary color property of a set of 8 entities over a sequence of 8 time steps. Representing a sequence of relations among objects over time would require a 3-dimensional tensor. Crucially, the design of TOQ-Nets allows the same network weights to be applied to domains with different numbers of objects and time sequences of different lengths. By stacking object and temporal quantification operations, TOQ-Nets can easily learn to represent higher-level sequential concepts based on the relations between entities over time, starting from low-level sensory input and supervised with only high-level class labels. There are traditional symbolic learning or logic synthesis methods that construct first-order or linear temporal logic expressions from accurate symbolic data [Neider and Gavran, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) 1 2 3 4 5 6 7 8 𝑭 𝑥. 𝑥 Finally (Eventually) Globally (Always) 𝑥. 𝑥 U 𝑥. 𝑥 (𝑥) U : Globally (Always) : Until : Finally (Eventually) (b) LTL Formula (a) Relational State Sequence Figure 1: (a) An input sequence composed of relational states: each column represents the state of an entity that changes over time. A logic formula describes a complex concept or feature that is true of this temporal sequence using object and temporal quantification. The sequence is segmented into three stages: throughout the first stage, holds for at least one entity, until the second stage, in which each entity is always either or , until the third stage, in which eventually becomes true for at least one of the entities. (b) Such events can be described using first-order linear temporal logic expressions. 2018; Camacho et al., 2018; Chou et al., 2020]. TOQ-Nets take a different approach and can learn from noisy data by backpropagating gradients, which allows them to start with a general perceptual processing layer that is directly fed into logical layers for further processing. We evaluate TOQ-Nets on two perceptually and conceptually different benchmarks: trajectory-based sport event detection and human activity recognition, demonstrating several important contributions. First, TOQ-Nets outperform both convolutional and recurrent baselines for modeling temporalrelational concepts across benchmarks. Second, by exploiting temporal-relational features learned through supervised learning, TOQ-Nets achieve strong few-shot generalization to novel actions. Finally, TOQ-Nets exhibit strong generalization to scenarios with more entities than were present during training and are robust w.r.t. time warped input trajectories. These results illustrate the power of combining symbolic representational structures with learned continuous-parameter representations to achieve robust, generalizable interpretation of complex relational-temporal events. The input to a TOQ-Net is a tensor representation of the properties of all entities at each moment in time. For example, in a soccer game, the input encodes the position of each player and the ball, as well as their team membership at each step of an extended time interval. The output is a label of the category of the sequence, such as the type of soccer play it contains. The first layer of a TOQ-Net (Fig. 2 (i)) extracts temporal features for each entity with an input feature extractor that focuses on entity features within a fixed and local time window. These features may be computed, e.g., by a convolutional neural network or a bank of parametric feature templates. The output of this step is a collection of nullary, unary, and binary relational features over time for all entities. Throughout the paper we will assume that all output tensors of this layer are binary-valued, but it can be extended directly to real-valued functions. This input feature extractor is task-specific and is not the focus of this paper. Second, these temporal-relational features go through several relational reasoning layers (RRLs), detailed in Section 2.2, each of which performs linear transformations, sigmoid activation, and object quantification operations. The linear and sigmoid functions allow the network to realize learned Boolean logical functions, and the object quantification operators can realize quantifiers. Additional RRLs enable deeper nesting of quantified expressions, as illustrated in Fig. 2. All operations in these layers are performed for all time steps in parallel. Next, the RRLs perform a final quantification, computing for each time step a set of a nullary features that are passed to the temporal reasoning layers (TRLs), as detailed in Section 2.3. Each TRL performs linear transformations, sigmoid activation, and temporal quantification, allowing the model to realize a subset of linear temporal logic [Pnueli, 1977]. As with RRLs, adding more TRLs enables the network to realize logical forms with more deeply nested temporal quantifiers. In the last layer, all object and time information is projected into a set of features of the initial time step, which summarize the temporal-relational properties of the entire trajectory (e.g., the kicker eventually scores ), and fed into a final softmax unit to obtain classification probabilities for the sequence. It is important to understand the representational power of this model. The input transformation layer learns basic predicates and relations that will be useful for defining more complex concepts, but no specific predicates or relations are built into the network in advance. The relational reasoning layers build quantified expressions over these basic properties and relations, and might construct expressions that could be interpretable as the player is close to the ball. Finally, the temporal reasoning layer applies temporal operations to these complex expressions, such as the player is close to the ball until the ball moves with high speed. Critically, none of the symbolic properties or predicates are hand defined they are all constructed by the initial layer in order to enable the network to express the concept it is being trained on. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Input Trajectory (i) Input Feature Extractor 𝑞" 𝑥, 𝑡 𝑞#(𝑥, 𝑡) (ii) Relational Reasoning Layers (K layers, Section 3.2) xpos(𝑥, 𝑡) ypos 𝑥,𝑡 zpos 𝑥, 𝑡 tplayer(𝑥,𝑡) ball(𝑥, 𝑡) Sequence Label 𝑃$ (iii) Temporal Reasoning Layers (L layers, Section 3.3) Increased Complexity in Object Quantification 𝑥.𝑞" 𝑥, 𝑡 𝑦. 𝑟"(𝑥, 𝑦, 𝑡) 𝒂𝒍𝒘𝒂𝒚𝒔 𝑥. 𝑞"(𝑥, 𝑡) 𝑥. 𝑦. 𝑟"(𝑥, 𝑦, 𝑡) 𝒖𝒏𝒕𝒊𝒍 𝒂𝒍𝒘𝒂𝒚𝒔 𝑥. 𝑞"(𝑥, 𝑡) Increased Complexity in Temporal Quantification Predicted Label Supervision (") 𝑟"(𝑥, 𝑦, 𝑡) 𝑟#(𝑥, 𝑦,𝑡) Summarized state representation at each time step t 𝑥. 𝑦. 𝑟"(𝑥, 𝑦, 𝑡) Figure 2: A TOQ-Net contains three modules: (i) an input feature extractor, (ii) relational reasoning layers, and (iii) temporal reasoning layers. To illustrate the model s representational power, we show that logical forms of increasing complexity can be realized by stacking multiple layers. TOQ-Nets are not fully first-order: all quantifiers operate only over the finite domain of the input instance, and can be seen as short-hand for finite disjunctions or conjunctions over objects or time points. In addition, the depth of the logical forms it can learn is determined by the fixed depth of the network. However, our goal is not to fully replicate temporal logic, but to bring ideas of object and temporal quantification into neural networks, and to use them as structural inductive biases to build models that generalize better from small amounts of data to situations with varying numbers of objects and time courses. 2.1 Temporal-Relational Feature Representation TOQ-Nets use tensors as internal representations between layers; they represent, all at once, the values of all predicates and relations grounded on all objects at all time points. The operations in a TOQ-Net are vectorized, operating in parallel on all objects and times, sometimes expanding the dimensionality via outer products, and then re-projecting into smaller dimensions via max-pooling. This processing style is analogous to representing an entire graph using an adjacency matrix and using matrix operations to compute properties of the nodes or of the entire graph. In TOQ-Nets, the input to the network, as well as the feature output of intermediate layers, is represented as a tuple of three tensors. Specifically, we use a vector of dimension D0 to represent aspects of the state that are global and do not depend on any specific object at each time t. We use a matrix of shape N D1 to represent the unary properties of each entity at time t, where N is the number of entities and D1 is the hidden dimension size. Similarly, we use a tensor of shape N N D2 to represent the relations between each pair of entities at time step t. As a concrete example, illustrated in Fig. 2, the number of entities N is the total number of players plus one (the ball). For each entity x and each time step, the inputs are their 3D position, type (ball or player) and team membership. The TOQ-Net outputs the action performed by the target player. Since there are only entity features, the input trajectory is encoded with a unary tensor of shape T N D1, where T is the length of the trajectory. That is, there are no nullary or binary inputs in this case. 2.2 Relational Reasoning Layers Our Relational reasoning layers (RRLs) follow prior work on Neural Logic Machines [Dong et al., 2019], illustrated in Fig. 3 (i). Consider a specific time step t. At each layer l, the input to a neural logic layer is a 3-tuple (Pl 1, Ql 1, Rl 1), which corresponds to nullary, unary, and binary features respectively. Their shapes are D0, N D1, and N N D2. The output is another 3-tuple (Pl, Ql, Rl), given by Pl = NNP (Concat [Pl 1; max(Ql 1, dim = 0)]) , Ql = NNQ (Concat [Ql 1; max(Rl 1, dim = 0); max(Rl 1, dim = 1); expand(Pl 1, dim = 1)]) , Rl = NNR (Concat [Rl 1; expand(Ql 1, dim = 0); expand(Ql 1, dim = 1)]) . where NN are single fully-connected layers with sigmoid activations. For unary and binary features, NNQ and NNR are applied along the feature dimension. That is, we apply the same linear transformation to the unary features of all entities. A different linear transformation is applied to the binary features of all entity pairs. Concat[ ; ] is the concatenation operation, applied to the last dimension of the tensors (the feature dimension). max, also called the reduced max operation, takes the maximum value along the given axis of a tensor. The expand operation, also called broadcast, will duplicate the input tensor N times and stack them together along the given axis. RRLs are applied identically to the input features at every time step t. That is, we use the same neural network weights in a RRL for all time steps in parallel. RRLs are motivated by relational logic rules in a finite and fully grounded universe. The max reduction operations implement a differentiable version of an existential quantifier over the finite universe of individuals, given that the truth values of the propositions are represented as values in (0.0, 1.0). Because preceding and subsequent RRLs can negate propositions as needed, we omit explicit implementation of finite-domain universal quantification, although it could be added by including analogous min reductions. Thus, as illustrated in Fig. 3 (i), given input features q1(x, t) and q2(x, t), we can realize the formula x. q1(x, t) q2(x, t) by stacking two such layers. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) 𝑞" 𝑥, 𝑡 𝑞#(𝑥, 𝑡) Relational Reasoning Layer #1 Relational Reasoning Layer #2 Feature Dim (*,") 𝑒" 𝑃$ Reduce NN σ Temporal Reasoning 𝒂𝒍𝒘𝒂𝒚𝒔 𝑒# (from time step 3) Reduce NN σ Temporal Reasoning 𝑒" 𝒖𝒏𝒕𝒊𝒍 𝒂𝒍𝒘𝒂𝒚𝒔 𝑒# (from time step 1) 𝑞" 𝑥, 𝑡 𝑞#(𝑥, 𝑡) 𝑟"(𝑥, 𝑦, 𝑡) 𝑟#(𝑥, 𝑦,𝑡) 𝑥. 𝑞" 𝑥,𝑡 𝑞#(𝑥, 𝑡) Figure 3: Illustration of (i) relational reasoning layers and (ii) temporal reasoning layers. We provide two illustrative running traces. (i) The first relational reasoning layer takes unary predicates q1 and q2 as input and its output Q1 is able to represent q1 q2. The max(Q1, dim = 0) in layer 2 can represent x. q1(x, t) q2(x, t). (ii) Assume PK encodes the occurance of events e1 and e2 at each time step. The first temporal reasoning layer can realize always e2 with a temporal pooling from time step 3 to time step T. In the second temporal reasoning layer, the temporal pooling summarizes that e1 holds true from time step 1 to 2. Thus, the NN should be able to realize e1 until (always e2). Throughout the paper we have been using only nullary, unary, and binary features, but the proposed framework itself can be easily extended to higher-order relational features. From a graph network [Bruna et al., 2014; Kipf and Welling, 2017; Battaglia et al., 2018] point of view, one can treat these features as the node and edge embeddings of a fully-connected graph and the relational reasoning layers as specialized graph neural network layers for realizing object quantifiers. 2.3 Temporal Reasoning Layers Temporal reasoning layers (TRLs) perform quantification operations similar to relational reasoning layers, but along the temporal rather than the object dimension. The first TRL takes as input the summarized event representation produced by the K-th relational reasoning layer, P (t) K for at all time steps t, as a matrix of shape T D. Each TRL is computed as P (t) K+l = max t >t NNl Concat P (t ) K+l 1; max t t t. Its input is composed of two parts: the events that happen at t , i.e., P (t) K+l 1, and the events that happen between t and t , summarized with temporal quantification operations, i.e., maxt t t enumerates over all t > t, and test whether the condition specified by NN l holds for at least one such t . A special case is the first temporal reasoning layer. It takes PK as its input, which is the output of last relational reasoning layer. Thus, the first temporal reasoning layer implements: P (t) K+1 = NN1 max t t t involved. In addition, for all temporal layers, we add residual connections by concatenating their inputs with the outputs. Next, let s consider a running example illustrating how a TOQ-Net can recognize the event that: event p holds true until event q becomes always true. In LTL, this can be written as p U(Gq). Using the plain first-order logic (FOL) language, we can describe it as: t. [ t . (0 t < t) = p(t )) ( t . (t t) = q(t )]. For simplicity, we consider a tensor representation for two events p(t) and q(t), where p(t) = 1 if it happens at time step t and p(t) = 0 otherwise. Given the input sequence of length 4 in Table 1 (p(t) and q(t)), the first layer is capable of computing the following four properties for each time step t: Gp (always p), which is true if p holds true for all future time steps starting from t, Fp (eventually p), which is true if p is true for at least one time step starting from t, and similarly, Gq (always q) and Fq (eventually q). Overall, together with residual connections, the first layer can recognize six useful events: p(t), q(t) (from residual connection), Gp, Fp, Gq, and Fq (by temporal quantification, i.e. pooling operations along the temporal dimension). The second layer can realize the computation depicted in Algorithm 1. For every time step t, it enumerates all t > t and computes the output based on 1. the events at t (represented as P (t ) K+l 1 in Equation 1, concretely the (Gq)(t ) in the Algorithm 1 example), and 2. the state of another event between t and t (represented as maxt t