# continuoustime_attention_for_sequential_learning__40709e2c.pdf

Continuous-Time Attention for Sequential Learning

Jen-Tzung Chien, Yi-Hsiang Chen Department of Electrical and Computer Engineering National Chaio Tung University, Hsinchu, Taiwan {jtchien, ethernet420.eed08g}@nctu.edu.tw

Attention mechanism is crucial for sequential learning where a wide range of applications have been successfully developed. This mechanism is basically trained to spotlight on the region of interest in hidden states of sequence data. Most of the attention methods compute the attention score through relating between a query and a sequence where the discretetime state trajectory is represented. Such a discrete-time attention could not directly attend the continuous-time trajectory which is represented via neural differential equation (NDE) combined with recurrent neural network. This paper presents a new continuous-time attention method for sequential learning which is tightly integrated with NDE to construct an attentive continuous-time state machine. The continuoustime attention is performed at all times over the hidden states for different kinds of irregular time signals. The missing information in sequence data due to sampling loss, especially in presence of long sequence, can be seamlessly compensated and attended in learning representation. The experiments on irregular sequence samples from human activities, dialogue sentences and medical features show the merits of the proposed continuous-time attention for activity recognition, sentiment classiﬁcation and mortality prediction, respectively.

Introduction

Learning representation of sequence data in spatial or temporal domain is crucial (Chien 2019; Tseng et al. 2017). The popular examples of sequence data include natural sentences, video streams and medical signals. Most of them are seen as time-series data although the sequences of spatial samples in images are also recognized as the sequence data. An essential solution to sequential learning is based on the recurrent neural network (RNN) where the hidden states of previous samples are continuously updated by a recurrent machine, and seamlessly applied for prediction of next sample. One critical issue in sequential learning is to characterize the dynamics of sampling resolution in sequence data. Generally, RNNs are feasible to learning representation for regularly-sampled time-series data, while they are an awkward ﬁt to irregularly-sampled time series. However, irregular time-series data are very common in real world. For example, in medical areas, we usually predict the health of

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

a patient by using several biomedical signals acquired by different sensors or diagnosis facilities where the sampling resolutions in different signals are varied (Lipton, Kale, and Wetzel 2016; Che et al. 2018; El Naqa et al. 2018; Chuang et al. 2020). Some other practical applications in presence of irregular-sampled data (Elman 1990) include ﬁnancial marketing (Bauer, Sch olkopf, and Peters 2016), weather forecasting (Shi et al. 2015) and trafﬁc engineering (Wang et al. 2017), to name a few. A key issue in these applications is the missing data problem which is possibly caused due to the facility cost or affected by machine anomaly. A naive way to deal with this issue is to adopt the time difference of sequence samples as a new feature input for RNN training. Alternatively, an attractive approach is to construct a continuous-time machine based on neural differential equation (NDE) (Chen et al. 2018) where the continuous-time hidden state space is constructed for learning representation similarly over the sequence data with unlimited length. The discrete-time state transitions in RNN are generalized to the continuous-time state dynamics by combining NDE with RNN (Rubanova, Chen, and Duvenaud 2019). NDE is then built as a strong model with rich sequence information for prediction. NDE is seen as an RNN model for extremely long sequence. NDE can tackle the weakness of RNN which is deteriorated when the length of input sequence increases (Bahdanau, Cho, and Bengio 2015). Nevertheless, the performance of sequential learning is bounded because the relevance or the importance of individual samples to target task is neglected. This paper presents a continuous-time attention scheme to strengthen the learning of irregular sampled data.

In general, attention mechanism can be powerful to capture the relevance information between a query and a document or sequence which has been successfully developed for a wide range of applications based on standard RNN (Chien and Lin 2018; Chien and Wang 2019; Chien and Lin 2020). The document is formed as a matrix where each row is a feature vector extracted from sequence data at different time points. Then, the attention score is computed by using a query vector and a document matrix. This paper presents a novel continuous-time attention method for sequential learning where the attention is performed in continuous-time state space based on NDE. Accordingly, the document is represented by a continuous function rather than a matrix. This function represents the features at any desired time moments

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Figure 1: Illustration for continuous-time attention. The difﬁculty is to calculate the dot-product of two continuous functions and integrate them to obtain context vector c( t). t is time index for query vector q( t). {tn}N n=1 denote the time points of sequence data {xtn}.

via continuous-time hidden states. The concept of calculating the continuous-time attention is depicted in Figure 1. The difﬁculty in this calculation is that there is no previous attention method suitable to ﬁnd attention weights by using a continuous function for document. This paper tackles such a dilemma by representing both attention score and context vector as continuous-time functions. Attention score function is exploited to carry out the attention α(t) for the whole state trajectory while context vector function reﬂects the weighted sum of whole continuous-time hidden states z(t). The merits of the proposed continuous-time attention are illustrated by the experiments on action recognition and medical data analysis as well as emotion recognition. We report the results of the so-called attentive neural differential equation (also denoted by Att-NDE) under different experimental settings by comparing with RNN and NDE.

Related Works The sequential learning methods with continuous-time state machine and attention mechanism are ﬁrst introduced.

Continuous-Time State Machine Neural differential equation (Chen et al. 2018; Zhang et al. 2021) was proposed to build a continuous-time state machine for learning representation of sequence data {xtn} where the time points {tn}N n=1 are irregularly sampled. NDE was implemented to learn the dynamics of transformation so as to characterize the state transition z(t) at continuous-time t between input samples and output targets based on an ordinary differential equation (ODE). This problem was solved by handling an ODE with initial value

dt = f(z(t), t; θ), z (ti) = z (t0)+ Z ti

where f is the function represented by neural network with parameter θ, z(t0) denotes the initial state, and z(ti) denotes the hidden state at a desirable time point ti. ODE solver is introduced to deal with the integration as illustrated in Figure 2. The ODE solver, which is a tool to solve ODE initial value problem, is continuously applied to carry out the continuous-time state z(t) at continuous time t in a range

Hidden state

Current time

Neural Network

Figure 2: ODE solver for continuous-time state machine.

ODE function

Neural Network

Figure 3: Dynamic of hidden state z(t) in neural ODE.

between t1 and t N. A neural network is introduced to represent the derivative f for an unknown differential equation. Figure 3 illustrates the dynamic from z(t) to z(t + t) in a continuous-time state machine by using neural ODE or simply denoted by NDE. ODE solver is adopted to resolve the latent dynamic system in time-series data (Rubanova, Chen, and Duvenaud 2019). This state machine calculates the continuous-time hidden states z(t) between two discrete observations xtn 1 and xtn by applying an ODE solver

ˆz(tn) = ODESolver (f, z(tn 1), tn 1, tn, θ) (2)

which is a function of neural network f with parameter θ, start state z(tn 1), start time tn 1, and end time tn. NDE is used to update hidden state by using an RNN cell

z(tn) = RNNCell(ˆz(tn), xtn). (3)

Here, the time index tn in brackets in z(tn) means continuous time while that in subscript in xtn means discrete time.

Discrete-Time Attention Mechanism Traditional attention mechanism was developed to elevate the performance of sequential learning based on recurrent neural network (Bahdanau, Cho, and Bengio 2015; Luong, Pham, and Manning 2015; Su et al. 2018; Cao et al. 2018) where the discrete-time hidden states {zj} (or {ztn}) from the time-series observations {xj} were represented by a recurrent machine. The discrete-time attention is then implemented by calculating the context vector ci corresponding to a query vector qi at each discrete time i by using the attention weights αi,j, which is yielded by a softmax function

j=1 αi,jzj, where αi,j = exp(score(qi, zj)) PN k=1 exp(score(qi, zk)) .

(4) The inner product can be used to compute the matching score between query qi and state zj, i.e. score(qi, zj) = q i zj, at discrete times i and j, respectively, where 1 j N. Once the context vector ci is calculated, the attended feature is usually obtained by the addition qi + ci. However,

discrete-time attention scheme is infeasible to merge with the continuous-time state machine.

Continuous-Time Attention This study presents a new sequential learning strategy where the continuous-time attention mechanism is seamlessly employed in continuous-time state machine. Figure 4 illustrates the conceptual difference between discrete-time attention and continuous-time attention. The curves in the bottom reﬂect the state trajectories z(tn) and ztn of observation sample xtn at different time points {tn}N n=1 in discrete-time and continuous-time state machines based on RNN and NDE, respectively. NDE predicts the continuoustime hidden states {z(tn 1), z(tn)} between two observations {xtn 1, xtn} which are more meaningful than those of RNN where only the hidden states for speciﬁc time points {tn 1, tn} are represented. Therefore, discrete-time attention is deﬁned for ﬁnite number of hidden states zj while continuous-time attention is performed for continuous state function z(t). In what follows, we generalize the discretetime attention to the continuous-time attention. Finding the context vector c( t) corresponding to a query vector q( t) using the summation is now extended to that using the integral.

Continuous-Time Generalization In (Ramachandran et al. 2019; Cordonnier, Loukas, and Jaggi 2020), self attention was interpreted as a kind of convolution calculation in convolutional neural network. In contrast, this work calculates the attention weights based on the convolution operation which has been well deﬁned in discrete-time and continuous-time signal processing. First, in discrete-time processing, the context vector c[ n] is calculated by the convolution with attention weight α n[n]

n=t1 α n[n]z[n], where α n[n] = exp(q[ n] z[n]).

(5) where n is the time index for query vector. Note that the context vector is denoted as a time series c[ n] with the same time index as query vector q[ n] which is deﬁned as q[ n] q n δ[n ] where δ[n ] denotes the delta function. The score function for ﬁnding attention weight is then obtained by

q[ n] z[n] =

n =t1 (q n δ[n ]) z[n + n ] = q n z[n]. (6)

Next step is to generalize Eq. (5) to calculate the context vector using continuous-time convolution

c( t) = Z t N

t1 α t(t)z(t)dt (7)

where the summation is replaced by the integral over continuous-time state trajectory z(t). t indicates the time index of query vector. Attention weight is then generalized to

α t(t) = exp Z t N

t1 (q t δ(t )) z(t + t )dt . (8)

Similar to Eq. (6), the score function is written by Z t N

t1 (q t δ(t )) z(t + t )dt = q t z(t). (9)

Model Implementation

In the implementation, the integral operation in continuoustime attention can be handled through ODE solver. Considering the ODE property, the solution is implemented by modeling the dynamics of hidden state, context vector as well as attention weight using neural networks. ODE solver is introduced to ﬁnd the solution to multiple dynamics at the same time. To do so, we ﬁrst rewrite Eq. (7) by meeting the format of ODE solver. Time variable t is required. We ﬁrst compute the context vector c( t). An additional context vector function C(t) is deﬁned with the value in time point t N such that C(t N) = c( t). t is time variable while t is a ﬁxed time point of a query. This continuous-time function is particularly calculated by C(t) = R t t1 α t(τ)z(τ)dτ which is reduced to Eq. (7) when time variable t equals to t N. The meaning of time variable t is the continuous time that query vector attends. Next, the context vector function is expressed to ﬁt the setting of ODE solver by using the start time t1 = 0 and following the Leibniz integral rule in a form of

dt = α t(t)z(t) + Z t

t = α t(t)z(t)

(10) Note that the attention weight α t(t) deﬁned in Eq. (8) is calculated without performing normalization similar to softmax function in Eq. (4) for discrete-time attention. However, the context vector should be normalized by using the summation of all attention scores along various times. We therefore deﬁne another continuous function A(t) for attention score function, which is used to represent the attention score summing up to the current time t. To apply ODE solver, the attention score function and its dynamic are expressed by

0 α t(τ)dτ, d A(t)

dt = α t(t). (11)

Then, the normalization of context vector is performed via the division C(t)/A(t). Finally, the solutions to differential equations of hidden state, context vector and attention score are simultaneously yielded by " z(tn) C(tn) A(tn)

| {z } solutions

" z(tn 1) C(tn 1) A(tn 1)

" f(z(t), t; θ) α t(t)z(t) α t(t)

| {z } dynamics

where the initial conditions are given by z(t1) = xt1, C(t1) = 0 and A(t1) = 0. Figure 5 illustrates how ODE solver is applied to implement continuous-time attention. The dynamic functions of z(t), C(t) and A(t) are used to represent the hidden state trajectory, context vector function and attention score function, respectively, which are various

query query ...

Figure 4: Comparison between discrete-time attention (left) and continuous-time attention (right). The discrete-time attention score is calculated by summing the dot-products between query and documents at time points {t1, t2, t3}. But, the continuoustime attention score is computed by integrating and interpolating via ordinary differential equation over continuous time t.

Hidden state

Context vector function

Attention score function

Current time

Neural Network

Figure 5: ODE solver for continuous-time attention.

continuous-time functions with different markers and colors. ODE solver is seen as a black box with inputs consisting of neural network f, initial values z(t1), C(t1), A(t1) as well as query q( t). This solver is implemented from start time t1 to end time t N. The current time t is spotlighted. Notably, RNNs are continuously applied to update z(t) once a new sample xtn is observed at time tn. Figure 6 shows the computation of derivatives or dynamics of various continuous-time functions inside the proposed attentive neural differential equation (Att-NDE). The ODE functions f(z(t), t; θ), α t(t)z(t) and α t(t) are calculated to solve the continuous-time functions z(t), C(t) and A(t), respectively. The dot-product of query and hidden state is used to update A(t) while the element-wise multiplication of current hidden state and derivative of attention score function is used to update C(t). After ﬁnding three derivatives dz(t)

dt and A(t)

dt , the values of next time point t + t are obtained by adding the current values with ﬁrst-order derivatives. Algorithm 1 shows the overall procedure of Att-NDE where the continuous-time attention is performed by Algorithm 2. Ba-

ODE function

Neural Network

Figure 6: Dynamics of z(t), C(t) and A(t) in the attentive neural differential equation.

sically, the continuous-time attention is obtained to attend hidden states by computing the normalized context vector between observations at time steps tn 1 and tn. The augmented dynamic function faug is calculated as the gradients of z(t), C(t) and A(t) as gz, gc and ga, respectively, which are incorporated into ODE solver to ﬁnd the corresponding continuous-time functions between tn 1 and tn. ODE solver is a kind of integrator to ﬁnd the integrated dynamics at any desired time instant t. RNN cell (Dupont, Doucet, and Teh 2019; Chien and Chen 2021; Chien and Ku 2015; Kuo and Chien 2018) is then used to update hidden state from ˆz(tn) to z(tn) when query point xtn (or q) is observed. The attended feature z(tn) + C(tn)/A(tn) is then computed and used to ﬁnd classiﬁcation output yn via the classiﬁer layer Output NN( ). The classiﬁcation loss is ﬁnally calculated and optimized to train Att-NDE.

Algorithm 1: Attentive neural differential equation

Input the parameter θ, data points, time stamps {(xtn, tn)}N n=1, query x t for each sample xtn at time tn do

{ˆz(tn), C(tn), A(tn)} = CTA(θ, z(tn 1), C(tn 1), A(tn 1), tn 1, tn, x t) z(tn) = RNNCell(ˆz(tn), xtn) end yt N = Output NN(z(t N) + C(t N)/A(t N)) return yt N

Algorithm 2: Continuous-time attention (CTA)

Input the parameter θ, initial values z(ttn 1), C(tn 1), A(tn 1), start time tn 1, end time tn, query q functionfaug(z(t), c(t)), a(t): gz = f(z(t), t, θ) ga = q z(t) gc = gaz(t) return {gz, gc, ga} end function {ˆz(tn), C(tn), A(tn)} = ODESolver(faug, z(tn 1), C(ttn 1), A(tn 1), tn 1, tn, θ) return ˆz(tn), C(tn), A(tn)

Extension to Self Attention Self attention has been popular in sequential learning tasks (Vaswani et al. 2017). This paper presents a new self attention scheme based on the continuous-time state machine. Attention is performed by treating all of data samples of a sequence as query and working with the other samples of the same sequence as key and value. The same sample is transformed to ﬁnd query, key and value using individual parameters. A general context vector ci based on discrete-time attention is extended from Eq. (4) which is calculated by dot-product (or matching score) between query Wqzi = qi and key Wkzj = kj, softmax, and then multiplication with value Wvzj = vj as

exp(q i kj) P k exp(q i kk) vj. (13)

The continuous-time context vector function and the corresponding dynamic using self attention are modiﬁed as

0 exp(q( t) k(τ)) v(τ)dτ

dt = exp(q( t) k(t)) v(t).

The attention score function and its dynamic are extended as

0 exp(q( t) k(τ))dτ, d A(t)

dt = exp(q( t) k(t))

(15) The continuous-time functions in numerator and denominator of Eq. (13) are calculated. Notably, self attention employs individual transformations to obtain query qi, key kj and value vj. However, using standard attention, input sample xi is used as query and state variable zj is shared as key and value. Discrete-time attention is correspondingly extended to the continuous-time attention based on Eqs. (14)-(15).

Experiments A set of experiments are conducted to evaluate the performance of continuous-time attention in sequential learning.

Evaluation on Action Recognition Human activity dataset (Kaluza et al. 2010) was used as an action recognition task which contained irregular time

series from ﬁve individuals with 3D positions of tags to their belt, chest and ankles. There were 12 observation features and eleven different actions including walking, sitting, lying, etc. For consistent comparison, we used the same preprocessing method as (Rubanova, Chen, and Duvenaud 2019) and combined similar activities like lying and lying down , sitting and sitting down . Different methods adopted the same hyperparameter setting as (Rubanova, Chen, and Duvenaud 2019). Number of training epoch was 200. Learning rate was initialized by 0.01 and decayed after each iteration by multiplying 0.999. Adamax (Kingma and Ba 2014) was used. Hidden state size was 15. Relative and absolute tolerances were 1e-3 and 1e-4 for solver, respectively. A six-layer fully-connected network was conﬁgured as ODE function. One-layer GRU was used as RNN cell. Classiﬁer was built by three-layer fully-connected network. It is important to investigate how the continuous-time attention in Att-NDE is working. Figure 7 shows the attention score function A calculated by ODE solver in four conditions where different settings of dropping samples from original sequence are considered. For the setting of dropping three important points which are near to query, it is found that Att-NDE can still pay attention in that missing region. High attention region is also extended. Att-NDE can compensate the missing region via continuous-time attention through the dynamic function. When dropping the unimportant time points far from the query location, Att NDE simply ignores that region and obtains almost the same attention scores compared with the scores by using full sequence. The last setting is the case of dropping a wide range of samples. Interestingly, Att-NDE even attends the unimportant region. This case happens partially because the missing region is too large to ignore. Attention is needed in this situation. Next, the continuous-time attention is evaluated by comparing the predictions using NDE and Att-NDE. Table 1 shows that Att-NDE is robust to obtain comparable results even when a wide range of samples are missing while NDE could not preserve the predictions. Such a phenomenon still happens in case of random dropping. This is because that the attention mechanism in Att-NDE can capture the history information to learn a reasonable state trajectory. Table 2 compares the accuracy and parameter size of different methods. The work in (Rubanova, Chen, and Duvenaud 2019) was trained by using time-invariant dynamics dz(t)

dt = f(z(t), θ). Att-NDE carries out the time-variant dynamics in Eq. (1). The results of our implementation and that work are both reported. t implies the implementation by treating time information as a new augmented feature. Basically, the discrete-time state machine with attention (RNN + Att relative to RNN) was degraded. Next, the time-variant and invariant dynamics with NDE (w/ and w/o time) are compared. Time-variant dynamics work well. In addition, we combine NDE with discrete-time attention mechanism to examine the other two implementations NDE + Att (w/ and w/o time). Interestingly, adding discrete-time attention does not help, even degrades the performance of NDE (w/time) setting. This is because that discrete-time attention could not property characterize temporal information from irregular time series. Att-NDE achieves the highest accuracy

Figure 7: Illustration of how continuous-time attention values (via the darkness of red) are affected by four cases. denotes the data samples at the corresponding time points. Red rectangular in time end is the query. The score functions are shown with different settings containing full sequence (top left), sequence dropped by a slice of time points which are important (top right) and unimportant (bottom left), and dropped by a wide range of samples (bottom right). Human activity dataset is used.

Model Prediction for two rows of sequence data

NDE 0 0 0 3 3 3 3 3 3 3 3 3 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 Att-NDE 1 1 1 1 1 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 labels 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 NDE 0 0 0 3 3 3 3 3 3 3 3 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 Att-NDE 1 1 1 1 1 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 labels 3 3 1 1 1 1 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 NDE 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 6 6 6 6 6 6 6 6 Att-NDE 1 1 1 1 1 1 1 1 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 1 labels 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

Table 1: Comparison of predictions using different models and settings. From top to bottom are the predictions for full sequence, dropping wide range of data, and dropping 10 time points randomly. 0:walking. 1:falling. 2:lying. 3:sitting. 4:standing up. 5:on all fours. 6:sitting on the ground.

even higher than latent ODE (Rubanova, Chen, and Duvenaud 2019) which was the best among the previous methods. Number of parameters is comparable for different NDEs.

Evaluation on Emotion Recognition

Multimodal Emotion Lines Dataset (MELD) (Poria et al. 2019) contained the dialogue instances collected from Friends TV series. Conventionally, the dialogue sequences were treated as regular time series. However, each utterance had different sequence lengths. Representing irregular time series in a spoken dialogue (Chien and Lieow 2019; Chien and Hsu 2020) is desirable for dialogue modeling. MELD had not only text information but also audio and visual modalities. There were 1433 dialogues and 13708 utterances. Each utterance in a dialogue was labeled by one of seven emotions including anger, joy, neutral, etc. For preprocessing procedure, the Glove embedding was employed to

Models Accuracy (%) Param

RNN t 78.7 (79.7*) RNN t + att 77.6 NDE (w/o time) 83.3 (82.9*) 1.13M NDE (w/ time) 84.3 1.14M NDE + att (w/o time) 83.5 1.13M NDE + att (w/ time) 83.3 1.14M Latent ODE 84.6 (84.6*) 1.70M Att-NDE 86.8 1.13M

Table 2: Comparison of accuracy and number of parameters on action recognition using different methods. * means the result from (Rubanova, Chen, and Duvenaud 2019).

embed all tokens of an utterance, and then compute the average of those embeddings to represent the utterance. This study ignored the audio and visual clips, and stamped the start time of each utterance as time index. Then, the utterances in dialogue could be seen as irregular time series. The number of epochs was 300 and the learning rate 0.001 was used. Hidden state size was 50. ODE function was represented by a ﬁve-layer fully-connected network. Other settings were the same as those in the ﬁrst task. Att-NDE with self-attention was evaluated. Figure 8 shows an example of attention score function A(t). In this example, Really?! is served as query and Att-NDE is used to predict the emotion of this utterance. Those utterances marked by red were said by the interviewee and green were said by the interviewer. Att-NDE pays most of the attention on three utterances, which is So let s talk about ... , But there ll be perhaps ... and All right then, we ll have ... . All of them were said by interviewer. This is reasonable that the emotion of the interviewee would be affected by the interviewer. Att-NDE also attends on Good to know. , which may not be well attended. It is because that this one is used to answer two utterances, So let s talk about ... and But there ll be perhaps ... . Namely, Att-NDE predicts how the interviewee will reply to those two utterances, which are related to the query. Another example is provided in Figure 9 where ﬁve persons were chatting. In this situation, person B couldn t distinguish two girls and made some mistake. Att-NDE pays the highest attention on utterances

also I was the point person on my company's transition from the KL-5 to GR-6 system. You must've had your hands full.

That I did. That I did.

So let's talk a little bit about your duties.

My duties? All right.

Now you'll be heading a whole division, so you'll have a lot of duties.

But there'll be perhaps 30 people under you so you can dump a certain amount on them.

Good to know. We can go into detail No don't I beg of you!

Really?! Surprise Surprise Label Prediction

All right then, we'll have a deﬁnite answer for you on Monday, but I think I can say with some conﬁdence, you'll ﬁt in well here

Figure 8: An example of dialogue with two persons.

Utterance Okay! You don't think I thought of that? Person

Utterance Are you insane? I mean Joey, is going to kill you, he's actually going to kill you dead.

Utterance How can you not know which one?

Utterance I mean that's unbelievable.

Utterance I mean, was it Gina?

Utterance Which one is Gina?

Utterance Dark, big hair, with the airplane earrings.

Utterance No, no, no, that's Dina.

Utterance You see you can't tell which one is which either, dwha!! Person

Label Surprise

Figure 9: An example of dialogue with ﬁve persons.

I mean, was it Gina? and Which one is Gina? . This example shows that the other persons still couldn t distinguish these two girls. Another two utterances How can you not know which one? and I mean that s unbelievable. are also attended by our model, which blamed person B. Although the label is surprise , the prediction anger is also acceptable. From the results in two tasks, it is obvious that the behavior of sequence data is substantially reﬂected by the attention scores. For human activity, which is irregular time series, the attention score function is smoother and more like continuous function. While MELD, which is seen as regular time series in literature, is more like discrete function. Stamping start time as time index may be too naive. Table 3 shows the weighted average of precision and recall (i.e. F1 score) using different methods. Att-NDE with self attention achieves the best performance in sentiment classiﬁcation.

Figure 10: Different features from irregular samples in Physio Net. Bold bars indicate the observation time points.

Models F1-score att type

RNN t 0.539 - NDE 0.551 - Att-NDE 0.560 att Att-NDE 0.565 self-att

Table 3: F1-score on MELD.

RNN t 0.783 NDE 0.826 Att-NDE 0.833

Table 4: AUC on Physio Net.

Evaluation on Mortality Prediction

Physio Net (Silva et al. 2012) was collected from the intensive care unit (ICU) containing the ﬁrst forty eight hours of patients physiological signals like respiration rate, heart rate (HR), etc. There were four time-invariant features including age, gender, height and ICU type. Figure 10 shows the scenarion of irregular samples. This task is to predict in-hospital mortality rate. Hyperparameter setting was similar to the previous tasks. Number of epochs was 40 and hidden state size was 20. ODE function was built by a ﬁve-layer fully-connected network. Because positive samples only have 13.75%, area under the curve (AUC) is used to evaluate model performance as shown in Table 4. Att NDE performs better than RNN and NDE in terms of AUC.

Conclusions

This paper presented the continuous-time attention for sequential learning over irregular sequence data. This attention scheme was derived by merging with neural differential equation to build continuous-time state machine. This Att-NDE represented the mapping from observations to targets where the continuous-time functions of attention score and context vector were computed. The experimental results showed that adding continuous-time attention did improve the robustness to missing time samples.The property in continuous-time attention was investigated. In future works, the limitation for Att-NED will be handled. In self attention setting, we basically feed all of query vectors to ODE solver and solve them individually, which is memory inefﬁcient. In addition, we will extend our methods to other NDE methods such as latent ODE (Chen et al. 2018; Rubanova, Chen, and Duvenaud 2019) or neural stochastic differential equation (Liu et al. 2019) where stochastic property is preserved. The proposed attention is also feasible to other types of time series (Yildiz, Heinonen, and Lahdesmaki 2019; Jia and Benson 2019). The time information of each word rather than each utterance will be used for emotion recognition.

Acknowledgments

This work is supported by the Ministry of Science and Technology, Taiwan, under MOST 109-2634-F-009-024.

References Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In Proc. of International Conference on Learning Representations.

Bauer, S.; Sch olkopf, B.; and Peters, J. 2016. The arrow of time in multivariate time series. In Proc. of International Conference on Machine Learning, 2043 2051.

Cao, W.; Wang, D.; L., J.; Zhou, H.; Li, L.; and Li, Y. 2018. BRITS: bidirectional recurrent imputation for time series. In Advances in Neural Information Processing Systems, 6775 6785.

Che, Z.; Purushotham, S.; Cho, K.; Sontag, D. A.; and Liu, Y. 2018. Recurrent reural networks for multivariate time series with missing values. Scientiﬁc Reports 8.

Chen, R. T. Q.; Rubanova, Y.; Bettencourt, J.; and Duvenaud, D. K. 2018. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, 6571 6583.

Chien, J.-T. 2019. Deep Bayesian natural language processing. In Proc. of Annual Meeting of the Association for Computational Linguistics : Tutorial Abstracts, 25 30.

Chien, J.-T.; and Chen, Y.-H. 2021. Continuous-time selfattention in neural differential equation. In Proc. of International Conference on Acoustics, Speech, and Signal Processing.

Chien, J.-T.; and Hsu, P.-C. 2020. Stochastic curiosity exploration for dialogue systems. In Proc. of Annual Conference of International Speech Communication Association, 3885 3889.

Chien, J.-T.; and Ku, Y.-C. 2015. Bayesian recurrent neural network for language modeling. IEEE Transactions on Neural Networks and Learning Systems 27(2): 361 374.

Chien, J.-T.; and Lieow, W. X. 2019. Meta learning for hyperparameter optimization in dialogue system. In Proc. of Annual Conference of International Speech Communication Association, 839 843.

Chien, J.-T.; and Lin, T.-A. 2018. Supportive attention in end-to-end memory networks. In Proc. of IEEE International Workshop on Machine Learning for Signal Processing, 1 6.

Chien, J.-T.; and Lin, T.-A. 2020. Supportive and Self Attentions for Image Caption. In Proc. of Asia-Paciﬁc Signal and Information Processing Association Annual Summit and Conference, 1713 1718.

Chien, J.-T.; and Wang, C.-W. 2019. Self attention in variational sequential learning for summarization. In Proc. of Annual Conference of International Speech Communication Association, 1318 1322.

Chuang, Y.-H.; Huang, C.-L.; Chang, W.-W.; and Chien, J.- T. 2020. Automatic Classiﬁcation of Myocardial Infarction Using Spline Representation of Single-Lead Derived Vectorcardiography. Sensors 20(24): 7246.

Cordonnier, J.-B.; Loukas, A.; and Jaggi, M. 2020. On the relationship between self-attention and convolutional layers. In Proc. of International Conference on Learning Representations.

Dupont, E.; Doucet, A.; and Teh, Y. W. 2019. Augmented neural ODEs. In Advances in Neural Information Processing Systems, 3140 3150.

El Naqa, I.; Pandey, G.; Aerts, H.; Chien, J.-T.; Andreassen, C. N.; Niemierko, A.; and Ten Haken, R. K. 2018. Radiation therapy outcomes models in the era of radiomics and radiogenomics: uncertainties and validation. International Journal of Radiation Oncology Biology Physics 102(4): 1070 1073.

Elman, J. L. 1990. Finding structure in time. Cognitive Science 14(2): 179 211.

Jia, J.; and Benson, A. R. 2019. Neural jump stochastic differential equations. In Advances in Neural Information Processing Systems, 9847 9858.

Kaluza, B.; Mirchevska, V.; Dovgan, E.; Lustrek, M.; and Gams, M. 2010. An agent-based approach to care in independent living. In Proc. of International Joint Conference on Ambient Intelligence, 177 186.

Kingma, D. P.; and Ba, J. 2014. Adam: a method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980 abs/1412.6980.

Kuo, C.-Y.; and Chien, J.-T. 2018. Markov recurrent neural networks. In Proc. of IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 1 6.

Lipton, Z. C.; Kale, D.; and Wetzel, R. 2016. Directly modeling missing data in sequences with RNNs: improved classiﬁcation of clinical time series. In Proc. of Machine Learning for Healthcare Conference, 253 270.

Liu, X.; Xiao, T.; Si, S.; Cao, Q.; Kumar, S. P. S.; and Hsieh, C.-J. 2019. Neural SDE: stabilizing neural ODE networks with stochastic noise. ar Xiv preprint ar Xiv:1906.02355 .

Luong, T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In Proc. of Conference on Empirical Methods in Natural Language Processing, 1412 1421.

Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; and Mihalcea, R. 2019. MELD: a multimodal multiparty dataset for emotion recognition in conversations. In Proc. of Annual Meeting of the Association for Computational Linguistics.

Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; and Shlens, J. 2019. Stand-alone self-attention in vision models. In Advances in Neural Information Processing Systems, 68 80.

Rubanova, Y.; Chen, R. T. Q.; and Duvenaud, D. K. 2019. Latent ordinary differential equations for irregularlysampled time series. In Advances in Neural Information Processing Systems, 5320 5330. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.; and Woo, W. 2015. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, 802 810. Silva, I.; Moody, G. B.; Scott, D. J.; Celi, L.; and Mark, R. G. 2012. Predicting in-hospital mortality of ICU patients: The Physio Net/Computing in cardiology challenge 2012. Computing in Cardiology 245 248. Su, J.; Wu, S.; Xiong, D.; Lu, Y.; Han, X.; and Zhang, B. 2018. Variational recurrent neural machine translation. In Proc. of AAAI Conference on Artiﬁcial Intelligence, 5488 5495. Tseng, H.-H.; Luo, Y.; Cui, S.; Chien, J.-T.; Ten Haken, R. K.; and El Naqa, I. 2017. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics 44(12): 6690 6705. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998 6008. Wang, D.; Cao, W.; Li, J.; and Ye, J. 2017. Deep SD: supplydemand prediction for online car-hailing services using deep neural networks. In Proc. of International Conference on Data Engineering, 243 254. Yildiz, C.; Heinonen, M.; and Lahdesmaki, H. 2019. ODE2VAE: deep generative second order ODEs with Bayesian neural networks. In Advances in Neural Information Processing Systems, 13412 13421. Zhang, J.; Zhang, P.; Kong, B.; Wei, J.; and Jiang, X. 2021. Continuous Self-Attention Models with Neural ODE Networks. In Proc. of AAAI Conference on Aritiﬁcial Intelligence.