# flowhmm_flowbased_continuous_hidden_markov_models__2d9ac615.pdf Flow HMM: Flow-based continuous hidden Markov models Paweł Lorek Mathematical Institute University of Wrocław Tooploox pawel.lorek@math.uni.wroc.pl Rafał Nowak Institute of Computer Science University of Wrocław Tooploox rafal.nowak@cs.uni.wroc.pl Tomasz Trzci nski Warsaw University of Technology Jagiellonian University of Cracow Tooploox, IDEAS NCBR tomasz.trzcinski@pw.edu.pl Maciej Zi eba Department of Artificial Intelligence Wrocław University of Science and Technology Tooploox maciej.zieba@pwr.edu.pl Continuous hidden Markov models (HMMs) assume that observations are generated from a mixture of Gaussian densities, limiting their ability to model more complex distributions. In this work, we address this shortcoming and propose novel continuous HMM models, dubbed Flow HMMs, that enable learning general continuous observation densities without constraining them to follow a Gaussian distribution or their mixtures. To that end, we leverage deep flow-based architectures that model complex, non-Gaussian functions and propose two variants of training a Flow HMM model. The first one, based on gradient-based technique, can be applied directly to continuous multidimensional data, yet its application to larger data sequences remains computationally expensive. Therefore, we also present a second approach to training our Flow HMM that relies on the co-occurrence matrix of discretized observations and considers the joint distribution of pairs of co-observed values, hence rendering the training time independent of the training sequence length. As a result, we obtain a model that can be flexibly adapted to the characteristics and dimensionality of the data. We perform a variety of experiments in which we compare both training strategies with a baseline of Gaussian mixture models. We show, that in terms of quality of the recovered probability distribution, accuracy of prediction of hidden states, and likelihood of unseen data, our approach outperforms the standard Gaussian methods. 1 Introduction Hidden Markov models (HMMs) are a standard tool in modeling and analysis of time series data. Although structurally simple, they have been successfully applied in a wide variety of applications, ranging from finance [1], speech recognition [2] to computational biology [3] and climate modeling [4]. HMMs are capable of solving complex problems, but their adoption is limited due to several reasons. First, the training process of HMMs typically relies on the Baum-Welch algorithm [5] a particular case of expectation-maximization (EM) method, which offers a relatively slow convergence to a local maximum. To reduce this burden, several discrete HHMs introduce the so-called co-occurrence matrix that aggregates information about the probability of jointly observed values in the chain and estimate 36th Conference on Neural Information Processing Systems (Neur IPS 2022). it together with the parameters of the whole model [6, 7, 8, 9, 10]. Although the co-occurrence matrix is computed only once and then used to significantly reduce convergence time during optimization, its application is strictly limited to discrete HMMs and cannot be easily generalized to their continuous variants. Figure 1: The concept of Flow HMM for L = 3 states and transition matrix A. Each emission distribution characterised by density fβl( ) is modeled using a separate flow component. Thanks to this, they can adjust to complex, non Gaussian distributions. Secondly, continuous HMM models are restricted to follow standard parametrized distributions when modeling the observations. Most of them use either Gaussians or other parametric families of distributions [11, 12], while the others rely on the mixtures of Gaussians or apply semiparametric distribution estimation [13, 14]. As a result, the existing HMMs have limited ability to model complex observations that do not follow the distributions mentioned above. This, in turn, hinders their application in real-life use cases, e.g., in human action recognition [15]. In this work, we address the shortcomings of the existing HMMs and introduce Flow HMM, a novel continuous hidden Markov model that learns a general continuous distribution of observations by exploiting the properties of flow-based models [16]. The core idea of our approach is to model the distributions of the observations with normalizing flows, instead of Gaussians or their mixtures. Flows map a simple Gaussian prior to more complex distributions using a parametrized neural network. This formulation enables flow-based models to overpass parametric models in their flexibility to handle data samples of unknown distributions. Moreover, flow-based models naturally extend to a multimodal setting, which effectively renders obsolete the tedious process of tuning the number of hidden states. Practically, training the model with flow-based components requires a gradient-based optimization. To achieve that, we propose two variants of training the Flow HMM model. The first approach is based on maximum likelihood (ML) technique and can be applied directly to continuous multidimensional data. In the second method, we discretize the continuous values during training and leverage the co-occurrence matrix. As a result, we provide an end-to-end training procedure that jointly optimizes the flow-based component parameters and the co-occurrence matrix using standard gradient-based techniques. Since the model is trained in a discretized form, the optimization process is simple and more efficient than the competing HMM training procedures, while during inference, we can still use a continuous version of our Flow HMM model. To summarize, the main contribution of our paper is a novel continuous HMM method dubbed Flow HMM with two alternative training scenarios, capable of modeling complex multimodal distributions of observations without constraining them to follow Gaussians. Not only does it outperform the competing approaches, but it also increases the efficiency of the required optimization procedure. 2 Background 2.1 Hidden Markov models Let {Xk}k 0 be an ergodic, time homogeneous Markov chain over hidden states S = {s1, . . . , s L} with a stationary distribution π = (π1, . . . , πL), and a transition matrix A, i.e., for any k we have P(Xk+1 = sj|Xk = si) = A(i, j). Let {Yk}k 0 be a sequence of random variables taking values in V (the observation space), which can be continuous or discrete. Let us fix some time horizon T. Random variables Y0:T = (Y0, . . . , YT ) are independent conditionally on the state sequence X0:T = (X0, . . . , XT ), i.e.: p(Y0:T = y0:T |X0:T = x0:T ) = k=0 p(yk|xk), (1) where p(y|x) P(Y = y|X = x) represents the so-called emission probabilities for discrete observations. We will shortly write p(y0:T ) for p(Y0:T = y0:T ) and similarly e.g., p(y0:T |x0:T ) for p(Y0:T = y0:T |X0:T = x0:T ) to simplify the notation. Considering the continuous case, p(y|x) represents the conditional emission density function for a given state x. For a Gaussian Mixture HMM (which we simply call Gaussian HMM) we assume that p(y|x) = PK k=1 αx,k N(y; µx,k, σ2 x,k), where αx, is a distribution on {1, . . . , K}. The values of parameters {αx,k, µx,k, σ2 x,k}K k=1 are determined by the conditioning value x. Note that we assume that each mixture has the same number of components K. For K = 1 we have a classic Gaussian HMM. HMM model can be parametrized by θ = {π, A, β}, where β = {β1, . . . βL}, and βl stays behind the parameters of p(y|l), for a given state l. For discrete case, βl represents the parameters for categorical distribution, while for a Gaussian HMM we have βl = {αl,k, µl,k, σ2 l,k}K k=1. The probability of observing the sequence of observations y0:T can be expressed as: p(y0:T ; θ) = X x0:T ST +1 p(y0:T |x0:T ; β)p(x0:T ; A, π), (2) where p(y0:T |x0:T ; β) is given by Eq. (1), and p(x0:T ; A, π) can be expressed as: p(x0:T ; A, π) = P(x0) k=1 P(xk|xk 1) = πx0 k=1 A(xk 1, xk). (3) 2.2 Normalizing Flows Normalizing flows [16] are generative models that can be efficiently trained via direct likelihood estimation thanks to the application of the change-of-variable formula. Practically, they utilize sequence of parametric and invertible transformations: y = hn h1(z). The goal of the transformation is to map z from the known, normal distribution N(z; 0, I) to the more complex distribution described by a density function f(y) from the observation domain. The log-probability for y can be expressed as: log f(y) = log N(z; 0, I) n=1 log det hn One of the main challenges while designing the normalizing flows is selection of a proper form of transformation functions hn. The sequence of discrete transformations can be replaced by continuous equivalent by application of Continuous Normalizing Flows (CNFs) [17], where the aim is to solve differential equation of the form dz dt = gβ(z(t), t), where gβ(z(t), t) represents the function of dynamics, described by parameters β. Our goal is to find a solution of the equation in t1, y := z(t1), assuming the given initial state z := z(t0) with a known prior. The transformation function hβ and its inverse are defined as defined as: y = hβ(z) = z + Z t1 t0 gβ(z(t), t) dt, h 1 β (y) = y Z t1 t0 gβ(z(t), t) dt. (5) The log-probability of y can be computed by (where h 1 β (y) = z): log fβ(y) = log N(h 1 β (y); 0, I)) Z t1 dgβ(z(t), t) dz(t) dt. (6) The choice of using CNFs is motivated by the fact, that we are focused on modeling distributions of one or low-dimensional data. CNFs were successfully applied in such models as NGGP [18], Point Flow [19], or Style Flow [20], where the dimensionality and characteristic of data are similar. It is somehow confirmed by the empirical results provided by the authors of FFJORD ([17] Table 2), the proposed approach performs better than discrete flows like Real NVP [21] or Glow [22] for low-dimensional data in terms of normalized log-likelihood. Such models were used in [23]. Moreover, flows that use coupling layers (Real NVP, Glow) and autoregressive flows (MAF [24]) do not make sense for 1D data. While operating on 1D data, we do not care about simplifying the estimation of the Jacobian, and any invertible differentiable transformation can be applied. At the same time, we need a complex, well-parameterized transformation that is delivered by a dynamic function of CNF. In this section we introduce Flow HMMs - HMM variants of continuous flow capable to model the observations using complex, non-Gaussian distributions. The idea behind this approach is to model each of conditional densities p(y|x = sl) = fβl(y), for each of the considered states, sl S, using a separate CNF module. In practice, p(y|x = sl) can be calculated using formula given by Eq. (6) with the parameters βl dedicated for the state sl. The idea of our approach is illustrated in Fig. 1. In Flow HMM we assume that {Xk} is stationary, i.e., the stationary distribution π is also its initial distribution. In such a case, instead of using A, the model can be equivalently represented by the state joint probabilities: S(i, j) = P(Xk = si, Xk+1 = sj) = P(Xk+1 = sj|Xk = si)P(Xk = si) = A(i, j)πi. (7) Note that πi can be computed as P j S(i, j), what can be written as πi = 1i S1T , where 1 denotes a vectors consisting of ones and 1i consists of 1 on position i and zeros otherwise. For such a formulation, the Flow HMM model can be parametrized by θ = {S, β1, . . . βL}. We propose two variants of Flow HMM models, which differ mainly in a training process: gradientbased model FML which is trained with an maximum likelihood approach directly on continuous data, and co-occurrence matrix-based model FQ, that utilizes co-occurrence matrix and is trained in end-to-end setting using discretized sequences. The first of the proposed methods does not require discretisation step, but is costly and ineffective for larger sequences. FQ eliminates that problem, and makes the training time independent of the length of the training sequence (only estimating empirical co-occurrence matrix depends on T, but the time is marginal). In upcoming sections we are going to introduce both models in more details. 3.1 Training Flow HMM FML model Given a sequence of observations y0:T = (y0, . . . , y T ), the model is trained by optimizing the log-likelihood log p(y0:T ; θ) (see Eq. (2)), where we aim at finding the optimal values of S (and thus A and π ) and parameters of flow models β l , such that θ = {S , β 1, . . . , β L} = arg maxθ log p(y0:T ; θ). In order to satisfy the constraints on S (i.e., that 1S1T = 1 and each entry is non-negative, what we denote by S 0) we parametrize the matrix S by a real-valued matrix S also of size L L, using the following softmax function: S(s1, s2) = exp( S(s1, s2)) P si,sj exp( S(si, sj)) . (8) Thanks to that representation, we train the S and βl s iteratively with gradient-based approach, by maximizing the incomplete log-likelihood log p(y0:T ; θ), calculated directly using the forward algorithm. 3.2 Training Flow HMM FQ model Considering FQ model we introduce the co-occurrence matrix that represents the joint distribution of two consecutive observations: Q(y1, y2) = p(Yk+1 = y2, Yk = y1). The matrix represents a categorical distribution, (i, j)-th entry represents the probability of observing a pair of states (vi, vj) at some fixed two consecutive steps k and k + 1. Note that it is independent of k, since throughout the paper we assume that underlying Markov chain on hidden states is stationary, thus a bivariate distribution of (Xk, Xk+1) is independent of k. The matrix can be rewritten as: Q(y1, y2) = X si,sj S p(y1|si)S(i, j)p(y2|sj). (9) Assuming the discrete set of observations, V = {v1, . . . , v M}, it can be further expressed as Q = PT SP, where P collects probabilities of all possible observations at each hidden state in a matrix P(si, vj) = p(vj|si). In this case there are M 2 possible observation pairs (y1, y2) V V. Note that matrices Q, S, P are of sizes M M, L L and L M, respectively. Moreover, we have P vi,vj V Q(vi, vj) = P si,sj S(si, sj) = 1 and P v V P(si, v) = 1 for all si S. In a matrix form, these can be written as 1Q1T = 1S1T = 1 and P1T = 1T (note that 1 must be of an appropriate size). Given observations y0:T the matrix Q can be empirically estimated by: ˆQ(vi, vj) = 1 k=0 I(yk = vi)I(yk+1 = vj), (10) for all pairs (vi, vj) V V. The problem of training the HMM is to find such parameters (matrices S and P) so that the co-occurrence matrix Q is close (in some sense) to the empirical co-occurrence matrix ˆQ. We can formulate the problem as (for some distance dist) min P RN M, S RL L dist( ˆQ, PT SP), subject to 1S1T = 1, P1T = 1T , P 0, S 0. (11) This problem formulation has a couple of advantages compared to standard likelihood-based optimization. First, the empirical co-occurrence matrix ˆQ is independent of the sequence length T. Second, the given objective can be easily optimized using gradient-based techniques. On the other hand, the constraints for matrices P and S should be satisfied. The set V is discrete, while we aim to design the model for continuous observations. In order to satisfy the constraints for matrix S we use the representation given by Eq. (8). In order to represent matrix P we use Flow-based emission probabilities. Flow-based emission probabilities. With each hidden state l we associate the flow model described by a density function fβl, that can be calculated from Eq. (6), where βl is a set of trainable parameters of the flow. We construct P based on these models (in such a case P Pβ). The i-th row of the matrix is a density fβi( ) evaluated at v1, . . . , v M and normalized as follows: Pβ(si, vj) = fβi(vj) PM k=1 fβi(vk) . (12) Similarly, the optimization problem (11) becomes: min β,S RL L dist( ˆQ, PT βSPβ). (13) As a distance function, we propose to use Kullback Leibler divergence, min β,S RL L L, L = X i,j Qβ(i, j) log Qβ(i, j) ˆQ(i, j) , (14) where Qβ = PT βSPβ. We postulate to apply divergence instead to L2 distance used e.g., in [7]., because it is more natural measure for comparing two distributions. We observed during empirical evaluation, that using this divergence gives better stability and convergence of the training process. The final training procedure of Flow HMM FQ model is as follows. Let ytrain 0:T be the training set and let ytest 0:T be the test set. For the fixed M we construct the grid Γ = (γ1, . . . , γM) using one of the approaches described in A.1. Next, we create the discretized training data ytrain,Γ 0:T , where ytrain,Γ i = arg minγ Γ ||γ ytrain i ||2. The values of the grid represent further the set V, V = Γ. Next, we calculate the empirical co-occurrence matrix ˆQ: ˆQ(vi, vj) = 1 k=0 I(ytrain,Γ k = vi)I(ytrain,Γ k+1 = vj), (15) and matrices S and Pβ (using Eqs. (8), (12)) that are further used to calculate the loss given by Eq. (14). Practically, we add Gaussian perturbations to the grid values, while calculating the matrix P: Pβ(si, vj) = fβi(vj + ϵ) PM k=1 fβi(vk + ϵ) , (16) where ϵ is a random sample from N(0, σ2 noise), and σ2 noise is a hyperparameter of the method. This is one of the standard tricks applied while training normalizing flows, that imitates the situation where we have an access to an infinite number of training examples. These perturbations prevent from overfitting of the flow-based components caused by observing the same grid while training. Next, all of the parameters of Flow HMM are updated with gradient based approach. The procedure is repeated until convergence. The procedures of training is presented in Algorithm 1. 3.3 Inference with Flow HMM Algorithm 1 Training using FQ technique Require: ˆQ: empirical co-occurrence matrix from Eq. (10). Parameters: β = {β1, . . . , βL} flow parameters, S parameters representing un-normalised co-occurrence matrix. Hyperparameters: L - number of hidden states, αstep size, noise variance σ2 noise. 1: function TRAIN( ˆQ, L, α) 2: Initialize β, and S. 3: while not convergent do 4: Calculate S from Eq. (8). and Pβ from Eq. (16). 5: Calculate loss function L from Eq. (14) using ˆQ. 6: S S α SL 7: for each l {1, . . . , L} do 8: βl βl α βl L 9: end for 10: end while 11: return β1, . . . , βL, S 12: end function As we postulated before, our model is designed for continuous problems, due to the fact, that both of the training techniques return estimated S, and the parameters, that represent emission distributions for each of the states. Thus, during the inference stage, we simply calculate A and π from S and apply forward procedure on the test data ytest 0:T to calculate p(ytest 0:T ; θ). We can also determine the hidden state values using the Viterbi procedure. Concluding, Flow HMM can be used in the same applications as standard continuous HMM models, but with no restrictions to emission distributions. Flow HMM for multidimensional observations. Our approach can be easily extended to multidimensional observations, i.e., to the case where ytrain 0:T are from Rd. For such a case, the FML model is straightforward, let us focus on a description of FQ. On one hand the extension is straightforward: we construct some grid Γ = (γ1, . . . , γM) of d-dimensional points, we create discretized training set ytrain,Γ 0:T and the empirical co-occurrence matrix ˆQ from (15), afterwards we compute Pβ from Eq. (16) and the gradient βl L of L given in (14). In other words we proceed with an Algorithm 1. Note that CNFs are well-suited for multivariate observations. On the other hand, there is a challenging aspect of training FQ model for multidimensional data: an effective discretization technique. We elaborate on that in A.1 proposing a new grid search method. We suggest using FQ model for lower-dimensional data, especially with long observation sequence (recall, the matrix Q is computed once) in such a case FML will usually be very slow, each epoch loops through a whole observation sequence. On the other hand, for FML is better suited for high-dimensional data and short observation sequence for longer observation sequences, in order to shorten the execution time, we apply small trick, at each epoch we sample a subsequence of fixed length, e.g., 103. There is a trade-off between execution time and performance, see Table 7 in A.7. 4 Related work One of the most popular way to train HMM models is the Baum-Welch algorithm (a name for EM applied to HMMs). However, one of (several) drawbacks of using Baum-Welch algorithm for training is that it is prohibitively slow for long sequences. The complexity of the forward-backward algorithm (which must be run for each epoch) is O(N 2T), where N is the number of hidden states and T is the length of the observation sequence. In order to eliminate this issue, the authors [6] propose to use pairwise co-occurrence probabilities (also higher order statistics are also discussed there) and prove that it is possible to recover the structure of an HMM based only on co-occurrences. An interesting extension (both, of pairs and triplets) was considered in [25]. Authors used there non-consecutive tuples, which outperforms consecutive tuples in some cases. Our approach is, in a sense, close to the technique used in [7], with however significant differences: authors use alternating least square methods for optimization (in our case, all the matrices are realvalued and, softmax is applied whenever needed) and for the continuous case, they assume that the observation densities are a mixture of the predefined number of kernels.The "softmax trick" is also used in [9] (their model learns so-called dense representations of hidden states and observation probabilities it is designed only for discrete HMMs). As already mentioned, typically continuous HMMs assume that observations were sampled from a Gaussian distribution or a mixture of such distributions. For a non-Gaussian distribution, one usually assumes some parametric family of distributions. In [11] authors propose a decoupling method to learning the parametric HMMs which are stationary. Instead of estimating the parameters of hidden states and observations jointly, they learn the parameters of observation densities using some parametric mixture learner, and then hidden states by solving some convex quadratic programming problem. In [13] a method is proposed for a case where at least one state-dependent distribution is modeled with some nonparametric technique (e.g., maximum likelihood estimation under shape constraints), some broader review of methods is presented in [26]. In [14] authors use a Gaussian copula to model the dependence structure. A semiparametric data transformation is also proposed to ensure one may indeed use such copulas (the final observation distribution is a finite mixture of the copula models). In [27] authors present an efficient learning for parametric continuous HMMs sampled at finite irregular time instants (they incorporate some ideas from the theory of continuous-time Markov chains). In [28], the authors consider the extension of a Kalman filter (which you can think of as a version of a HMM with continuous both, observations and hidden states), where the pseudo-observations (as authors call it) are transformed through a normalizing flow producing the actual observation. The most similar model to our approach was introduced in Ghosh et al. [29] and further developed in [23], where the authors employ normalizing flows in observation densities and apply it to classification problems. However, they utilize the flow with coupling layers that cannot be directly applied to one-dimensional data. We use CNFs that can be easily used for any type of data, and are characterized by better quality for low-dimensional examples compared to coupling-based flows. Moreover, their approach relies on EM training, which makes it impractical for longer sequences of data. We propose two variants of training models: a) FQ model which relies on co-occurrences of discretized data and can successfully be applied on very long sequences, and b) FML model which applies gradient-based optimization not only to the parameters of the flow (as in [29] and [23]), but also to the transition matrix on hidden states (which - because the underlying chain is stationary is uniquely determined by the matrix representing joint states probabilities and parametrization given in Eqs. (8) and (9)) consequently, in each iteration, only one forward step is needed, while Ghosh et al. model requires two forward passes (one to fit frozen weights (no gradient) and one gradient-based to update weights). For one-dimensional examples we show the advantages of the proposed FQ empirically (compared to FML model). Finally, the authors of [29] apply their model only on sequence classification as black box. At the same time, we investigate the proprieties of specific examples in order to discover the capabilities of distribution adjustment and discover the hidden states of data unobserved during training. 5 Experimental results Figure 2: Distributions learned by FML and FQ for T = 104 observations for Example 1 (top row) and Example 2a (bottom row). Original distributions shaded. Evaluation. Here we consider our Flow HMM models FML and FQ with training procedures described in Sections 3.1 and 3.2 respectively. We compare our results with classic HMM models utilizing Gaussian mixtures observation densities with K = 1, 10, 20 components; we denote them by G(K) and for simplicity we write G G(1). We used hmmlearn library1 for the computations of all G(K) models and provide the source code for Flow HMM models2. We evaluate the quality of the proposed methods computing Normalized log-likelihood (norm LL) of unseen observations, Total variation distance (dtv) between learned and true emission probabilities, and accuracy of predicted hidden states (accuracy). 1https://github.com/hmmlearn 2https://github.com/tooploox/flowhmm T Example 1 Example 2a Example 2b G G(10) G(20) FML FQ G G(10) G(20) FML FQ G G(10) G(20) FML FQ 103 -2.21 -2.21 -2.21 -2.46 -3.06 0.17 0.17 0.17 0.33 0.11 0.09 0.17 0.17 0.32 0.09 104 -2.19 -2.19 -2.19 -2.28 -2.18 0.18 0.17 0.17 0.33 0.34 0.09 0.18 0.18 0.32 0.30 105 -2.19 -2.19 -2.19 -2.28 -2.16 0.18 0.17 0.17 0.34 0.33 0.09 0.18 0.18 0.32 0.18 103 0.09 0.09 0.09 0.13 0.39 0.28 0.31 0.31 0.38 0.43 104 0.07 0.07 0.07 0.11 0.05 0.29 0.32 0.32 0.30 0.15 105 0.07 0.07 0.07 0.11 0.04 0.29 0.32 0.32 0.28 0.15 Table 1: norm LL and dtv metrics for synthetics Examples 1, 2a-b. Best results bolded. FQ FML G G(10) Figure 3: Results for Example 2b (three original distributions shaded) for T = 104 observations. We define all the metrics and motivate their choice in A.2, the experimental settings are described in details in A.3. 1D Synthetic sequences. We consider HMMs with L = 3 hidden states and experiment with transition matrices A1 (used in [7, Eq. (11)]) or A2, where 0.0 0.9 0.1 0.0 0.0 1.0 1.0 0.0 0.0 0.4 0.2 0.4 0.25 0.5 0.25 0.4 0.2 0.4 We consider 2 examples. Example 1 - transition matrix A1 and emission probabilities: two Gaussian and one Uniform. Example 2a - transition matrix A2, emission probabilities: one Gaussian, one Uniform, and Beta distribution. In addition, we consider Example 2b, where we aim at adjusting the models assuming only two hidden states. The examples are described in details in Appendix A.4. For each example we sample train and test observations of the same length T {103, 104, 105}. We train FML and FQ models using 500 (Example 1) and 1000 (Examples 2a and 2b) epochs. For FQ we used a grid of size M = 30. For each case (i.e., fixed example and T) we perform 10 simulations and report means of norm LLs and total variation distances in Table 1. For observations of length T 104 flow models clearly outperform Gaussian ones in all the examples. Note that in these examples we had a grid of size M = 30, thus we estimated the matrix Q with 900 entries (number of observation pairs). It is intuitively clear that T = 103 observations is not enough then for FQ. In this case FML outperforms FQ in terms of norm LLs; models are comparable for larger T. For T 104 the FQ model "recovers" original distributions much more accurately, the small values of dtv from Table 1 are confirmed in Fig. 2, where distributions learned by our flow models trained with T = 104 observations are depicted (from single simulations). In A.4 in Fig. 5, the trained distributions for Examples 1, 2a-b trained with T = 103 observations are depicted. In A.7 we investigate the impact of length T on quality of the model. We encourage the reader to confront Fig. 2 for FQ and T = 104 for Example 1 with [7, Fig. 4]. In Fig. 3 we depict the distributions learned models with only L = 2 hidden states for T = 104 observations (recall observations were sampled from a model with 3 different distributions). Since flow models are capable of fitting multimodal distributions, they outperform Gaussian ones. Moreover, the co-occurrence matrix-based model FQ outperforms the maximumum-likelihood-based model FML. Recall that, by construction, our models FML and FQ assume that the underlying Markov chain is stationary. We want to point out that in our examples we used at most L = 3 states. In such a case, even if the assumption of stationarity does not hold, it does not have a large impact on results. This is because for such chains the rate of convergence to stationarity is exponential (in number of steps). To be more specific, e.g., for the chain with t.m. A2 given in (17), after 10 steps the total variation between the stationary distribution (which is π(1) = 5/14, π(2) = 4/14, π(3) = 5/14) and the distribution of X10 is at most (for any initial distribution) 4.22 10 6. S&P 500 L G G(10) G(20) FML FQ 2 -10.476 -10.157 -10.686 -6.850 -5.479 3 -10.800 -10.194 -10.806 -6.837 -4.966 4 -12.462 -9.7641 -11.130 -6.908 -5.043 Dow Jones 2 -12.317 -14.154 -13.138 -8.730 -7.120 3 -12.645 -13.747 -15.007 -8.897 -7.426 4 -13.956 -13.313 -14.669 -8.902 -7.514 Air pressure 2 -0.704 0.169 0.598 -0.984 0.823 3 -0.743 0.646 0.603 0.540 0.861 4 -0.758 0.710 0.692 0.661 0.783 Table 2: norm LL for Examples 3 and 4. Real datasets. Example 3. S&P 500 and Dow Jones are popular measures of market performance. We retrieved data [30] for a period of 9/1977-8/2017 that consist of 2082 points. We trained models on T = 1000 observations (and computed norm LLs for the remaining data). Example 4. We used mean air pressure from [31] dataset collected between 02/15/18 and 02/28/19. There were 54 531 observations, the models were trained on the first T = 40 000 ones and tested them on the remaining observations. In both, Example 3 and Example 4, we applied a standard difference transform (to remove a trend), i.e., we considered time series y k = yk yk 1. We trained the models with L {2, 3, 4} hidden states, models FML and FQ were trained for 500 epochs, for the latter we used a uniform grid of size M = 30. During training FML model we randomly sampled subsequences of length 103 for each epoch. The values of norm LL are reported in Table 2. As can be seen, in all cases FQ yields the best results and for Example 3 model FML is second best. Note also that although a number of observations for Example 3 is not large (we used only T = 1000 observations, while for M = 30 we have 900 observation pairs) FQ outperforms FML. Fig. 6 in A.4 shows that the learned observation densities for L {2, 3, 4} are non-Gaussian. We want to remark one thing: in financial data one usually works with log-returns log(yk/yk 1), which tend to be "more Gaussian". This is also the case with the examples we considered: FML, FQ and Gaussian models returned very similar norm LLs (which is no surprise in case of Gaussian data, other models easily fitted to them). We chose and reported the differences yk yk 1 to compare the models in case of non-Gaussian data. 2D synthetic sequences. We consider HMM models with L = 3 states. We consider 2 examples: Example 5 with emission probabilities: two "Moons" and one Uniform and Example 6 with emission probabilities: one bivariate Gaussian, one Uniform and one related to geometric Brownian motion. Details provided in A.5. In both examples we consider the transition matrix A1 (variant (a)) and A2 (variant (b)), defined in (17). T Example 5a Example 5b G G(10) G(20) Ghosh FML FQ G G(10) G(20) Ghosh FML FQ 103 -1.6572 -0.6521 -0.6637 -0.7643 -0.6003 -2.7090 -1.6560 -1.4459 -1.4486 -0.7430 -1.3309 -2.6789 104 -1.6083 -0.6176 -0.6081 -0.7052 -0.4822 -0.6711 -1.6574 -1.4392 -1.4356 -0.7036 -1.2286 -1.4185 105 -1.6117 -0.6001 -0.5988 -0.6740 -0.4898 -3.4500 -1.6510 -1.4160 -1.4178 -0.6728 -1.2538 -3.0717 103 0.6010 1.0000 1.0000 1.0000 0.5600 0.5450 0.6560 0.6480 0.7390 0.5360 104 0.6325 0.9995 0.9998 0.9998 0.9956 0.5680 0.6677 0.6723 0.7088 0.7414 105 0.6274 1.0000 1.0000 0.9998 0.6097 0.5726 0.6775 0.6727 0.7448 0.5615 Example 6a Example 6b 103 0.8652 0.9443 0.9472 0.1600 1.0531 -2.1823 0.2270 0.3128 0.2915 0.1487 0.3636 -2.1891 104 0.8755 1.0009 1.0014 0.1673 1.0807 0.9189 0.2378 0.3161 0.3173 0.1665 0.3551 0.2120 105 0.8843 0.9978 0.9992 0.1713 1.0688 -1.8349 0.2186 0.3076 0.3103 0.1730 0.3496 -0.3917 103 0.9960 0.9960 0.9960 0.9940 0.6700 0.8270 0.6580 0.6590 0.8550 0.6490 104 0.9971 0.9987 0.9987 0.9989 0.9934 0.8173 0.6533 0.6558 0.5419 0.8428 105 0.9964 0.9994 0.9994 0.9987 0.9219 0.8077 0.6503 0.6443 0.8796 0.7376 Table 3: norm LL and accuracy for Examples 5 and 6. Best results bolded. Figure 4: Trained distributions for Example 5a for T = 104: original observations denoted as red dots , grid Γ presented as gray crosses +; samples from trained distributions fβl, l = 1, 2, 3 depicted as black , green and blue dots (colors correspond to state l). We compare Gaussian baseline models, our FML and FQ models, as well as a model from Ghosh at al. [23] we used authors official implementaion3. For the latter we computed only norm LLs. In Table 3 values of norm LL (for all models) and accuracy (for all models except Ghosh at al. [23]) are reported, whereas in Fig. 4 the trained distributions for Example 5a for T = 104 are depicted. Similar plots for Example 6a are presented in A.5 in Fig. 7. We used non-uniform grids of size M = 302 (cartesian product of two one-dimensional grids of size 30 on each coordinate for other strategies for choosing a grid see A.1. One can observe that for all examples except Example 5b model FML gives the best norm LLs, Ghosh et al. outperforms our models only in Example 5b. For 2D examples we can spot a drop of quality in terms of norm LL for FQ model. We performed additional simulations for this model for T = 103 observations and grids of sizes 352 and 402, the results are provided in Table 4. Ex. M = 352 M = 402 5a -2.548 -2.499 5b -2.520 -2.470 6a -2.042 -1.997 6b -2.042 -1.990 Table 4: Values of norm LL for FQ model for Examples 5a-b, 6ab for T = 103 observations. As we can see they are still worse than FML model (which is no surprise: we have 304, 354 and 404 pairs and only 103), but we see a tendency: increasing a grid size increases norm LL in all cases. One can observe that in all the considered cases and variants, model FML gives the best norm LL values. When it comes to the other metric, i.e., the accuracy, all the classic Gaussian mixture models, as well as FML, yield almost the same accuracy if transition matrix A1 was used (variant (a)). However, for the transition matrix A2 (variant (b)), the models FML and FQ outperform the classic models with significant margin (note that A2 yields a "more complex" model compared to A1). L G G(10) G(20) FML 2 -1097.49 -1096.81 -1096.65 -571.08 3 -1097.45 -1096.73 -1096.51 -571.97 4 -1097.39 -1096.65 -1096.09 -572.61 Table 5: Values of norm LL for 6dimensional data. High dimensional (6D) real dataset. Example 7: from [31] dataset (used in Example 4) we chose 6 features related to humidity (mean, stdev) and air pressure (min, max, mean, stdev). Similarly as in Example 4 we considered differences of the observations. Again, the length of training set was set to 40k, the remaining 14k observations constituted a test set. We used FML model in each epoch we randomly sampled a subsequence of length 103. In all cases (L = 2, 3, 4) the model significantly outperfoms Gaussian models, as can be seen in Table 5. 6 Conclusions and limitations In this work, we proposed a continuous hidden Markov model that leverages deep flow-based network architectures to model complex, non-Gaussian distributions. Although our approach outperforms several baselines and competing approaches, it relies on the co-occurrence matrix that can be only computed with the assumption that the Markov chain on a set of hidden states is stationary - a limitation not present in the EM-based approaches. Addressing this constraint, e.g., by re-computing temporally stationary co-occurrence matrix during training, is part of our future work. So is in-depth validating our model on multivariate time series or distributions, where our flexible approach can potentially offer more benefits over the existing works. While we do not identify any straightforward negative societal implications of our work, we acknowledge that its application to large data corpora can lead to significant energy consumption, despite our main contributions being focused on increased efficiency. Acknowledgements This work was supported by Foundation for Polish Science (grant no POIR.04.04.00-00-14DE/18-00) carried out within the Team-Net program co-financed by the European Union under the European Regional Development Fund, as well as the National Centre of Science (Poland) Grant No. 2020/39/B/ST6/01511. The work conducted by Maciej Zieba was supported by the National Centre of Science (Poland) Grant No. 2021/43/B/ST6/02853. For the purpose of Open Access, the author has applied a CC-BY public copyright licence to any Author Accepted Manuscript (AAM) version arising from this submission. 3https://github.com/anubhabghosh/genhmm [1] Rogemar S Mamon and Robert James Elliott. Hidden Markov models in finance, volume 4. Springer, 2007. [2] Mark Gales and Steve Young. The application of hidden Markov models in speech recognition. Now Publishers Inc, 2008. [3] Mathukumalli Vidyasagar. Hidden Markov Processes: Theory and Applications to Biology. Princeton University Press, 2014. [4] Pierre Ailliot, Craig Thompson, and Peter Thomson. Space time modelling of precipitation by using a hidden markov model and censored gaussian distributions. Journal of the Royal Statistical Society: Series C (Applied Statistics), 58(3):405 426, 2009. [5] Leonard E Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics, 41(1):164 171, 1970. [6] Kejun Huang, Xiao Fu, and Nicholas Sidiropoulos. Learning hidden markov models from pairwise co-occurrences with application to topic modeling. In International Conference on Machine Learning, pages 2068 2077. PMLR, 2018. [7] Balaji Lakshminarayanan and Raviv Raich. Non-negative matrix factorization for parameter estimation in hidden markov models. In 2010 IEEE International Workshop on Machine Learning for Signal Processing, pages 89 94, 2010. [8] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460 1480, 2012. [9] Joachim Sicking, Maximilian Pintz, Maram Akila, and Tim Wirtz. Dense HMM: Learning hidden markov models by learning dense representations. Co RR, abs/2012.09783, 2020. [10] Robert Mattila, Cristian R Rojas, Vikram Krishnamurthy, and Bo Wahlberg. Identification of hidden markov models using spectral learning with likelihood maximization. In 2017 IEEE 56th annual conference on decision and control (CDC), pages 5859 5864. IEEE, 2017. [11] Aryeh Kontorovich, Boaz Nadler, and Roi Weiss. On learning parametric-output hmms. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML 13, page III 702 III 710. JMLR.org, 2013. [12] Oscar Darwin and Stefan Kiefer. Equivalence of hidden markov models with continuous observations. ar Xiv preprint ar Xiv:2009.12978, 2020. [13] Jörn Dannemann. Semiparametric hidden markov models. Journal of computational and graphical statistics, 21(3):677 692, 2012. [14] Hongyang Yu. A novel semiparametric hidden markov model for process failure mode identification. IEEE Transactions on Automation Science and Engineering, 15(2):506 518, 2018. [15] Lu Xia, Chia-Chih Chen, and J. K. Aggarwal. View invariant human action recognition using histograms of 3d joints. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 20 27, 2012. [16] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML. PMLR, 2015. [17] Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. ar Xiv, 2018. [18] Marcin Sendera, Jacek Tabor, Aleksandra Nowak, Andrzej Bedychaj, Massimiliano Patacchiola, Tomasz Trzcinski, Przemysław Spurek, and Maciej Zieba. Non-gaussian gaussian processes for few-shot regression. In Advances in Neural Information Processing Systems, 2021. [19] Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4541 4550, 2019. [20] Rameen Abdal, Peihao Zhu, Niloy J Mitra, and Peter Wonka. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (To G), 40(3):1 21, 2021. [21] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. [22] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018. [23] Anubhab Ghosh, Antoine Honoré, Dong Liu, Gustav Eje Henter, and Saikat Chatterjee. Normalizing flow based hidden markov models for classification of speech phones with explainability. ar Xiv preprint ar Xiv:2107.00730, 2021. [24] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30, 2017. [25] Robert Mattila, Cristian Rojas, Eric Moulines, Vikram Krishnamurthy, and Bo Wahlberg. Fast and consistent learning of hidden Markov models by incorporating non-consecutive correlations. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6785 6796. PMLR, 13 18 Jul 2020. [26] Jörn Dannemann, Hajo Holzmann, and Anna Leister. Semiparametric hidden markov models: identifiability and estimation. Wiley Interdisciplinary Reviews: Computational Statistics, 6(6):418 425, 2014. [27] Yu-Ying Liu, Shuang Li, Fuxin Li, Le Song, and James M Rehg. Efficient learning of continuoustime hidden markov models for disease progression. Advances in neural information processing systems, 28:3599, 2015. [28] Emmanuel de Bézenac, Syama Sundar Rangapuram, Konstantinos Benidis, Michael Bohlke Schneider, Richard Kurle, Lorenzo Stella, Hilaf Hasson, Patrick Gallinari, and Tim Januschowski. Normalizing kalman filters for multivariate time series analysis. In Neur IPS, 2020. [29] Anubhab Ghosh, Antoine Honoré, Dong Liu, Gustav Eje Henter, and Saikat Chatterjee. Robust classification using hidden markov models and mixtures of normalizing flows. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1 6, 2020. [30] data.world. Dow Jones and S&P500 datasets, accessed 2022. https://data.world/ chasewillden/stock-market-from-a-high-level/workspace/. [31] Ameri GEOSS. Wind-measurementds in Papua New Guinea, 2019. data retrieved from https: //data.amerigeoss.org/dataset/857a9652-1a8f-4098-9f51-433a81583387/ resource/a8db6cec-8faf-4c8e-aee2-4fb45d1e6f14.