# graphical_models_in_heavytailed_markets__4d38e3a3.pdf

Graphical Models in Heavy-Tailed Markets

José Vinícius de M. Cardoso, Jiaxi Ying, Daniel P. Palomar Department of Electronic and Computer Engineering Department of Industrial Engineering and Decision Analytics The Hong Kong University of Science and Technology Clear Water Bay, Hong Kong SAR China {jvdmc, jx.ying}@connect.ust.hk, palomar@ust.hk

Heavy-tailed statistical distributions have long been considered a more realistic statistical model for the data generating process in ﬁnancial markets in comparison to their Gaussian counterpart. Nonetheless, mathematical nuisances, including nonconvexities, involved in estimating graphs in heavy-tailed settings pose a signiﬁcant challenge to the practical design of algorithms for graph learning. In this work, we present graph learning estimators based on the Markov random ﬁeld framework that assume a Student-t data generating process. We design scalable numerical algorithms, via the alternating direction method of multipliers, to learn both connected and k-component graphs along with their theoretical convergence guarantees. The proposed methods outperform state-of-the-art benchmarks in an extensive series of practical experiments with publicly available data from the S&P500 index, foreign exchanges, and cryptocurrencies.

1 Introduction

Graph learning frameworks are often designed based on the assumption that the observed graph signals are Gaussian distributed [1 9]. While such assumption for graphical models has found great success in many practical areas, which includes brain network analysis [10], psychological networks [11], and single-cell sequencing [12], it inherently neglects scenarios where there may exist outliers or the underlying data is naturally heavy-tailed distributed. As a consequence, those methods often lack robustness and may not succeed in capturing a meaningful representation of the underlying graph [13].

Data from ﬁnancial instruments are well-known examples of such scenarios where heavy-tailedness and skewedness are present [14 19]. In addition, there has been a growing interest in methods for estimating graphical models in ﬁnancial markets, which hence demands the development of scalable and robust learning algorithms [20].

Perhaps one of the most prominent applications, clustering ﬁnancial time-series via graph techniques has been an active research topic [20 24]. Nonetheless, current techniques rely on the assumption that the underlying graph has a tree structure, which does bring advantages due to its hierarchical clustering properties, but also have been shown to be unstable [25 27] and not suitable when the data is not Gaussian distributed [28].

Motivated by practical challenging applications in ﬁnance, such as clustering of ﬁnancial instruments and network estimation, we investigate the problem of learning graph matrices whose structure follow that of a Laplacian matrix of an undirected weighted graph for which the data generating process is assumed to be Student-t distributed. In particular, the main contributions of this paper are as follows:

We propose a novel formulation for learning undirected weighted graphs under the assumption that the data generating process is Student-t distributed. We solve the underlying

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

learning problem via a carefully designed numerical algorithm based on the alternating direction method of multipliers (ADMM), along with the establishment of its theoretical convergence guarantees. We note that the proposed algorithm can be easily extended to account for additional linear constraints on the graph weights.

We extend the proposed framework to account for heavy-tails and k-component graphs simultaneously, which enables a novel method for clustering ﬁnancial time-series.

We present extensive practical results, with real-world data from the US stock market, foreign exchanges, and cryptocurrencies, that showcase clear advantages of including heavy-tail assumptions into graph learning frameworks when compared to state-of-the-art, Gaussian-based methods.

Notation: Matrices (vectors) are denoted by bold, italic, capital (lowercase) roman letters like X, x. Vectors are assumed to be column vectors. The (i, j) element of a matrix X Rn p is denoted as Xij. The i-th element of a vector x is denoted as xi. The i-th row of X is denoted as xi Rp 1. Given a symmetric matrix A, λi(A) and λmax(A) denote the i-th smallest and maximum eigenvalue of A, respectively. The Moore-Penrose inverse of A is denoted as A . The Frobenius norm of a matrix A is denoted as A F = p

tr (A A). The operator Diag : Rp Rp p creates a diagonal matrix with the elements of an input vector along its diagonal. The operator diag : Rp p Rp extracts the diagonal of a square matrix. For x Rp, x = maxi|xi|. (x)+ denotes the projection on to the nonnegative orthant, i.e., (x)+ = max (0, x).

2 Background & Related Works

An undirected, weighted graph is denoted as a triple G = (V, E, W ), where V = {1, 2, . . . , p} is the node set, E {{u, v} : u, v V, u = v} is the edge set, that is, a subset of the set of all possible unordered pairs of nodes such that {u, v} E iff there exists a link between nodes u and v. W Rp p + is the symmetric weighted adjacency matrix that satisﬁes Wii = 0, Wij > 0 iff {i, j} E and Wij = 0, otherwise. The combinatorial, unnormalized graph Laplacian matrix L is deﬁned, as L D W , where D Diag(W 1) is the degree matrix.

A p-dimensional, real-valued, Gaussian random variable x, with mean vector E [x] µ and rankdeﬁcient precision matrix L, is said to form a Laplacian constrained Gaussian Markov random ﬁeld (LGMRF) [9, 29 31] of rank p k, k 1, with respect to a graph G, when its probability density function is given as

det (L) exp 1

2(x µ) L(x µ) , (1)

where det (L) is the pseudo-determinant of L, i.e., the product of its positive eigenvalues [32].

Assume we are given n observations of x, i.e., X = [x1, x2, . . . , xn] , X Rn p, xi Rp 1. The goal of graph learning algorithms is to learn a Laplacian matrix, or equivalently an adjacency matrix, given only the data matrix X, i.e., often without any knowledge of E.

To that end, the penalized maximum likelihood estimator (MLE) of the Laplacian constrained precision matrix of x, on the basis of the observed data X, is:

minimize L 0 tr (LS) log det (L) + h(L),

subject to L1 = 0, Lij = Lji 0, (2)

where S is a similarity matrix, e.g., the sample covariance (or correlation) matrix S X X, and h is a regularization function to promote certain properties on L such as sparsity or low-rankness.

Even though Problem (2) is convex, provided we assume a convex choice for h, it is not adequate to be solved by disciplined convex programming languages, such as cvxpy [33], particularly due to scalability issues related to the computation of the term log det (L) [6, 7]. Indeed, recently, considerable efforts have been directed towards the design of scalable, iterative algorithms based on block coordinate descent [34], majorization-minimization (MM) [35, 36], and ADMM [37] to solve Problem (2) in an efﬁcient fashion, e.g., [6] and [7].

Estimators based on Gaussian assumptions have been proposed for connected graphs [2, 4 7]. Some of their properties, such as sparsity, are yet being investigated [9, 31]. The authors in [29] and [38] proposed optimization programs for learning the class of k-component graphs, as such class is an appealing model for clustering tasks due to the spectral properties of the Laplacian matrix. However, a major shortcoming in their formulations is the lack of constraints on the degrees of the nodes, which allows for trivial solutions, i.e., graphs with isolated nodes.

Elliptical losses along with linear structural constraints that retain the positive-deﬁniteness of the estimated covariance matrix have been proposed in the literature [39]. In this work, however, we address the case of Laplacian constraints, which lead to positive-semideﬁnite precision matrices, and nonconvex k-component structural constraints.

3 Proposed Formulations & Algorithms

In this section, we propose optimization formulations and an iterative algorithm to learn a Laplacian matrix from heavy-tailed assumptions. With that goal, we express the Laplacian matrix via its linear operator, i.e., L = Lw [29], where w Rp(p 1)/2 is the vectorized form of the upper triangular part of the adjacency matrix, also known as the vector of graph weights. In addition, we use the fact that, for connected graphs, it follows that det (Lw) = det(Lw + J), J 1

In order to address the inherent heavy-tailed nature of ﬁnancial market data [40], we consider the Student-t distribution under the improper Markov random ﬁeld assumption [41] with Laplacian structural constraints, that is, we assume the data generating process to be modeled as multivariate zero-mean Student-t distribution, whose probability density function can be written as

det (Θ) 1 + x Θx

2 , ν > 2, (3)

where Θ is a positive-semideﬁnite inverse scatter matrix modeled as a combinatorial graph Laplacian matrix and ν is the number of degrees of freedom, which measures the rate of decay of the tails.

This results in a robustiﬁed version of the MLE for connected graph learning, i.e.,

minimize w 0,Θ 0 p + ν

i=1 log 1 + x i Lwxi

log det (Θ + J) ,

subject to Θ = Lw, dw = d, (4)

where d : Rp(p 1)/2 Rp is the degree operator deﬁned as dw diag (Lw). The constraint dw = d enables the learning of additional graph structures such as regular graphs and it is crucial for k-component graphs, as discussed in Section 3.2.

From a theoretical perspective, the Student-t model naturally yields sparse graphs. Comparing the objective function in Problem (4) to that of Problem (2), we note that the Student-t contains a log( ) term in place of a linear term of the graph weights. The usage of a log function to promote sparsity is closely related to the iteratively reweighted ℓ1-norm as an approximation for the ℓ0-norm problem [42]. Problem (4) is, in general, nonconvex due to the summation of log terms and hence it is challenging to be considered directly. Hence, we design an iterative algorithm based on the ADMM framework.

3.1 ADMM Solution

The partial augmented Lagrangian function of Problem (4) is given as

Lρ(Θ, w, Y , y) = p + ν

i=1 log 1 + x i Lwxi

log det (Θ + J)

+ y, dw d + ρ

2 dw d 2 2 + Y , Θ Lw + ρ

2 Θ Lw 2 F , (5)

where Y and y are the dual variables associated with the equality constraints Θ = Lw and dw = d, respectively. Note that we deal with the constraints w 0 and Θ 0 directly, hence there are no dual variables associated with them.

Given wl and Y l, the subproblem for Θ can be written as

Θl+1 = arg min Θ 0 log det (Θ + J) + Θ, Y l + ρ

whose closed-form solution is given by Lemma 1.

Lemma 1 The global minimizer of problem (6) is [43, 44]

Γ2 + 4ρI U J, (7)

where UΓU is the eigenvalue decomposition of ρ Lwl + J Y l.

Given Θl+1, Y l, and yl, the subproblem for w can be formulated as

minimize w 0 ρ 2w (d d + L L) w w, L Y l + ρΘl+1 d yl ρd

i=1 log 1 + x i Lwxi

where d and L are the adjoint operators of the degree and Laplacian operators, respectively.

In general, subproblem (8) is nonconvex due to the concave nature of the logarithm function. Hence, we resort to the MM method [36] to ﬁnd a stationary point of subproblem (8). We proceed by constructing a global upper-bound of the objective function of (8) at point wj Rp(p 1)/2 + as

g(w, wj) = g(wj, wj) + w wj, wf(wj) + µ

w wj 2 2 , (9)

where f is the objective function in the minimization in (8), its gradient is given as wf(wj) = aj + bj, where

aj = L Sj Y l ρ Θl+1 Lwj , (10)

bj = d yl ρ d dwj , (11)

where Sj 1 n Pn i=1 (p + ν)xix i wj, L (xix i ) + ν is a weighted sample covariance matrix, and µ =

ρλmax (d d + L L), and the maximum eigenvalue of d d + L L is given by Lemma 2, whose proof is presented in the Supplementary Material.

Lemma 2 The maximum eigenvalue of the matrix d d + L L is given as λmax (d d + L L) = 2(2p 1). (12)

The vector of graph weights w can then be updated by minimizing the function g constructed in (9), which is tantamount to solving the following nonnegative, quadratic-constrained, strictly convex problem:

wj+1 = arg min w 0 ρ(2p 1) w wj 2 2 + w, aj + bj , (13)

whose unique solution can be readily obtained via its KKT optimality conditions and is given as

wj+1 = wj aj + bj

that is, a projected gradient descent step with learning rate (2ρ(2p 1)) 1. Thus, we iterate (14) in order to obtain a stationary point, wl+1, of Problem (8). In practice, we observe that a few iterations are sufﬁcient to retrieve wl+1.

The dual variables Y and y are updated via gradient ascent steps. Algorithm 1 summarizes the implementation to ﬁnd a stationary point of Problem (4). We present Theorem 3, proved in the Supplementary Material, which establishes the convergence of Algorithm 1.

Theorem 3 Algorithm 1 converges subsequently for any sufﬁciently large ρ, that is, the sequence Θl, wl, Y l, yl generated by Algorithm 1 has at least one limit point, and each limit point is a stationary point of Problem (4).

Algorithm 1: Student-t Graph Learning

Data: Data matrix X Rn p, initial estimate of the graph weights w0, desired degree vector d, penalty parameter ρ > 0, degrees of freedom ν > 2, convergence tolerance ϵ > 0 Result: Graph Laplacian estimation: Lw

1 initialize Y = 0, y = 0

3 while rl > ϵ or sl > ϵ do

4 update Θl+1 via (7)

5 update wl+1 by iterating (14)

6 update Y l+1 = Y l + ρ Θl+1 Lwl+1

7 update yl+1 = yl + ρ dwl+1 d

8 compute residual rl+1 = Θl+1 Lwl+1

9 compute residual sl+1 = dwl+1 d

3.2 An extension to k-component graphs

The graph learning formulation proposed in (4) is applicable to learn connected graphs. Learning graphs with k components, k assumed to be known, poses a considerably higher challenge, as the dimension of the nullspace of the Laplacian matrix Lw is equal to the number of components of the graph [45]. One way to achieve this requirement is by imposing the constraint rank (Lw) = p k, which is nonconvex and nondifferentiable, in the maximum likelihood problem generated by (3). We instead relax this rank constraint by noting that via Fan s theorem [46], we have

i=1 λi (Lw) = minimize V Rp k,V V =Itr V Lw V . (15)

Thus, by using the right hand side of (15) as a regularization term, we are able formulate the following optimization problem to learn a Student-t k-component graph:

minimize w 0,Θ 0,V p + ν

i=1 log 1 + x i Lwxi

log det (Θ) + ηtr(Lw V V ),

subject to Θ = Lw, rank(Θ) = p k, dw = d, V V = I, V Rp k. (16)

The partial augmented Lagrangian function of Problem (16) can be expressed as

Lρ(Θ, w, V , Y , y) = p + ν

i=1 log 1 + x i Lwxi

log det (Θ) + ηtr Lw V V

+ y, dw d + ρ

2 dw d 2 2 + Y , Θ Lw + ρ

2 Θ Lw 2 F . (17)

Given wl and Y l, the subproblem for Θ can be written as

Θl+1 = arg min rank(Θ)=p k log det (Θ) + Θ, Y l + ρ

which is nearly the same as (6). Its solution is obtained as

Γ2 + 4ρI U , (19)

except that now UΓU is the eigenvalue decomposition of ρLwl Y l, with Γ having the largest p k eigenvalues along its diagonal and U Rp (p k) contains the corresponding eigenvectors.

The subproblem to obtain wl+1 is virtually the same as in (8) except for the additional linear term ηtr(Lw V l V l ). Hence, its update is also a projected gradient descent step, alike (14) where

aj L Sj + ηV l V l Y l ρ Θl+1 Lwj . (20)

Given wl+1, we have the following subproblem for V :

minimize V Rp k,V V =I tr V Lwl+1V , (21)

whose closed-form solution is given by the k eigenvectors associated with the k smallest eigenvalues of Lwl+1 [47, 48]. Algorithm 2 summarizes the implementation to ﬁnd a stationary point of Problem (16), and its convergence is established through Theorem 4, whose proof is presented in the Supplementary Material.

Theorem 4 Algorithm 2 converges subsequently for any sufﬁciently large ρ, that is, the sequence Θl, wl, V l, Y l, yl generated by Algorithm 2 has at least one limit point, and each limit point is a stationary point of Problem (16).

Algorithm 2: k-component Student-t graph learning

Data: Data matrix X Rn p, initial estimate of the graph weights w0, number of graph components k, desired degree vector d, degrees of freedom ν, rank hyperparameter η > 0, penalty parameter ρ > 0, tolerance ϵ > 0 Result: Laplacian estimation: Lw

1 initialize Y = 0, y = 0

3 while rl > ϵ or sl > ϵ do

4 update Θl+1 via (19)

5 update wl+1 as in (14) with aj given in (20)

6 update V l+1 as in (21)

7 update Y l+1 = Y l + ρ Θl+1 Lwl+1

8 update yl+1 = yl + ρ dwl+1 d

9 compute residual rl+1 = Θl+1 Lwl+1

10 compute residual sl+1 = dwl+1 d

4 Experiments

To evaluate the performance of the proposed graph learning algorithms, we perform experiments using historical daily price time series data, available in Yahoo! Finance TM, from ﬁnancial instruments in three scenarios: (i) stocks belonging to the S&P500 index, (ii) foreign exchange markets, and (iii) cryptocurrencies. We start by constructing the log-returns data matrix, i.e., a matrix X Rn p, where n is the number of log-return observations and p is the number of instruments, as

Xi,j = log Pi,j log Pi 1,j, (22)

where Pi,j is the closing price of the j-th instrument at the i-th day.

Benchmarks: We compare our proposed algorithms with state-of-the-art, Gaussian distribution-based methods for connected graphs, namely GLE [7] and NGL [9], which use ℓ1-norm and minimax concave penalty regularizations, respectively; and CLR [38] and SGL [29] that consider k-component graphs. For a fair comparison among algorithms, we set the degree vector d equal to 1 for the proposed algorithms, i.e., we do not consider any prior information on the degree of nodes. In our ADMM algorithms, we set the penalty parameter to ρ = 1 and the hyperparameter η in (16) is adaptively increased until the rank constraint is satisﬁed. For GLE and NGL, we use grid search on the sparsity hyperparameter such that the resulting graph yields the highest modularity value. The graph weights in Algorithm 1 and 2 are initialized using the same procedure as in [8].

Our goal with the experiments that follow is to verify whether the heavy-tail assumption provides an improved version of the learned graph, which is evaluated based on the modularity1 of the estimated

1Modularity measures the strength of separability of a graph into groups [49].

graph and the graph visualization. In addition, for the task of clustering stocks, we analyze whether the learned graphs agree with industry standards of sector classiﬁcation set by the Global Industry Classiﬁcation Standard (GICS) [50, 51]. 2

4.1 Communities in S&P500 Stocks

In this experiment, we consider S&P500 stocks belonging to three sectors, namely, Communication Services (red), Utilities (blue), and Real Estate (green), totalling p = 82 stocks, during the time horizon from Jan. 3rd 2014 to Dec. 29th 2017, resulting in n = 1006 observations. In order to obtain descriptive insights on this dataset, we measure its degree of heavy-tailedness and annualized volatility3. The former is obtained by ﬁtting the degrees of freedom of a Student-t distribution to the matrix of log-returns, whereby we obtain ν 5.5 and σ 21%. This scenario can be considered as having a moderate amount of heavy-tailedness.

Figure 1 depicts the learned connected graphs on the aforementioned time periods. It can be readily noticed that the graph learned with the Student-t distribution (Figure 1c) is sparser than those learned with the Gaussian assumption (Figure 1a and 1b), which results from the fact that the Gaussian distribution is more sensitive to outliers. Moreover, the Student-t graph presents a higher degree of interpretability as measured by its modularity value. In addition, a larger number of inter-sector connections, as indicated by gray-colored edges, which are often spurious from a practical perspective, are present in the graphs learned by NGL and GLE. Sparsity regularization provides a means to remove edges between nodes in the presence of data with outliers and possibly increasing the modularity of the resulting graph. However, they bring the additional task of tunning hyperparameters, which is often repetitive and impractical for real-time applications. A cleaner graph, without the need for postprocessing or additional regularization, is obtained directly by using the Student-t assumption.

DISCA DISCK

(a) GLE, modularity = 0.31.

AIV AMT ARE

DIS DISCA DISCK DISH

TWTR UDR VNO

(b) NGL, modularity = 0.49.

DISCA DISCK

DISH DLR DRE

REG SBAC SLG

(c) Algorithm 1, modularity = 0.54.

Figure 1: Learned graphs of S&P500 stocks.

Figure 2 illustrates the learned 3-component graphs during the time from Jan. 3rd 2014 until Dec. 29th 2017. We can notice that SGL (Figure 2a) and CLR (Figure 2b) are unable to separate the stocks in a way that agrees with their sector information as given by GICS. In addition, the high number of spurious connections (gray-colored edges) are uncharacteristic of the actual expected behavior in stock markets. Figure 2c displays the graph learned by the proposed Algorithm 2, where it presents not only a higher modularity value, but also a sparser, more plausible representation of an actual network of stocks with three sectors.

4.1.1 Impact of COVID-19 in the US Stock Market

We collect price data of p = 85 stocks belonging to three sectors, namely, Communication Services (red), Utilities (blue), and Real Estate (green) from Apr. 22nd 2019 to Dec. 30th 2020, resulting in n = 429 observations. The degrees of freedom and annualized volatility during this period, which includes the ﬁnancial crisis caused by the COVID-19 pandemic, were ν 2.89 and σ 41%, indicating a strongly heavy-tailed scenario.

2More often than not, stocks have impacts on multiple sectors, e.g., the evident case of technology companies, whose inﬂuence affect the prices of stocks not only in their own sector, but spans across multiple sectors. Therefore, GICS or other sector classiﬁcation systems cannot be considered as absolute ground-truth labels. 3The annualized volatility is computed as σ =

p Pp i=1 σi, where σi is the daily sample standard deviation of the i-th stock.

DISCA DISCK

(a) SGL, modularity = 0.29.

DISCA DISCK

(b) CLR, modularity = 0.33.

DISCA DISCK

(c) Algorithm 2, modularity = 0.56.

Figure 2: Learned 3-component graphs of S&P500 stocks.

In Figure 3, we observe that the modularity value of the Gaussian-based graphs are severely reduced as compared to the results presented in Figure 1, while the opposite occurs with the Student-t graph. That may be explained by the increase in variability and outliers in the data during the ﬁnancial crisis caused by the COVID-19 pandemic, which is better modelled by a Student-t distribution. In addition, the learned graph in Figure 3c clearly displays expected behaviors such as the strong correlation between pairs of companies including {Zoom Video Communications Inc., Netﬂix Inc.}, and {Twitter Inc., Facebook Inc.}, which are not as evident in Figures 3a and 3b.

FOXFOXA FRT

PLD PNW PPL

(a) GLE, modularity = 0.23.

DISCA DISCK

(b) NGL, modularity = 0.46.

AES AIV AMT ARE

DIS DISCA DISCK DISH

PSA REG SBAC SLG

(c) Algorithm 1, modularity = 0.56.

Figure 3: Learned graphs of S&P500 stocks during the COVID-19 pandemic.

4.2 Communities in Foreign Exchange Markets

We query foreign exchange data from the 34 most traded currencies between the period from Jan. 2nd 2019 to Dec. 31st 2020, totalling n = 522 observations. The data matrix is composed by the log-returns of the currencies prices with respect to the United States Dollar. Similar to the previous experiment, we compute the degrees of freedom of a Student-t distribution ﬁtted to the log-returns data matrix, whereby we obtain ν 4.6, which represents a scenario with considerable amount of heavy-tailedness. Unlike in the experiment involving S&P500 stocks, there are no classiﬁcation standard for currencies, hence we rely on a community detection algorithm [52] in order to create classes within the learned graph. In particular, the algorithm in [52] takes as input the learned Laplacian matrix of the graph and outputs a membership assignment that maximizes the modularity of the graph.

Figure 4 displays the learned graphs. As it can be observed, the Student-t graph (Figure 4c) is sparser, more interpretable, and has a higher modularity value than that of the Gaussian-based graphs (Figure 4a and 4b). In addition, the expected correlation between currencies of locations geographically close to each other, e.g., {Hong Kong SAR, China}, {Taiwan, South Korea}, and {Poland, Czech Republic} are signiﬁcantly more evident for the Student-t graph.

Figure 5 depicts the learned 9-component graphs of currencies during the time window from Jan. 2nd 2019 to Dec. 31st 2020. It can be observed that the graph learned by the proposed Algorithm 2 (Figure 5c) presents a ﬁner structure and a higher modularity value than those learned by SGL (Figure 5a) and CLR (Figure 5b). In addition, the learned graph in Figure 5c presents more reasonable clusters such as {New Zealand, Australia} and {Poland, Czech Republic, Hungary}, which are not

(a) GLE, modularity = 0.34.

(b) NGL, modularity = 0.46.

(c) Algorithm 1, modularity = 0.58.

Figure 4: Learned connected graphs of currencies.

separated in the Gaussian-based graphs. More critically, we can observe that SGL and CLR allow the existence of isolated nodes in the learned graphs. In our proposed algorithm, we avoid such solutions by imposing linear constraints on the degree of the graph.

(a) SGL, modularity = 0.62.

(b) CLR, modularity = 0.79.

(c) Algorithm 2, modularity = 0.84.

Figure 5: Learned 9-component graphs of currencies.

4.3 Communities in Cryptocurrencies

We query daily prices of the p = 41 most traded cryptocurrencies during the period starting from Aug. 1st 2017 to Dec. 1st 2020, which amounts to n = 1218 observations. The degrees of freedom during this time frame was measured as ν 3, which is tantamount to a strong heavy-tail scenario.

Figure 6 shows the learned graphs by GLE, NGL, and our proposed Algorithm 1. While the graphs in Figures 6a and 6b present a small modularity value and contain a large number of edges, which impairs interpretability, the resulting proposed graph in Figure 6c reveals a reﬁned representation of the interactions between pairs of cryptocurrencies, which is possibly more aligned with the actual market scenario. As an example, the link between Bitcoin (BTC) and Litecoin (LTC), a Bitcoin spinoff established in 2011, is substantially more evident in Figure 6c.

(a) GLE, modularity = 0.19.

(b) NGL, modularity = 0.40.

(c) Algorithm 1, modularity = 0.52.

Figure 6: Learned connected graphs of cryptocurrencies.

Figure 7 shows the learned 7-component graphs of cryptocurrencies during the aforementioned time window. As in the previous experiments with foreign exchange data, SGL shows isolated nodes in the learned graph (Figure 7a) and CLR contains a large number of spurious connections in the main cluster (Figure 7b), whereas the graph learned via Algorithm 2 (Figure 7c) has the largest modularity value. Interestingly, while all three methods agree to cluster {Dogecoin (DOGE), Verge (XVG), Siacoin (SC), Digi Byte Coin (DGB)}, which may be related to their similar initial release dates, only our proposed algorithm clusters together the coins that mainly focus on privacy and anonymity features, i.e., {Monero (XMR), Zcash (ZEC), DASH}.

LTC BCH BNB

(a) SGL, modularity = 0.36

(b) CLR, modularity = 0.66.

ETH XRP USDT

(c) Algorithm 2, modularity = 0.79.

Figure 7: Learned 7-component graphs of cryptocurrencies.

4.4 Effect of Algorithm Initialization

Initialization is a critical phase in any iterative algorithm, especially when dealing with nonconvex optimization problems involving nonconvex constraints as it is the case of Algorithm 2. To verify the stability of Algorithm 2, we initialize it as a fully connected graph with equal weights, i.e., w0 1. Figure 8 illustrates that there is no signiﬁcant difference from the estimated graphs with uniform initial point and the ones reported in the previous section. This result provides evidence that Algorithm 2 is robust against inaccurate initializations.

(a) modularity = 0.56.

(b) modularity = 0.80.

(c) modularity = 0.78.

Figure 8: Learned graphs by Algorithm 2 with an uniform initial point.

5 Conclusions

Heavy-tails are prevalent in time-series of ﬁnancial markets. Yet, they have been little explored in the context of Laplacian graphical models. In this paper, we have proposed optimization programs to learn graphical models with Laplacian constraints assuming that the data generating process is Student-t distributed. The formulations follow a maximum likelihood approach of a Markov random ﬁeld, for which we designed ADMM algorithms that converge to a stationary point of the resulting nonconvex problems. The proposed algorithms showed signiﬁcant gains, measured via the modularity values of the estimated graphs, when compared to state-of-the-art counterparts in real-world scenarios that involved data from the US stock market, foreign exchange markets, and cryptocurrencies.

Acknowledgments

We would like to thank the anonymous reviewers for their helpful comments. This work was supported by the Hong Kong GRF 16207820 research grant.

[1] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9:432 41, 2008.

[2] B. M. Lake and J. B. Tenenbaum. Discovering structure by learning sparse graph. In Proceedings of the 33rd Annual Cognitive Science Conference, 2010.

[3] M. Slawski and M. Hein. Estimation of positive deﬁnite m-matrices and structure learning for attractive Gaussian Markov random ﬁelds. Linear Algebra and its Applications, 473:145 179, 2015.

[4] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst. Learning Laplacian matrix in smooth graph signal representations. IEEE Transactions on Signal Processing, 64(23):6160 6173, 2016.

[5] V. Kalofolias. How to learn a graph from smooth signals. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, volume 51, pages 920 929, 2016.

[6] H. E. Egilmez, E. Pavez, and A. Ortega. Graph learning from data under Laplacian and structural constraints. IEEE Journal of Selected Topics in Signal Processing, 11(6):825 841, 2017.

[7] L. Zhao, Y. Wang, S. Kumar, and D. P. Palomar. Optimization algorithms for graph Laplacian estimation via ADMM and MM. IEEE Transactions on Signal Processing, 67(16):4231 4244, 2019.

[8] S. Kumar, J. Ying, J. V. de M. Cardoso, and D. P. Palomar. A uniﬁed framework for structured graph learning via spectral constraints. Journal of Machine Learning Research, 21:1 60, 2020.

[9] J. Ying, J. V. de M. Cardoso, and D. P. Palomar. Nonconvex sparse graph learning under Laplacian-structured graphical model. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

[10] Stephen M. Smith, Karla L. Miller, Gholamreza Salimi-Khorshidi, Matthew Webster, Christian F. Beckmann, Thomas E. Nichols, Joseph D. Ramsey, and Mark W. Woolrich. Network modelling methods for fmri. Neuro Image, 54(2):875 891, 2011.

[11] S. Epskamp, D. Borsboom, and E. I. Fried. Estimating psychological networks and their accuracy: A tutorial paper. Behavior Research Methods, 50:195 212, 2018.

[12] O. Stegle, S. A. Teichmann, and J. C. Marioni. Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics, 16:133 145, 2015.

[13] Michael Finegold and Mathias Drton. Robust graphical modeling with t-distributions. In Conference on Uncertainty in Artiﬁcial Intelligence, 2009.

[14] C. Gourieroux and A. Monfort. Time Series and Dynamic Models. Themes in Modern Econometrics. Cambridge University Press, 1997.

[15] R. Cont. Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance, 1:223 236, 2001.

[16] R. S. Tsay. Analysis of Financial Time Series. Wiley, 3rd edition, 2010.

[17] A. C. Harvey. Dynamic models for volatility and heavy tails: with applications to ﬁnancial and economic time series. Cambridge University Press, 2013.

[18] Y. Feng and D. Palomar. A signal processing perspective on ﬁnancial engineering. Foundations and Trends in Signal Processing, 9:1 231, 2015.

[19] Nassim Dehouche. Scale matters: The daily, weekly and monthly volatility and predictability of bitcoin, gold, and the s&p 500, 2021.

[20] G. Marti, F. Nielsen, M. Bi nkowski, and P. Donnat. A review of two decades of correlations, hierarchies, networks and clustering in ﬁnancial markets. In ar Xiv: 1703.00485, 2017.

[21] R. N. Mantegna. Hierarchical structure in ﬁnancial markets. The European Physical Journal B, 11(1):193 197, 1999.

[22] C. Dose and S. Cincotti. Clustering of ﬁnancial time series with application to index and enhanced index tracking portfolio. Physica A: Statistical Mechanics and its Applications, 355(1):145 151, 2005.

[23] G. Marti, S. Andler, F. Nielsen, and P. Donnat. Clustering ﬁnancial time series: How long is enough? In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence, 2016.

[24] G. Marti, F. Nielsen, P. Donnat, and S. Andler. On clustering ﬁnancial time series: a need for distances between dependent random variables. Computational Information Geometry, pages 149 174, 2017.

[25] G. Carlsson and F. Mémoli. Characterization, stability and convergence of hierarchical clustering methods. Journal of Machine Learning Research, 11:1425 1470, 2010.

[26] V. Lemieux, P. S. Rahmdel, R. Walker, B. L. W. Wong, and M. Flood. Clustering techniques and their effect on portfolio formation and risk analysis. In Proceedings of the International Workshop on Data Science for Macro-Modeling, page 1 6, 2014.

[27] G. Marti, P. Very, P. Donnat, and F. Nielsen. A proposal of a methodological framework with experimental guidelines to investigate clustering stability on ﬁnancial time series. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pages 32 37, 2015.

[28] P. Donnat, G. Marti, and P. Very. Toward a generic representation of random variables for machine learning. Pattern Recognition Letters, 70:24 31, 2016.

[29] S. Kumar, J. Ying, J. V. de M. Cardoso, and D. P. Palomar. Structured graph learning via Laplacian spectral constraints. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

[30] J. Ying, J. V. de M. Cardoso, and D. P. Palomar. Does the ℓ1-norm learn a sparse graph under Laplacian constrained graphical models? ar Xiv e-prints: 2006.14925, June 2020.

[31] J. Ying, J. V. M. Cardoso, and D. P. Palomar. Minimax estimation of Laplacian constrained precision matrices. In 24th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 21), 2021.

[32] O. Knill. Cauchy Binet for pseudo-determinants. Linear Algebra and its Applications, 459:522 547, 2014.

[33] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016.

[34] S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151:3 34, 2015.

[35] David Hunter and Kenneth Lange. A tutorial on mm algorithms. The American Statistician, 58(1):30 37, 2004.

[36] Y. Sun, P. Babu, and D. P. Palomar. Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Transactions on Signal Processing, 65(3):794 816, 2017.

[37] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011.

[38] F. Nie, X. Wang, M. I. Jordan, and H. Huang. The constrained Laplacian rank algorithm for graph-based clustering. In Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, AAAI 16, pages 1969 1976, 2016.

[39] Y. Wald, N. Noy, G. Elidan, and A. Wiesel. Globally optimal learning for structured elliptical losses. In Advances in Neural Information Processing Systems (Neur IPS 19), 2019.

[40] S. I. Resnick. Heavy-Tail Phenomena: Probabilistic and Statistical Modeling. Springer-Verlag New York, 2007.

[41] H. Rue and L. Held. Gaussian Markov Random Fields: Theory And Applications. Chapman & Hall/CRC, 2005.

[42] Emmanuel J. Candès, Michael B. Wakin, and Stephen P. Boyd. Enhancing sparsity by reweighted ℓ1 minimization. Journal of Fourier Analysis and Applications, 14(5):877 905, 2008. [43] D. M. Witten and R. Tibshirani. Covariance-regularized regression and classiﬁcation for high dimensional problems. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 71(3):615 636, 2009. [44] P. Danaher, P. Wang, and D. M. Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society Series B, 76(2):373 397, 2014. [45] F. R. K. Chung. Spectral Graph Theory, volume 92. CBMS Regional Conference Series in Mathematics, 1997. [46] K. Fan. On a theorem of Weyl concerning eigenvalues of linear transformations I. Proceedings of the National Academy of Sciences, 35(11):652 655, 1949. [47] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [48] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, 2007. [49] M. E. J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences of the United States of America, 103, 2006. [50] Standard & Poor s. Global Industry Classiﬁcation Standard (GICS). Tech Report, 2006. [51] Morgan Stanley Capital International and S&P Dow Jones. Revisions to the global industry classiﬁcation standard (GICS) structure, 2018. [52] A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, 70, Dec 2004.