# etree_learning_treestructured_embeddings__f1386cc2.pdf e TREE: Learning Tree-structured Embeddings Faisal M. Almutairi1,2, Yunlong Wang2, Dong Wang2, Emily Zhao2, and Nicholas D. Sidiropoulos3 1Department of ECE, University of Minnesota, MN, USA 2Advanced Analytics, IQVIA Inc., Plymouth Meeting, PA, USA 3Department of ECE, University of Virginia, VA, USA almut012@umn.edu, {yunlong.wang, dong.wang, emily.zhao}@iqvia.com, nikos@virginia.edu Matrix factorization (MF) plays an important role in a wide range of machine learning and data mining models. MF is commonly used to obtain item embeddings and feature representations due to its ability to capture correlations and higherorder statistical dependencies across dimensions. In many applications, the categories of items exhibit a hierarchical tree structure. For instance, human diseases can be divided into coarse categories, e.g., bacterial, and viral. These categories can be further divided into finer categories, e.g., viral infections can be respiratory, gastrointestinal, and exanthematous viral diseases. In e-commerce, products, movies, books, etc., are grouped into hierarchical categories, e.g., clothing items are divided by gender, then by type (formal, casual, etc.). While the tree structure and the categories of the different items may be known in some applications, they have to be learned together with the embeddings in many others. In this work, we propose e TREE, a model that incorporates the (usually ignored) tree structure to enhance the quality of the embeddings. We leverage the special uniqueness properties of Nonnegative MF (NMF) to prove identifiability of e TREE. The proposed model not only exploits the tree structure prior, but also learns the hierarchical clustering in an unsupervised data-driven fashion. We derive an efficient algorithmic solution and a scalable implementation of e TREE that exploits parallel computing, computation caching, and warm start strategies. We showcase the effectiveness of e TREE on real data from various application domains: healthcare, recommender systems, and education. We also demonstrate the meaningfulness of the tree obtained from e TREE by means of domain experts interpretation. Introduction Matrix Factorization (MF) plays an important role in a wide range of machine learning models, for various applications such as dimensional reduction and embedding. A popular task is matrix completion, where the goal is to infer the unknown/missing matrix entries from the observed ones. A common approach is to employ MF to find a reduceddimension representation (embedding) of each element corresponding to the matrix dimensions (e.g., users and items) These embeddings capture the essential information due to the ability of MF to capture correlations and higher-order Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: An example of hierarchical movie categories. statistical dependencies across dimensions. The entry corresponding to the ith user and jth item can be inferred by the inner product of their embeddings. Matrix completion finds a wide range of applications including collaborative filtering in recommender systems (Koren 2008), disease and treatment prediction and patient subtyping (Wang et al. 2019) in healthcare analytics, student performance prediction and course recommendation in learning analytics (Almutairi, Sidiropoulos, and Karypis 2017), and image processing (Liu et al. 2015). Incorporating side contextual information or priors, e.g., sparsity (Hoyer 2004), smoothness, and latent clustering (Yang, Fu, and Sidiropoulos 2016), is well-motivated in matrix factorization and completion of sparse data. This is because a major challenge stems from the fact that we aim to find latent representations from very few samples. In this paper, we present a principled approach that incorporates the unknown implicit tree structure prior. In many applications, categories of items display a hierarchical tree structure. In higher education, for instance, courses form multiple trees via their prerequisite hierarchy. Movie genres, e.g., comedy, action, and fantasy, comprise different fine subcategories as illustrated in the example in Fig. 1. Another example appears in Electronics Health Records (EHR) in healthcare analytics, where medical service (diagnoses, procedures, and prescription) can be clustered into subcategories, and these subcategories can also be grouped into coarse categories (examples are provided in the experimental results in Fig. 3). Individuals, e.g., users, students, and patients, also exhibit hierarchical clusters where the common traits between people increase as we move down from the root nodes to the leaf nodes in a tree (Maleszka, Mianowska, and Nguyen 2013; Wang, Pan, and Xu 2014). In many applications, the categorical hierarchy is either unknown, or requires manual labeling of massive amounts of data. Incorporating tree structures in machine learning models has been recently considered, mostly in recommender sys- The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) tems (Nikolakopoulos, Kouneli, and Garofalakis 2015; Li et al. 2019; Zhong, Fan, and Yang 2012) and also in other applications such as image processing (Fan and Cheng 2018), clustering and classification (Trigeorgis et al. 2016), and natural language processing (NLP) (Shen et al. 2018). For example, the recommender system model in (Yang et al. 2016) penalizes MF with the distance between users who share common traits based on hierarchically-organized features. In another MF model (Sun et al. 2017), the item embeddings are assumed to form a tree, where each leaf node represents a single item and the parent nodes contain subsets of items (categories). The embeddings of parent nodes and leaf nodes are learned jointly. The final item feature vector is modeled as a weighted sum of its embedding and those of the categories it belongs to. Regularizing MF with a pre-defined tree prior has been also explored in response prediction in online advertising (Menon et al. 2011). In this example, the tree groups the set of ads according to their campaigns, and the campaigns are further grouped based on the advertisers running them. All the aforementioned methods assume that the tree structure is known apriori, or learned separately via side information. Recently, (Wang et al. 2018, 2015) proposed to capture the unknown implicit tree structure via a model based on nonnegative matrix factorization (NMF). In a three-layer tree, the embedding of a leaf node (item/user) is assumed to be a linear combination of all the parent nodes (subcategories) in the intermediate layer, and each subcategory is a linear combination of all the categories in the root nodes. The weights that determine the memberships of a child node to the parent nodes are non-negative and learned by the model. This results in a fully connected tree, thus, a clear tree clustering can not be obtained. Moreover, (Wang et al. 2018, 2015) imposes the implicit tree as a hard constraint, which can be restrictive if the data do not exactly follow the imposed prior. In this paper, we propose e TREE (Learning Treestructured Embeddings), a framework that integrates the unknown implicit tree structure into a low-rank nonnegative factorization model to improve the quality of embeddings. e TREE does not require any extra information and jointly learns: i) the embeddings of all the tree nodes (items, subcategories, and main categories), and ii) the tree clustering in an unsupervised fashion. Unlike (Wang et al. 2015, 2018), the obtained tree provides clear hierarchical clusters as each node belongs to exactly one parent node, e.g., an item belongs to one subcategory, and a subcategory belongs to one main category. The formulation of e TREE handles partially observed data matrices, which appear often in real-world applications. We derive an efficient algorithm to compute e TREE with a scalable implementation that leverages parallel computing, computation caching, and warm-start strategies. Our contributions can be summarized as follows: 1. Formulation: e TREE provides an intuitive formulation that: i) exploits the tree structure, and ii) learns the hierarchical clustering in an unsupervised data-driven fashion. 2. Identifiability: We leverage the special uniqueness properties of NMF to prove identifiability of e TREE. 3. Effectiveness: e TREE significantly improves the quality of the embeddings in terms of matrix completion error on data from recommender systems, healthcare, and education. 4. Interpretability: We demonstrate the meaningfulness of the tree clusters learned by e TREE using real-data interpreted by domain experts. Background In this section, we provide the background needed before presenting e TREE. We review NMF and related recent identifiability results, followed by a brief background on the alternating direction method of multipliers (ADMM). Notation: x. x, X denote scalars, vectors, and matrices, respectively; X(i, :) (X(:, j)) refers to the ith row (jth column) of X; X(J , :) denotes the rows of X in the set J . The product is the Hadamard (element-wise) product. Non-negative Matrix Factorization Assume we have a healthcare data matrix X RN M indexed by (patient, medical service), where X(i, j) denotes the number of times the ith patient has received the jth service. In other applications, X may contain the ratings given by users to items, or the grades received by students in their courses. In some parts of this paper, we refer to patients, users, students, etc. as individuals, and to medical services, products, etc. as items. NMF models aim to decompose the data matrix into low-rank latent factor matrices as X = ABT , where A RN R, BM R only have nonnegative values, and R min(N, M) is the matrix rank. NMF has gained considerably special attention as it tends to produce interpretable representations. For instance, it has been shown that the columns of A produce clear parts of human faces (e.g., nose, ears, and eyes) when NMF in applied on a matrix X whose columns are vectorized face images (Lee and Seung 1999). In practice, NMF is often formulated as a bilinear optimization problem: min A 0,B 0 1 2F(A, B) = W (X ABT ) 2 F (1) where W {0, 1}N M has ones at the indices of the observed entries in X, and zeros otherwise. Each row of A corresponds to the embedding/latent representation of the corresponding individual, whereas the rows of B are the embeddings of the items. Identifiability of NMF: The interpretability of NMF is intimately related to its uniqueness properties the latent factors are identifiable under some conditions (up to trivial ambiguity, e.g., scaling/counter-scaling or permutation). To facilitate our discussion of the uniqueness of e TREE, we present the following definitions and established identifiability results. Definition 1 (Identifiability) The NMF of X = ABT is said to be (essentially) unique if X = e Ae BT implies e A = AΠD and e B = B(ΠD) 1, where Π is a permutation matrix, and D is a diagonal positive matrix. Definition 2 (Sufficiently Scattered) A nonnegative matrix B RM R is said to be sufficiently scattered if: 1) cone{BT } C, and 2) cone{BT } bd{C } = {λek|λ 0, k = 1, . . . , R}, where C = {x|x T 1 C = {x|x T 1 x 2}, cone{BT } = {x|x = BT θ, θ 0, 1T θ = 1}, and cone{BT } = {y|x = x T y, x cone{HT }} are the conic hull of BT and its dual cone, respectively, and bd is the boundary of a set. The works in (Fu et al. 2015; Lin et al. 2015) prove that the so-called volume minimization (Vol Min) criterion can identify the factor matrices if A is full-column rank, and the rows of B are sufficiently scattered (Definition 2) and sumto-one (row stochastic). Recently, Fu et al. shifted the row stochastic condition on rows of B to the columns of B. Theorem 1 (NMF Identifiability) (Fu, Huang, and Sidiropoulos 2018) A and B are essentially unique under the criterion of minimizing det(AT A) w.r.t. A RN R and B RM R, subject to X = ABT and BT 1 = 1, B 0 if B is sufficiently scattered, and rank(X) = rank(A) = R. Theorem 1 provides an intriguing generalization of NMF, as it pertains to a more general factorization. Note that A is not restricted to be non-negative. Also note that the columnsum-to-one constraint on B is without loss of generality, as one can always assume the columns of B are scaled by a diagonal matrix D, and compensate for this scaling in the columns of A, i.e., X = (AD 1)(BD)T . Alternating Direction Method of Multipliers ADMM is a primal-dual algorithm that solves convex optimization problems in the form minx,z f(x) + g(z) s.t. Ax + Bz = c (2) by iterating the following updates x arg minx f(x) + ρ/2 Ax + Bz c + u 2 2 z arg minz g(z) + ρ/2 Ax + Bz c + u 2 2 u u + (Ax + Bz c) where u is a scaled version of the dual variable, and ρ > 0 is a Lagrangian parameter. A comprehensive review of ADMM can be found in (Boyd, Parikh, and Chu 2011). Proposed Framework: e TREE In this section, we present our proposed framework. We start with the mathematical formulation of e TREE, then we discuss the theoretical uniqueness of the proposed model. Next, we work out some design considerations of e TREE, then we derive the algorithmic solution. e TREE: Formulation In many applications, categories of items exhibit a hierarchical tree structure as we showed in the introduction. For ease of notation, let us denote the embedding matrix of items resulting from NMF in (1) as B1 RM1 R, where M1 is the number of items. Assume that the embeddings of the M1 items (rows of B1) are the leaf nodes at the very bottom layer in a tree. A subset of items that belong to the same category is grouped together via one parent node, where the parent node is the embedding of the corresponding category. Assuming that the embeddings are fully inherited (replicated verbatim) from one s parent category, we can further decompose B1 into B1 = S1B2 (4) Figure 2: Illustration of the tree prior in e TREE. where each row of B2 RM2 R is the embedding of one category, M2 is the number of categories with M2 M1, and S1 {0, 1}M1 M2, S1(i, :) 0 = 1, i [M1], i.e., values in S1 are binary with only one 1 per row to ensure that each item belongs to exactly one category. Note that M2 is the number of parent nodes (categories) in the second from bottom layer. The M2 categories can be grouped into coarser categories, i.e., we decompose B2 into B2 = S2B3, where rows of B3 RM2 M3 represent the embeddings of the coarse categories, and S2 maps the M2 fine-level categories into the M3 coarse-level categories in the same fashion as S1. Up to here, we have constructed a three-layer tree, and we can use the same concept to create a Q-layer tree. Fig. 2 illustrates the mapping between B1 and B2 in matrix notation (left), and shows a 3-layer tree (right). Generalizing to Q layers, we obtain B1 = S1S2 . . . SQ 1BQ (5) Substituting the embedding matrix of items in (1) with the right term in (5) above may seem natural, however there is a solution ambiguity in the cases where Q > 2. To see this, the route B1(1, :) B2(2, :) B3(1, :) in Fig. 2 (right) would give the same cost value as B1(1, :) B2(1, :) B3(1, :) (the dotted gray arrow). Moreover, imposing the tree structure as a hard constraint can be too intrusive when the data do not exactly follow the assumed prior. Thus, we propose to: i) incorporate the tree prior as a soft constraint, and ii) explicitly solve for the embedding of the intermediate layers to resolve the (immaterial for our purposes) solution ambiguity. This yields the following formulation: min U F(A, B1) + µ 2 PQ 1 q=1 Bq Sq Bq+1 2 F s.t. Sq {0, 1}Mq Mq+1, q [Q 1] Sq(i, :) 0 = 1, i [Mq], q [Q 1] A 0, B1 0 where U := A, {Bq}Q q=1, {Sq}Q 1 q=1 is the set of all variables, and F is the NMF cost function as defined in (1) with B1 as the embedding matrix of items. The second term is to minimize the difference between the embedding of each child node and its parent node in the tree structure. In other words, it minimizes the difference between the embeddings of each item and its category, or between each fine category and its coarse category. µ 0 is a regularization parameter to balance the data fidelity and the tree prior. There is an intriguing connection between the proposed tree regularizer and k-means formulation. The variables {Sq}Q 1 q=1 are equivalent to the assignment variables in kmeans for clustering the rows of {Bq}Q 1 q=1 , respectively, and BQ is equivalent to the centroid variable in k-means for clustering the rows of BQ 1. On the other hand, each variable in {Bq}Q 1 q=2 is involved in two terms: 1) Bq 1 Sq 1Bq 2 F where its rows are centroids, and 2) Bq Sq Bq+1 2 F where its rows are the points to be clustered. This can be thought of as a regularized k-means. Interestingly, the joint NMF and latent k-means model in (Yang, Fu, and Sidiropoulos 2016) is a special case of e TREE with Q = 2. Note that if the tree structure is known apriori, it can be seamlessly incorporated using our formulation. A direct way is to fix {Sq}Q 1 q=1 to the known tree and solve (6) w.r.t. the rest of the variables. This way we learn the embedding of the categories and penalize the distance between items and their corresponding categories. When the tree is partially known, one can fix the known parts and learn the unknowns. Another way to integrate a known tree is to penalize the difference between the embeddings of items that share similar paths to the root nodes. The latter method does not require us to learn {Bq}Q q=2. In this work, we focus on the more challenging scenario where the tree structure is unknown and to be learned from the data. e TREE has the following advantages: i) it incorporates the tree structure to improve the quality of embeddings, ii) unlike most methods in the literature, it assumes the tree is unknown and learns it through the solution of {Sq}Q 1 q=1 in an unsupervised fashion, and iii) it provides the embedding of the parent nodes (categories) in addition to the item embeddings. The tree clustering can be useful in broader applications such as classification and data labeling. The embeddings of categories provide extra information for some applications, e.g., web personalization and category-based recommendation (He, Li, and Liao 2017). e TREE: Theoretical Identifiability Identifiability in machine learning problems that require parameter estimation is essential in guaranteeing sensible results, especially in applications where model identifiability is entangled with interpretability, such as topic modeling (Arora et al. 2013), image processing (Lee and Seung 1999), and social network clustering (Mao, Sarkar, and Chakrabarti 2017). Nevertheless, the majority of MF-based methods in practice do not have known identifiability guarantees. In the following theorem, we establish the identifiability of e TREE for the case where X is fully observed. Theorem 2 Assume that a data matrix follows X = ABT 1 , where A RN R, and B1 RM1 R are the ground-truth factors, and assume that B1 = S1S2 . . . SQ 1BQ, where Sq {0, 1}Mq Mq+1, Sq(i, :) 0 = 1, i [Mq], q [Q 1]. Let S = S1S2 . . . SQ 1, then, B1 = SBQ. Also, assume that rank(X) = rank(A) = R, and, without loss of generality, MQ R. If A and S are full-column rank, and rows of BQ are sufficiently scattered, then rows of B1 are sufficiently scattered, and A, B1, BQ, and S are essentially unique. Proof Sketch: The factors A and BQ in M = ABT Q are essentially unique by Theorem 1, since A is full-column rank, and rows of BQ are sufficiently scattered (Definition 2). If S is full-column rank, then all the rows of BQ will appear in B1 = SBQ. Thus, rows of B1 are sufficiently scattered iff the rows of BQ are sufficiently scattered. Now, A and B1 in X = ABT 1 are essentially unique by Theorem 1, since A is full-column rank, and rows of B1 are sufficiently scattered. Next, the factor S in B1 = SBQ is also essentially unique because the rows of B1 are rows of BQ, and every row of BQ appears in B1, hence S can be determined based on the correspondence (identifiability of S also follows as a very special instance of Theorem 1). In plain words, in addition to the NMF identifiability conditions (A to be full-rank, and rows of BQ to be sufficiently scattered), we only require S to be full-column rank, which means that every root node (main category) must have at least one leaf node this is a natural condition in a tree. Interestingly, the rows of BQ are likely to be sufficiently scattered as they are the embeddings of the coarsest categories and encouraged to be distant (think, e.g., in a 3-layer tree, each row is the centroid of multiple subcategories, where each subcategory is the centroid of a set of items). We point out that there is an inherent column permutation ambiguity in {Sq}Q 1 q=1 in S = S1S2 . . . SQ 1, however, this is immaterial in our context. e TREE: Model Engineering In this section, we discuss some caveats that need to be addressed in the formulation (6) before moving to the algorithmic derivation. The first point is the scaling between the low-rank factors A and B1. The tree structure regularizer implicitly favors B1 to have a small norm. On the other hand, the first term is not affected by the scaling of B1, as long as this scaling is compensated for in A. This motivates introducing norm regularization on A, i.e., A F . The second consideration is regarding the tree structure term. It has been shown that the cosine similarity metric is superior over the Euclidean distance in clustering (Strehl, Ghosh, and Mooney 2000) and latent clustering (Yang, Fu, and Sidiropoulos 2016) in many applications. We also observed that constraining the rows of {B}Q 1 q=1 to be in the unit l2-norm ball, which is equivalent to using cosine similarity in clustering, gives better performance. Taking these points into account, we obtain the following formulation: min Y Fd(A, B, D) + µ q=1 Bq Sq Bq+1 2 F + λ s.t. Bq(i, :) 2 = 1 i [Mq], q [Q 1] Sq {0, 1}Mq Mq+1, q [Q 1] Sq(i, :) 0 = 1, i [Mq], q [Q 1] D = Diag(d1, . . . , d M1) A 0, B1 0 (7) where Y := A, D, {Bq}Q q=1, {Sq}Q 1 q=1 is the set of all variables, λ 0, Fd := 1/2 W (X ABT D) 2 F , and D is a diagonal matrix that is introduced to allow us to fix the rows of B1 onto the unit l2-norm ball without loss of generality of the factorization model. e TREE: Algorithm The optimization problem in (7) is NP-hard (as it contains both NMF and k-means as special cases, and both are known to be NP-hard). We therefore present a carefully designed alternating optimization (AO) algorithm. The proposed algorithm leverages ADMM (reviewed in the background section above) and utilizes parallel computing, computation caching, and warm-start to provide a scalable implementation. The high level algorithmic strategy is to employ AO to update A, D, {Bq}Q q=1, and {Sq}Q 1 q=1 one at a time, while fixing the others. The resulting sub-problems w.r.t. a single variable can be solved optimally. We propose a variable-splitting strategy by introducing slack variables {Zq RMq R}Q 1 q=1 to handle the unit l2 norm ball constraints on {Bq RMq R}Q 1 q=1 in (7). Specifically, we consider the following optimization surrogate: min H Fd(A, B, D) + µ 2 PQ 1 q=1 Bq Sq Bq+1 2 F 2 PQ 1 q=1 Bq Zq 2 F + λ s.t. Zq(i, :) 2 = 1, i [Mq], q [Q 1] Sq {0, 1}Mq Mq+1, q [Q 1] Sq(i, :) 0 = 1, i [Mq], q [Q 1] D = Diag(d1, . . . , d M1) A 0, B1 0 where H := A, D, {Bq}Q q=1, {Zq}Q 1 q=1 , {Sq}Q 1 q=1 is the set of all the variables, and η 0. Note that when η = + , then (8) is equivalent to (7). In practice, we choose a large η to enforce Bq Zq (we set it to η = 1000 in all experiments). We handle problem (8) as follows. First, we update A by solving the following non-negative least squares min A 0 1 2 W (X ABT 1 D) 2 F + λ 2 A 2 F (9) using ADMM. Due to space limitation, we use the update of A as a working example for the two updates that uses ADMM (A and B1). Problem (9) can be reformulated by introducing an auxiliary variable e A 1 2 WT (XT e B e A) 2 F + λ 2 e A 2 F + R(A) s.t. A = e AT (10) where e B = DB1, and R(.) is the indicator function of the nonnegative orthant. Next, we derive the ADMM updates e A(:, i) (e B(Ji, :) T e B(Ji, :) + c IR) 1 e B(Ji, :)T X(i, Ji)T + ρ( e A(:, i) + U(i, :)T ) (11a) A(i, :) [ e A(:, i)T U(i, :)]+ (11b) U(i, :) U(i, :) + A(i, :) e A(:, i)T (11c) where c := λ + ρ, [.]+ is the projection on R+ by zeroing the negative entries, and Ji is the set of items that have observations for the ith individual. We use the adaptive ρ = e B 2 F /NR, which is a scaled version of ρ suggested in (Huang, Sidiropoulos, and Liavas 2016). The ADMM steps in (11) are performed until a termination criterion is met. We adopt the criterion in (Boyd, Parikh, and Chu 2011; Huang, Sidiropoulos, and Liavas 2016), namely, the primal and dual residuals pi = A(i, :) e A(:, i)T 2 F / A(i, :) 2 F ; di = A(i, :) A0(i, :) 2 F / U(i, :) 2 F ; (12) where A0 is A from the previous iteration. We iterate between the ADMM updates until p and d are smaller than a predefined threshold, or we reach the maximum number of iterations K in our experiments we set K = 5. Scalability Considerations: There are some important observations regarding the implementation of the ADMM updates in (11). First, we do not compute the matrix inversion in (11a) explicitly. Instead, the Cholesky decomposition of the Gram matrix Gi := e B(Ji, :) T e B(Ji, :) + c IR is computed, i.e., Gi = Li LT i , where Li is a lower triangular matrix. Then, at each ADMM iteration, we only need to perform a forward and a backward substitution to get the solution of e A(:, i). Thus, the step in (11a) is replaced with: Gi e B(Ji, :)T e B(Ji, :) + c IR; Li Cholesky(Gi) (13a) Fi e B(Ji, :)T X(i, Ji)T (13b) e A(:, i) L T i L 1 i (Fi + ρ( e A(:, i) + U(i, :)T )) (13c) Computing the Cholesky decomposition requires O(R3) flops, and the back and forward substitution steps cost O(NR2). The matrix multiplication in e B(Ji, :)T e B(Ji, :) and in computing Fi in (13b) takes O(|Ji|R2) and O(|Ji|R), respectively, where, |Ji| N is the cardinality of the set Ji. An important implication is that Li and Fi do not change throughout the ADMM iterations, thus can be cached to save computation. The overall complexity to update A is O(NR2). Moreover, the ADMM updates enjoy row separability, allowing parallel computation. In the case where X is fully observed, G := e BT e B + c IR and F := e BT XT are shared not only across the ADMM iterations, but also among the N parallel sub-problems corresponding to the rows of A. Finally, the outer AO routine naturally provides a good initial point (warm-start) to the inner ADMM iterations (for both A and its dual variable U), resulting in a faster convergence. Next, we update B1 using ADMM in the same fashion as A. The updates of Z1 and D admit closed form solutions Z1(j, :) = B1(j, :)/ B1(j, :) 2; dj = h T j X(Ij, j)/h T j hj (14) where hj = (B1(j, :)AT )T , and Ij is the set of individuals that have observations for the jth item. In the next step, we perform few inner iterations to alternate between updating the tree structure triplets {Sq 1, Bq, Zq)}Q 1 q=2 and BQ in a cyclic fashion (we call it the tree loop) in the experiments we set the maximum number of tree iterations T = 5. The updates w.r.t. {Bq}Q 1 q=2 are unconstrained least squares problems. These Algorithm 1: Algorithmic Solution to e TREE 1 Initialize: A, B1 NMF; 2 {Bq}Q q=2 and {S}Q 1 q=1 random; D I; U 0 4 Compute Li, and Fi, i using (13a) and (13b) 5 Set k = 1 // counter of ADMM loop 6 while pi, di in (12) > ϵ and k < K do 7 Update e A(:, i), A(i, :), and U(i, :), i using (13c), (11b), and (11c), respectively 8 k = k + 1 10 Update B1 using ADMM loop (similar to A) 11 Update D and Z1 using (14) 12 Set t = 1 // counter of tree loop 13 while t < T do 14 for q = 2, . . . , Q 1 do 15 compute L using (15a) 16 update Bq(:, j), j using (15b) 17 update Sq 1(i, :), i using (17) 18 Zq(i, :) = Bq(i, :)/ Bq(i, :) 2, i 20 update BQ(i, :), i using (16) 21 t = t + 1 23 until convergence problems are column separable with a common mixing matrix. Thus, the complexity can be reduced by computing one Cholesky decomposition. Then, at each iteration, the update of each column only requires a forward and a backward substitution as follows H µST q 1Sq 1 + v IMq; L Cholesky(H) (15a) Bq(:, j) L T L 1 µSq 1Bq 1(:, j) + µSq Bq+1(:, j) + ηZq(:, j) (15b) where v := µ + η. The updates of BQ and the matrices {S}Q 1 q=1 are similar to solving for the centroids and the assignment variables in the k-means algorithm, respectively. Let Tm Q = {i|SQ 1(i, m Q) = 1}, then each row in BQ is BQ(m Q, :) P i Tm Q BQ 1(i, :)/|Tm Q| (16) And the ith row of the assignment matrices is updated using Sq(i, k) 1, k = arg minmq Bq 1(i, :) Bq(mq, :) 2 0, otherwise (17) The overall algorithm is summarized in Algorithm 1. One nice property of the proposed algorithm is that all the updates are row separable and can be computed in a distributed fashion (with the exception of {Bq}Q 1 q=2 , which are column separable). Experiments1 In this section, we evaluate the proposed framework on real data from various application domains: healthcare analytics, 1Code is at: https://github.com/Faisal Almutairi/e TREE movie recommendations, and education. This section aims to answer the following questions: Q1. Accuracy: Does e TREE improve the quality of embeddings for the downstream tasks? Q2. Interpretability: How meaningful is the tree structure learned by e TREE from an application domain knowledge viewpoint? Datasets: We evaluate e TREE and the competing baselines on the following real datasets: (i) Med-HF: These data are provided by IQVIA Inc. and include the counts of medical services performed on patients with heart failure (HF) conditions, including patients with preserved ejection fraction (p EF), and reduced ejection fraction (r EF). We include the 5, 000 patients with the most records in our experiments. The total number of medical services is 411. The majorities of the counts fall in the range of small numbers, with a small percentage of larger numbers, resulting in a long tail in the histogram of the data. To circumvent this, we apply a logarithmic transform on X+1 (we add the 1 to be slightly above the zero as we have nonnegativity constraints). The resulting range is log(2) 7.79, and the sparsity of the data matrix is 78.01%, (ii) Med-MCI: These data are also provided by IQVIA Inc. and similar to Med-HF, but they include patients with mild cognitive impairment (MCI) conditions. Similarly, we include 5, 000 patients and the total number of medical services is 412. We also apply a logarithmic transform on the data. The final range of data is log(2) 6.98, with a 77.76% sparsity, (iii) Movielens: Movielens (Harper and Konstan 2015) is a movie rating dataset and a popular baseline in recommender systems literature. It contains 105 ratings. The data only include users with at least 20 ratings. We also filter out movies with less than 20 ratings. The total number of users is 943 and the total number of movies is 1, 152. The rating range is 1 5, with 0.5 increments. The sparsity of this dataset is 90.98%, and (iv) College Grades: These data contain the grades of students from the College of Science and Engineering at the University of Minnesota spanning from Fall 2002 to Spring 2013. The total number of students is 5, 703, and the number of courses is 837. The grades take 11 discrete values (0, and 1 to 4 with increments of 1.33), and the sparsity of the data matrix is 96.28%. Baselines: We compare to the plain NMF and following state-of-the-art methods from the literature: (i) NMF: nonnegative Matrix factorization (1) regularized with ( A 2 F + B 2 F ), and implemented using ADMM (Huang, Sidiropoulos, and Liavas 2016), (ii) BMF: matrix factorization with rank-1 factors specified to capture items and individuals biases (Paterek 2007; Koren 2008); implemented using Stochastic Gradient Descent (SGD). The is a well-known approach in recommender systems and is considered a stateof-the-art method in student grade prediction (Almutairi, Sidiropoulos, and Karypis 2017), (iii) Ada Error: a collaborative filtering model based on matrix factorization with learning rate that adaptively adjusts based on the prediction error (Li et al. 2018). Ada Error is reported to have a stateof-the-art results on Movie Lens (Rendle, Zhang, and Koren 2019), (iv) HSR: a hierarchical structure recommender system model that captures the tree structure in items (users) via factorizing the item (user) embeddings matrix into a prod- BMF Ada Error HSR NMF e TREE NMF+KM Data RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE Med-HF 0.9875 0.7721 0.9147 0.6858 0.9287 0.7094 0.9031 0.6788 0.8873 0.6808 1.0797 0.8094 Med-MCI 0.8034 0.6232 0.7468 0.5680 0.7807 0.5990 0.7445 0.5612 0.7317 0.5611 0.8781 0.6578 Movie Lens 0.9300 0.7312 0.9123 0.7165 0.9216 0.7226 0.9286 0.7286 0.9106 0.7136 1.0182 0.8250 College Grades 0.5765 0.4254 0.5777 0.4206 0.5844 0.4339 0.5755 0.4229 0.5601 0.4126 0.5991 0.4476 Table 1: Matrix Completion Errors. e TREE significantly improves the prediction accuracy. uct of matrices, i.e., X = A1A2 . . . AP (B1B2 . . . BQ)T (Wang et al. 2018, 2015). BQ RMQ R can be interpreted as the embedding of the coarsest items categories, whereas the matrices {Bq RMq Mq+1}Q 1 q=1 indicate the affiliation of the Mq subcategories (or items) with the Mq+1 coarser categories. Note that an item can belong to all the subcategories with different scales since no constraints are imposed on Bq s matrices (except for nonnegativity). The same analysis also applies to user embeddings. We are unaware of other algorithms that incorporate the tree structure while learning the embeddings simultaneously. We used the Matlab code sample provided by the authors for a 3-layer tree and generalized it to handle Q layers, and (v) NMF+KM: is a simple two-stage procedure where we first apply NMF, then we obtain the embeddings of the root nodes BQ and the product of the assignment matrices S = S1 . . . SQ 1 via kmeans centroids and assignment variable, respectively we include NMF+KM to demonstrate the advantage of learning the embeddings and tree structure simultaneously. Q1. Accuracy of Embeddings: The quality of embeddings can be evaluated by testing their performance with a particular task, e.g., classification or regression. Here we take a more generic approach and evaluate the embedding quality on matrix completion. The philosophy is: if the embeddings predict missing data with high accuracy, then they must be good representations of items and individuals. We split each dataset into 5 equal folds. After training the models on 4 folds (80% of the data), we test the trained models on the held-out fold. The hyper-parameters of all methods are chosen via cross validation (10% of training data). Due to random initialization, the results can differ for different runs. Thus, after choosing the hyper-parameters, we run the training and testing on each fold 20 times and report the average error of the total 100 experiments. Table 1 shows the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) of all methods on the different datasets. We highlight the smallest error in bold and underline the second smallest. e TREE significantly improves the best baseline with all datasets. Note that Movie Lens and the grade datasets are challenging and an improvement in the second digit is considered significant in the literature. By comparing NMF and e TREE, we can conclude that the tree prior enhances the accuracy. Moreover, we can see the clear advantage of simultaneously learning the embeddings and tree clusters when we compare e TREE with MNF+KM. We point out that HSR baseline works better with Movie Lens, compared to the medical datasets. This is likely because a movie usually belongs to a mix of genres, which suits the complete tree assumption in HSR. Nevertheless, the proposed tree formulation in e TREE provides better accuracy. Q2. Interpretability of Learned Trees: We ran a 3-layer (a) Sample of the 412 medical services in Med-MCI. (b) A tree learned by e TREE and labeled by domain experts. Figure 3: e TREE Provides Meaningful Clusters. e TREE on Med-MCI with the following parameters: R = 9, λ = 1, µ = 50 (more emphasis on the tree term), M2 = 27 (number of subcategories), and M3 = 9 (number of main categories). A sample of the 412 medical services is shown in Fig. 3 (a), where dx stands for a diagnosis, rx is a prescription, and px is a procedure. Note that e TREE assigns each medical service to one subcategory, and each subcategory to a main category. Services with the same color in Fig. 3 (a) belong to the same subcategories, whereas services with similar colors (e.g., light and dark blue) belong to the same main category but to different subcategories. These unsupervised tree clusters were then shown to medical professionals. The domain experts were able to find coherence in the groups and they labeled both the main categories and their subcategories. The tables in Fig. 3 (b) shows the names of the main categories and their subcategories as labeled by the medical professionals we show the top 6 coherent categories. Similar interpretability was observed on Med-HF data, but not shown due to space limitation. In this paper, we proposed e TREE, a framework that incorporates the tree structure while learning the embeddings of a data matrix. e TREE not only exploits the tree structure, but also learns the hierarchical clustering in an supervised fashion. We leveraged the special properties of NMF to prove the uniqueness of the proposed model. We employed ADMM, parallel computing, and computation caching to derive a lightweight algorithm with scalable implementation to solve e TREE. We showed the effectiveness and interpretability of e TREE on real data. Acknowledgements The work of N. D. Sidiropoulos was partially supported by ARO W911NF19-1-0407. References Almutairi, F. M.; Sidiropoulos, N. D.; and Karypis, G. 2017. Context-aware recommendation-based learning analytics using tensor and coupled matrix factorization. IEEE Journal of Selected Topics in Signal Processing 11(5): 729 741. Arora, S.; Ge, R.; Halpern, Y.; Mimno, D.; Moitra, A.; Sontag, D.; Wu, Y.; and Zhu, M. 2013. A practical algorithm for topic modeling with provable guarantees. In International Conference on Machine Learning, 280 288. Boyd, S.; Parikh, N.; and Chu, E. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc. Fan, J.; and Cheng, J. 2018. Matrix completion by deep matrix factorization. Neural Networks 98: 34 41. Fu, X.; Huang, K.; and Sidiropoulos, N. D. 2018. On identifiability of nonnegative matrix factorization. IEEE Signal Processing Letters 25(3): 328 332. Fu, X.; Ma, W.-K.; Huang, K.; and Sidiropoulos, N. D. 2015. Blind separation of quasi-stationary sources: Exploiting convex geometry in covariance domain. IEEE Transactions on Signal Processing 63(9): 2306 2320. Harper, F. M.; and Konstan, J. A. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5(4): 1 19. He, J.; Li, X.; and Liao, L. 2017. Category-aware Next Point-of-Interest Recommendation via Listwise Bayesian Personalized Ranking. In IJCAI, volume 17, 1837 1843. Hoyer, P. O. 2004. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research 5(Nov): 1457 1469. Huang, K.; Sidiropoulos, N. D.; and Liavas, A. P. 2016. A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Transactions on Signal Processing 64(19): 5052 5065. Koren, Y. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 426 434. Lee, D. D.; and Seung, H. S. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401(6755): 788 791. Li, D.; Chen, C.; Lv, Q.; Gu, H.; Lu, T.; Shang, L.; Gu, N.; and Chu, S. M. 2018. Adaerror: An adaptive learning rate method for matrix approximation-based collaborative filtering. In Proceedings of the 2018 World Wide Web Conference, 741 751. Li, H.; Liu, Y.; Qian, Y.; Mamoulis, N.; Tu, W.; and Cheung, D. W. 2019. HHMF: hidden hierarchical matrix factorization for recommender systems. Data Mining and Knowledge Discovery 33(6): 1548 1582. Lin, C.-H.; Ma, W.-K.; Li, W.-C.; Chi, C.-Y.; and Ambikapathi, A. 2015. Identifiability of the simplex volume minimization criterion for blind hyperspectral unmixing: The nopure-pixel case. IEEE Transactions on Geoscience and Remote Sensing 53(10): 5530 5546. Liu, M.; Luo, Y.; Tao, D.; Xu, C.; and Wen, Y. 2015. Lowrank multi-view learning in matrix completion for multilabel image classification. In Proceedings of the Twenty Ninth AAAI Conference on Artificial Intelligence, 2778 2784. Maleszka, M.; Mianowska, B.; and Nguyen, N. T. 2013. A method for collaborative recommendation using knowledge integration tools and hierarchical structure of user profiles. Knowledge-Based Systems 47: 1 13. Mao, X.; Sarkar, P.; and Chakrabarti, D. 2017. On mixed memberships and symmetric nonnegative matrix factorizations. In International Conference on Machine Learning, 2324 2333. Menon, A. K.; Chitrapura, K.-P.; Garg, S.; Agarwal, D.; and Kota, N. 2011. Response prediction using collaborative filtering with hierarchies and side-information. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 141 149. Nikolakopoulos, A. N.; Kouneli, M. A.; and Garofalakis, J. D. 2015. Hierarchical itemspace rank: Exploiting hierarchy to alleviate sparsity in ranking-based recommendation. Neurocomputing 163: 126 136. Paterek, A. 2007. Improving regularized singular value decomposition for collaborative filtering. In Proceedings of KDD cup and workshop, volume 2007, 5 8. Rendle, S.; Zhang, L.; and Koren, Y. 2019. On the difficulty of evaluating baselines: A study on recommender systems. ar Xiv preprint ar Xiv:1905.01395 . Shen, Y.; Tan, S.; Sordoni, A.; and Courville, A. 2018. Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. In International Conference on Learning Representations. Strehl, A.; Ghosh, J.; and Mooney, R. 2000. Impact of similarity measures on web-page clustering. In Workshop on artificial intelligence for web search (AAAI 2000), volume 58, 64. Sun, Z.; Yang, J.; Zhang, J.; and Bozzon, A. 2017. Exploiting both vertical and horizontal dimensions of feature hierarchy for effective recommendation. In Thirty-First AAAI Conference on Artificial Intelligence. Trigeorgis, G.; Bousmalis, K.; Zafeiriou, S.; and Schuller, B. W. 2016. A deep matrix factorization method for learning attribute representations. IEEE transactions on pattern analysis and machine intelligence 39(3): 417 429. Wang, S.; Tang, J.; Wang, Y.; and Liu, H. 2015. Exploring Implicit Hierarchical Structures for Recommender Systems. In IJCAI, 1813 1819. Wang, S.; Tang, J.; Wang, Y.; and Liu, H. 2018. Exploring hierarchical structures for recommender systems. IEEE Transactions on Knowledge and Data Engineering 30(6): 1022 1035. Wang, X.; Pan, W.; and Xu, C. 2014. Hgmf: Hierarchical group matrix factorization for collaborative recommendation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, 769 778. Wang, Y.; Wu, T.; Wang, Y.; and Wang, G. 2019. Enhancing model interpretability and accuracy for disease progression prediction via phenotype-based patient similarity learning. In Pac Symp Biocomput. World Scientific. Yang, B.; Fu, X.; and Sidiropoulos, N. D. 2016. Learning from hidden traits: Joint factor analysis and latent clustering. IEEE Transactions on Signal Processing 65(1): 256 269. Yang, J.; Sun, Z.; Bozzon, A.; and Zhang, J. 2016. Learning hierarchical feature influence for recommendation by recursive regularization. In Proceedings of the 10th ACM Conference on Recommender Systems, 51 58. Zhong, E.; Fan, W.; and Yang, Q. 2012. Contextual collaborative filtering via hierarchical matrix factorization. In Proceedings of the 2012 SIAM International Conference on Data Mining, 744 755. SIAM.