# constraintbased_sequential_pattern_mining_with_decision_diagrams__02a18edc.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Constraint-Based Sequential Pattern Mining with Decision Diagrams Amin Hosseininasab,1 Willem-Jan van Hoeve,1 Andre A. Cire2 1Tepper School of Business, Carnegie Mellon University, USA 2Dept. of Management, University of Toronto Scarborough, Canada aminh@andrew.cmu.edu, vanhoeve@andrew.cmu.edu, andre.cire@rotman.utoronto.ca Constraint-based sequential pattern mining aims at identifying frequent patterns on a sequential database of items while observing constraints defined over the item attributes. We introduce novel techniques for constraint-based sequential pattern mining that rely on a multi-valued decision diagram (MDD) representation of the database. Specifically, our representation can accommodate multiple item attributes and various constraint types, including a number of non-monotone constraints. To evaluate the applicability of our approach, we develop an MDD-based prefix-projection algorithm and compare its performance against a typical generate-and-check variant, as well as a state-of-the-art constraint-based sequential pattern mining algorithm. Results show that our approach is competitive with or superior to these other methods in terms of scalability and efficiency. Introduction Sequential Pattern Mining (SPM) is a fundamental data mining task with a large array of applications in marketing, health care, finance, and bioinformatics, to name a few. Frequent patterns are used, e.g., to extract knowledge from data within decision support tools, to develop novel association rules, and to design more effective recommender systems. We refer the reader to (Fournier-Viger et al. 2017) for a recent and thorough review of SPM and its applications. In practice, mining the entire set of frequent patterns in a database is not of interest, as the resulting number of items is typically large and may provide no significant insight to the user. It is hence desirable to restrict the mining algorithm search to smaller subsets of patterns that satisfy problem-specific constraints. For example, in online retail click-stream analysis, we may seek frequent browsing patterns from sessions where users spend at least a minimum amount of time on certain items that have specific price ranges. Such constraints limit the output of SPM and are much more effective in knowledge discovery, as compared to an arbitrary large set of frequent click-streams. A na ıve approach to impose constraints in SPM is to first collect all unconstrained frequent patterns, and then to apply This work was supported by ONR grant N00014-18-1-2129. Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. a post-processing step to retain patterns that satisfy the desired constraints. This approach, however, may be expensive in terms of memory requirements and computational time, in particular when the resulting subset of constrained patterns is small in comparison to the full unconstrained set. Constraint-based sequential pattern mining (CSPM) aims at providing more efficient methods by embedding constraint reasoning within existing mining algorithms (Pei, Han, and Wang 2007; Negrevergne and Guns 2015). Nonetheless, while certain constraint types are relatively easy to incorporate in a mining algorithm, others of practical use are still challenging to handle in a general and effective way. This is particularly the case of non-monotone constraints representing, e.g., sums and averages of attributes. Contributions. In this paper, we propose a novel representation of sequential database using a multi-valued decision diagram (MDD), a graphical model that compactly encodes the sequence of items and their attributes by leveraging symmetry. The MDD representation can be augmented with constraint-specific information, so that constraint satisfaction is either guaranteed or enforced during the mining algorithm. Finally, as a proof of concept, we implement a general prefix-projection algorithm equipped with an MDD to enforce several constraint types, including complex constraints such as average ( avg ) and median ( md ). To the best of our knowledge, this paper is the first to consider the sum, avg, and md constraints with arbitrary item-attribute association within the pattern mining algorithm. Lastly, we provide an experimental comparison on real-world benchmark databases, and show that our approach is competitive with or superior to a state-of-the-art CSPM algorithm. Related work Research in CSPM has primarily focused on exploiting special properties of constraints, such as monotonicity or antimonotonicity, to guarantee the feasibility of pattern extensions in the mining algorithm (Garofalakis, Rastogi, and Shim 1999; Zaki 2000; Lin and Lee 2005; Bonchi and Lucchese 2005; Chen and Hu 2006; Pei, Han, and Wang 2007; Nijssen and Zimmermann 2014; Mallick, Garg, and Grover 2014; Aoga, Guns, and Schaus 2017). Constraint types that do not possess such properties remain a challenge for CSPM algorithms, although some of these have been successfully incorporated in more general item-set mining on databases Table 1: Example SD, with attributes of time and price. SID Sequence: {(item, time, price)} 1 (B, 1, 5), (B, 3, 3) 2 (B, 3, 3), (A, 8, 1), (B, 9, 3) 3 (C, 2, 1), (C, 5, 2), (A, 8, 3) where events have no specific order (Soulet and Cr emilleux 2005; Bistarelli and Bonchi 2007; Bonchi and Lucchese 2007; Le Bras, Lenca, and Lallich 2009; Leung et al. 2012), as well as in CSPM when items and attributes are interchangeable (Pei, Han, and Wang 2007). Recently, constraint programming (CP) has emerged as a successful tool for CSPM (Negrevergne and Guns 2015; Kemmar et al. 2016; Kemmar et al. 2017; Aoga, Guns, and Schaus 2017; Guns et al. 2017). CP search techniques, albeit general, can potentially be more efficient when compared to specialized CSPM algorithms. Nonetheless, they still rely on constraint-specific properties to effectively prune undesired patterns.For example, (Aoga, Guns, and Schaus 2017) show how to effectively implement a number of prefix anti-monotone constraints into CP, but indicate that postprocessing is still required to handle monotone constraints such as the minimum span. Graphical representations of a database have been shown to be effective in item-set mining (Han et al. 2004; Pyun, Yun, and Ryu 2014; Borah and Nath 2018) and SPM (Masseglia, Poncelet, and Teisseire 2009). Previous works have also applied binary decision diagrams as a database modeling tool (Loekito and Bailey 2006; Loekito and Bailey 2007; Loekito, Bailey, and Pei 2010; Cambazard, Hadzic, and O Sullivan 2010), which are effective when the sequences of the database are similar, but typically do not scale otherwise. We show that our MDD representation retains its size regardless of the similarity between sequences, and provides a more concise representation in the context of SPM. Problem definition We next formally describe the SPM problem and then discuss the handling of constraints. The SPM database and mining algorithm The SPM database consists of a set of events, which are modeled by a set of literals I denoted by items. Items i I are associated with a set of attributes A = { A1, ..., A|A| } ; for example, attributes can be price, quality, or time. A sequence database SD is defined as a collection of N item sequences {S1, S2, . . . , SN}, where all sequences are ordered with respect to the same attribute A A; e.g., occurrence in time. Table 1 illustrates an example SD with N := 3, |I| := 3, and M := max n {1,...,N} {|Sn|} = 3, where items i I are associated with time and price attributes. The SPM task asks for the set of frequent patterns within SD. A pattern P = i1, i2, . . . , i|P | is a subsequence of some S SD. Let S[j] denote the jth position (i.e., item) of sequence S. A subsequence relation P S holds if and only if there exists an embedding e : e1 e2 ... e|P | such that S[ej] = ij, ij P. For example, P = A, B is a subsequence of S = A, B, C, B with two possible embeddings (1, 2) or (1, 4). We define a super-sequence relation S P analogously, with replaced by . A pattern is frequent if it is a subsequence of at least θ number of sequences in SD, where θ is a given frequency threshold. The two best-known mining algorithms for SPM are the Apriori algorithm introduced by (Agrawal, Srikant, and others 1994), and the prefix-projection algorithm introduced by (Han et al. 2001). Both are iterative procedures and operate by extending frequent patterns one item at a time. In Apriori, candidate patterns are generated by expanding a pattern with all available items, and thereafter checking the frequency of generated candidates. As candidates may or may not be frequent, the Apriori algorithm suffers from the exponential explosion of the number of generated candidates and redundancy. The prefix-projection algorithm, in turn, operates by projecting each sequence S SD onto a smallest subsequence S = i1, i2, . . . , ij , denoted by prefix, and searching for frequent items in this reduced database. Any sequence that is obtained by extending a frequent prefix is guaranteed to be frequent in the original database. Prefixprojection is more efficient than the Apriori algorithm as it rules out infrequent patterns more effectively, but it requires the full database to be in memory (Han et al. 2001). Constraint satisfaction in CSPM A constraint Ctype( ) is a Boolean function imposed on either the patterns or their attributes. A pattern P satisfies a constraint if and only if Ctype(P) = true. The objective of CSPM is to find all frequent patterns that satisfy a set of user-defined constraints. In particular, the challenge of CSPM is to impose constraints during the mining algorithm, rather than post-processing mined patterns for constraint satisfaction. The standard framework for CSPM is to classify constraints as being monotone or anti-monotone, as such constraint are easy to handle within the mining algorithm (Pei, Han, and Wang 2007).1 A constraint is monotone if its violation by a sequence S implies that all subsequences S S also violate the constraint. A constraint is anti-monotone if its violation by a sequence S implies violation by all super-sequences ˆS S. Table 2 lists common constraint types with their characterization. The concepts of monotonicity, anti-monotonicity, and violation are analogously extended to prefixes. Constraints that are neither monotone nor anti-monotone are called non-monotone and are the most challenging to enforce during mining. While dedicated approaches have been developed for certain non-monotone constraints (Pei, Han, and Wang 2007), they are otherwise handled by postprocessing (Aoga, Guns, and Schaus 2017). Our goal is to develop a generic platform to handle non-monotone constraints effectively. 1A third classification is succinctness, which allows immediate pattern generation using a formula rather than an algorithm. Table 2: Characterization of constraints as monotone (M), anti-monotone (AM), or non-monotone (NM) for SPM. Name Constraint := definition M AM NM Maximal Cmxl(P) := P SD : P P Sup-Patt Cspt(P) := P SD : P P Length Clen(P) c := |P| c Clen(P) c Reg Expr Creg(P) := P[i] I I Gap Cgap(A) c := αj αj 1 c, αj A, 2 j |P| Cgap(A) c Span Cspn(A) c := max {A} min {A} c Cspn(A) c Max/Min Cmax(A) c, Cmin(A) c Cmax(A) c, Cmin(A) c Stats Csum(A), Cavg(A), Cvar(A), Cmed(A) Not anti-monotone, but prefix anti-monotone. An MDD representation for SD MDDs are widely applied as an efficient data structure in verification problems (Wegener 2000) and were more recently introduced as a tool for discrete optimization and constraint programming (Bergman et al. 2016). Here, we use an MDD to fully encode the sequences from SD; we refer to such data structure as an MDD database. We show how constraint satisfaction is achieved by storing constraint-specific information at the MDD nodes, thereby removing the need to impose constraint-specific rules in a mining algorithm. MDD construction for the SPM problem An MDD M = (U, A) is a layered directed acyclic graph, where U is the set of nodes, and A is the set of arcs. Set U is partitioned into layers (l0, l1, ..., lm+1), such that layers li : 1 i m correspond to position (item) i of a sequence S SD. Layers l0, and lm+1 consist of single nodes, namely the root node r l0, and the terminal node t lm+1. The root and terminal node are used to model the start and end of all sequences, respectively. Figure 1.a shows the MDD database model for the SD of Table 1. Layers lj, 1 j m, contain one node per item i I : S SD, S[j] = i, and model the possible items at position j of all sequences S SD. For example, layer 1 of the MDD database in Figure 1.a has two nodes corresponding to items B, C, and no node associated to item A. To distinguish which nodes are associated to which sequences S SD, we define labels du for nodes u U, and store the associated sequence index SID in du. The first label of node B at layer 1 of Figure 1.a, indicates that sequences 1 and 2 contain item B at their first position. In addition, we store the attribute labels associated with the item, one per SID at each node. For example, in Figure 1.a we store the time and price attributes. An arc a = (u, v) A, is directed from a node u lj to a node v lj : j > j, and represents the next possible item after node u, for all sequences in SD. Similar to nodes u U, labels da are defined for arcs a A and store their associated sequences. A sequence S is thus represented by a path from r to t, following the nodes and arcs associated to SID. As we will search the MDD for patterns during the 1,2 <1,3> <5,3> C ID