# delegationrelegation_for_boolean_matrix_factorization__8f64b300.pdf

Delegation-Relegation for Boolean Matrix Factorization

Florent Avellaneda1, 2, Roger Villemaire1

1 Universit e du Qu ebec a Montr eal (UQAM), Montr eal, Canada 2 Centre de Recherche de l Institut Universitaire de G eriatrie de Montr eal, Montr eal, Canada avellaneda.florent@uqam.ca, villemaire.roger@uqam.ca

The Boolean Matrix Factorization (BMF) problem aims to represent a n m Boolean matrix as the Boolean product of two matrices of small rank k, where the product is computed using Boolean algebra operations. However, finding a BMF of minimum rank is known to be NP-hard, posing challenges for heuristic algorithms and exact approaches in terms of rank found and computation time, particularly as matrix size or the number of entries equal to 1 grows. In this paper, we present a new approach to simplifying the matrix to be factorized by reducing the number of 1-entries, which allows to directly recover a Boolean factorization of the original matrix from its simplified version. We introduce two types of simplification: one that performs numerous simplifications without preserving the original rank and another that performs fewer simplifications but guarantees that an optimal BMF on the simplified matrix yields an optimal BMF on the original matrix. Furthermore, our experiments show that our approach outperforms existing exact BMF algorithms.

1 Introduction

The problem of Boolean Matrix Factorization (BMF) is to represent a n m matrix X as the Boolean product of an n k and k m matrices A B. The Boolean semiring operator, denoted by , represents a matrix multiplication in which the product is computed using the logical AND operator and the addition is computed using the logical OR operator. By searching for matrices A and B of small rank k, the BMF allows us to represent the initial matrix X in a concise way. BMF is widely used in various fields such as data mining (Wicker, Pfahringer, and Kramer 2012), role mining (Ene et al. 2008), bioinformatics (Liang, Zhu, and Lu 2020), logic synthesis (Ma, Hashemi, and Reda 2022; Hashemi, Tann, and Reda 2018), and network analysis (Kocayusufoglu, Hoang, and Singh 2018), and has been the subject of numerous research studies (Miettinen and Neumann 2020). In the context of BMF, two types of problems arise. The first problem is, given a matrix X, finding a BMF of minimal rank. The second problem is, given a matrix X and a

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

rank k, finding a BMF that is as close as possible to the matrix X. In this paper, we only consider exact solutions in which no error is allowed. Exact solutions are a key concern in application areas such as security role mining (Ene et al. 2008) where a spurious authorization would jeopardize a security policy. Moreover, it is well-known that the set basis and biclique covering problems are equivalent to BMF (Miettinen and Neumann 2020). Therefore, this opens up additional domains that would benefit from exact BMF such as network nodes service that provide functions bundling (Markopoulou and E. Anagnostou 1998), encryption key management (Shu, Lee, and Yannakakis 2006), and frameproof codes (Hajiabolhassan and Moazami 2012). In this setting, our objective is to devise methods to find exact BMF with minimal or as close to minimal rank as possible. Since the problem of finding the BMF of minimal rank k is known to be NP-hard (Miettinen et al. 2008), most existing algorithms in the literature aim to minimize the rank k without guaranteeing optimality (Asso (Miettinen et al. 2008), Gre Con D (Belohlavek and Vychodil 2010), Hyper (Xiang et al. 2011), MEBF (Wan et al. 2020), Nassau (Karaev, Miettinen, and Vreeken 2015), Pa NDa+ (Lucchese, Orlando, and Perego 2010), Proximus (Koyut urk and Grama 2003), Tiling (Geerts, Goethals, and Mielik ainen 2004)). However, recent work based on constraint solvers proposed methods for finding optimal BMF (Kovacs, Gunluk, and Hauser 2021; Avellaneda and Villemaire 2022). While these approaches can find optimal factorizations for small matrices only, they can still find low-rank factorization for larger matrices by using relaxations and approximations. Therefore, these constraint solver-based methods are more computationally demanding than traditional approaches, but they generally produce smaller rank BMF. A natural question that arises is: is it possible to simplify the matrices before factorization in order to make the constraint solving algorithms run more efficiently? The idea of simplifying the matrices to be factorized before performing the factorization is not new and has been introduced in a particular family of BMF algorithms based on formal concept analysis (Ganter and Wille 2012). A formal concept corresponds to a pair of lines and columns (I, J) such that i I, j J, Xi,j = 1. Formal concepts are partially ordered and give rise to a lattice called the concept lattice. The BMF problem then coincides with the selection

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

of a set of formal concepts called concept factors. Although this representation does not allow missing values in the matrix to be factorized, it is used by an algorithm called Gre Ess (Belohlavek and Trnecka 2015) to locate socalled essential entries . The authors furthermore show that a BMF covering all the essential entries can be transformed into a BMF of equivalent rank covering all the entries of the initial matrix. Although the rank of this factorization is not optimal for the initial matrix, the gain related to the simplification of the matrix allows good execution time and factorization of low rank. This approach has been pushed to the extreme in Iteress (Belohlavek, Outrata, and Trnecka 2019) by applying this simplification process iteratively until a fixed point is reached. We revisit the concept of matrix simplification used in Gre Ess and Iteress, but from the perspective of row/column inclusion. In contrast to previous work, which only apply to complete 0/1 matrices, we propose a simplification method that incorporates missing values (don t care) and allows for more extensive simplification than Iteress. Furthermore, we introduce a second line/column inclusion notion that additionally guarantees that an optimal BMF on the simplified matrix yields an optimal BMF on the initial matrix. The BMF problem can be seen as the problem of covering the matrix s 1-entries by the so-called blocks. The rationale for our approach is that reducing the number of 1s in a matrix will simplify their covering and should generally lead to BMF speed-ups. In particular, with constraint-based BMF algorithms, where the presence of 1s in the matrix to be factorized generate constraints that are harder to satisfy than missing values or 0s, we show that our simplification approach can significantly reduce the computational time. The paper is structured as follows. First, to formalize the approach, we present definitions related to BMF and introduce the concept of matrix inclusion (Section 2). Then, in Section 3, we define a simplification operator, which we call delegation , and show how to obtain a BMF on the original matrix from a BMF on a simplified matrix using the relegation operator. Algorithms based on the delegation and relegation operator are then proposed in Section 4. Finally, we report the results of several experiments in Section 5.

2 Notation We denote by Xm n a Boolean matrix X {0, 1, }n m with m rows, and n columns. We use to represent missing data and if X {0, 1}n m we say that the matrix X is complete. We also use the notation Xi,j to represent the entry in the i-th row and the j-th column and the notation Xi,: and X:,j, to represent the i-th row and the j-th column of X, respectively. We now formally define the Boolean product of two matrices.

Definition 2.1. The Boolean product of two complete matrices Am k and Bk n is a matrix (A B)m n defined by:

ℓ=1 (Ai,ℓ Bℓ,j)

This definition is similar to the classical matrix product,

where the product is replaced by the logical AND and the addition is replaced by the logical OR . Note that the Boolean product of two rank-k matrices can also be seen as the union of k rank-1 matrices called blocks:

ℓ=1 (A:,ℓ Bℓ,:)

where the W operator applies entrywise. Therefore, the BMF problem corresponds to finding a set of blocks that cover all the 1s of the matrix and do not cover any 0s. The idea developed in this paper is that blocks that cover certain 1s can be extended to cover some other specific 1s. Therefore, the 1s that can be covered by extending these blocks can be ignored and removed from the matrix. To define these specific 1s, we introduce the concept of matrix inclusion. Definition 2.2. A matrix X m n is existentially included in a matrix Xm n (denoted X X) if there is no i, j such that X i,j = 1 and Xi,j = 0. Intuitively, this definition means that it is possible to fill in the missing values, represented by , in both matrices X and X such that the inequality X i,j Xi,j holds true for all indices i, j. Definition 2.3. A matrix X m n is universally included in a matrix Xm n (denoted X X) if for all i, j, if Xi,j = 0, then X i,j = 0. Intuitively, this definition means that regardless of how we fill in the missing values in X , we can always find a way to fill in the missing values in X such that X i,j Xi,j holds true for all indices i, j. Definition 2.4. A matrix X m n is consistent with a matrix Xm n (denoted X X) if X X and X X. Intuitively, this definition means that it is possible to fill in the missing values in both X and X such that for all i, j, we have X i,j = Xi,j. Definition 2.5. Two complete matrices Am k and Bk n are a BMF of the matrix Xm n if (A B) X.

3 Transformation Into Sparse Matrices In this section, we show how to simplify a matrix to be factorized using the delegation operator and demonstrate how to obtain a BMF on the original matrix from a BMF on a simplified matrix using the relegation operator. Definition 3.1. We denote by Xv w the delegation of the line w to the line v in the matrix X.

( 0 if i = v and Xw,j = 0, if i = w and Xv,j = 1, Xi,j otherwise.

Definition 3.2. We denote by Av w the relegation of the line w from the line v in the matrix A.

Av w i,j = 1 if i = w and Av,j = 1, Ai,j otherwise.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

We now show that if a line v is existentially included in a line w, then we can obtain a BMF of the original matrix X from a BMF A B on the simplified matrix Xv w (Theorem 3.5). The intuition, illustrated in Figure 1, is as follows. A rankk BMF can be seen as the union of k rank-1 BMF, the blocks. Since each 1 present in the matrix to be factorized must be covered by at least one block, we can see that if a line v is included in a line w, then a block covering a 1 on v can also be extended to cover the line w. Thus, all columns where there is a 1 on the line v and the line w can be removed from the line w. However, to ensure that the line v is indeed included in the line w, we must replace every from the line v with zeros if the column j is such that Xw,j = 0.

Figure 1: Delegation Xv w in the case where a line v is included in a line w.

We now prove this result more formally, using two intermediate propositions.

Proposition 3.3. Let v, w be such that Xv,: Xw,: and Y be complete. If Y Xv w then Yv w X

Proof. Let Y Xv w. If Yv w i,j , Xi,j take different values in {0, 1} we either have that Yv w i,j = Yi,j or Xi,j = Xv w i,j .

If Yv w i,j = Yi,j by definition 3.2 we must have i = w, Yv w w,j = 1 and Yv,j = 1. Since Yv w w,j = Xw,j, it follows that Xw,j = 0 and by definition 3.1 Xv w v,j = 0, which contradicts Y Xv w. If Xi,j = Xv w i,j by definition 3.1 we either have that i = v with Xv w v,j = 0 and Xw,j = 0 or i = w, Xv w w,j = and Xv,j = 1. In the first case, it follows from Xw,j = 0 that Xv,j = 0 since this entry is in {0, 1} by hypothesis and Xv,: Xw,:. Now Yv w v,j = 1 since it differs from Xv,j. Furthermore, Yv,j is also 1 since in definition 3.2 line v is left unchanged. But now Xv w v,j = 0 contradicts Y Xv w.

In the final case, i = w, Xv w w,j = and Xv,j = 1, we have from Xv,: Xw,: that Xw,j = 1 and hence Yv w w,j = 0 since these two last values are different and in {0, 1}. Definition 3.1 yields, from Xw,j = 1, that Xv w v,j = Xv,j = 1. Now since Yv w w,j = 0 it follows from definition 3.2 that Yv,j = 1 and is hence 0 but then Yv,j = Xv w v,j contradicts Y Xv w.

Proposition 3.4. Let A and B be two matrices. Then:

(A B)v w = Av w B

Proof. By Definition 3.2, the relegation operator only modifies line w, it hence follows that for i = w, (A B)v w i,j = (A B)i,j and Av w i,ℓ = Ai,ℓ, hence (Av w B)i,j = (A B)i,j. Let us now consider the entries on lines i = w. By Definition 3.2 we have that (A B)v w w,j = 1 exactly when either (A B)w,j = 1 or (A B)v,j = 1. Similarly, (A)v w w,ℓ= 1 holds exactly when either Aw,ℓ= 1 or Av,ℓ= 1. Furthermore, from Definition 2.1 we have that (Av w B)w,j = 1 exactly when Av w w,ℓ= 1 and Bℓ,j = 1 for some ℓ. Combining both, (Av w B)w,j = 1 exactly when either Aw,ℓ= 1 and Bℓ,j = 1 or Av,ℓ= 1 and Bℓ,j = 1 holds, for some ℓ. This therefore holds exactly when either (A B)w,j = 1 or (A B)v,j = 1 and (A B)v w w,j = (Av w B)w,j as required.

Theorem 3.5. Let v, w be such that Xv,: Xw,:. If (A B) Xv w then (Av w B) X.

Proof. By Proposition 3.3, if (A B) Xv w then (A B)v w X. Then, by Proposition 3.4, (Av w B) X.

Now, let us show that if the line v is universally included in the line w, then finding an optimal BMF on the original matrix X is equivalent to finding an optimal BMF on the simplified matrix Xv w (Theorem 3.7). The intuition here is that if the line v is universally included in the line w, no of the line v will have to be changed to 0 during the delegation operation (see Figure 1). Thus, no assumptions about will be made, only reductions of 1s that we know can be covered by relegation. We prove this theorem using an intermediate lemma.

Lemma 3.6. Let v, w be such that Xv,: Xw,:. There exists a k-rank BMF on X if and only if there exists a krank BMF on Xv w.

Proof. If (A B) Xv w then, since Xv,: Xw,: implies Xv,: Xw,:, by Theorem 3.5, we know that (Av w B) X. Conversely, if (A B) X, let us show that (A B) Xv w. By Definition 2.3 Xv,: Xw,: means that for all j, Xv,j = 0 when Xw,j = 0. Thus, by applying Xv w, the only changes will be the replacement of some 1s by (Definition 3.1). Thereby, (A B) Xv w.

Theorem 3.7. Let v, w be such that Xv,: Xw,:. (Av w B) is an optimal BMF for X if and only if (A B) is an optimal BMF for Xv w.

Proof. Direct application of Lemma 3.6.

We now introduce the analogous notions for columns, and extend the previous properties to the case of column inclusion.

Definition 3.8. We denote Xv w the delegation of the column w to the column v in the matrix X.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

( 0 if j = v and Xi,w = 0, if j = w and Xi,v = 1, Xi,j otherwise.

Definition 3.9. We denote Bv w the relegation of the column w from the column v in the matrix B.

Bv w i,j = 1 if j = w and Bi,v = 1, Bi,j otherwise.

Theorem 3.10. Let v, w be such that X:,v X:,w. If (A B) Xv w then (A Bv w) X.

Proof. Similar to Theorem 3.5.

Theorem 3.11. Let v, w such that X:,v X:,w. (A Bv w) is an optimal BMF for X if and only if (A B) is an optimal BMF for Xv w.

Proof. Similar to Theorem 3.7.

4 Algorithm In this section, we leverage the theorems from the previous section to build a BMF algorithm. This algorithm (Algorithm 2) consists of simplifying the initial matrix using the delegation operation, applying a BMF algorithm on the simplified matrix, and then transforming this factorization using the relegation operation to obtain a factorization of the initial matrix. In the rest of this paper, we say that a matrix X is universally simplified if we apply this Algorithm 2 to the matrix X with the operator , and existentially simplified if we apply this Algorithm 2 to the matrix X with the operator . To avoid adding unnecessary delegation, the algorithm only applies delegation when at least one 1 entry is replaced by an . Algorithm 1 therefore first performs the line delegation each time it detects that a line is included in a second one and replaces a 1 by a in a least one column, then performs the same operation for the column delegation. This process is performed as long as there are lines or columns that can be delegated. The inclusion detection can be performed using either universal or existential inclusion, depending on whether the goal is to preserve the rank of the original matrix or perform more simplifications. Example 4.1. Consider the matrix of Figure 2. Line 1 is universally included in lines 4 and 5 and a delegation of these two lines to line 1 can be performed, thereby removing six 1s, and replacing them with the emptyset value as shown in the left matrix of Figure 3. There is no more further universal simplifications between the lines of this resulting matrix. For existential simplification, line 1 is also existentially included in lines 4 and 5 but now we furthermore have that line 2 is existentially included in lines 3 and 5. Performing these existential simplifications yields the further existential inclusion of line 3 in line 4 and of line 4 in line 3. Performing the first of these two last existential simplifications leads to the right matrix in Figure 3. There is then no more existential simplifications to apply. Note that the simplified matrix we obtain will depend on the order in which the delegations are made. After performing all the line delegations, the algorithm delegates columns,

leading to the matrices in Figure 4. From this example, we see that the existentially simplified matrix has fewer 1entries (3) than the universally simplified matrix (7), which should yield a simpler BMF problem. However, as seen before, the minimum rank on the universally simplified matrix is guaranteed to be the same as the minimum rank of the initial matrix, while this is not necessarily the case for the existential simplification.

j1 j2 j3 j4 j5 j6

i1 0 0 1 1 0 1

i3 1 1 0 1 1

i4 0 0 1 1 1 1

i5 1 1 1 1 0 1

Figure 2: Example of a Boolean matrix to factorize.

0 0 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0

0 0 1 1 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0

Figure 3: Matrix from Figure 2 after applying Xv w for every Xv,: Xw,: at left and for every Xv,: Xw,: at right.

0 0 1 1 0 1 0 1 0 1 0 0 1 1 0

0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0

Figure 4: Left matrix from Figure 3 after applying Xv w for every X:,v X:,w at left and right matrix from Figure 3 for every X:,v X:,w at right.

Theorem 4.2. Algorithm 2 is correct, i.e., it correctly produces a BMF of Matrix X and this for both values OP { , }.

Proof. First, Algorithm 1 applies delegation (both on lines and columns) in a greedy fashion by stacking the applied delegations in LIFO Delegation. The resulting matrix and delegation LIFO are then affected in X and Delegation in line 2 of Algorithm 2. Algorithm 2 then proceeds to relegate in the inverse order of the delegations, which by an iterative use of Theorems 3.5 and 3.10 yields a BMF of the original matrix X.

Theorem 4.3. Algorithm 2 with OP = returns an optimal BMF whenever the BMF routine returns optimal BMF.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 1: Simpli OP

1: Input: Matrix X, OP { , } 2: Delegation LIFO() 3: repeat 4: no Change true 5: while v, w, j such that Xv,: OP Xw,: and Xv,j = Xw,j = 1 do 6: X Xv w

7: Delegation.push(v w) 8: no Change false 9: end while 10: while v, w, i such that X:,v OP X:,w and Xi,v = Xi,w = 1 do 11: X Xv w

12: Delegation.push(v w) 13: no Change false 14: end while 15: until no Change is true 16: return X, Delegation

Algorithm 2: BMF through simplified matrix

1: Input: Matrix X, OP { , } 2: X , Delegation Simpli OP (X) 3: A, B BMF(X ) 4: while Delegation is not empty do 5: if (v w) Delegation.pop() then 6: A Av w

7: end if 8: if (v w) Delegation.pop() then 9: B Bv w

10: end if 11: end while 12: return A, B

Proof. The analysis proceeds as in the proof of Theorem 4.2 expect that Lemma 3.6 guarantees that the rank of the matrix after a universal simplification remains identical to the rank of the initial matrix. Therefore, an optimal BMF algorithm will find a factorization of this rank for the simplified matrix, and since the relegation operation does not increase the rank of the factorization, it will find an optimal BMF for the initial matrix.

5 Experimentation Our algorithms were implemented in C++1, and we performed the experiments on an Intel Gold 6148 Skylake processor using a single thread and 32 Go of RAM. We conducted an evaluation of our methods, Simpli and Simpli , focusing on two key aspects: the degree of simplification they achieve and their effect on the time savings when performing factorizations on the simplified matrices using existing constraint-based BMF solvers. We first evaluated the performance of our methods on well-established datasets from the literature and then on synthetic datasets.

1https://github.com/Florent Avellaneda/Delegation BMF

5.1 Classic Medium Datasets

We considered 31 classic datasets from the literature, namely: Audiology , Autism , Balance Scale , Brest Cancer , Car Evaluation , Chess , Contraceptive Method Choice , Dermatology , Firewall , Solar Flare , Heart Disease , Hepatitis , Iris , Lymphography , Mushroom , Nursery , Website Phishing , Soybean , Student Performance , Thoracic Surgery , Tic Tac-Toe Endgame , Primary Tumor , Voting Records , Wine , Zoo from UCI (Kelly, Longjohn, and Nottingham 2023), Americas-small , Apj , Customer from (Ene et al. 2008), DNA from (Myllykangas et al. 2006) and Paleo from (ˇZliobait e et al. 2023). To evaluate the degree of simplification provided by our method, we compared the number of 1s present in the initial matrix with the number of 1s remaining after applying our algorithms Simpli and Simpli . We then compared these results with the only matrix simplification algorithm from the literature, Iteress. As shown in Table 1, our method Simpli consistently produces simplified matrices with fewer 1s compared to the simplified matrices obtained by Iteress. In regards to the second algorithm, even though Simpli performs only simplifications that do not increase the rank of the matrix, it still generally produces matrices with fewer 1s compared to the simplified matrices obtained by Iteress. To evaluate the impact of these simplifications, we compared the performance of two constraint-based BMF solvers: CG (Kovacs, Gunluk, and Hauser 2021) and Opti Block (Avellaneda and Villemaire 2022). CG is an algorithm based on a Mixed-Integer Programming (MIP) solver that uses a column generation approach, while Opti Block uses a Max SAT solver. We set a time limit of 3 hours and recorded the execution time and rank found for CG in Table 1 and for Opti Block in Table 2. We present results for both solvers using the original matrix and using the matrix simplified by Iteress, Simpli , and Simpli 0. The approach Simpli 0 consists of replacing empty values with zeros in the simplified matrix obtained by Simpli . While this additional simplification could theoretically lead to a degradation in matrix rank, practical observations indicate that such rank degradation is rare. Furthermore, this step often leads to a reduction in the time required to factorize the simplified matrix. In Table 1, an asterisk indicates instances where optimality is proven, meaning that CG reports an optimal factorization on a matrix that retains the same rank as the initial matrix. While CG manages to report the optimal BMF for 5 out of 31 datasets using the original matrices, using the Simpli simplification increases this number to 11 out of 31 datasets. Moreover, the use of Simpli never degrades the rank of the obtained factorizations and improves it for 6 of the datasets. While finding an optimal BMF when CG is applied to the matrix simplified by Simpli 0 is not guaranted, we observe that the rank of the BMF obtained is the best for 29 out of 31 datasets. Moreover, the computation times are frequently and significantly reduced, often by several orders of magnitude. Note that for some datasets the value is present for Iteress, corresponding to the fact that Iteress is

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

DATASET CHARACTERISTICS # ONES AFTER SIMPLIFY BMF WITH CG : TIME (RANK)

SIZE # ONES ITERESS SIMPLI SIMPLI ORIGINAL ITERESS SIMPLI SIMPLI 0 ADVERT. 3279 1557 45139 5941 3942 705 3h (1556) 3h (1596) 3h (1556) 3h (704) AMERIC. 3477 1587 105205 49980 396 187 3h (1521) 3h (1251) 3h (299) 2m (187) APJ 2044 1164 6841 2946 503 453 3h (1148) 3h (1133) 3h (497) 20m (453) AUDIO 200 310 2333 225 200 200 3h (200) 2m (200) 3m (200*) 2m (200) AUTISM 704 155 6832 5702 1776 3h (154) 3h (154) 50m (154) BALANCE 625 23 3125 3125 3125 3125 20m (23*) 20m (23) 20m (23*) 20m (23) BREAST 699 90 6516 4153 4149 3h (90) 3h (90) 3h (90) CAR 1728 25 12096 11071 11071 11071 1h (25*) 30m (25) 3h (25*) 30m (25) CHESS 3196 39 25582 368 780 38 3h (38) 1s (38) 20m (38*) 1s (38) CMC 1473 71 11009 9992 10179 9573 3h (71) 3h (71) 3h (71) 3h (71) CUSTOM. 10961 277 45427 2771 1442 276 3h (277) 6m (276) 3h (277) 3m (276) DERMAT. 366 194 12482 6823 6164 3h (194) 1h (194) 3h (187) DNA 4590 392 26527 1556 539 367 1m (392) 3h (368) 3h (384) 5m (367) FIREWA. 365 709 31951 2744 88 65 3h (64) 9m (65) 1h (64*) 1s (65) FLARE 1066 43 9283 2928 1950 428 2m (43) 3h (42) 1h (42*) 3h (42) HEART 270 382 3036 325 1459 270 3h (270) 9m (270) 3h (270) 9m (270) HEPATI. 155 337 1377 1027 154 3h (154) 3h (154) 10s (154) IRIS 150 126 750 502 515 486 10m (121*) 8m (121) 10m (121*) 9m (121) LYMPH 148 54 1823 1288 1543 1283 3h (52) 20m (53) 3h (52*) 40m (53) MUSH. 8124 54 124768 1654 51 3h (54) 2h (51) 1s (51) NURSERY 12960 31 110160 61196 59480 53001 3h (30) 3h (30) 3h (30) 3h (30) PALEO 501 139 3537 284 1853 139 3h (139) 1s (139) 3h (139) 1s (139) PHISHI. 1353 26 11507 8430 4856 4466 3h (26) 3h (26) 3h (26) 6m (26) SOYBEAN 307 100 6315 3354 2909 1h (99) 20m (90) 3h (88) STUDENT 395 176 9254 8488 8517 8470 3h (176) 3h (176) 3h (176) 3h (176) THORAC. 470 340 3376 2373 2439 2310 3h (304) 3h (304) 3h (304) 3h (304) TICTAC. 958 28 8954 8954 8954 8954 3h (28) 3h (28) 3h (28) 3h (28) TUMOR 339 44 2136 1099 107 40m (44*) 20m (44*) 2h (44) VOTES 435 17 2961 381 17 90m (17*) 3h (17*) 1s (17) WINE 178 1279 2492 816 190 178 3h (178) 20s (178) 4m (178*) 20s (178) ZOO 101 28 640 85 108 25 3h (25) 1s (25) 3h (25*) 1s (25)

Table 1: Characteristics of the datasets used, comparison of the number of 1s after simplification across different algorithms and time required to find a BMF with CG. Bolded values indicate the best performance, and underlined values indicate the second best performance. The asterisk signifies that the tool was able to prove the optimality of the rank.

not able to factorize incomplete matrices. In Table 2, we notice that no asterisk is present since Opti Block is not capable of proving the optimality of the found factorizations. Similar to the previous result with CG, we observe that the use of Simpli and Simpli 0 improves the rank of the BMF found by Opti Block. If we compare the rank of the BMF and, in the case of equality, the resolution times, Opti Block applied to matrices simplified with Simpli 0 gave the best results in 25 out of the 31 datasets.

5.2 Synthetic Datasets

In this second experiment, we reused the protocol used by Gre Ess (Belohlavek and Trnecka 2015), the ancestor of Iteress. In their protocol, the authors created synthetic matrices of size 1000 500 with four different levels of density (0.1, 0.2, 0.3, and 0.4) and three different levels of rank (20, 30, and 40). Based on the same protocol, we compared our two algorithms Simpli and Simpli to Iteress. Note that we have not included Gre Ess because Iteress, which uses the

same method iteratively, performs more simplifications than Gre Ess. As shown in Table 3, the results indicate that Simpli consistently finds the best simplification with lowest number of 1-entries, Simpli generally finds the second-best simplification, and Iteress generally finds a simplification with more 1-entries compared to the other two algorithms. Note that in our experiment, Simpli almost always finds minimal simplifications, i.e., containing only k 1s for matrices of rank k. Indeed, since a k-rank BMF can be viewed as the union of k 1-rank BMF, a matrix of rank k containing k 1s cannot be further simplified and its factorization is trivial. To quantify how these simplifications, in terms of the number of 1s removed from the original matrix, affect the performance of constraint-based BMF algorithms, we used the Undercover BMF tool (Avellaneda and Villemaire 2022) with the --optimal option to find optimal BMF. Table 3 shows the time used by this tool to find and guarantee an optimal BMF from the original and the universally

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

DATA BMF WITH OPTIBLOCK : TIME (RANK)

ORIGINAL SIMPLI ITERESS SIMPLI 0 ADV. 3h (794) 3h (749) 3h (711) 3h (703) AME. 3h (216) 3h (188) 1h (189) 1h (187) APJ 3h (474) 3h (453) 2h (453) 2h (453) AUD. 30m (200) 1m (200) 1m (200) 1m (200) AUT. 3m (154) 3m (154) 1m (154) BAL. 1s (23) 1s (23) 1s (23) 1s (23) BRE. 10s (90) 1m (90) 30s (90) CAR 3s (25) 3s (25) 3s (25) 3s (25) CHE. 1m (38) 20s (38) 10s (38) 10s (38) CMC 40s (71) 40s (71) 20s (71) 20s (71) CUS. 3h (282) 2h (277) 1h (276) 1h (276) DER. 17m (189) 17m (184) 6m (192) DNA 3h (497) 3h (373) 1h (368) 1h (367) FIRE. 2m (64) 1m (64) 30s (65) 30s (65) FLA. 14s (42) 14s (42) 4s (42) 4s (42) HEA. 90m (270) 90m (270) 2m (270) 2m (270) HEP. 9m (154) 7m (154) 20s (154) IRIS 10s (121) 10s (122) 10s (122) 20s (122) LYM. 5s (54) 6s (54) 3s (55) 3s (54) MUS. 20m (51) 3m (51) 1m (51) NUR. 40s (30) 40s (30) 90s (30) 40s (30) PAL. 4m (139) 2m (139) 30s (139) 30s (139) PHI. 10s (26) 5s (26) 5s (26) 5s (26) SOY. 2m (84) 1m (85) 1m (88) STU. 6m (176) 6m (176) 5m (177) 4m (176) THO. 10m (304) 20m (306) 20m (306) 20m (305) TIC. 3s (28) 3s (28) 3s (28) 3s (28) TUM. 4s (44) 3s (44) 1s (44) VOT. 2s (17) 1s (17) 1s (17) WIN. 6m (178) 2m (178) 2m (178) 2m (178) ZOO 1s (25) 1s (25) 1s (25) 1s (25)

Table 2: Characteristics of the datasets used, comparison of the number of 1s after simplification across different algorithms and time required to find a BMF with Opti Block. Bolded values indicate the best performance, and underlined values indicate the second best performance.

simplified datasets. We did not use the existentially simplified dataset since the simplifications made in this case no longer guarantee finding a minimal rank factorization. We observed that the simplification obtained by the universal simplification improved the computation time of this constraint solver by several orders of magnitude. Note that when CG replaces optimal Undercover BMF, the results remain similar: for universally simplified matrices, CG identifies optimal BMF within seconds for k = 20, within minutes for k = 30, and within few dozen minutes for k = 40. However, after 3 hours, no solution is found for k = 40 with a density of 0.4. And when the matrix is not simplified, no proof of optimality is obtained after 3 hours of computation.

This experiment showed that for matrices of low rank, our approach allowed us to find optimal BMF, even if the size of the matrix to be factorized is large.

DENS NUMBER OF REMAINING 1S BMF

SIMPLI SIMPLI ITERESS OPT OPT

K=20 0.1 20 20 3218 4s 5m 0.2 20 20 1561 4s 14m 0.3 20 20 796 3s 20m 0.4 20 20 343 2s 30m

K=30 0.1 30 30 1525 1m 2h 0.2 30 30 550 4m 1h 0.3 30 188 584 8m 1h 0.4 36 2856 1324 4m 2h

K=40 0.1 40 72 1011 11m 0.2 40 259 669 54m 0.3 40 512 513 3h 0.4 57 9029 5292

Table 3: Number of ones present in a synthetic matrix after simplification and time to find an optimal BMF from the original and simplified matrices. Bolded values indicate the best performance, and underlined values indicate the second best performance.

6 Conclusion This paper has addressed the problem of BMF by exploring the potential of matrix simplification before factorization to reduce the time required by existing constraint-based BMF algorithms. We introduced two methods to simplify Boolean matrices before factorization by replacing specific 1-entries in the matrix with missing values. The central idea is that these simplifications allow for an easy translation from a BMF on the simplified matrix to a BMF on the original matrix. The first method, existential simplification (Simpli ), can perform numerous simplifications, but does not ensure that the rank of the simplified matrix matches the rank of the original matrix. The second method, universal simplification (Simpli ), performs fewer simplifications, but guarantees that an optimal BMF on the simplified matrix results in an optimal BMF on the original matrix. The benefits of our approach were quantified through experiments. First, our methods effectively reduced the number of 1s in the matrices to be factorized, regardless of whether they come from real or synthetic datasets. Second, a significant reduction in computational time was observed for factorizations on simplified matrices compared to the original matrices. Third, the simplification of matrices not only improved the computation time, but also allowed heuristics to discover lower-rank factorizations. This result can be attributed to the reduction of the search space due to simplification. Finally, our method demonstrated the ability to identify optimal rank factorizations for ranks up to several dozen, even with large matrices.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgements We gratefully acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference number RGPIN-2023-04468 and RGPIN-2023-04557].

References Avellaneda, F.; and Villemaire, R. 2022. Undercover Boolean Matrix Factorization with Max SAT. Proceedings of the AAAI Conference on Artificial Intelligence, 36(4): 3672 3681. Belohlavek, R.; Outrata, J.; and Trnecka, M. 2019. Factorizing Boolean matrices using formal concepts and iterative usage of essential entries. Information Sciences, 489: 37 49. Belohlavek, R.; and Trnecka, M. 2015. From-below approximations in Boolean matrix factorization: Geometry and new algorithm. Journal of Computer and System Sciences, 81(8): 1678 1697. Belohlavek, R.; and Vychodil, V. 2010. Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences, 76(1): 3 20. Ene, A.; Horne, W.; Milosavljevic, N.; Rao, P.; Schreiber, R.; and Tarjan, R. E. 2008. Fast exact and heuristic methods for role minimization problems. In Proceedings of the 13th ACM symposium on Access control models and technologies, 1 10. Ganter, B.; and Wille, R. 2012. Formal concept analysis: mathematical foundations. Springer Science & Business Media. Geerts, F.; Goethals, B.; and Mielik ainen, T. 2004. Tiling databases. In International conference on discovery science, 278 289. Springer. Hajiabolhassan, H.; and Moazami, F. 2012. Secure frameproof codes through biclique covers. Discrete Mathematics & Theoretical Computer Science, Vol. 14 no. 2. Hashemi, S.; Tann, H.; and Reda, S. 2018. BLASYS: Approximate logic synthesis using Boolean matrix factorization. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), 1 6. IEEE. Karaev, S.; Miettinen, P.; and Vreeken, J. 2015. Getting to know the unknown unknowns: Destructive-noise resistant boolean matrix factorization. In Proceedings of the 2015 SIAM International Conference on Data Mining, 325 333. SIAM. Kelly, M.; Longjohn, R.; and Nottingham, K. 2023. The UCI Machine Learning Repository. Kocayusufoglu, F.; Hoang, M. X.; and Singh, A. K. 2018. Summarizing network processes with network-constrained Boolean matrix factorization. In 2018 IEEE International Conference on Data Mining (ICDM), 237 246. IEEE. Kovacs, R. A.; Gunluk, O.; and Hauser, R. A. 2021. Binary matrix factorisation via column generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(5): 3823 3831.

Koyut urk, M.; and Grama, A. 2003. PROXIMUS: a framework for analyzing very high dimensional discrete-attributed datasets. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 147 156. Liang, L.; Zhu, K.; and Lu, S. 2020. BEM: mining coregulation patterns in transcriptomics via boolean matrix factorization. Bioinformatics, 36(13): 4030 4037. Lucchese, C.; Orlando, S.; and Perego, R. 2010. Mining top-k patterns from binary datasets in presence of noise. In Proceedings of the 2010 SIAM International Conference on Data Mining, 165 176. SIAM. Ma, J.; Hashemi, S.; and Reda, S. 2022. Approximate Logic Synthesis Using Boolean Matrix Factorization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(1): 15 28. Markopoulou, A. P.; and E. Anagnostou, M. 1998. Optimal grouping of components in a distributed system. Computer Communications, 21(16): 1452 1461. Miettinen, P.; Mielik ainen, T.; Gionis, A.; Das, G.; and Mannila, H. 2008. The discrete basis problem. IEEE transactions on knowledge and data engineering, 20(10): 1348 1362. Miettinen, P.; and Neumann, S. 2020. Recent Developments in Boolean Matrix Factorization. In Bessiere, C., ed., Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, 4922 4928. ijcai.org. Myllykangas, S.; Himberg, J.; B ohling, T.; Nagy, B.; Hollm en, J.; and Knuutila, S. 2006. DNA copy number amplification profiling of human neoplasms. Oncogene, 25(55): 7324 7332. Shu, G.; Lee, D.; and Yannakakis, M. 2006. A note on broadcast encryption key management with applications to large scale emergency alert systems. In IEEE International Parallel & Distributed Processing Symposium, 8 pp. . Wan, C.; Chang, W.; Zhao, T.; Li, M.; Cao, S.; and Zhang, C. 2020. Fast and efficient boolean matrix factorization by geometric segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04): 6086 6093. Wicker, J.; Pfahringer, B.; and Kramer, S. 2012. Multi-label classification using boolean matrix decomposition. In Proceedings of the 27th annual ACM symposium on applied computing, 179 186. Xiang, Y.; Jin, R.; Fuhry, D.; and Dragan, F. F. 2011. Summarizing transactional databases with overlapped hyperrectangles. Data Mining and Knowledge Discovery, 23(2): 215 251. ˇZliobait e, I.; Fortelius, M.; Bernor, R. L.; Van den Hoek Ostende, L. W.; Janis, C. M.; Lintulaakso, K.; S ail a, L. K.; Werdelin, L.; Casanovas-Vilar, I.; Croft, D. A.; et al. 2023. The NOW database of fossil mammals. In Evolution of Cenozoic Land Mammal Faunas and Ecosystems: 25 Years of the NOW Database of Fossil Mammals, 33 42. Springer.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)