# undercover_boolean_matrix_factorization_with_maxsat__2a1b2af2.pdf

Undercover Boolean Matrix Factorization with Max SAT

Florent Avellaneda,1,2 Roger Villemaire1

1 Universit e du Qu ebec a Montr eal (UQAM), Montr eal, Canada 2 Centre de Recherche de l Institut Universitaire de G eriatrie de Montr eal, Montr eal, Canada avellaneda.ﬂorent@uqam.ca, villemaire.roger@uqam.ca

The k-undercover Boolean matrix factorization problem aims to approximate a m n Boolean matrix X as the Boolean product of an m k and a k n matrices A B such that X is a cover of A B, i.e., no representation error is allowed on the 0 s entries of the matrix X. To infer an optimal and block-optimal k-undercover, we propose two exact methods based on Max SAT encodings. From a theoretical standpoint, we prove that our method of inferring blockoptimal k-undercover is a (1 1

e) 0.632 approximation for the optimal k-undercover problem. From a practical standpoint, experimental results indicate that our blockoptimal k-undercover algorithm outperforms the state-ofthe-art even when compared with algorithms for the more general k-undercover Boolean Matrix Factorization problem for which only minimizing reconstruction error is required.

1 Introduction The Boolean Matrix Factorization (BMF) problem consists, for an m n Boolean matrix X and an integer k min(n, m), of ﬁnding m k and k n Boolean matrices A and B such that A B is equal to X for as many entries as possible. The Boolean semiring operator is matrix multiplication where the product is logical AND and addition logical OR . When the rows of the matrix X represent instances and the columns attributes, such a decomposition yields a matrix A that represents the same set of instances using only k derived attributes and a matrix B that deﬁnes these new attributes in terms of the original ones. Thus, since BMF summarizes data in terms of these new k attributes, this factorization has been applied to many areas such as rolebased access control (Vaidya, Atluri, and Guo 2007), multilabel classiﬁcation (Wicker, Pfahringer, and Kramer 2012), network pattern mining (Kocayusufoglu, Hoang, and Singh 2018) and functional interactions in brain networks (Haddad et al. 2018). Although many methods have been developed for the non Boolean case, these methods fail to work in a Boolean context. For example, it is possible that the well-known Singular Value Decomposition (SVD) obtains a greater reconstruction error than a BMF for the same decomposition size

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(Miettinen et al. 2008). For this reason, speciﬁc algorithms dealing with the Boolean context must be developed. In recent years, many different approaches have been applied to BMF (Miettinen and Neumann 2020). Representing BMF as a maximum a posteriori inference problem, (Ravanbakhsh, P oczos, and Greiner 2016) presented a message passing algorithm Message Passing , while (Kovacs, G unl uk, and Hauser 2021) introduced the CG algorithm that used mixed integer linear programming. This last paper also presented a greedy algorithm called k-greedy . Still widely used is also the greedy algorithm ASSO from (Miettinen et al. 2008). As for the MEBF algorithm of (Wan et al. 2020), it used matrix permutations to approximate an upper triangular-like matrix. The quality of a BMF solution is usually evaluated in terms of the reconstruction error, which is the proportion of entries where A B differs from X. This considers erroneous 0 s and 1 s to be of equal importance. However, in applications such as access control where an erroneous 1 in A B would represent a spurious permission, incorrect 1 s in A B must be avoided. In undercover BMF (also known as from-below approximation BMF) one hence searches for a solution A B whose 1-entries have this same value in X. Unexpectedly, undercover BMF algorithms have been shown to give good approximations compared to general BMF algorithms (Belohlavek and Trnecka 2015) even though BMF allows more candidate factorizations. As for BMF, different approaches have been applied to undercover BMF. For instance, Formal Concept Analysis (FCA) has been used by (Belohlavek and Trnecka 2015) to introduce the Gre Ess algorithm. This approach was later improved upon with the Iter Ess algorithm (Belohlavek, Outrata, and Trnecka 2019). Also, the Tiling algorithm (Geerts, Goethals, and Mielik ainen 2004) greedily searches for rectangular patterns (tiles). Building on the approach used with Gre Con D , the GRECOND+ algorithm (Belohlavek and Trnecka 2018) adds a greedy step to minimize reconstruction error. However, this last step will usually not preserve the undercovering property. Similarly, PANDA+ (Lucchese, Orlando, and Perego 2013) is not strictly an undercover BMF algorithm, but it ﬁrst ﬁnds dense initial components that overcover only a strictly limited amount of entries, before greedily extending without increasing recovery error. The contribution of this paper is a Max SAT approach to

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

solve the undercover BMF problem. Max SAT is the optimization version of the well-known Boolean Satisﬁability Testing (SAT) problem. We leverage recent progress in Max SAT solving that occurred in the wake of the impressive developments in SAT solving (Bacchus, J arvisalo, and Martins 2021). We introduce the notion of block-optimal factorization with efﬁcient algorithms for computing this factorization and show that even if a block-optimal factorization is a restricted form of undercover BMF, our algorithms outperform state-of-the-art undercover and even BMF algorithms. This paper is structured as follows. First, to formalize the approach, we provide deﬁnitions related to Boolean matrix factorization (Section 2). Then, we give a Max SAT encoding for the k-undercover Boolean matrix factorization, propose two optimizations and evaluate the performance on synthetic data (Section 3). Finally, we introduce block-optimal factorizations, propose exact methods for ﬁnding such a factorization, and compare the results obtained with the classical approach in the literature (Section 4).

2 Deﬁnitions We denote by Am n a Boolean matrix A {0, 1, null}n m with m rows, and n columns. We use null to represent missing data and if A {0, 1}n m we say that the matrix A is complete. We also use the notation Ai,j to represent the entry in the i-th row and the j-th column and the notation Ai,: and A:,j, to represent the i-th row and the j-th column of A, respectively. A submatrix AI,J of A is the matrix obtained by selecting a subset I [1, m] of A s rows and a subset J [1, n] of A s columns. We now formally deﬁne the Boolean product of two matrices. Deﬁnition 1. The Boolean product of two complete matrices Am k and Bk n is a matrix (A B)m n deﬁne by:

ℓ=1 (Ai,ℓ Bℓ,j)

This deﬁnition is similar to the classical matrix product, where the product is replaced by the logical AND and addition by the logical OR . Note that (A B) can also be written as the union of products of rank 1 matrices:

ℓ=1 (A:,ℓ Bℓ,:)

where (A B)i,j = Ai,j Bi,j. The matrix A:,ℓ Bℓ,: is also called the (ℓ-th) block of the product. The Boolean product A B is then the union of its blocks. Deﬁnition 2. A matrix Dm n undercovers a matrix Xm n (denoted D X) if D is a complete matrix such that there are no i, j such that Xi,j = 0 and Di,j = 1. An entry Xi,j such that Di,j = 1 is said to be covered by D. We deﬁne (A B), the subtraction operator between two matrices by:

(A B)i,j = null if Bi,j = 1 Ai,j otherwise

We denote by |A|1 the number of 1 s in A. Note that when B A the i, j-entry of (A B) is 1 exactly when Ai,j = 1 and Bi,j = 0 and |A B|1 is equal to the number of erroneous entries in the reconstruction B of A.

Deﬁnition 3. Two matrices Am k and Bk n are an optimal k-undercover of the matrix Xm n if:

(A B) X A m k B k n: (A B ) X |X (A B )|1 |X (A B)|1

3 Optimal Undercover Factorization In this section, we propose a Max SAT encoding for optimal k-undercover. We start by giving a simple encoding of the problem in Max SAT, then we propose two optimizations. The ﬁrst one is a classical optimization which consists in adding constraints to break the symmetry of the solutions. The second optimization, which we call cardinality generation, is to the best of our knowledge, a type of optimization not found in the literature. Here, we will show that cardinality generation allows us to gain several orders of magnitude on the computation time needed to solve the Max SAT formula.

3.1 Max SAT Encoding As with SAT instances a Max SAT instance is speciﬁed by a set of constraints (clauses) that are logical OR of Boolean variables and their negations. Max SAT also considers some clauses to be soft and one searches for an assignment of Boolean values to variables that satisfy all non-soft clauses while maximizing the number of satisﬁed soft clauses. Given a matrix Xm n, the k-undercover problem can be encoded in Max SAT as the set of following clauses. For every i, j such that Xi,j = 0:

ℓ=1 ( Ai,ℓ Bℓ,j) (1)

These clauses guarantee that (A B)i,j = 0. For every i, j such that Xi,j = 1:

ℓ=1 (T ℓ i,j Ai,ℓ)

ℓ=1 (T ℓ i,j Bℓ,j) (2)

These clauses guarantee that if T ℓ i,j = 1 then (A B)i,j = 1. Finally, for every i, j such that Xi,j = 1:

ℓ=1 T ℓ i,j (3)

where Si,j are unary soft clauses. These clauses guarantee that if Si,j = 1 then (A B)i,j = 1. Note that since clauses (1) guarantee to infer an undercover, and since (2) and (3) guarantee to cover Xi,j if Si,j = 1, then an assignment that maximizes the number of Si,j = 1 corresponds to a maximal undercover.

3.2 Symmetry Breaking We propose in this section an optimization based on symmetry breaking. Symmetry breaking, a well-known practice of the SAT community (Aloul et al. 2002; Aloul, Sakallah, and Markov 2006; Brown, Finkelstein, and Purdom Jr 1988), consists in adding constraints to quickly remove some assignments as acceptable solutions. The idea is to reduce the number of solutions that are equivalent to each other. In our matrix factorization context, for any solution A B and for all i, j k, we can switch the i-th column and the j-th column on the matrix A and switch the i-th row and the j-th row on the matrix B to obtain a new solution whose Boolean product remains unchanged. To avoid this kind of permutation, we impose a lexicographical order on the rows of the matrix B. This can be performed with the following set clauses. For every i [1, k 1] and j [1, n 1]:

(Zi,j Bi,j Bi+1,j) Zi,j+1 (4)

For every i [1, k 1] and j [1, n 1]:

(Zi,j Bi,j Bi+1,j) Zi,j+1 (5)

For every i [1, k 1] and j [1, n]:

(Zi,j Bi,j) Bi+1,j (6)

And ﬁnally, for each i [1, k]:

3.3 Generate Cardinalities A recent and efﬁcient algorithm for solving the Max SAT problem is the OLL algorithm (Andres et al. 2012). This algorithm originally created for ASP solvers has been adapted to Max SAT solvers (Morgado, Dodaro, and Marques-Silva 2014) and is used in tools such as MSCG (Morgado, Ignatiev, and Marques-Silva 2014), RC2 (Ignatiev, Morgado, and Marques-Silva 2019) and Eval Max SAT (Avellaneda 2020). The principle of this algorithm is to search for unsatisﬁable cores (sets of soft clauses that cannot all be satisﬁed at the same time) in order to replace many soft clauses by few cardinality constraints. The following example illustrates this algorithm. Example 1. Let ϕsoft = {x1, x2, x3} be a set of soft (unary) clauses and ϕhard = {( x1 x2), ( x2 x3), ( x1 x3)} be a set of hard clauses. A call to a SAT solver on the set of clauses ϕsoft ϕhard will ﬁnd that the formula is unsatisﬁable and an unsatisﬁable core can be {x1, x2}. This means that at least one of the two clauses x1 or x2 has to be false and the cost of the Max SAT formula is at least 1. Thus, we remove x1 and x2 from ϕsoft, we add the cardinality constraint (x1+x2 1) to ϕsoft and we increment the cost of the formula. On the second iteration, a SAT solver will ﬁnd that ϕsoft ϕhard is still unsatisﬁable and the core will be {x3, (x1 + x2 1)}. This means that at least x3 has to be false or x1 + x2 2. Thus, we remove x3 from ϕsoft, replace x1 + x2 1 by x1 + x2 0, add a new cardinality constraint

(x3 + (x1 + x2 1) 1) and increment the cost of the formula. Now, the formula ϕsoft ϕhard is satisﬁable and we can conclude that the cost of the initial formula is two.

We observe that a call to a SAT solver is necessary to increment the cost of the formula. Thus, if the cost of the formula to be solved is high, it implies many calls to a SAT solver, which, in practice, can take a long time. In this section, we propose a method to signiﬁcantly reduce the number of SAT calls performed. The idea is to use the knowledge of the problem modeled by the formula to be solved in order to quickly ﬁnd sets of soft clauses in which only k of the clauses can be satisﬁed at a time. If at most k clauses can be satisﬁed in the set of clauses {c1, c2, . . . , cf}, this means that we can increase the cost of the formula by f k, remove the f clauses from the set ϕsoft and add a cardinality constraint c1 + c2 + + cn k to ϕsoft. A similar idea has been used by the RC2 solver (Ignatiev, Morgado, and Marques-Silva 2019), but only for k = 1. The approach used consists in calling a SAT solver to ﬁnd incompatibilities between soft clauses. When a set of clauses that are incompatible with each other is found, it generates a cardinality constraint At Most1 and removes these clauses from ϕsoft. Although this approach is efﬁcient, it has a major drawback: calling a SAT solver to ﬁnd incompatibilities between clauses can be very costly. In their approach, the authors limit the search time of the SAT solver by accepting not to detect some incompatibilities to overcome this drawback. Knowing the problem modeled by the formulas will allow us to propose a more efﬁcient and general method than the one used in RC2. In order to ﬁnd sets of soft clauses for which only k clauses can be satisﬁed at the same time, we simply identify sets of 1 s entries in the matrix to undercover that are two by two incompatible.

Deﬁnition 4. Two entries Xi1,j1 and Xi2,j2 of a matrix Xm n are incompatible if Xi1,j1 = 1, Xi2,j2 = 1 and X contains either Xi1,j2 = 0 or Xi2,j1 = 0.

Now, the key property is that a rank-1 factorization can cover at most one of two incompatible 1-entries.

Proposition 1. If Xi1,j1 and Xi2,j2 are two incompatible entries of a matrix Xm n then for every product of rank-1 matrices (Am 1 B1 n) X, we have (A B)i1,j1 = 0 or (A B)i2,j2 = 0.

Proof. If (A B)i1,j1 = 1 and (A B)i2,j2 = 1, then Ai1,1 = 1, Ai2,1 = 1, B1,j1 = 1 and B1,j2 = 1. This implies that (A B)i1,j2 = 1 and (A B)i2,j1 = 1 and therefore that (i1, j1, 1) are not incompatible with (i2, j2, 1).

Furthermore, this implies a bound on how many entries in a set of two by two incompatible 1 entries can be covered in a rank-k factorization.

Theorem 1. Let Xm n be a matrix. If I X is a set of 1 s entries in X two by two incompatible, then at most k of these entries can be covered by a k-undercover.

Proof. By deﬁnition, a k-undercover correspond to two matrices Am k and Bk n such that A B X. Also by deﬁnition, A B = Sk ℓ=1 Dℓwhere Dℓ= A:,ℓ Bℓ,:. Since (A B) X and Dℓ (A B) we know that Dℓ X for each ℓ [1, k]. By Proposition 1, since all elements from I are two by two incompatible, each Dℓcan cover at most one element of I. Thus, at most k elements of I can be covered by a k-undercover.

Theorem 1 implies that for each set I of elements two by two incompatible in Xm n, at most k elements can be covered by an undercover of rank k. Thus, without loss of generality, we can remove the soft clauses {Si,j | (i, j) I} associated with elements from I and replace them by a cardinality constraint P (i,j) I Si,j k. In Algorithm 1, we propose a method that performs these substitutions.

Algorithm 1: Generate Cardinalities Input: A matrix Xm n, an integer k, a Boolean formula Φ, and a matrix of soft unary clauses S. Output: A simpliﬁed and equivalent Boolean formula. 1: All One {(i, j) | Xi,j = 1} 2: while true do 3: no Comp {} 4: for all (i, j) All One do 5: if (i , j ) no Comp : Xi,j = 0 Xi ,j = 0 then 6: no Comp no Comp {(i, j)} 7: end if 8: end for 9: if no Comp = then 10: return Φ, S 11: end if 12: All One All One \ no Comp 13: if |no Comp| > k then 14: Φ ( P

(i,j) no Comp Si,j k) )

15: Φ remove Si,j as a soft clause in Φ 16: end if 17: end while 18: return Φ

3.4 Experimentation All experiments have been performed on Ubuntu 20.04 with Intel Core i7-2600K CPU @ 3.40GHz and 12 GB of RAM. In this section, we compare the time required to solve randomly generated problems with all Max SAT solvers competing in the Max SAT Evaluation 2021 (Bacchus et al. 2021), i.e., Max HS (Bacchus 2021), CASHWMax SAT (Lei et al. 2021), Eval Max SAT (Avellaneda 2020), UWr Max SAT (Piotr ow 2021; Piotrow 2020), Open-WBO-RES (Martins et al. 2021; Martins, Manquinho, and Lynce 2014), Pacose (Paxian and Becker 2021) and Exact (Devriendt 2021). The method followed to perform this experimentation is as follows.

Algorithm 2: Optimal Undercover Input: A matrix X and an integer k. Output: Two matrices Am k and Bk n such that (A B) is an optimal k-undercover for X. 1: Initialize the formula Φ with constraints (4), (5), (6), (7). 2: Let Sm n be a matrix of soft unary clause. 3: for all (i, j) | Xi,j = 0 do

ℓ=1 ( Ai,ℓ Bℓ,j) {Formula (1)}

5: end for 6: for all (i, j) | Xi,j = 1 do

ℓ=1 (T ℓ i,j Ai,ℓ) {Formula (2)}

ℓ=1 (T ℓ i,j Bℓ,j) {Formula (2)}

9: Φ Φ ( Si,j k W

ℓ=1 T ℓ i,j) {Formula (3)}

10: end for 11: Φ Generate Cardinalities(X, k, Φ, S) 12: return Max SAT(Φ, P i [1,m],j [1,n] Si,j)

We randomly generate two matrices A100 10 and B10 100 and assigned the value 1 to each entry with the probability of p

1 10 1 0.1. Thus, ensuring that the probability of having a 1 in a entry of A B is 10%. We then calculate X = (A B) and remove 10% of entries randomly. For k ranging from 1 to 10 we call Optimal Undercover(X, k) (Algorithm 2) with or without the Generate Cardinality (Algorithm 1) optimization and compare the time required by many Max SAT solvers. For the sake of clarity, we plot in Figure 1 the times required to solve the problem for only four Max SAT solvers. This experimentation showed that the Generate Cardinality optimization reduced computation time by several orders of magnitude. We do not compare our tool to others in this section because we are not aware of any other tool that can ﬁnd optimal k-undercover. Note however that the CG method (Kovacs, G unl uk, and Hauser 2021), the only tool allowing to infer exact BMF, requires many hours, even with small k (3h30 with k = 2).

4 Block-Optimal Undercover Factorization

To handle large datasets and to be able to deal with larger ranks, we propose a weaker deﬁnition of the optimality of an undercover that we call block-optimal undercover.

Deﬁnition 5. Two matrices Am k and Bk n form a blockoptimal k-undercover for a matrix Xm n if for each p [1, k], the block (A:,p Bp,:) is an optimal 1-undercovers for X S

ℓ =p (A:,ℓ Bℓ,:).

The key insight is that two matrices A and B form a block-optimal undercover of X if we cannot ﬁnd a better undercover by modifying a single block. The advantage of this

1 2 3 4 5 6 7 8 9 k

Time in seconds

Eval Max SAT Pacose Max HS UWr Max Sat

Figure 1: Average execution time over ten runs of Optimal Undercover (Algorithm 2) on a randomly generated dataset using various Max SAT solvers. The dashed lines correspond to our encoding without the Generate Cardinality optimization and the solid lines correspond to our encoding with the optimization.

deﬁnition is that one can ﬁnd a block-optimal undercover by incrementally improving an initial solution one block at a time. Moreover, ﬁnding an appropriate block consists in searching for an optimal 1-undercover, a problem that we are able to solve quickly as we have seen in the previous section. Naturally, it remains that a better solutions can potentially be found by modifying several blocks as allowed by the BMF problem. However, we show in practice that the solutions found with our algorithm are of good quality compared to general BMF algorithms. Furthermore, although under the P = NP assumption, it is not possible to approximate the k-undercover Boolean matrix factorization problem with a ﬁnite ratio in polynomial time (Miettinen et al. 2008), we show that a block-optimal k-undercover is a 1 2-approximation.

Theorem 2. A block-optimal k-undercover is a 1 2approximation for the k-undercover Boolean matrix factorization problem.

Proof. Let X be the matrix to factorize, b A b B be an optimal k-undercover factorization of X and A B be a blockoptimal k-undercover factorization for X. Let #APX = |A B|1 be the number of entries covered by the blockoptimal k-undercover. Note that since A B are block-optimal, a block corresponds to an optimal 1-undercover. Let bmin be a block of A B that covers the minimum number of entries not covered by another block. The block bmin covers at most #AP X

k entries that are not covered by another block. Let bmax be the block of b A b B that covers the maximum number no of entries not covered by A B. Now, no cannot be strictly greater than #AP X

k since otherwise bmax would be a better 1-undercover than bmin, which contradicts the fact that bmin is an optimal 1-undercover. It then follows that b A b B covers at most k ( #AP X

k ) = #APX additional entries compared to A B. Therefore b A b B covers at most twice as many entries than A B, establishing 1

2-approximation.

Algorithm 3: Opti Block Input: A matrix Xm,n and two matrices Am k, Bk n such that (A B) X. Output: Two matrices that are block-optimal k-undercover for X. 1: repeat 2: modif false 3: for all p = 1 to k do 4: X X S

ℓ =p (A:,ℓ Bℓ,:)

5: Y , Z Optimal Undercover(X , 1) 6: if |X (Y Z)|1 < |X (A:,p Bp,:)|1 then 7: A:,p Y 8: Bp,: Z 9: modif true 10: end if 11: end for 12: until modif = false 13: return A, B

4.1 Algorithm to Find Block-Optimal Undercover To ﬁnd a block-optimal k-undercover of a matrix Xm,n, our method consists in incrementally improving an initial solution until reaching a valid solution. Let Am k and Bk n be two matrices such that (A B) X. A trivial value for Am k and Bk n can be A = {0}m k and B = {0}k n. Then, if A and B are not a block-optimal k-undercover for X, then there exists a block A:,p Bp,: that is not an optimal 1-undercover for X S ℓ =p(A:,ℓ Bℓ,:). In that case we can replace A:,p and Bp,: to have an optimal 1-undercover and iterate until we reach a block-optimal k-undercover for X. The pseudocode of our method is illustrated in Algorithm 3. Theorem 3. Algorithm 3 is correct.

Proof. When the algorithm stops, it means that the variable modif is false, i.e., for each p [1, k], (A:,p Bp,:) is an optimal 1-undercovers for X S ℓ =p(A:,ℓ Bℓ,:). Since this property corresponds to deﬁnition 5, we know that when we leave the until loop, the two matrices A and B are a block-optimal k-undercover for a matrix X. Let us now show that the algorithm terminates after a ﬁnite number of loops. Note that the number of 1 s in X not covered by A B can be calculated by the equation |X Sk ℓ=1(A:,ℓ Bℓ,:)|1 which can also be rewritten as:

ℓ =p (A:,ℓ Bℓ,:)) (A:,p Bp,:)|1

By replacing X S ℓ =p(A:,ℓ Bℓ,:)) by X as in Algorithm 3, the equation becomes |X (A:,p Bp,:)|1. Thus, when |X (Y Z)|1 < |X (A:,p Bp,:)|1, by replacing A:,p by Y and Bp,: by Z, the number of 1 s in X not covered by A B strictly decreases. Since at least one replacement is necessary to not leave the until loop, the number of 1 s covered by A B decreases strictly between each new until loop, guaranteeing the termination of the algorithm.

Although any block-optimal k-undercover is a 1 2approximation for the k-undercover problem, we show that a Opti Block(Xm n, {0}m k, {0}k n) is a (1 1

e)- approximation.

Theorem 4. Opti Block(Xm n, {0}m k, {0}k n) is a 1 (1 1

e) 0.632 approximation for the kundercover Boolean matrix factorization problem.

Proof. Since each block covers a subset of X entries and the problem is to cover as many entries as possible with k blocks among the admissible blocks, the problem can be seen as the Maximum Coverage problem. It is established that the greedy cover algorithm, which consists of selecting at each iteration a block that covers the maximum number of 1 entities not already covered, is a 1 (1 1

e) 0.632 approximation (Hochbaum 1996). Since when A = {0}m k and B = {0}k n, the algorithm Opti Block(X, A, B) behaves for the ﬁrst k iterations like the greedy cover algorithm, and after the k-th iteration, the number of covered entries can only increase, so it follows that Opti Block(Xm n, {0}m k, {0}k n) is a 1 (1 1

e) 0.632 approximation for the k-undercover Boolean matrix factorization problem.

4.2 Using a Good Initial Undercover We have seen that Algorithm 3 starts with an initial solution Am k and Bk n such that A B X. Although a trivial solution consists in using A = {0}m k and B = {0}k n, we propose in this section an efﬁcient algorithm to start with a better solution. Note that although this algorithm is used to initialize a solution for Opti Block, it can also be used independently as we will see in the experimentation section. Our approach involves ﬁnding an optimal 1-undercover A:,p and Bp,: with the constraint that a particular entry Xi,j = 1 has to be covered by A:,p Bp,:. If Xi,j = 1 is covered by A:,p Bp,:, we know that for each i such that Xi ,j = 0, Ai ,p = 0 and for each j such that Xi,j = 0, we have Bp,j = 0. Therefore, it is not necessary to consider the entire X matrix and it is sufﬁcient to only keep rows in {i | Xi ,j = 1} and columns in {j | Xi,j = 1}. In practice, we choose Xi,j = 1 as a entry to be covered such that |Xi,:|1 |X:,j|1 is maximal. The pseudo-code of our method is illustrated in Algorithm 4.

4.3 Experimentation Our algorithms have been implemented1 in C++ and in this section, we evaluate our methods Fast Undercover (Algorithm 4), Opti Block (Algorithm 3) with Am k and Bk n initialized to {0}m k, {0}k n and Opti Block which corresponds to ﬁrst running Fast Undercover, then using the solution found as the initial values of Am k and Bk n for Opti Block. We have performed experiments on 25 datasets from UCI (Dua and Graff 2017), namely: Audiology, Autism Screening Adult, Balance Scale, Breast Cancer, Car Evaluation, Chess (King-Rook vs. King), Congressional Voting Records, Contraceptive Method Choice, Dermatology, Hepatitis, Iris, Lung Cancer, Lymphography,

1See https://github.com/Florent Avellaneda/Undercover BMF

Algorithm 4: Fast Undercover Input: A matrix Xm,n and an integer k. Output: Two matrices Am k, Bk n such that (A B) X 1: A {0}n k

2: B {0}k m

3: for p = 1 to k do

4: Let (i, j) arg max (i,j) (( n P

j =1 Xi,j ) ( m P

i =1 Xi ,j))

5: I {i | Xi ,j = 1} 6: J {j | Xi,j = 1} 7: AI,{p}, B{p},J Optimal Undercover(XI,J, 1) 8: X X (A:,p Bp,:) 9: end for 10: return A, B

Mushroom, Nursery, Primary Tumor, Solar Flare, Soybean (Large), Statlog (Heart), Student Performance, Thoracic Surgery, Tic-Tac-Toe Endgame, Website Phishing, Wine and Zoo. All datasets have been binarized with a one-shot encoding when there were more than two categories in a column, and remain binary when there are only two categories. The binarized datasets used are also available in the appendix. For each dataset, we compute the factorization over three difference rank value kmax

4 2 and kmax

4 3 where kmax is a trivial rank for the dataset to factorize (generally the minimum between the number of lines and the number of columns of the dataset). Figure 2 shows the time required to execute Opti Block in solid line, and Opti Block in dashed line according to the Max SAT solver used. The execution time of Optimal Undercover (Algorithm 2) is not shown because the size of the datasets and the value of k are too large to expect to obtain a solution in a reasonable time. Note however that for smaller datasets such as Balance, Iris, Vote or Zoo, the al-

0 10 20 30 40 50 60 70 Number of instances solved

Time in seconds

Eval Max SAT internal Eval Max SAT Max HS Pacose UWr Max SAT Open-WBO-RES

Figure 2: Number x of instances solved in y seconds with different Max SAT solvers on many real-world datasets. The solid lines correspond to Opti Block and the dashed lines correspond to Opti Block.

gorithm Optimal Undercover can ﬁnd the optimal solution for small values of k. For example, Optimal Undercover can ﬁnd an optimal 7-undercover for the dataset Iris in about 100 minutes, while CG is not able to prove the optimality of the solution after 5 hours. We can see that running Fast Undercover ﬁrst to get an initial solution and then using this initial solution as a starting point for Opti Block allows us to gain a few orders of magnitude on the execution time. Since the solvers Eval Max SAT and Pacose seem slightly more efﬁcient on this problem, we have chosen one of them (Eval Max SAT), and have integrated it in our tool in order to insert the constraints generated by Generate Cardinalities (Algorithm 1) directly in the form of cardinality constraints instead of as a set of clauses. This allows us to save even more computation time and we will only report the results of this solver for the rest of this section. In Figure 2 this solver integration version appears as (Eval Max SAT internal). We compare our method to two types of tools: tools that aim to solve the same problem as us, namely to determine a k-undercover; and tools for the general Boolean matrix factorization problem. We considered the following k-undercover tools: Iter Ess (Belohlavek, Outrata, and Trnecka 2019), which is an extension of Gre Ess (Belohlavek and Trnecka 2015). Tile (Geerts, Goethals, and Mielik ainen 2004), one of the ﬁrst effective methods to solve this problem. Gre Con D+ (Belohlavek and Trnecka 2018), which is an extension of the popular Gre Con D method (Belohlavek and Vychodil 2010) and can be conﬁgured to calculate k-undercover (r BMF package). And the following BMF tools: MP (Ravanbakhsh, P oczos, and Greiner 2016), a message passing technique for approximate a BMF. k-greedy (Kovacs, G unl uk, and Hauser 2021), an heuristic used before calling the CG method. CG (Kovacs, G unl uk, and Hauser 2021), an integer programming method using CPLEX to ﬁnd solutions. CG is an exact method that improves a solution until ﬁnding an optimal one. A time budget must be given to obtain a solution in a reasonable time. In our evaluation, we set the time budget to twice the time used by Opti Block . Asso (Miettinen et al. 2008), that is probably one of the most popular tool for BMF (r BMF package). MEBF (Wan et al. 2020), which uses matrix permutations in order to approximate an upper triangular-like matrix. We have run each of these tools on the 25 datasets with the three k values discussed above. A detailed table of the results obtained can be found in the appendix. Table 1 summarizes these results through two values: the number of times each tool found a better solution than the other tools (# TOP) and the total time used by each tool to execute all instances. First of all, we observe that the time used varies a lot from one method to another. We distinguish three groups of methods:

Method # TOP Time (min)

Fast Undercover 8 3.5 Opti Block 29 544 Opti Block 45 98 Iter Ess 6 0.4 Tile 7 220 Gre Con D+ 9 107 MP 14 372 k-greedy 5 59 CG 19 469 Asso 0 96 MEBF 0 4.0

Table 1: Benchmark on 25 real-world datasets with three values of k (75 cases).

Fast methods using less than 5 min: Fast Undercover, Iter Ess and MEBF. Medium fast methods using between 5 and 100 min: Opti Block , k-greedy and Asso. Slow methods using more than 100 min: Opti Block, Tile, Gre Con D+, MP and CG. In the fast group, our algorithm Fast Undercover ﬁnds better quality factorizations, however, Iter Ess is signiﬁcantly faster. In the medium fast group, our Opti Block method obtains notably higher quality factorizations, even when compared to the slower method. Moreover, although Opti Block is constrained to ﬁnd an undercover, this method still ﬁnds better quality factorizations than methods without this constraint such as MP or CG.

5 Conclusion We have presented two exact methods to infer optimal and block-optimal k-undercover matrix factorizations. Although these two problems are known to be NP-hard, we proposed efﬁcient methods to solve them. Our ﬁrst contribution is an efﬁcient Max SAT formulation for ﬁnding optimal k-undercover matrix factorizations. The main idea is to use the knowledge of the problem to ﬁnd unsatisﬁable cores in order to replace many soft clauses by few cardinality constraints. We have shown that this method allows us to gain several orders of magnitude on the computation time. Our second contribution addressed the scalability issue. We propose a weaker deﬁnition of optimality, called blockoptimal, and show that every block-optimal solution is a 1

2approximation of an optimal solution. In order to ﬁnd such factorizations efﬁciently, we propose the Opti Block algorithm. This algorithm consists in initializing a solution with our Fast Undercover heuristic and then improving this solution until it is block optimal. The experimental results show that Opti Block produces much better reconstruction errors than other tools in the literature and within reasonable computation times.

Dataset k Fast Undercover Opti Block* Iter Ess Tile Gre Con D+

Errors Time Errors Time Errors Time Errors Time Errors Time (number) (sec.) (number) (sec.) (number) (sec.) (number) (sec.) (number) (sec.)

Car 6 8006 0.2 7676 14.2 8006 0.1 8006 0.0 8006 0.3 12 4838 0.3 4608 6.5 4838 0.1 4838 0.0 4838 0.6 18 2246 0.4 2060 3.0 2246 0.1 2246 0.0 2246 0.9

Chess 10 11335 1.6 10454 14.9 11332 0.1 10435 0.2 10435 1.9 20 4220 1.8 3023 9.0 3052 0.1 3809 0.3 3809 4.0 30 1346 1.9 453 10.4 453 0.1 1308 0.3 1353 6.5

Cmc 18 3194 0.5 2994 7.1 3114 0.1 3234 0.1 3234 4.3 36 1314 0.6 1104 8.3 1196 0.1 1200 0.1 1200 8.9 54 391 0.6 279 12.6 346 0.1 347 0.1 347 14.1

Flare 11 2027 0.3 1953 1.8 1669 0.1 1699 0.1 1684 1.3 22 284 0.4 241 2.3 429 0.1 354 0.2 346 2.2 33 33 0.4 15 8.0 95 0.1 81 0.2 82 3.6

Heart 68 760 0.4 728 17.4 799 0.6 795 0.0 795 107.6 136 419 0.5 401 47.1 462 0.6 441 0.1 444 206.9 204 222 0.6 208 186.6 248 0.6 231 0.1 234 306.2

Iris 32 288 0.1 282 1.2 284 0.0 287 0.0 282 3.2 64 122 0.1 117 2.9 119 0.0 121 0.0 117 6.8 96 37 0.1 32 10.1 37 0.0 36 0.0 34 10.9

Lymph 14 766 0.1 754 0.5 761 0.0 755 0.1 753 0.4 28 275 0.1 229 1.1 283 0.0 292 0.2 288 0.8 42 73 0.1 33 1.2 83 0.0 103 0.2 94 1.2

Mushroom 28 39560 48.5 31226 1483 33852 2.4 33208 119 33844 176.5 56 12264 53.9 7596 460 9732 2.5 10176 143 9364 337.4 84 2194 50.7 436 587 708 2.6 1554 141 1280 471.7

Nursery 8 73440 5.6 68004 1723 68004 0.7 68004 0.6 69120 4.2 16 35970 9.0 35970 78.0 36216 0.7 36216 0.8 36684 9.3 24 10698 10.0 10698 53.2 11028 0.7 11028 0.9 12708 15.4

Phishing 7 6029 0.3 6029 1.6 6029 0.1 6026 0.1 6029 0.4 14 3043 0.4 2999 5.0 3043 0.1 2937 0.2 3027 0.8 21 1253 0.5 806 3.7 1199 0.1 1172 0.2 1218 1.1

Student 44 3408 0.5 3213 18.3 3413 0.4 3400 6.9 3400 21.6 88 1230 0.6 967 19.8 1223 0.4 1234 7.9 1227 42.9 132 227 0.7 125 33.1 271 0.4 305 7.8 298 69.0

Thoracic 85 662 0.6 626 36.1 647 0.7 647 0.0 652 174.8 Surgery 170 233 0.9 219 100 229 0.7 227 0.0 230 341.3 255 54 1.0 51 201 52 0.7 64 0.1 58 532.5

Tictactoe 7 6068 0.2 6062 7.0 6062 0.0 6062 0.0 6068 0.2 14 3588 0.3 3588 1.4 3588 0.1 3588 0.0 3588 0.5 21 1518 0.3 1518 1.4 1518 0.1 1518 0.0 1518 0.8

Wine 45 1766 0.7 1766 12.0 1862 0.6 1766 0.1 1766 475.5 90 1181 0.9 1181 27.9 1232 0.6 1181 0.1 1181 895.7 135 596 1.0 576 97.6 602 0.6 596 0.2 596 1404.7

Zoo 7 192 0.0 173 0.1 176 0.0 171 0.0 178 0.0 14 86 0.0 48 0.1 52 0.0 66 0.0 63 0.1 21 17 0.0 6 0.1 11 0.0 18 0.0 17 0.2

Table 2: Subset of the results obtained for the undercover Boolean matrix factorization problem. All the benchmarks results are available on https://github.com/Florent Avellaneda/Undercover BMF.

References Aloul, F. A.; Ramani, A.; Markov, I. L.; and Sakallah, K. A. 2002. Solving difﬁcult SAT instances in the presence of symmetry. In Proceedings of the 39th annual Design Automation Conference, 731 736. ACM. Aloul, F. A.; Sakallah, K. A.; and Markov, I. L. 2006. Efﬁcient symmetry breaking for boolean satisﬁability. IEEE Transactions on Computers, 55(5): 549 558. Andres, B.; Kaufmann, B.; Matheis, O.; and Schaub, T. 2012. Unsatisﬁability-based optimization in clasp. In Technical Communications of the 28th International Conference on Logic Programming (ICLP 12). Schloss Dagstuhl Leibniz-Zentrum fuer Informatik. Avellaneda, F. 2020. A short description of the solver Eval Max SAT. Max SAT Evaluation 2020, 8. Bacchus, F. 2021. Max HS in the 2021 Max Sat Evaluation. Max SAT Evaluation 2021, 14. Bacchus, F.; Berg, J.; J arvisalo, M.; and Martins, R. 2021. Max SAT Evaluation 2021: Solver and Benchmark Descriptions. SAT, 2021. Bacchus, F.; J arvisalo, M.; and Martins, R. 2021. Maximum Satisﬁabiliy. In Biere, A.; Heule, M. J. H.; van Maaren, H.; and Walsh, T., eds., Handbook of Satisﬁability, volume 336 of Frontiers in Artiﬁcial Intelligence and Applications, chapter 24, 929 991. IOS Press. Belohlavek, R.; Outrata, J.; and Trnecka, M. 2019. Factorizing Boolean matrices using formal concepts and iterative usage of essential entries. Information Sciences, 489: 37 49. Belohlavek, R.; and Trnecka, M. 2015. From-below approximations in Boolean matrix factorization: Geometry and new algorithm. Journal of Computer and System Sciences, 81(8): 1678 1697. Belohlavek, R.; and Trnecka, M. 2018. A new algorithm for Boolean matrix factorization which admits overcovering. Discrete Applied Mathematics, 249: 36 52. Belohlavek, R.; and Vychodil, V. 2010. Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences, 76(1): 3 20. Brown, C. A.; Finkelstein, L.; and Purdom Jr, P. W. 1988. Backtrack searching in the presence of symmetry. In International Conference on Applied Algebra, Algebraic Algorithms, and Error-Correcting Codes, 99 110. Springer. Devriendt, J. 2021. Exact: evaluating a pseudo-Boolean solver on Max SAT problems. Max SAT Evaluation 2021, 12 13. Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Geerts, F.; Goethals, B.; and Mielik ainen, T. 2004. Tiling databases. In International conference on discovery science, 278 289. Springer. Haddad, A.; Shamsi, F.; Zhu, L.; and Najaﬁzadeh, L. 2018. Identifying Dynamics of Brain Function Via Boolean Matrix Factorization. In 2018 52nd Asilomar Conference on Signals, Systems, and Computers, 661 665. IEEE.

Hochbaum, D. S. 1996. Approximating Covering and Packing Problems: Set Cover, Vertex Cover, Independent Set, and Related Problems, 94 143. PWS Publishing Co. Ignatiev, A.; Morgado, A.; and Marques-Silva, J. 2019. RC2: An efﬁcient Max SAT solver. Journal on Satisﬁability, Boolean Modeling and Computation, 11(1): 53 64. Kocayusufoglu, F.; Hoang, M. X.; and Singh, A. K. 2018. Summarizing Network Processes with Network Constrained Boolean Matrix Factorization. In IEEE International Conference on Data Mining (ICDM), 237 246. Kovacs, R. A.; G unl uk, O.; and Hauser, R. A. 2021. Binary Matrix Factorisation via Column Generation. In AAAI Conference on Artiﬁcial Intelligence, AAAI 2021, 3823 3831. AAAI Press. Lei, Z.; Cai, S.; Wang, D.; Peng, Y.; Geng, F.; Wan, D.; Deng, Y.; and Lu, P. 2021. CASHWMax SAT: Solver Description. Max SAT Evaluation 2021, 8 9. Lucchese, C.; Orlando, S.; and Perego, R. 2013. A Unifying Framework for Mining Approximate Top-k Binary Patterns. IEEE Transactions on Knowledge and Data Engineering, 26(12): 2900 2913. Martins, R.; Manquinho, V.; and Lynce, I. 2014. Open WBO: A modular Max SAT solver. In International Conference on Theory and Applications of Satisﬁability Testing, 438 445. Springer. Martins, R.; Manthey, N.; Terra-Neves, M.; Manquinho, V.; and Lynce, I. 2021. Open-WBO Max SAT Evaluation 2021. Max SAT Evaluation 2021, 15 16. Miettinen, P.; Mielik ainen, T.; Gionis, A.; Das, G.; and Mannila, H. 2008. The discrete basis problem. IEEE transactions on knowledge and data engineering, 20(10): 1348 1362. Miettinen, P.; and Neumann, S. 2020. Recent Developments in Boolean Matrix Factorization. In Bessiere, C., ed., Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2020, 4922 4928. ijcai.org. Morgado, A.; Dodaro, C.; and Marques-Silva, J. 2014. Coreguided Max SAT with soft cardinality constraints. In International Conference on Principles and Practice of Constraint Programming, 564 573. Springer. Morgado, A.; Ignatiev, A.; and Marques-Silva, J. 2014. MSCG: Robust core-guided Max SAT solving. Journal on Satisﬁability, Boolean Modeling and Computation, 9(1): 129 134. Paxian, T.; and Becker, B. 2021. Pacose: An Iterative SATbased Max SAT Solver. Max SAT Evaluation 2021, 28. Piotrow, M. 2020. UWr Max Sat: Efﬁcient Solver for Max SAT and Pseudo-Boolean Problems. In 2020 IEEE 32nd International Conference on Tools with Artiﬁcial Intelligence (ICTAI), 132 136. Los Alamitos, CA, USA: IEEE Computer Society. Piotr ow, M. 2021. UWr Max Sat in Max SAT Evaluation 2021. Max SAT Evaluation 2021, 17 18. Ravanbakhsh, S.; P oczos, B.; and Greiner, R. 2016. Boolean matrix factorization and noisy completion via message passing. In International Conference on Machine Learning, 945 954. PMLR.

Vaidya, J.; Atluri, V.; and Guo, Q. 2007. The role mining problem: ﬁnding a minimal descriptive set of roles. In Proceedings of the 12th ACM symposium on Access control models and technologies, 175 184. Wan, C.; Chang, W.; Zhao, T.; Li, M.; Cao, S.; and Zhang, C. 2020. Fast and efﬁcient boolean matrix factorization by geometric segmentation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 6086 6093. AAAI Press. Wicker, J.; Pfahringer, B.; and Kramer, S. 2012. Multi-label classiﬁcation using boolean matrix decomposition. In Proceedings of the 27th Annual ACM Symposium on Applied Computing, 179 186.