# the_computational_complexity_of_concise_hypersphere_classification__7898a2f8.pdf

The Computational Complexity of Concise Hypersphere Classiﬁcation

Eduard Eiben 1 Robert Ganian 2 Iyad Kanj 3 Sebastian Ordyniak 4 Stefan Szeider 2

Hypersphere classiﬁcation is a classical and foundational method that can provide easy-to-process explanations for the classiﬁcation of real-valued and binary data. However, obtaining an (ideally concise) explanation via hypersphere classiﬁcation is much more difﬁcult when dealing with binary data than real-valued data. In this paper, we perform the ﬁrst complexity-theoretic study of the hypersphere classiﬁcation problem for binary data. We use the ﬁne-grained parameterized complexity paradigm to analyze the impact of structural properties that may be present in the input data as well as potential conciseness constraints. Our results include stronger lower bounds and new ﬁxed-parameter algorithms for hypersphere classiﬁcation of binary data, which can ﬁnd an exact and concise explanation when one exists.

1. Introduction

With the rapid advancement of Machine Learning (ML) models to automate decisions, there has been increasing interest in explainable Artiﬁcial Intelligence (XAI), where the ML models can explain their decisions in a way humans understand. This has led to the reexamination of ML models that are implicitly easy to explain and interpret with a particular focus on the conciseness of explanations (Doshi-Velez & Kim, 2017; Lipton, 2018; Monroe, 2018; Ribeiro et al., 2018; Shih et al., 2018; Barcel o et al., 2020; Chalasani et al., 2020; Darwiche & Hirth, 2020; Blanc et al., 2021; Ignatiev et al., 2021; W aldchen et al., 2021; Izza et al., 2022).

In this article, we consider a simple classiﬁcation task one of the cornerstones of machine learning from the

*Equal contribution 1Royal Holloway, University of London, UK. 2Algorithms and Complexity Group, TU Wien, Austria. 3De Paul University, USA. 4University of Leeds, UK.. Correspondence to: Eduard Eiben <eduard.eiben@rhul.ac.uk>, Robert Ganian <rganian@gmail.com>, Iyad Kanj <ikanj@depaul.edu>, Sebastian Ordyniak <sordyniak@gmail.com>, Stefan Szeider <sz@ac.tuwien.ac.at>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

viewpoint of XAI. However, unlike previous works on explainability, which have typically targeted questions such as identifying a suitable interpretable model for (area-speciﬁc) classiﬁcation (Nori et al., 2021; Shih et al., 2021; Wang et al., 2021) or measuring the accuracy cost of explainability (Laber & Murtinho, 2021; Makarychev & Shan, 2021), the goal of this work is to obtain a comprehensive understanding of the computational complexity of performing binary classiﬁcation via one of the most fundamental interpretable models.

Consider a set M of either real-valued or binary training feature points, each represented as a d-dimensional feature vector over [0, 1] or {0, 1} and labeled as either blue (the set VB) or red (the set VR). There is, by now, a broad set of more or less opaque classiﬁers capable of using such a training set to classify unlabeled data, where the suitability of each method depends on the data domain and context; moreover, some classiﬁers are tailored to real-valued data, while others are designed for binary (or, more generally, categorical) data. In this paper, we consider one of the two arguably simplest and hence easiest to explain and visualize types of classiﬁers, which can be used in both data settings: a hypersphere. More formally, the explanations we consider consist of a cluster center c (an element of [0, 1]d or {0, 1}d) and distance ℓsuch that each feature vector is at a distance at most ℓfrom c if and only if it is blue.

The reason for studying the complexity of hypersphere classiﬁcation does not stem purely from the problem s connection to explainability. Together with classiﬁcation by a separating hyperplane, hypersphere classiﬁcation represents one of the most classic explanatory examples of classiﬁers (see (Cooper, 1962; Wang et al., 2007; 2005) to name a few) which have been extensively studied from both the computational geometry and the machine learning perspectives (Astorino et al., 2016; Astorino & Gaudioso, 2009; Cooper, 1962; Wang et al., 2007; 2005; O Rourke et al., 1986; Agarwal et al., 2006; Hurtado et al., 2003). Moreover, hypersphere classiﬁcation is of special importance in one-class classiﬁcation due to the inherent asymmetry of the provided explanations (Kim et al., 2021). While hyperplane separation can be encoded as a linear program and hence is easily polynomial-time solvable for real-valued and binary data, the computational complexity of hypersphere

The Computational Complexity of Concise Hypersphere Classiﬁcation

classiﬁcation is far less obvious and has so far remained surprisingly unexplored.

This apparent gap contrasts the situation for several other computational problems arising in the area of machine learning, which have already been targeted by detailed complexity-theoretic studies (Ordyniak & Szeider, 2013; Ganian et al., 2018; Simonov et al., 2019; Dahiya et al., 2021; Ganian & Korchemna, 2021; Ganian et al., 2022; Gr uttemeier & Komusiewicz, 2022), carried out using the classical as well as the parameterized-complexity paradigms. In this article, we close the gap by laying bare a detailed map of the problem s computational complexity via the design of novel theoretical algorithms as well as accompanying computational lower bounds.

Contributions. We begin by observing that the hypersphere classiﬁcation problem is polynomial-time solvable when the input data is real-valued; in particular, this case can be handled via a more sophisticated linear programming encoding than the one used for the classical hyperplane separation problem. However, this approach completely fails when dealing with binary data, warranting a more careful complexity-theoretic study of this case. Our ﬁrst result shows that hypersphere classiﬁcation of binary data is not only NP-hard in general but remains NP-hard even when there are only two red vectors. We also obtain an analogous hardness result for instances with two blue vectors.

The fact that the problem s complexity differs between realvalued and categorical data is already interesting. However, the NP-hardness of the latter case does not preclude the existence of efﬁcient algorithms that can solve the problem under additional natural restrictions. Indeed, one of the central themes in modern complexity-theoretic research is the identiﬁcation of the exact boundaries of tractability. This is frequently achieved through the lens of the parameterized complexity paradigm (Downey & Fellows, 2013; Cygan et al., 2015), where we associate each problem instance I with an integer parameter k (often capturing a certain structural property of the instance) and ask whether the problem of interest can be solved by a ﬁxed-parameter algorithm, that is, by an algorithm with runtime of the form f(k) |I|O(1) for some computable function f. This gives rise to a strong form of computational tractability called ﬁxed-parameter tractability (FPT).

In the case of our problem of interest, it is easy to observe that hypersphere classiﬁcation of binary data is ﬁxedparameter tractable when parameterized by the data dimension d (since the number of possible centers is upperbounded by 2d). Moreover, we show that the problem also admits a ﬁxed-parameter algorithm when parameterized by the total number of feature points via a combinatorial reduction to a known tractable fragment of Integer Linear Programming. While these are important pieces of the

complexity-theoretic landscape of hypersphere classiﬁcation, these two initial results are somewhat unsatisfying on their own because (1) they rely on highly restrictive parameterizations, and (2) they ignore a central aspect of explainability, which is conciseness or succinctness (Ribeiro et al., 2018; Shih et al., 2018; Blanc et al., 2021; W aldchen et al., 2021; Chalasani et al., 2020; Izza et al., 2022; Ordyniak et al., 2023).

A natural measure of conciseness in our setting is the number of 1 coordinates in a vector; indeed, any explanation produced by a classiﬁer will likely end up ignored by users if such an explanation is incomprehensibly long, relying on too many features. At the same time, depending on the source of the input data, we may often deal with feature vectors that are already concise. Having concise feature vectors does not necessarily guarantee the existence of a concise center (and vice-versa, concise centers may exist for non-concise data); however, at least one of the two independent measures of conciseness can be expected (or even required) to be small in a variety of settings, making them natural choices for parameters in our analysis. In the second part of the article, we show that these two conciseness parameters a bound econ on the conciseness of the soughtafter explanation and a bound dcon on the conciseness of all feature vectors in the training data can be algorithmically exploited to cope with the NP-hardness of the hypersphere classiﬁcation problem for binary data.

Toward understanding the complexity of hypersphere classiﬁcation of binary data through the perspective of conciseness constraints, we begin by considering restrictions on the data conciseness dcon. We obtain a tight classiﬁcation by showing that the problem is polynomial-time tractable when dcon 3 via a reduction to a tractable fragment of the constraint satisfaction problem and NP-hard otherwise. Moreover, we obtain ﬁxed-parameter algorithms parameterized by dcon plus the number of red or blue points, circumventing the earlier NP-hardness results. When considering the explanation conciseness, we show that hypersphere classiﬁcation is XP-tractable when parameterized by econ and at the same time provide evidence excluding ﬁxed-parameter tractability even parameterized by econ together with the number of red or blue points. Finally, we obtain a linear-time ﬁxed-parameter algorithm for the problem parameterized by econ + dcon.

While this settles the complexity of binary-data hypersphere classiﬁcation from the perspective of conciseness measures, the obtained lower bounds imply that neither measure of conciseness (i.e., neither econ nor dcon) sufﬁces to achieve ﬁxed-parameter tractability on its own. As our ﬁnal contribution, we consider whether achieving tractability for the problem is possible by exploiting a suitable structural measure of the input data. In particular, following recent

The Computational Complexity of Concise Hypersphere Classiﬁcation

Conciseness

Structure econ dcon econ + dcon

NP-h (Thm 3) XP (Obs 10), W[2]-h (Thm 11) NP-h 4 (Thm 7) FPT (Thm 13) |VR| NP-h 2 (Thm 3) XP (Obs 10), W[2]-h (Thm 11) FPT (Thm 9) FPT (Thm 9) |VB| NP-h 2 (Thm 3) XP (Obs 10), W[1]-h (Thm 12) FPT (Thm 9) FPT (Thm 9) |VR VB| FPT (Thm 5) FPT (Thm 5) FPT (Thm 5) FPT (Thm 5) d FPT (trivial) FPT (trivial) FPT (trivial) FPT (trivial) tw XP (Cor 15) FPT (Cor 16) FPT (Cor 17) FPT (Cor 16)

Table 1. The complexity landscape of hypersphere classiﬁcation with respect to combinations of structural and conciseness parameters: VR and VB are the sets of red and blue points, respectively; d is the dimension; tw is the incidence treewidth of the data representation; econ is the conciseness of the explanation, and dcon is the data conciseness. NP-h i means that the problem becomes NP-hard for parameter values of at least 4, while W[j]-h means that the problem is hard for the complexity class W[j] and hence is unlikely to be ﬁxed-parameter tractable (Downey & Fellows, 2013).

successes in closely related areas such as clustering and data completion (Ganian et al., 2022; 2018), we consider the incidence treewidth tw of the data representation. Using a non-trivial dynamic programming procedure, we obtain a ﬁxed-parameter algorithm for binary-data hypersphere classiﬁcation parameterized by either tw + econ or tw + dcon; in other words, each of the two notions of conciseness sufﬁces for tractability of hypersphere clustering for data that is well-structured, in the sense of having small incidence treewidth. Moreover, as a byproduct of our algorithm, we also obtain the XP-tractability of binary-data hypersphere classiﬁcation parameterized by tw alone.

A summary of our results is provided in Table 1.

Related Work.

While there is, to the best of our knowledge, no prior work targeting the complexity of hypersphere classiﬁcation of binary data, there is a signiﬁcant work on the real-valued variant of the problem by the machine learning community (Cooper, 1962; Wang et al., 2007; 2005; Astorino & Gaudioso, 2009; Astorino et al., 2016), where they studied the optimization version of the problem in which one seeks the smallest bounding sphere that separates the blue points from the red ones. Our results extend to the optimization version of the problem for binary data, as mentioned in Section 7. Many of the above works consider relaxations of the real-valued optimization problem, in which the sphere sought is not of minimum radius (Wang et al., 2007; 2005; Astorino & Gaudioso, 2009; Astorino et al., 2016) allowing for error or for outliers and reduce the problem to some fragment of quadratic programming. We point out that the problem is also related to that of ﬁnding a minimum bounding sphere to distinguish/discriminate a set of objects (i.e., one-class classiﬁcation) (Tax & Duin, 1999), which is, in turn, inspired by the Support Vector Machine models introduced in (Vapnik, 1995).

The hypersphere classiﬁcation of real-valued lowdimensional data has also been studied in the context

of point separability within the ﬁeld of computational geometry. In particular, the separability of two sets of points in R2 by a circle was studied by O Rourke, Kosaraju and Megiddo (1986), who established the linear-time tractability of that case. They also observed that their result could be lifted to an O(nd)-time algorithm for the separability of n d-dimensional data points by a hypersphere; however this is superseded by the n O(1)-time algorithm observed in Proposition 1, which runs in polynomial time even for unbounded values of d. Several authors also studied related point-separation problems in R2 and R3, such as separability of points via polyhedra (Megiddo, 1988), L-shapes (Sheikhi et al., 2015) and a variety of other objects (Agarwal et al., 2006; Alegr ıa et al., 2022).

2. Preliminaries

For ℓ N, we write [ℓ] for {1, . . . , ℓ}. For convenience, we identify each vector v = (v1, . . . , vd) with the point (v1, . . . , vd) in d-dimensional space.

Problem Deﬁnition and Terminology. For two vectors a, b {0, 1}d, we denote by δ( a, b) the Hamming distance between a and b. For a vector v {0, 1}d and r N, denote by B( v, r) the hypersphere (i.e., ball) centered at v and of radius r; that is, the set of all vectors x {0, 1}d

satisfying δ( v, x) r. Similarly, for vectors over [0, 1]d we denote by B( v, r) the hypersphere (i.e., ball) centered at v and of radius r with respect to the Euclidean distance in Rd.

The problem under consideration in this paper is deﬁned as follows:

BINARY HYPERSPHERE CLASSIFICATION (BHC)

Input: A set V = VR VB of d-dimensional vectors over the binary domain D = {0, 1}, where VR VB = . Question: Is there a vector c Dd and r N such that VB B( c, r) and VR B( c, r) = ?

The Computational Complexity of Concise Hypersphere Classiﬁcation

Throughout the paper, we will refer to the vectors in VR as red and those in VB as blue . We also denote by 1( v) the set of all coordinates i such that v[i] = 1 and we write con( v) for the conciseness of v, i.e., con( v) = |1( v)|. Moreover, observe that the Hamming distance δ( v, c) of two vectors v and c can be written as δ( v, c) = |1( c)|+|1( v)| 2|1( v) 1( c)|.

Naturally, one may also consider the analogous problem of REAL-VALUED HYPERSPHERE CLASSIFICATION, where the only distinction is that the domain is [0, 1] instead of {0, 1}. As mentioned in the introduction, this problem can be shown to be polynomial-time solvable and hence is not considered further in our complexity-theoretic analysis.

Proposition 1. REAL-VALUED HYPERSPHERE CLASSIFICATION can be solved in polynomial time.

Parameterized Complexity. In parameterized algorithmics (Flum & Grohe, 2006; Downey & Fellows, 2013; Cygan et al., 2015) the running-time of an algorithm is studied with respect to a parameter k N0 and input size n. The basic idea is to ﬁnd a parameter that describes the structure of the instance such that the combinatorial explosion can be conﬁned to this parameter. In this respect, the most favorable complexity class is FPT(ﬁxed-parameter tractable) which contains all problems that can be decided by an algorithm running in time f(k) n O(1), where f is a computable function. Algorithms with this running-time are called ﬁxedparameter algorithms. A less favorable outcome is an XP algorithm, which is an algorithm running in time O(nf(k)); problems admitting such algorithms belong to the class XP.

Showing that a parameterized problem is hard for the complexity classes W[1] or W[2] rules out the existence of a ﬁxed-parameter algorithm under well-established complexity-theoretic assumptions. Such hardness results are typically established via a parameterized reduction, which is an analogue of a classical polynomial-time reduction with two notable distinctions: a parameterized reduction can run in time f(k) n O(1), but the parameter of the produced instance must be upper-bounded by a function of the parameter in the original instance.

Treewidth. A nice tree-decomposition T of a graph G = (V, E) is a pair (T, χ), where T is a tree (whose vertices are called nodes) rooted at a node tr and χ is a function that assigns each node t a set χ(t) V such that the following hold:

For every uv E there is a node t such that u, v χ(t). For every vertex v V , the set of nodes t satisfying v χ(t) forms a subtree of T. |χ(ℓ)| = 0 for every leaf ℓof T and |χ(tr)| = 0. There are only three kinds of non-leaf nodes in T:

Introduce node: a node t with exactly one child t such that χ(t) = χ(t ) {v} for some vertex v χ(t ). Forget node: a node t with exactly one child t such that χ(t) = χ(t ) \ {v} for some vertex v χ(t ). Join node: a node t with two children t1, t2 such that χ(t) = χ(t1) = χ(t2).

The width of a nice tree-decomposition (T, χ) is the size of a largest set χ(t) minus 1, and the treewidth of the graph G, denoted tw(G), is the minimum width of a nice treedecomposition of G.

We let Tt denote the subtree of T rooted at a node t, and use χ(Tt) to denote the set S

t V (Tt) χ(t ) and Gt to denote the graph G[χ(Tt)] induced by the vertices in χ(Tt). Efﬁcient ﬁxed-parameter algorithms are known for computing a nice tree-decomposition of near-optimal width:

Proposition 2 (Kloks 1994; Korhonen 2021). There exists an algorithm which, given an n-vertex graph G and an integer k, in time 2O(k) n either outputs a nice treedecomposition of G of width at most 2k+1 and O(n) nodes, or determines that tw(G) > k.

Constraint Satisfaction Problems. Let D = {0, 1} and let n an integer. An n-ary relation on D is a subset of Dn. An instance I of a Boolean constraint satisfaction problem (CSP) is a pair (V, C), where V is a ﬁnite set of variables and C is a set of constraints. A constraint c C consists of a scope, denoted by V (c), which is an ordered list of a subset of V , and a relation, denoted by R(c), which is a |V (c)|-ary relation on D; |V (c)| is the arity of c.

A solution to a CSP instance I = (V, C) is a mapping τ : V D such that τ(v1), . . . , τ(v|V (c)|) R(c) for every c C with V (c) = v1, . . . , v|V (c)| . A CSP instance is satisﬁable if and only if it has at least one solution.

3. NP-hardness of BHC Restricted to Two Red or Blue Vectors

In this section, we show that BHC remains NP-hard even when one of the two sets (VB or VR) has size at most two. In particular, let us denote by 2RED-BHC the restriction of BHC to instances in which the number of red vectors is two (i.e., |VR| = 2), and by 2BLUE-BHC the restriction of BHC to instances in which the number of blue vectors is two (i.e., |VB| = 2).

Theorem 3. 2RED-BHC and 2BLUE-BHC are NPcomplete.

Proof Sketch. Proving membership in NP is straightforward and is omitted. We begin with the following problem, which is known to be NP-hard (Frances & Litman, 1997):

The Computational Complexity of Concise Hypersphere Classiﬁcation

Minimum Radius (MR)

Input: A set V of (2n)-dimensional binary vectors where n N. Question: Is there a (2n)-dimensional center vector c such that V B( c, n)?

Denote by Rest-MR the restriction of MR to instances in which V contains the (2n)-dimensional all-zero vector 02n and we ask for a center vector c that contains exactly n ones. We ﬁrst show that Rest-MR remains NP-hard via a polynomial-time Turing-reduction from MR to Rest-MR.

At this point, to complete the proof of the theorem it sufﬁces to exhibit a polynomial-time reduction from Rest-MR to 2RED-BHC (an analogous reduction is also be used for 2BLUE-BHC). Given an instance V of Rest-MR, we construct an instance V + of 2Red-BHC such that V is a Yes-instance of Rest-MR if and only if V + is a Yesinstance of 2RED-BHC. Without loss of generality, we may assume that (the (2n)-dimensional all 1 s vector) 12n V since V is a Yes-instance of Rest-MR if and only if V { 12n} is. The previous statement is true since 02n V B( c, n), where c contains exactly n 1 s, if and only if (V { 12n}) B( c, n).

To construct V +, we extend each (2n)-dimensional vector v V by adding two coordinates, that we refer to as coordinates q2n+1 and q2n+2, and setting their values to 0 and 1, respectively; let v+ denote the extension of v. Let Vb be the resulting set of (extended) vectors from V , and let Vr = { 02n+2, 12n+2}, where 02n+2, 12n+2 are the (2n + 2)-dimensional all-zero and all-one vectors, respectively. Finally, let V + = Vb Vr. We can now show that V is a Yes-instance of Rest-MR if and only if V + is a Yes-instance of 2Red-BHC.

4. Basic Parameterizations for BHC

We follow up on Theorem 3 by considering the two remaining obvious parameterizations of the problem, notably d and |V |. The former case is trivial since it bounds the size of the input.

Observation 4. BHC is FPT parameterized by d.

Next, we give a ﬁxed-parameter algorithm for BHC parameterized by the total number of vectors.

Theorem 5. BHC is FPT parameterized by the number of red vectors plus blue vectors.

Proof Sketch. Let k be the total number of red vectors plus blue vectors. For convenience, we will consider the matrix representation of the input, in which the vectors are represented as the rows of a matrix M. Observe that, since there are k rows of binary coordinates in M, the total number of

different column conﬁgurations of M is at most 2k. The idea behind the ﬁxed-parameter algorithm is to encode the problem as an instance of an Integer Linear Program (ILP) with 2k+1 variables two for every column type. One such variable will capture the number of 1 s the center uses in the columns belonging to that type, while the other simply captures the number of 0 s of the center in these columns. The main reason why this sufﬁces is that the exact positions of these 0 s and 1 s within these column types is irrelevant when considering the distance between an arbitrary point and a center. The constraints simply ensure that the distance between the center and each blue point is strictly smaller than the distance between the center and each red point.

It is well known that such an ILP instance can then be solved in FPT time using the classical result of Lenstra (H. W. Lenstra, 1983; Kannan, 1987; Frank & Tardos, 1987). Once the coordinates of the desired (hypersphere) center have been determined, the radius of the hypersphere can be set as the maximum Hamming distance between the center and the blue vectors in M.

5. The Complexity of BHC with Conciseness

In this section, we perform a detailed analysis of the complexity of BHC with respect to conciseness. We will distinguish between data conciseness and explanation conciseness. Data conciseness is the maximum number of 1 s appearing in any red or blue vector of the instance I and is denoted dcon(I); that is, dcon(I) = max v VR VB con( v). The explanation conciseness on the other hand is the maximum number of 1 s appearing in the sought-after vector c. To capture this aspect of the problem, we deﬁne a new version of BHC that imposes a bound econ on the explanation conciseness of the vector c. Formally, let EC-BHC be deﬁned analogously to BHC, but where we are additionally given an integer econ and the question is whether there exists a vector c Dd of conciseness at most econ and r N such that VB B( c, r) and VR B( c, r) = .

5.1. Data Conciseness

In this subsection, we analyse the parameterized complexity of BHC parameterized by the conciseness of the data dcon(I). We start by showing that instances I satisfying dcon(I) 3 can be solved in polynomial-time.

Theorem 6. The restriction of BHC to instances I satisfying dcon(I) 3 can be solved in time O(|V |d).

Proof Sketch. The main idea behind the algorithm is a case distinction based on the minimum distance Mr of any red vector from a solution vector c. Note ﬁrst that if we ﬁx a solution c for I, then |1( v)| 2|1( v) 1( c)| can be used

The Computational Complexity of Concise Hypersphere Classiﬁcation

instead of the Hamming distance to compare the distances of two vectors from c. Altogether, we obtain four cases for Mr = min r Vr |1( r)| 2|1( r) 1( c)| , i.e., (1) Mr 1, (2) Mr = 0, (3) Mr = 1, and (4) Mr 2. While (1) and (4) are trivial to solve, (2) and (3) are solved via a reduction to a Boolean CSP instance that can be solved in polynomial-time because its relational language is closed under a majority operation. To illustrate the ideas for (1), note that if Mr 1, then it is easy to show that any solution vector must be 1 on all coordinates that has a 1 for any blue vector. But this means that the instance has a solution if and only if the vector c that is 1 at all coordinates in S

v VB 1( v) and otherwise 0 is a solution for I; this is because setting the coordinates outside of S v VB 1( v) to 1 only reduces the distance of c to vectors in VR.

We now show that BHC is already NP-complete for instances with dcon(I) 4; in fact, this holds even when restricted to the class of instances where dcon(I) is precisely 4.

Theorem 7. BHC is NP-complete even when restricted to instances I satisfying dcon(I) = 4.

We prove Theorem 7 via a reduction from the CSP problem using a constraint language Γ4 that is NP-hard by Schaefer s theorem (Schaefer, 1978; Chen, 2009). Γ4 is the Boolean constraint language containing the following two Boolean 4-ary relations: the red relation RR containing all tuples having at most 2 ones and the blue relation RB containing all tuples having at least 3 ones.

Lemma 8. CSP(Γ4) is NP-complete.

With Lemma 8 in hand, we establish Theorem 7 by designing a polynomial-time reduction from CSP(Γ4). We note that, as mentioned already in the proof of Theorem 3, inclusion in NP is trivial.

Proof Sketch for Theorem 7. Let I = (V, C) be the given instance of CSP(Γ4). We denote by Cr/Cb the set of all constraints c in C with R(c) = RR/R(c) = RB; note that C = Cr Cb. We will construct the instance I = (VR, VB, d) of BHC as follows. First, we introduce one coordinate dv for every variable v V . Moreover, for every constraint c Cr, we introduce the red vector rc that is 1 on all coordinates that correspond to variables within the scope of c and is 0 otherwise, i.e., rc is 1 exactly on the coordinates in { dv | v S(c) } and 0 at all other coordinates. Similarly, for every constraint c Cb, we introduce the blue vector bc that is 1 on all coordinates that correspond to variables within the scope of c and 0 otherwise.

Finally, we will introduce two gadgets which will enforce that in every solution c of I , it holds that:

(1) there is a red vector r VR such that c is 1 on at least two coordinates, where r is also 1; and (2) there is a blue vector b VB such that c is not 1 on all coordinates where b is 1.

Towards enforcing (1), we add two new blue vectors vb 1 and vb 2 together with 8 new coordinates db 1, . . . , db 8 such that vb 1 is 1 exactly at the coordinates db 1, . . . , db 4 and vb 2 is 1 exactly at the coordinates db 5, . . . , db 8. Moreover, for every i and j with 1 i < j 8, we introduce a red vector vr ij that is 1 exactly at the coordinates ci and cj plus two additional fresh coordinates.

Towards enforcing (2), we add one new blue vector ub i together with four fresh coordinates ei 1, . . . , ei 4 for every i with 1 i 4 such that ub i is 1 exactly on the coordinates ei 1, . . . , ei 4. Finally, we add one new red vector ur that is 1 exactly at the coordinates e1 1, e2 1, e3 1, and e4 1.

We can now complete the proof by showing the equivalence between the original instance I of CSP(Γ4) and the constructed instance I of BHC.

5.2. Data Conciseness Plus |VR| or |VB|

Here we show that if in addition to the input conciseness one also parameterizes by the minimum of the numbers of red vectors and blue vectors, then BHC becomes ﬁxedparameter tractable.

Theorem 9. BHC is ﬁxed-parameter tractable parameterized by dcon(I) + min{|VB|, |VR|}.

Proof. Let I = (VR, VB, d) be the given instance of BHC. It sufﬁces to show that BHC is ﬁxed-parameter tractable parameterized by dcon(I) + |VB| and also by dcon(I) + |VR|. To avoid any confusion, we remark that it is well known (and easy to see) that establishing ﬁxed-parameter tractability w.r.t. the sum α+β of two numbers is equivalent to establishing ﬁxed-parameter tractability w.r.t. the product α β of the same numbers.

The main observation behind the algorithm (for the case dcon(I)+|VB|) is that the total number of coordinates, where any blue vector can be 1 is at most |VB|dcon(I); let B = S b VB 1( b) be the set of all those coordinates. Since any solution c can be assumed to be 0 at any coordinate outside of B, we can solve I by guessing (i.e., branching to ﬁnd) a solution in time O(2dcon(I)|VB||V |d). More speciﬁcally, for every subset B of the at most 2dcon(I)|VB| subsets of B, we check in time O(|V |d) whether the vector c that is 1 exactly at the coordinates in B is a solution for I. If one of those vectors is a solution, then we output it, otherwise we can correctly return that I is a No-instance.

The algorithm for the case where we parameterize by dcon(I) + |VR| is almost identical with the only difference

The Computational Complexity of Concise Hypersphere Classiﬁcation

being the observation that the set R = S

r VR 1( r) has size at most 2dcon(I)|VR| and that any solution c can be assumed to be 1 at every coordinate in R.

5.3. Explanation Concisness

Recall that EC-BHC is deﬁned analogously as BHC, but one is additionally given an integer econ and is asked for a solution c for BHC with conciseness at most econ. Note that a simple brute-force algorithm that enumerates all potential solution vectors c with at most econ 1 s shows that EC-BHC is in XP parameterized by econ.

Observation 10. EC-BHC can be solved in time O(decon|V |d).

Therefore, it becomes natural to ask whether this can be improved to ﬁxed-parameter tractability. The following two theorems show that this is unlikely to be the case, even if we additionally assume |VB| = 1 or |VR| = 1.

Theorem 11. EC-BHC is W[2]-hard parameterized by the conciseness econ of the solution even if |VR| = 1.

Proof Sketch. We provide a parameterized reduction from the UNIFORM HITTING SET problem, which given a set U of elements, a family F 2U of subsets of U with |F| = ℓ for every F F and an integer k, asks whether F has a hitting set H U of size at most k, i.e., H F = for every F F. UNIFORM HITTING SET is W[2]-complete parameterized by k (Downey & Fellows, 2013).

Let I = (U, F, k) be an instance of UNIFORM HITTING SET with sets of size ℓ. We construct an equivalent instance I = (VR, VB, d, econ) of EC-BHC as follows. We set econ = k. For every u U, we introduce the (element) coordinate du and for every i with 1 i ℓ, we introduce the (dummy) coordinate d i. Moreover, for every F F, we add the blue vector b F to VB, which is 1 on all coordinates du with u F and 0 at all other coordinates. Finally, we introduce the red vector r that is 1 at all dummy coordinates d i and 0 at all element coordinates. This completes the construction of I , which can clearly be achieved in polynomial-time.We can now show that I is a Yes-instance of UNIFORM HITTING SET if and only if I is a Yes-instance of EC-BHC.

Theorem 12. EC-BHC is W[1]-hard parameterized by the conciseness econ of the solution even if |VB| = 1.

Proof Sketch. We will provide a parameterized reduction from the MULTI-COLORED INDEPENDENT SET problem, which given an undirected graph G = (V, E), where V is partitioned into k vertex sets V1, . . . , Vk with |Vi| = n and G[Vi] is a clique and an integer k, asks whether G has an independent set of size at least k; note that such an

independent set must contain exactly one vertex from each Vi. MULTI-COLORED INDEPENDENT SET is well-known to be W[1]-complete (Downey & Fellows, 2013).

Let I = (G, V1, . . . , Vk, k) be an instance of MULTICOLORED INDEPENDENT SET with |Vi| = n and V = Sk i=1 Vi. We construct an equivalent instance I = (VR, VB, d, econ) of EC-BHC as follows. We set econ = k. For every v V , we introduce the (vertex) coordinate dv and for every i with 1 i nk 2k + 1, we introduce the (dummy) coordinate d i. Moreover, for every e = {u, v} E(G), we add the red vector re to VR, which is 1 on the coordinates du and dv as well as the coordinate d i for every i with 1 i nk 2k + 1. We also add the red vector r to VR that is 1 at the coordinates d i with 1 i nk 2k + 1. Finally, we introduce the blue vector b that is 1 at all vertex coordinates dv and 0 at all dummy coordinates. This completes the construction of I , which can clearly be achieved in polynomial-time. We can now show that I is a Yes-instance of MULTI-COLORED INDEPENDENT SET if and only if I is a Yes-instance of EC-BHC.

5.4. Data and Explanation Conciseness

As our ﬁnal result in this section, we show that EC-BHC is ﬁxed-parameter tractable when parameterized by data and explanation conciseness combined.

Theorem 13. EC-BHC can be solved in time O(dcon(I)econ|V |d) and is therefore ﬁxed-parameter tractable parameterized by econ + dcon.

Proof. Let I = (VR, VB, d, econ) with V = VR VB be the given instance of EC-BHC. The main idea behind the algorithm is as follows. We start by initializing the solution vector c to the all-zero vector. We then check in time O(|V |d) whether c is already a solution. If so, we are done. Otherwise, there must exist a red vector r VR and a blue vector b VB such that δ( r, c) δ( b, c) and therefore: |1( r)| + |1( c)| 2|1( r) 1( c)| |1( b)|+|1( c)| 2|1( b) 1( c)|, or in short |1( r)| 2|1( r) 1( c)| |1( b)| 2|1( b) 1( c)|. It follows that any vector c with 1( c) 1( c ) and δ( r, c ) > δ( b, c ) has to be obtained from c by ﬂipping at least one coordinate in B = 1( b) \ (1( r) 1( c)) from 0 to 1; note that |B| dcon(I). We can therefore branch on the coordinates of B, and for every such choice b B, we continue with the vector c obtained from c after ﬂipping the coordinate b from 0 to 1. We stop if either we have reached a solution or if the number of 1 s in the current vector c exceeds the conciseness upper bound econ. In other words, we can solve the problem using a branching algorithm that has at most |1( b)| dcon(I) many choices per branch, uses time O(|V |d) per search-tree node, and makes at most econ branching decisions before it stops. Therefore, the run-time of the algorithm is O(dcon(I)econ|V |d).

The Computational Complexity of Concise Hypersphere Classiﬁcation

6. A Treewidth-Based Algorithm for BHC

Let the incidence graph GI of an instance I = (VR, VB, d) of BHC be the bipartite graph deﬁned as follows. First of all, V (GI) = VR VB [d]. As for the edge set, there is an edge vc E(GI) between a vector v VR VB and a coordinate c [d] if and only if c 1( v). We identify the vertices of GI with the vectors in VR VB and the coordinates in [d]. That is, for a set of vertices X in GI, we often say vectors in X or coordinates in X to mean the vectors/the coordinates associated with the vertices in X.

This section is dedicated to proving the following technical theorem, which implies all the claimed tractability results concerning the treewidth of the incidence graph: Theorem 14. Given an instance I = (VR, VB, d) of BHC and a nice tree-decomposition T = (T, χ) of GI of width w, there is an algorithm solving I in time (2 min{econmin, dcon(I)})2w+2 (|V | + d)O(1), where econmin is the minimum conciseness of any center. Moreover, if I is Yes-instance, then the algorithm outputs a center with conciseness econmin and minimum radius among all such centers.

Proof Sketch. We begin by enumerating each choice of center conciseness λ = 0, 1, 2, . . . , d and radius r = 0, 1, 2, . . . , d, and aim to construct a solution with exactly this conciseness and radius. The algorithm is a bottom-up dynamic programming along the nice tree-decomposition T . We ﬁrst describe the records that we need to compute for every node t of T. Given the description of the records, we need to show that for each of the node types (i.e., leaf/introduce/forget/join), we can compute the records from the records of their children. Finally, we need to also show that given the records for the root node of the treedecomposition, we can decide whether I is a Yes-instance and if so output a center vector c such that |1( c)| = λ, VB B( c, r), and VR B( c, r) = .

We begin by describing the record Γt for each node t T. We can think about Γt as a map that maps a tuple C = (cpast, cfuture, Cbag, Vbag) N N 2χ(t) N|χ(t)| to either a vector ct = {0, 1}d with con( ct) = λ or . The intuition behind the record is that the tuple (cpast, cfuture, Cbag, Vbag) is mapped to an arbitrary vector ct such that

1. cpast is the number of non-zero coordinates of ct on already forgotten coordinates, i.e., cpast = |1( ct) χ(Tt) \ χ(t)|; 2. cfuture is the number of non-zero coordinates of ct on coordinates that are not yet introduced, i.e., cfuture = |1( ct) [d] \ χ(Tt)|; 3. Cbag = 1( ct) χ(t); 4. Vbag contains, for every vector v χ(t), the number of ones on forgotten coordinates in 1( ct), that is Vbag( v) = |1( ct) 1( v) (χ(Tt) \ χ(t))|;

5. no forgotten red vector is at distance at most r from ct, i.e., VR (χ(Tt) \ χ(t)) B( ct, r) = ; and 6. all forgotten blue vectors are at distance at most r from ct, i.e., (VB (χ(Tt) \ χ(t))) B( ct, r).

We say that a vector ct that satisﬁes all the above properties is compatible with (cpast, cfuture, Cbag, Vbag) for t. Moreover, (cpast, cfuture, Cbag, Vbag) is mapped to if and only if no vector in {0, 1}d is compatible with (cpast, cfuture, Cbag, Vbag).

First note that if t is the root node, then χ(t) is empty and χ(Tt) \ χ(t) contains all vectors in the instance. Hence, if any tuple is mapped to a vector in the root, then the vector is a solution by properties 5 and 6 above.

We say that a tuple C = (cpast, cfuture, Cbag, Vbag) is achievable for Γt if the following holds:

cpast + cfuture + |Cbag| = λ; and for all vectors v χ(t): Vbag( v) min{λ, |1( v) (χ(Tt) \ χ(t))|}.

Note that if C is not achievable for Γt, then no vector with conciseness λ can be compatible with C. Hence, the table Γt will only contain the achievable tuples for Γt. We observe that |Γt| λ2 2χ(t) (min{λ, dcon(I)})|χ(t)|. We can now compute the records in a leaf-to-root fashion at each of the four different types of nodes in T .

Combining Theorem 14 and Proposition 2, we get the following three corollaries:

Corollary 15. BHC and EC-BHC are in XP parameterized by tw(GI).

Corollary 16. EC-BHC is ﬁxed-parameter tractable parameterized by tw(GI) + econ.

Corollary 17. BHC and EC-BHC are ﬁxed-parameter tractable parameterized by tw(GI) + dcon(I).

7. Concluding Remarks

In this paper, we studied hypersphere classiﬁcation problems from a parameterized complexity perspective, focusing strongly on conciseness. We considered conciseness in terms of the sought-after explanation and in terms of the feature vectors in the training data. Our algorithmic and lower-bound results draw a comprehensive complexity map of hypersphere classiﬁcation. This map pinpoints the exact complexity of the various combinations of parameters which can either measure the structural properties of the input data or the conciseness of data or explanations.

All our lower and upper complexity bounds are essentially tight, with a single exception: While we show that hypersphere classiﬁcation without conciseness restrictions is XPtractable when parameterized by treewidth alone, whether the problem is ﬁxed-parameter tractable or W[1]-hard under

The Computational Complexity of Concise Hypersphere Classiﬁcation

this parameterization is left open.

Finally, we remark that all our results carry over to the case where one aims to ﬁnd a minimum-radius separating hypersphere (instead of merely deciding whether one exists) that classiﬁes the training data. This problem has also been extensively studied (Cooper, 1962; Wang et al., 2007; 2005; Astorino & Gaudioso, 2009; Astorino et al., 2016).

Acknowledgments

Robert Ganian acknowledges support from the Austrian Science Fund (FWF, project Y1329). Iyad Kanj acknowledges support from De Paul University through a URC grant 602061. Stefan Szeider acknowledges support from the Austrian Science Fund (FWF, projects P32441), and from the Vienna Science and Technology Fund (WWTF, project ICT19-065). Sebastian Ordyniak acknowledges support from the Engineering and Physical Sciences Research Council (EPSRC, project EP/V00252X/1).

Agarwal, P. K., Aronov, B., and Koltun, V. Efﬁcient algorithms for bichromatic separability. ACM Trans. Algorithms, 2(2):209 227, 2006.

Alegr ıa, C., Orden, D., Seara, C., and Urrutia, J. Separating bichromatic point sets in the plane by restricted orientation convex hulls. Journal of Global Optimization, pp. 1 34, 2022.

Astorino, A. and Gaudioso, M. A ﬁxed-center spherical separation algorithm with kernel transformations for classiﬁcation problems. Comput. Manag. Sci., 6(3):357 372, 2009.

Astorino, A., Fuduli, A., and Gaudioso, M. Nonlinear programming for classiﬁcation problems in machine learning. In AIP Conference Proceedings, volume 1776, 2016.

Barcel o, P., Monet, M., P erez, J., and Subercaseaux, B. Model interpretability through the lens of computational complexity. In Proceedings of the Thirty-Third Annual Conference on Neural Information Processing Systems Neur IPS, pp. 15487 15498, 2020.

Blanc, G., Lange, J., and Tan, L. Provably efﬁcient, succinct, and precise explanations. In Proceedings of the Thirty Fourth Conference on Neural Information Processing Systems Neur IPS, pp. 6129 6141, 2021.

Chalasani, P., Chen, J., Chowdhury, A. R., Wu, X., and Jha, S. Concise explanations of neural networks using adversarial training. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119, pp. 1383 1391. PMLR, 2020.

Chen, H. A rendezvous of logic, complexity, and algebra. ACM Comput. Surv., 42(1):2:1 2:32, 2009.

Cooper, P. W. The hypersphere in pattern recognition. Inf. Control., 5(4):324 346, 1962.

Cygan, M., Fomin, F. V., Kowalik, L., Lokshtanov, D., Marx, D., Pilipczuk, M., Pilipczuk, M., and Saurabh, S. Parameterized Algorithms. Springer, 2015.

Dahiya, Y., Fomin, F. V., Panolan, F., and Simonov, K. Fixed-parameter and approximation algorithms for PCA with outliers. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139, pp. 2341 2351, 2021.

Darwiche, A. and Hirth, A. On the reasons behind decisions. In ECAI 2020 - 24th European Conference on Artiﬁcial Intelligence, volume 325 of Frontiers in Artiﬁcial Intelligence and Applications, pp. 712 720, 2020.

Doshi-Velez, F. and Kim, B. A roadmap for a rigorous science of interpretability. Co RR, abs/1702.08608, 2017. URL http://arxiv.org/abs/1702.08608.

Downey, R. G. and Fellows, M. R. Fundamentals of Parameterized Complexity. Texts in Computer Science. Springer, 2013.

Flum, J. and Grohe, M. Parameterized Complexity Theory, volume XIV of Texts in Theoretical Computer Science. An EATCS Series. Springer, Berlin, 2006.

Frances, M. and Litman, A. On covering problems of codes. Theory of Computing Systems, 30:113 119, 1997.

Frank, A. and Tardos, E. An application of simultaneous diophantine approximation in combinatorial optimization. Combinatorica, 7(1):49 65, 1987.

Ganian, R. and Korchemna, V. The complexity of bayesian network learning: Revisiting the superstructure. In Proceedings of the Thirty-Fourth Conference on Neural Information Processing Systems Neur IPS, pp. 430 442, 2021.

Ganian, R., Kanj, I. A., Ordyniak, S., and Szeider, S. Parameterized algorithms for the matrix completion problem. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80, pp. 1642 1651, 2018.

Ganian, R., Hamm, T., Korchemna, V., Okrasa, K., and Simonov, K. The complexity of k-means clustering when little is known. In International Conference on Machine Learning (ICML), volume 162, pp. 6960 6987, 2022.

Gr uttemeier, N. and Komusiewicz, C. Learning bayesian networks under sparsity constraints: A parameterized complexity analysis. J. Artif. Intell. Res., 74:1225 1267, 2022.

The Computational Complexity of Concise Hypersphere Classiﬁcation

H. W. Lenstra, J. Integer programming with a ﬁxed number of variables. Mathematics of Operations Research, 8(4): 538 548, 1983.

Hurtado, F., Seara, C., and Sethia, S. Red-blue separability problems in 3D. In Computational Science and Its Applications, pp. 766 775, Berlin, Heidelberg, 2003. Springer Berlin.

Ignatiev, A., Marques-Silva, J., Narodytska, N., and Stuckey, P. J. Reasoning-based learning of interpretable ML models. In Zhou, Z. (ed.), Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence IJCAI, pp. 4458 4465, 2021.

Izza, Y., Ignatiev, A., Narodytska, N., Cooper, M. C., and Marques-Silva, J. Provably precise, succinct and efﬁcient explanations for decision trees. Co RR, abs/2205.09569, 2022.

Kannan, R. Minkowski s convex body theorem and integer programming. Mathematics of Operations Research, 12 (3):415 440, 1987.

Kim, S., Lee, K., and Jeong, Y. Norm ball classiﬁer for oneclass classiﬁcation. Ann. Oper. Res., 303(1):433 482, 2021.

Kloks, T. Treewidth, Computations and Approximations, volume 842 of Lecture Notes in Computer Science. Springer, 1994.

Korhonen, T. A single-exponential time 2-approximation algorithm for treewidth. In 62nd IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 184 192. IEEE, 2021.

Laber, E. S. and Murtinho, L. On the price of explainability for some clustering problems. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139, pp. 5915 5925, 2021.

Lipton, Z. C. The mythos of model interpretability. Commun. ACM, 61(10):36 43, 2018.

Makarychev, K. and Shan, L. Near-optimal algorithms for explainable k-medians and k-means. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139, pp. 7358 7367, 2021.

Megiddo, N. On the complexity of polyhedral separability. Discrete & Computational Geometry, 3(4):325 337, 1988.

Monroe, D. Ai, explain yourself. Commun. ACM, 61(11): 11 13, 2018.

Nori, H., Caruana, R., Bu, Z., Shen, J. H., and Kulkarni, J. Accuracy, interpretability, and differential privacy via explainable boosting. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139, pp. 8227 8237, 2021.

Ordyniak, S. and Szeider, S. Parameterized complexity results for exact bayesian network structure learning. J. Artif. Intell. Res., 46:263 302, 2013.

Ordyniak, S., Paesani, G., and Szeider, S. The parameterized complexity of ﬁnding concise local explanations. In Elkind, E. (ed.), The 32nd International Joint Conference on Artiﬁcial Intelligence (IJCAI-23), August 19 25, 2023, Macao, S.A.R. International Joint Conferences on Artiﬁcial Intelligence Organization, 2023. Main Track, to appear.

O Rourke, J., Kosaraju, S. R., and Megiddo, N. Computing circular separability. Discrete & computational geometry, 1:105 114, 1986.

Ribeiro, M. T., Singh, S., and Guestrin, C. Anchors: Highprecision model-agnostic explanations. In Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, pp. 1527 1535. AAAI Press, 2018.

Schaefer, T. J. The complexity of satisﬁability problems. In Proceedings of the 10th Annual ACM Symposium on Theory of Computing, May 1-3, 1978, San Diego, California, USA, pp. 216 226. ACM, 1978.

Sheikhi, F., Mohades, A., de Berg, M., and Davoodi, M. Separating bichromatic point sets by l-shapes. Comput. Geom., 48(9):673 687, 2015.

Shih, A., Choi, A., and Darwiche, A. A symbolic approach to explaining bayesian network classiﬁers. In Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI, pp. 5103 5111, 2018.

Shih, S., Tien, P., and Karnin, Z. GANMEX: one-vs-one attributions using gan-based model explainability. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139, pp. 9592 9602, 2021.

Simonov, K., Fomin, F. V., Golovach, P. A., and Panolan, F. Reﬁned complexity of PCA with outliers. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97, pp. 5818 5826, 2019.

Tax, D. M. J. and Duin, R. P. W. Data domain description using support vectors. In The European Symposium on Artiﬁcial Neural Networks, 1999.

Vapnik, V. N. The Nature of Statistical Learning Theory. Springer-Verlag, Berlin, Heidelberg, 1995.

The Computational Complexity of Concise Hypersphere Classiﬁcation

W aldchen, S., Mac Donald, J., Hauch, S., and Kutyniok, G. The computational complexity of understanding binary classiﬁer decisions. J. Artif. Intell. Res., 70:351 387, 2021.

Wang, J., Neskovic, P., and Cooper, L. N. Pattern classiﬁcation via single spheres. In proceedings of 8th International Conference on Discovery Science, volume 3735 of Lecture Notes in Computer Science, pp. 241 252. Springer, 2005.

Wang, J., Neskovic, P., and Cooper, L. N. Bayes classiﬁcation based on minimum bounding spheres. Neurocomputing, 70(4-6):801 808, 2007.

Wang, Z., Zhang, W., Liu, N., and Wang, J. Scalable rulebased representation learning for interpretable classiﬁcation. In Proceedings of the Thirty-Fourth Conference on Neural Information Processing Systems (Neur IPS), pp. 30479 30491, 2021.