# faster_optimal_univariate_microaggregation__425a57a5.pdf

Published in Transactions on Machine Learning Research (10/2024)

Faster Optimal Univariate Microgaggregation

Felix I. Stamm felix.stamm@cssh.rwth-aachen.de RWTH Aachen University Germany

Michael T. Schaub schaub@cs.rwth-aachen.de RWTH Aachen University Germany

Reviewed on Open Review: https: // openreview. net/ forum? id= s5l EUty Vly

Microaggregation is a method to coarsen a data set, by optimally clustering data points in groups of at least k points, thereby providing a k-anonymity type disclosure guarantee for each point in the data set. Previous algorithms for univariate microaggregation had a O(kn) time complexity. By rephrasing microaggregation as an instance of the concave least weight subsequence problem, in this work we provide improved algorithms that provide an optimal univariate microaggregation on sorted data in O(n) time and space. We further show that our algorithms work not only for sum of squares cost functions, as typically considered, but seamlessly extend to many other cost functions used for univariate microaggregation tasks. In experiments we show that the presented algorithms lead to performance improvements on real hardware.

1 Introduction

Public data, released from companies or public entities, are an important resource for data science research. Such data can be used to provide novel types of services to society and provide an essential mechanism for holding public or private entities accountable. Yet, the benefits of publicly released data have to be carefully traded off against other fundamental interest such as privacy concerns of affected people (Aggarwal & Yu, 2008). A common practical solution to circumvent these problems is thus to reduce the resolution of the collected raw data, e.g., via spatial or temporal aggregation, which leads to a coarse-grained summary of the initial data.

Microaggregation (Defays & Anwar, 1998; Samarati, 2001; Domingo-Ferrer & Mateo-Sanz, 2002; Domingo Ferrer & Torra, 2005) is a method that aims to provide an adaptive optimal aggregation that remains informative, while providing a certain guaranteed level of privacy. Specifically, in microaggregation we aim to always group at least k data items together onto a common representation. For instance, for vectorial data, we may map data points to their centroid values, while ensuring a cluster size of at least k points. The found centroid values of the clusters may thus be released and used as aggregated representation of the original data. Clearly, for the aggregated data to remain useful for further analysis, it should remain closely aligned with the original data according to some metric relevant for the task at hand. Hence, in microaggregation we seek to minimize a distortion metric of the aggregated data, while respecting the minimal group size constraint. This minimal group-size constraint (at least k data-items per group), which ensures a guaranteed minimal level of privacy within each cluster, is a major difference to standard clustering procedures such as k-means, which simply aim to obtain a low-dimensional representation of the observed data in terms of clusters.

Microggregation tasks are usually formulated as discrete optimization problems, where low costs indicate high similarity among the data items in each cluster. Similar to many other clustering problems, microaggregation is in general an NP-hard problem (see Oganian & Domingo-Ferrer (2001) or appendix B.1). A specific form

Published in Transactions on Machine Learning Research (10/2024)

Figure 1: Example application of univariate microaggregation to a toy medical data set. Left we have the original data containing quasi identifying information such as age and sensitive information such as health status. In the anonymized data an attacker interested in the health status of an individual and equipped with knowledge about the age of an individual can no longer infer the health status with certainty. To achieve such an anonymization, potentially identifying attributes such as age are aggregated. The task of finding such an aggregation while minimizing the distortion of the data is called univariate microaggregation. In this work, we are concerned with efficient algorithms to solve univariate microaggregation for various kinds of distortion metrics/cost functions.

of microaggregation that remains efficiently solvable, however, is univariate microaggregation (UM), in which the data points {xi} to be aggregated are (real-valued) scalars. This is the focus of our work here.

Despite its relative simplicity, univariate micro-aggregation is a useful primitive in a number of problems. A concrete example is the anonymization of degree sequences of graphs (Casas-Roma et al., 2017). Further, univariate microaggregation can also be used to provide heuristic solutions to microaggregation tasks in higher dimensions with the help of projections (Domingo-Ferrer & Mateo-Sanz, 2002). In this work we consider the univariate microaggregation problem under a wide range of cost functions, including the typically considered sum-of-squares error (SSE) based on the (squared) ℓ2 norm, the ℓ1 norm, the ℓ norm and roundup cost error types.

Contributions Our main contributions are as follows (see Figure 2 for a visual representation).

We show that for ordered minimizable cost functions UM can be solved in O(n2), improving previous results with exponential runtime. We achieve this by rephrasing UM as a least weight subsequence problem (Wilber, 1988; Wu, 1991).

We show that for ordered minimizable cost functions, which in addition have a splitting is beneficial property, this can be improved to O(kn). This includes many popular cost functios based on the ℓ1 norm (sum of absolute errors), ℓ2 norm (sum of squared errors), ℓ (maximal error) norm, and round up/down cost functions.

Finally, we show that for ordered minimizable cost functions which are concave, algorithms with O(n) time and space are possible. These findings apply to above mentioned examples of ℓp norm-based cost functions.

In our presentation we focus on three key conceptual ideas: ordered minimizable , splitting is beneficial , and concave costs. While these concepts are already implicitly present in the literature, they are typically not made explicit, which can make it difficult to relate the underlying algorithmic ideas to each other. Making these concepts explicit, enables us to unfold a natural problem hierarchy within the associated univariate microaggregation formulations, which leads us to the design of two algorithms that use all three concepts: the simple+ algorithm and the staggered algorithm. These algorithms have a faster runtime compared with previous algorithms for the UM task. Lastly, we provide several practical improvements specific to UM that allow for (i) algorithms that run empirically faster in practice and (ii) substantially decreased impact of floating point errors on the resulting clustering. Overall, our work thus enables to robustly compute

Published in Transactions on Machine Learning Research (10/2024)

Figure 2: Complexity of UM on sorted data for different classes of cost functions. For ordered minimizable cost functions univariate microaggregation can be solved in O(n2). If additionally splitting is beneficial, then one can actually solve UM in O(kn) while for concave cost functions UM can be solved in O(n). For cost functions that have both properties we provide the Simple+ and the Staggered algorithm which have faster empirical runtime without sacrificing worst case bounds.

optimal UM-clusterings for various cost functions and large values of n and k. An implementation of the presented ideas is available at https://github.com/Feelx234/microagg1d and the code can also be found at https://doi.org/10.5281/zenodo.10459327.

Outline In Section 2 we first outline definitions associated with UM and the related least weight subsequence (LWSS) problem. In Section 3 we then show that for ordered minimizable cost functions, we can reformulate UM as LWSS problem which allows for O(n2) algorithms. For cost functions with the splitting is beneficial property, this runtime complexity can be improved to O(kn). For concave cost functions this can be futher improved to O(n). Closing Section 3 we introduce a simpler and empirically faster O(n) algorithm for UM. In Section 4 we present considerations for running UM on real hardware. First, we present two strategies to avoid issues arising from finite float precision. Second, we illustrate a small algorithmic trick to use fewer instructions when computing the cluster cost, which is the main computation carried out within UM. We verify that the presented theoretical considerations lead to significant empirical performance improvements in Section 5.

1.1 Related Work

The question of how to publicly release data, while retaining certain levels of privacy, has become increasingly important in recent years. Privacy preservation for data sets is most relevant if data entries contain both public as well as sensitive private information. Note that what information is considered public and private can depend on the application scenario. For instance, we may have tabular data, where each row contains public information such as name or age, in conjunction with private information such as medical diagnoses. In this case, an important objective is to design a surrogate data set, such that it is virtually impossible to use public information to deduce private information about an individual.

The public attributes of a database can usually be split into identifiers and quasi-identifiers. To privacyharden a database it is typically insufficient to merely remove identifiers such as name or social security number. It can often remain possible to identify an individual in the data set by a combination of quasiidentifiers (Rocher et al., 2019), e.g., there is only one 30 year old female in the data. Thus other privacy protection measures need to be employed.

To combat the leakage of sensitive information, different privacy concepts such as l-diversity (Machanavajjhala et al., 2007), t-closeness (Li et al., 2007), and the popular k-anonymity (Samarati & Sweeney, 1998; Samarati, 2001; Sweeney, 2002) have been proposed. The latter is the privacy concept we are concerned with in this work. If a data set is k-anonymous, it is not possible to use any combination of quasi-identifiers to narrow down the set of individuals described by those quasi-identifiers to less than k individuals. Because there will always be k 1 other rows indistinguishable by quasi-identifiers alone, it remains impossible to deduce which data item corresponds to a specific identity, no matter how much knowledge about the quasiidentifiers of an individual is available. Unfortunately, data sets are usually not inherently k-anonymous but the released data needs to be altered to become k-anonymous distorting the statistics of the released data.

Published in Transactions on Machine Learning Research (10/2024)

Microaggregation (Domingo-Ferrer & Mateo-Sanz, 2002; Defays & Anwar, 1998; Domingo-Ferrer & Torra, 2005) was proposed as a method to make a data set k-anonymous. Therefore data entries are grouped into groups of size at least k, and the quasi-identifiers of entries are replaced with their group-centroid values. To uphold the utility of the released data, the grouping is conducted by minimizing a distortion metric subject to a minimal group size constraint. Solving microaggregation with more than one variable, so called multivariate microaggregation, is known to be NP-hard for the sum-of-squares cost function (Oganian & Domingo-Ferrer, 2001), a popular distortion metric. To perform multivariate microaggregation in practice, several algorithms have been proposed. Some of these alogorithms come with approximation guarantees, while others are simply heuristics. We refer to Yan et al. (2022) for many references to algorithms for multivariate microaggregation.

Microaggregation with only one variable, so called univariate microaggregation, is known to be solvable in polynomial time. More precisely, O(kn) for the sum-of-squares cost function (Hansen & Mukherjee, 2003) by constructing a graph and solving a shortest path problem on this graph. Algorithms which solve univariate microaggregation may also be used to solve multivariate microaggregation problems. Recently, other cost functions besides SSE have been used for univariate microaggregation tasks such as degree-sequence anonymisation (Casas-Roma et al., 2017).

We remark that the term univariate microaggregation is also used in the literature to describe a microaggregation problem of ordered high dimensional data: In that scenario we are provided with data vectors xi Rd, with d 2 and an extrinsically defined total order of the points, which must be respected when creating clusters. Hence, any cluster that contains the points xi and xj must also contain all points greater than xi and smaller than xj according to the provided ordering. While some of the ideas we develop here may be extended to this scenario, this is in general not a trivial task and beyond the scope of this work.

2 Problem Formulation and Preliminaries

In univariate microaggregation (UM) we consider a set of scalar data points xi R for i = 1, . . . , n. To keep the treatment general we allow for repeated elements xi, i.e., we consider our universe Ω= {{x1, . . . , xn}} to be a multi-set of points. The goal of UM can now be formulated as follows.

Definition 1 (Univariate Microaggregation problem). Given a multi-set of scalars Ω= {{x1, . . . , xn}}, find a non-empty partition of the points into clusters Xj, such that each cluster has size at least k and a total additive cost function TC({{X1, . . .}}) := P j C(Xj) is minimized, where C denotes a cost function that measures the distortion within each cluster.

Formally, we look for a partition, i.e., a split of the multi-set Ωinto nonempty multi-sets Xi = , such that Sn i=1 Xi = Ω. To avoid confusion, in this work the union of two multi-sets is an additive union, i.e., the multiplicities of the elements add up when forming the union of two multi-sets. Notice, that there might be multiple partitions minimizing the total cost even if all xi are unique. This is usually not a problem as we are typically interested in finding any partition minimizing the total cost.

A brute-force approach, neglecting possible structure in the cost function C needs at least exponential time, as all possible partitions have to be assessed. However, two important observations have been employed in the literature to lower this complexity. First, for many cost functions it can be shown that a minimal cost partition has no interleaved clusters (see Corollary 7 or Domingo-Ferrer & Mateo-Sanz (2002) for the special case of SSE). This substantially restricts the search space that needs to be explored in order to find an optimal solution, as it implies that only pairwise-ordered clusterings need to be considered. From now on we thus assume that data points xi are sorted in ascending order, i.e. i < j xi xj. For a given cost function C, this allows us to denote by C(i, j) := C({{xi+1, . . . , xj}}), the cost of a cluster with starting point xi+1 and endpoint xj. Second, for many cost functions, there always exist an optimal cluster assignment in which no cluster has size larger than 2k 1, i.e., it is never detrimental to split clusters of size at least 2k into smaller clusters (see eq. 2 or Domingo-Ferrer & Mateo-Sanz (2002) for the special case of SSE). This also reduces the size of the search space, as there is no need to check possible clusterings containing clusters of size greater 2k 1.

Published in Transactions on Machine Learning Research (10/2024)

Previously, these observations were used to solve univariate micoraggregation for the SSE cost function on sorted data in O(kn) time (Hansen & Mukherjee, 2003; Mortazavi, 2020). In the following we will exploit the fact that many popular cost functions are concave cost functions, which can be used to create even faster algorithms. For these concave cost functions it is possible to achieve O(n) runtime algorithms for the UM problem on sorted data by rephrasing UM as a least weight subsequence problem.

2.1 The Least-Weight Subsequence Problem

As we will see in the next sections, univariate microaggregation can actually be phrased such that established algorithms which solve the least-weight subsequence problem can be applied to UM. Hence, we briefly recall the relevant aspects of the least-weight subsequence problem Definition 2 (Least weight subsequence problem (Wilber, 1988) ). Given an integer n and a cost function C(i, j), find an integer q 1 and a sequence of integers 0 = l0 < l1 < < lq 1 < lq = n such that the total cost TC = Pq 1 i=0 C(li, li+1) is minimized.

2.1.1 A Standard Algorithm (Wilber, 1988)

To gain some intuition about how to solve the least-weight subsequence problem, we first outline an algorithm with an O(n2) time, O(n) memory requirement. To this end we introduce the auxiliary variables mij, which is the sum of the cost obtained after solving the LWSS problem for the points x1 . . . xi plus the cost of C(i, j) := C({{xi+1, . . . , xj}}). We call mij the conditional minimal cost. More specifically, m0j := C(0, j) is simply the cost of considering the first j points to be in a single cluster. The remaining conditional minimal costs are then recursively defined as mij := minl<i mli + C(i, j). We notice that minl<i mli is the minimal total cost required to solve the least-weight subsequence problem on the points x1 . . . xi, and so mij can be interpreted as the sum of the costs of (i) assigning the first i points optimally and (ii) assigning the points {{xi+1, . . . , xj}} to a single cluster.

To efficiently compute all mij we make use of a lookup array A which we fill with the relevant mij entries as follows: We first compute m01 and store it in array A at position 1. Next, we compute the conditional minimal costs m02 and m12. To compute m12 we read the entry m01 from our lookup array A at position 1. We store the minimum of m02 and m12 in array A at position two. We can continue in the same way iteratively. In each iteration we increase j by one and compute all the conditional minimal costs mij with i < j. Each mij requires the value minl<i ml,i which we can read from our array A at position i. At the end of each such iteration we store mini mij in our array A at position j such that it is available for further computations. The algorithm terminates once we have found mini min which corresponds to the minimal total cost. Overall, this takes O(n2) time if we can compute the cost function C in O(1).

In univariate microaggregation we are usually more interested in a minimal cost clustering rather than the minimal cost itself. This can be achieved by slight adaptation of the above algorithm. Whenever we compute mini mij we also store the index i = arg mini mij of the minimum in a linear array Aind at position j. The full pseudo code for this algorithm can be found in Alg. 2. The result of this standard algorithm implicitly holds the information about a minimal cost clustering which can be obtained by backtracking in O(n). Starting with j = n we consider imin = arg mini mij which we can read from our array Aind at position j. All points with index l with imin < l j belong to one cluster in the optimal costs clustering. If imin > 0 we repeat the same process with j = imin to obtain the next cluster. Full pseudo code of this backtracking procedure is provided in Alg. 3 in the appendix.

2.1.2 Algorithms for Concave Cost Functions

More efficient algorithms than the standard algorithm outlined above are known for the concave least weight subsequence problem. The problem is called concave if the cost function C satisfies for all 0 a < b < c < d n:

C(a, c) + C(b, d) C(a, d) + C(b, c). (1)

The above constraint, is sometimes called quadrangle inequality . If a cost function fulfills this requirement it is called concave Wilber (1988); Galil & Park (1990).

Published in Transactions on Machine Learning Research (10/2024)

If we consider non negative cost functions (i.e. XC(X) 0) with the property that a one-element cluster has no cost (C(i, i + 1) = 0), we can derive a few interesting implications about concave cost functions. The first observation is, that wide is worse. Mathematically this means that for a < b < c

C(a, b) C(a, c)

C(b, c) C(a, c).

In simple words, this means that if we widen a cluster by adding more points to the left or to the right the cost does not decrease. A similar consequence of concavity is that splitting is beneficial, i.e. for all a < b < c

C(a, b) + C(b, c) C(a, c). (2)

Note that the above equation is effectively an opposite of the triangle inequality (which would hold for so-called convex cost functions).

Most important for solving least weight subsequence problems is the consequence that the associated matrix C with entries Ci,j = C(i, j) is a Monge matrix (where for notational convenience we index the rows starting from zero and the columns starting from 1). Most relevant for our purposes is that Monge matrices and their transpose are totally monotone (Burkard et al., 1996), i.e., it holds for each 2 2 submatrix α β γ δ

that γ < δ implies that α < β and, similarly, we have γ = δ α β. It is easy to see that if we add a constant value to a column of a totally monotone matrix, the resulting matrix is again totally monotone. This implies that the transposed conditional minimal cost matrix MT with entries Mij = mij is totally monotone, as it is obtained from the transposed cost matrix CT (which is Monge) by adding constant values (minl mli) to the columns. Finding all minima within the rows of implicitly defined totally monotone matrices of shape n m (n m) is possible in O(m) using the SMAWK algorithm (Aggarwal et al., 1987).

Using the SMAWK algorithm, dynamic programming (Wilber, 1988; Galil & Park, 1990) allows us to solve the concave least weight subsequence problem in O(n) if the concave cost function C can be computed in O(1) time using O(n) time for preprocessing. While the works (Wilber, 1988; Galil & Park, 1990) focus on concave cost functions C, we notice that the presented algorithms in (Wilber, 1988; Galil & Park, 1990) will also work if the corresponding transposed cluster cost matrix MT is only totally monotone. As we will see in the next chapter, univariate microaggregation for concave cost functions can be phrased as a least weight subsequence problem with an non-concave adapted cost function Cadapt. The corresponding transposed cluster cost matrix MT is nonetheless totally monotone which allows for the algorithms from (Wilber, 1988; Galil & Park, 1990) to be applied to univariate microaggregation.

3 Faster Univariate Microaggregation

In the following we present what properties of the cost function can be used to efficiently solve the univariate microaggregation problem. We first reformulate UM as a least weight subsequence problem (LWSS) which allows for O(n2) algorithms. We call cost functions which allow this reformulation ordered minimizable. If cost functions additionally also have the splitting is beneficial property, O(kn) algorithms are possible. For the very common type of concave cost functions, we show that UM can be solved in O(n). Lastly, we present an O(n) algorithm, the staggered algorithm, which shows faster empirical running for the concave UM problem compared to more generic O(n) LWSS algorithms presented in (Wilber, 1988; Galil & Park, 1990).

3.1 Reformulating Univariate Microaggregation as a Least Weight Subsequence Problem

When comparing UM and LWSS, the most apparent issue is, that in the LWSS problem the values are inherently ordered while for UM no such inherent order is apparent for all cost functions. We will thus consider properties of cost functions that imply an inherent order. We will now characterize those cost

Published in Transactions on Machine Learning Research (10/2024)

functions that allow a reformulation of the univariate microaggregation problem as a least weight subsequence problem. Parts of the below work is inspired by the treatment of SSE by Domingo-Ferrer & Mateo-Sanz (2002).

Definition 3 (Pairwise ordered sets). We call two (multi-)sets A and B pairwise ordered iff all elements in one set are smaller or equal than those in the other, i.e. a A,b B a b or a A,b B b a. We call a partition pairwise ordered iff all pairs of multi-sets in the partition are pairwise ordered.

Equipped with this definition we can characterize those cost functions whose UM problem can be reformulated as a least weight subsequence problem.

Definition 4 (Ordered minimizable). We call a total cost TC ordered minimizable if it is sufficient to consider only pairwise ordered partitions when minimizing the total cost TC over all relevant1 partitions of the universe Ω.

To grasp the importance of definition 4, lets consider the case of minimizing an ordered minimizable total cost for a bipartition. If there are n elements there are n 1 pairwise ordered partitions splitting the data in two parts, but there are 2n 2 partitions in total. So being able to only check all pairwise ordered partitions makes a huge computational difference.

While Definition 4 is probably the least restrictive definition, most widely used cost functions fulfill the more restrictive following definition which mixes both the UM and k-means restrictions.

Definition 5 (q-partition ordered minimizable). A total cost function TC is q-partition ordered minimizable iff for all partitions P = {{X1, . . . , Xq}} with multi-sets of cardinalities SP := {{|Xi| for i {1, . . . , q} }} there is a pairwise ordered partition P consisting of multi-set with cardinalities SP = SP that fulfills TC(P ) TC(P).

Theorem 6. If a total cost function is 2-partition ordered minimizable, then it is ordered minimizable.

In the proof of Theorem 6 we show that if a cost function is 2-partition ordered minimizable it is q-partition ordered minimizable for all q. This implies that it is ordered minimizable. For the full proof see Appendix B.5.1 in the appendix. Using Theorem 6, it can be shown that the following cost functions are ordered minimizable.

Corollary 7. The total costs TC = P C(X) associated to following cost functions C are ordered minimizable:

Sum of Squares Error C(X) = SSE(X) = P x X(x X)2

Sum of Absolute Error C(X) = SAE(X) = P

x X |x M(X)|, where M is the median of X

Maximum distance C(X) = C (X) = maxx X |x X|, where X = (max(X) + min(X))/2

Round up error C(X) = C (X) = P

x X |x max(X)|

Round down error C(X) = C (X) = P

x X |x min(X)|

For the proof of Corollary 7 see Appendices B.5.3 to B.5.6. The sum of absolute error cost has been previously proposed under the name absolute deviation from median (ADM) (Kabir & Wang, 2011). For the scenario in which all values xi are integers, a sum-of-squares distortion metric with rounded mean was also proposed (Mortazavi, 2020) to be able to perform all operations on integers only.

When dealing with any ordered minimizable cost function, we thus can sort the input data either in ascending or descending order (ascending is usually preferred for numerical reasons). Depending on the data, sorting can be done in O(n log n) or O(n) (e.g. radix-sort on small integers). Currently, sorting is part of any practical algorithm to solve UM exactly and we thus consider the time complexity of all algorithms on sorted data, thereby neglecting the potential additional cost of O(n log n).

1For UM the relevant partitions are those which respect the minimum cluster size constraint. For k-means the relevant partitions are those of size k.

Published in Transactions on Machine Learning Research (10/2024)

If cost functions are 1) ordered minimizable, and 2) can be computed in O(1) based on an O(n) preprocessed data, we solve UM in O(n2) time and O(n) space. This can be done by adapting the cost function of the classic algorithm for LWSS problems such that large cluster costs are returned if the cluster size is smaller than k:

Cadapt(i, j) =

( VAL j i < k, C(i, j) else. (3)

Here VAL is an arbitrary large enough number, such that the cluster cost C is always smaller than VAL, i.e., C(i, j) < VAL for all valid choices i < j. The following lemma ensures that we can use this adaptation for all previously considered cost functions. Lemma 8. The cost functions introduced in corollary 7 can be computed in O(1) using O(n) time for preprocessing and consuming no more than O(n) additional memory.

For the proof of Lemma 8 see Appendix B.3.

Many cost functions are even more structured, which allows faster algorithms as we will see in the next subsection.

3.2 Univariate Microaggregation for Cost Functions where Splitting is Beneficial

If the ordered minimizable cost function also has the splitting is beneficial property (i.e., eq. 2 holds), we can improve upon the O(n2) time complexity for the classical algorithm. The main insight is that we do not need to consider clusters that contain less than k or more than 2k 1 points. We can forbid these clusterings by adapting the normal cluster cost. For too small/too large clusters we return large costs which ensure the forbidden assignments are never selected. Thus we adapt any cost function with the splitting is beneficial property as

Cadapt(i, j) =

VAL j i < k, VAL j i 2k, C(i, j) else. (4)

Similar to LWSS we can define a matrix M with entries Mi,j = minl<i mli + Cadapt(i, j). If we consider the structure of the M matrix (see colors in Figure 3), we find that there are only k entries per column that need to be considered (green entries in Figure 3). In particular, when computing the minimum value in each column of the matrix M only entries Mij with j 2k + 1 i j k are necessary to be computed to find the minima of column j. This allows a complexity improvement of the classic algorithm from O(n2) to O(kn). We conclude this subsection by noting the introduced considerations are applicable to the presented cost functions. Corollary 9. All the costs presented in Corollary 7 have the splitting is beneficial property.

This is true as they are concave (see next subsection, Lemma 10), they have C 0 and C(i, i + 1) = 0.

3.3 Univariate Microaggregation for Concave Cost Functions

Besides the classical algorithm used to solve least weight subsequence problems, more advanced algorithms (Wilber, 1988; Galil & Park, 1990) are available to solve concave least weight subsequence problems in O(n). We observe that those algorithms will also work if the associated cluster cost matrix MT is only totally monotone. The previous definition of Cadapt (eq. (3)) needs to be slightly modified such that the matrix MT is totally monotone for these algorithms to be applicable.

Cadapt(i, j) =

( VAL(i) j i < k, C(i, j) else. (5)

Published in Transactions on Machine Learning Research (10/2024)

We again choose the large costs VAL(l) large enough such that C(i, j) < VAL(l) for any l (1 l n) and any valid i, j. Further, we require VAL(l) < VAL(m) for l < m to ensure total monotonicity of the transposed cluster cost matrix MT .

If our cost function is concave this definition of the adapted cost Cadapt ensures that, the transposed cluster cost matrix MT is totally monotone (see appendix for a proof, a visualization of such a matrix is provided in Figure 3 if we assume the orange entries are also green). For cost functions C in that can be evaluated in O(1) time (for O(n) preprocessed data) we can use Cadapt as cost function and solve this problem using dynamic programming algorithms, e.g., via the algorithm by Wilber (1988), or an algorithm proposed by Galil & Park (1990), to obtain a solution for univariate microaggregation on sorted input in O(n) time and space.

For the special case of the sum of squares cost, it has been previously shown, that the cost is concave (Grønlund et al., 2018; Wu, 1991) and can be computed in O(1) (using O(n) time and space for preprocessing) using a cumulative sum approach. These results have been used to speed up the 1D k-means problem (Grønlund et al., 2018), but have not yet been used to solve univariate microaggregation tasks. Let us assert that all of the considered cost functions in Corollary 7 are indeed concave:

Lemma 10. The cost functions from Corollary 7 are concave.

For the proof of Lemma 10 we refer to the appendix Appendix B.4.

Putting together Corollary 7, Lemma 8, and Lemma 10 we obtain the following result:

Theorem 11. On sorted data, we can compute optimal univariate microaggregation in O(n) time and space for the cost functions introduced in Corollary 7.

Naturally, Theorem 11 extends to unsorted data if we pay at most an additional O(n log n) cost for sorting.

3.4 Univariate Microaggregation for Concave Costs where Splitting is Beneficial

If a cost function is concave and splitting is beneficial we can combine restricting the cluster size and preserving total monotonicity to arrive at the following adapted cost

Cadapt(i, j) =

VAL(i) j i < k VAL( i) j i 2k C(i, j) else (6)

As before VAL(l) should be large enough such that C(i, j) < VAL(l) for all valid i, j, but this time not only for positive l, but also for negative l with values beteen n l n. We require again that VAL(l) < VAL(m) for l < m to ensure total monotonicity of the transposed cluster cost matrix MT . We proove that for this definition of the adapted cost Cadapt the transposed cluster cost matrix MT is totally monotone in appendix Appendix B.2.2. The structure of the cluster cost matrix correspond to this definition of Cadapt is displayed in Figure 3.

When a cost function has both the splitting is beneficial property and is concave, it is possible to significantly decrease the runtime of the classical algorithm introduced in Section 3.2. To this end it is instrumental to observe that in totally monotone matrices, the positions of the minima of the rows are non decreasing (see Alg. 2 for pseudo-code). Although this does not provide an asymptotic runtime improvement, it greatly decreases empirical runtime for larger k (see Figure 4).

By incorporating the cluster size constraints, it is moreover possible to improve the generic algorithms for least weight subsequence problems (Wilber, 1988; Galil & Park, 1990) for concave cost functions with the splitting is beneficial property. We present an algorithm for this problem in the next subsection.

3.5 The Staggered Algorithm

When solving UM, the algorithm by Wilber (Wilber, 1988) and the algorithm by Galil and Park (Galil & Park, 1990) are both generic dynamic programming algorithms that work with any kind of ordered minimizable concave cost function. They achieve the linear running time by repeatedly calling the SMAWK

Published in Transactions on Machine Learning Research (10/2024)

algorithm Aggarwal et al. (1987) to compute the minima of rows of submatrices of the totally monotone matrix MT . These algorithm only use the concavity and are oblivious to the restriction on the cluster size and the resulting structure of the MT matrix (compare green entries in Figure 3). In contrast, we have the O(kn) algorithm (Domingo-Ferrer et al., 2008) or the adapted classical algorithm (Section 3.2) which are unaware of the concavity of the cost function and use the restrictions on cluster sizes to achieve speed. By using both the restrictions on the cluster sizes and the concavity of the cost function we can obtain an algorithm that empirically runs faster than the other O(n) algorithm on UM tasks, and the asymptotic runtime is still O(n). We present the pseudo code for one such algorithm, the staggered algorithm , in Alg. 1. The main idea of the staggered algorithm is to apply the SMAWK algorithm to matrices along the diagonals that correspond to valid cluster assignments (see Figure 3 for a visualization of this idea). For UM tasks, the algorithm is empirically faster as fewer calls of Cadapt(i, j) are wasted on combinations of i, j which are forbidden by cluster size constraints. Lastly, the staggered algorithm is conceptually simpler than the other two O(n) algorithms which employ a guessing and recovery strategy to cope with unavailable entries minl<i mli. For the staggered algorithm, no such guessing needs to be employed, as all necessary entries are available when making calls to the SMAWK algorithm.

This algorithm has space and runtime complexity of O(n) for cost functions that can be computed in O(1) using O(n) preprocessing. To see this, not that any call to the SMAWK algorithm is O(k) and there at most n

k calls to the SMAWK algorithm. This means that all the calls to the SMAWK algorithm are at most O(n).

Input: sorted array v, minimum size k, cost function calculator C Output: array implictly representing the optimal univariate microaggregation of v

1 let n = length(v)

2 if n 2k 1 then

3 return all zeros array of length n

4 C.do_preprocessing(v, k)

5 SMAWK.cost = C

6 for i {k, . . . , 2k 1} do

7 SMAWK.Min Total Cost[i] = C.calc(0, i)

8 SMAWK.Argmin Total Cost[i] = 0

9 SMAWK.col_min(2k, min(3k 1, n), k, 2k 1)

10 if n 3k 1 then

11 return SMAWK.Argmin Total Cost

12 f, R = (n 3k + 1) divmod k

13 for i {3, . . . , 2 + f} do

14 SMAWK.col_min(ik, (i + 1)k 1, (i 2)k + 1, ik 1)

15 if R > 0 then

16 i = 3 + f

17 SMAWK.col_min(ik, n, (i 2)k + 1, n 1)

18 return SMAWK.Argmin Total Cost

Algorithm 1: Pseudo code for the staggered algorithm. This algorithm uses both the concavity of the cost function through the use of the SMAWK algorithm and the restrictions, that clusters are of size at least k and at most 2k 1. When calling SMAWK.col_min(i,j,k,l), the SMAWK algorithm is applied to compute the column minima of a submatrix of the implicitly defined cost matrix M which contains columns i through j and rows k to l (see boxes in Figure 3). After executing the SMAWK algorithm, the minimal total cost for each column is stored in Min Total Cost and the row that corresponds to that value is stored in Argmin Total Cost.

Published in Transactions on Machine Learning Research (10/2024)

m01 m02 m03 m04 m05 m06 m07 m08 m11 m12 m13 m14 m15 m16 m17 m18 m21 m22 m23 m24 m25 m26 m27 m28 m31 m32 m33 m34 m35 m36 m37 m38 m41 m42 m43 m44 m45 m46 m47 m48 m51 m52 m53 m54 m55 m56 m57 m58 m61 m62 m63 m64 m65 m66 m67 m68 m71 m72 m73 m74 m75 m76 m77 m78

Figure 3: Visualization of how the staggered algorithm processes the cluster cost matrix M for a cost function that is concave and has the splitting is beneficial. The matrix is shown for n = 8 and k = 2. Red entries correspond to clusterings which contain clusters that are to small. Orange entries correspond to entries that contain unnecessarily large clusters and green entries need to be processed to find an optimal univariate microaggregation clustering. In Alg. 1 lines 5-7 initialize m02 and m03. Line 9 processes the first submatrix indicated by the square overlay in columns 4 and 5. For most inputs the loop in lines 12 & 13 processes most of the matrix (here only columns 6 & 7). Lastly, potentially remaining columns are processed in line 15-17.

4 Univariate Microaggregation on real hardware

So far we have considered the mathematical foundations underlying fast UM algorithms. In practice though, the computations involved are not happening at arbitrary precision. On most current hardware, calculations involving reals are executed with fixed width floating point numbers (e.g. 32 or 64 bits). We notice that utilizing the presented algorithms/cost functions can result in suboptimal clusterings, due to finite precision of these floating point numbers. We thus dedicate the next two subsections to the analysis and mitigation of errors caused by finite precision floating point numbers. At the end of the section we also highlight a real world runtime improvement possible for some of the cost functions. As a running example we will consider the sum of squares cost function, though many of the considerations carry over to other cost functions as well.

The default approach of computing the SSE in O(1) is to first precompute (in O(n)) the cumulative sums

s(1)(j) = X

i j xi and s(2)(j) = X

with s(1)(0) = s(2)(0) = 0. Then we can compute for i < j

SSE(i, j) = s(2)(j) s(2)(i)

s(1)(j) s(1)(i) 2

which is constant time irrespective of the values of i and j.

4.1 Floating Point Errors During Preprocessing

When computing the cumulative sum during preprocessing, the running sum can grow and reach quantities which are no longer represented well numerically. This can lead to erroneous cluster cost being calculated which in turn may lead to suboptimal clusterings. As an example, computing the sum of squares error according to eq. (7) on the integers (i.e. the universe is Ω= {0, 1, 2, . . . }) using 64bit floats returns the wrong result SSE(300 079, 300 082) = 1 (which should be 2). We can do better than the simple cumulative sum approach, by observing that we only need to consider C(i, j) with j i 2k 1 for UM. This means we don t need the full cumulative sum and can get away with sums of fewer elements.

We therefore consider partial cumulative sums which are cumulative sums which we restart every k-th time. So in preprocessing we compute the partial cumulative sums

s(1) i,j = X

Published in Transactions on Machine Learning Research (10/2024)

for i = 0, k, 2k, . . . and j(i) = i + 1, i + 2, . . . , i + k, and we set s(1) i,i = 0. We can do this similarly for the cumulative sums of the squared entries. Both of these preprocessing steps take O(n) time and O(n) space.

Using these partial cumulative sums, we can also compute the sum of squares error for univariate microaggregation. Let us denote with m, r = i mod j the modulus m and remainder r of the integer division of i by j. Then we can compute the SSE from the si,j for a univariate microaggregation problem with min size k: For i < j compute mi, ri = i mod k and mj, rj = j mod k. We now notice that for valid entries i, j we have mj mi 1 as j i 2k 1. Lets first consider mj = mi, then we compute the SSE as

SSE(i, j) = s(2) mj,rj s(2) mj,ri s(1) mj,rj s(1) mj,ri 2 /(j i).

In the case that mj = mi + 1 we compute the SSE as

SSE(i, j) = s(2) mj,rj + s(2) mi,k s(2) mi,ri s(1) mj,rj + s(1) mi,k s(1) mj,ri 2 /(j i).

Both of these are O(1) but the numbers involved in the cumulative sums are much smaller which can decrease floating point representation errors. This consideration is particularly important for the case of SSE and MAE both of which use cumulative sums in their computations.

4.2 Floating Point Errors when Computing the Total Cost Function

When computing the total cost function TC we are computing the sum of at least n 2k 1 values with equal sign, thus TC is increasing in magnitude with each summand. Yet, floating point numbers have a higher absolute resolution close to zero. It would thus be numerically advantageous if we could keep the total cost TC close to zero. For cost function with the splitting is beneficial property this can be achieved without increasing the runtime complexity by regularly resetting the stored TC values. Most algorithms work by first computing and storing the minimal total cluster cost TCmin(q) := mini miq of the first q points (see Section 2.1 for the definition of mij). Then the algorithms find the optimal clustering of the next q + p points based on the optimal clustering of those previous values up to q. We notice that TCmin(l) increases as l increases but for cost functions where splitting is beneficial we only need to know the last 2k 1 total cost values, i.e., we need to know TCmin(q) for q l (2k 1) to compute the total cost TCmin(l). Thus for certain algorithms we can reduce the growth of the total cost as follows. On a high level, we subtract the smallest stored and still needed minimal total cost from the other stored and still needed minimal total cost values at regular intervals. As an example, consider the classical algorithms simple and simple+ with their pseudocode in Alg. 2. To implement the strategy, we could expand the loop spanning lines 7-9 in Alg. 2:

1 [continuing the loop in lines 7-9 in Alg. 2]

2 Let imin be the lower bound of the argmin in line 8 of Alg. 2

3 Min Needed Min Cost = minimin+1 i j Min Cost[i]

4 for i {imin + 1, . . . , j} do

5 Min Cost[j] = Min Cost[j] - Min Needed Min Cost

Here we subtract the still needed minimal total cost Min Needed Min Cost from those Min Cost values that are still needed in future iterations. The values that are still needed in future iterations of the main loop are designated mostly by the lower bound imin. This lower bound differs depending on whether we consider the simple or simple+ algorithm. It is possible to do a similar procedure for the staggered algorithm without changing the runtime complexity of either algorithm. Using such a procedure, we can ensure that the total cost TC doesn t grow as a function of n 2k 1 but remains nearly constant (in the case of simple/simple+) or growth as a function k in the case of the staggered algorithm. Thus for small k and large n the described procedure allows us to make use of the higher absolute floating point resolution near zero.

4.3 Faster Cost Function Evaluations

As a final consideration for implementing UM on real hardware, we consider how to compute the cost function with fewer instructions. All the here presented algorithms work by computing the optimal partitions of the

Published in Transactions on Machine Learning Research (10/2024)

sorting simple simple+ galil park wilber staggered

101 102 103 104

minimum group size 푘

runtime [s]

Figure 4: Runtime on random data of size 1 million as a function of minimum group size k. Colors indicate different algorithms as displayed in the legend above. The time to sort the elements is included as a baseline (black). Shown are mean and standard deviation over 10 repeats. As cost function, the sum-of-squares cost function as introduced in section 4.1 is used. The simple dynamic program (denoted as simple , pseudocode in Alg. 2) is representative of algorithms previously proposed (e.g. Domingo-Ferrer & Mateo-Sanz 2002). It is slower than the newly proposed dynamic program (simple+, pseudocode in Alg. 2) for any minimum group size k. For small values of k, the O(kn) dynamic programs simple and simple+ are faster than the O(n) algorithms staggered, Wilber, and Galil Park. For large values of the minimum group size k 30 the O(n) dynamic programs are faster than even the improved O(kn) program.

first q elements of the universe Ω. Let us denote these first q elements with Ω[. . . , q]. From the optimal partition of the first q points, the optimal partition of the first q + 1 points of the universe can then be computed. During this process we only compare the total cost of partitions of the same set Ω[. . . , q]. Stated differently, we compare the total cost TC(PΩ[...,q]) of one partition PΩ[...,q] of the first points Ω[. . . , q] with the total cost of another partition P Ω[...,q] of the same set Ω[. . . , q]. If we are only interested in the resulting clustering and not in the actual cost of the clustering, we can minimize cost functions which require fewer operations to compute than the original cost functions. This idea is more easily understood with an example. Lets assume we want to minimize the SSE on partitions of the first q elements of the universe Ωthen for two partitions PΩ[...,q] and P Ω[...,q] of this set,

TCSSE(PΩ[...,q]) TCSSE(P Ω[...,q])

x Ω[...,q] x2 X

x Ω[...,q] x2 X

CSSE(X) = X2

Overall, we find the clustering that minimizes SSE is equivalent to finding the clustering minimizing CSSE if we only use algorithms that compare clusterings of the same set Ω[. . . , q]. We can find similar expressions

Published in Transactions on Machine Learning Research (10/2024)

101 102 103 104

minimum group size 푘

runtime [s]

Figure 5: Runtime impact of different methods to compute the cost functions. Colors indicate algorithms, linestyles indicate methods to compute the cost functions. Shown are mean and standard deviation aggregated over 10 runs. The colors red/blue indicate the staggered/simple+ algorithm respectively (same colors as Figure 4). The dashed lines are the default cumulative sum approach (eq. 7), the solid lines are the partial cumulative sum method introduced in section 4.1, and the dotted lines indicate the alternative cost function approach introduced in section 4.3. We see that overall for the same algorithm the alternative cost function is fastest with the default cumulative sum approach being not much slower. The partial cumulative sum approach comes with a big runtime penalty.

for some cost functions, as well:

C (X) = |X| max(X) C (X) = |X| min(X)

5 Experiments

In the following we show to kinds of experiments. Firstly, we perform runtime experiments to evaluate whether the theoretical considerations made lead to performance improvements on real hardware. In the second set of experiments we evaluate whether projection based approaches are a possible competitor to solve the multivariate microaggregation task.

5.1 Runtime Experiments

We compare the algorithms by Galil and Park, Wilbers algorithm, and two versions of a simple dynamic program, one which does only use the restrictions on cluster sizes (simple) and another one which additionally uses that the row minima of the M T matrix are non decreasing (simple+). All the methods were implemented in python and compiled with the numba compiler. We additionally include the time to sort the initially unsorted data of the same size for comparison. We generate our synthetic data set by sampling one million reals uniformly at random between zero and one.

For low values of the minimum group size k the simple dynamic programs are faster than the O(n) algorithms, with the simple+ algorithm outperforming the simple algorithm in terms of computation time. These simple algorithm become slower than the more complex dynamic programs when the minimum group size k exceeds 250. When comparing the O(n) algorithms, the staggered algorithm is overall about 20 percent faster than other algorithms. Overall, we see that the algorithms including both the concavity of the cost function and the minimum / maximum group size constraints, i.e., the simple+ and staggered algorithm, are faster than those algorithms that don t include both constraints.

Published in Transactions on Machine Learning Research (10/2024)

PCA near-neigh random 10 random 50 random 100 kmeans 0.5 kmeans 1 kmeans 2

0 10 20 30 40 50 minimum group size k

Reconstruction Error (SSE)

(a) Dataset EIA

0 10 20 30 40 50 minimum group size k

Reconstruction Error (SSE)

(b) Dataset Tarragona

Figure 6: Reconstruction Error for the Multivariate Microaggregation task on real world datasets. We show the Reconstruction Error or loss (lower is better) for the SSE cost function for different values of minimum group size k. We compare four different approaches, two projection based approaches PCA and random projection as well as two approaches designed for the multivariate case the kmeans based approach and a nearest-neigh based approach by (Domingo-Ferrer et al., 2008). For the PCA and random approach, we use a vector to project the multivariate task into one dimension and solve the univariate microaggregation task with the methods introduced here. For the random projection approach one run consist of making 10, 50, and 100 random projections (as indicated in the legend) and taking their minimum. For the k-means + cleanup based approach, we first performed k-means where the number of clusters k is f dataset_size/k where f is a factor of 0.5, 1 or 2 as indicated in the legend. To make sure that the obtained data fulfills the minimum group size constraint we merge clusters of to small size with the closest clusters until all clusters are large enough. For the random, nearest-neighbor and k-means based approach we show means and standard deviation over 10 runs. On the EIA dataset (a), the k-means based approaches outperform the projection based approaches by a large margin. The nearest-neighbor based approach is performing on par with the k-means based approach for low k but gets increasingly worse with increasing k. On the Tarragona dataset (b) the picture is not as clear. For low values of k 5 the k-means based approach is better or equally good as the projection based approaches while for larger values of k the projection based approaches show less reconstruction error. The PCA based approach is usually slighlty worse than the random projection based approach. The nearest-neighbor approach on par with the k-means based approach for k less than 5 but it gets increasingly worse with increasing values of k.

In Figure 5 we compare the different ways of computing the cost functions exemplary for the simple+ and staggered algorithm. Within the same algorithm we find that the method from section 4.3 is fastest with the default cumulative sum (eq. 7) being a close competitor. The numerically more stable method introduced in section 4.1 is about four times slower than the default method.

5.2 Solving Multivariate Microaggregation through projection

In practice one is frequently met with Multivariate Microaggregation problems, i.e. there are multiple descriptors that need to be aggregated simultaneously. In this section we compare approaches designed for the multivariate problem with approaches which work by projecting the multivariate problem into 1d, solving the univariate problem with the methods discussed here and then project the solution back to the multivariate setting.

The most simple approach to project the multivariate data into one dimension is to use principal components analysis (PCA). PCA identifies the axes of maximal variance. We then use the axis with maximal variance to project the multivariate data into a single dimension, we then solve the 1d problem to obtain the cluster assignment. A similar projection based approach uses random projections, i.e. we sample random vectors

Published in Transactions on Machine Learning Research (10/2024)

with random values between zero and one as our projection vector. To achieve reasonable results, we try multiple random vectors and take the one with lowest multivariate loss. We compare these two projection based approaches with two approaches designed for the multivariate case. The first one works by obtaining a heuristic solution to the k-means problem. Because there might be clusters which are to small, we merge the clusters of size less than k with the closest other cluster (measured by the distance of the centroids) until no cluster of too small size remains. We randomize the order of this merging procedure. As a second method designed for the multivariate case we consider the nearest-neighbor based approach by Domingo-Ferrer et al. (2008). While the original proposed method allows for overlapping clusters2, i.e. the subsets Xi (compare Definition 1) need not be pairwise disjoint, we slightly modified their procedure such that they are pairwise disjoint for a more fair comparison. We found experimentally that whether we allow for overlapping clusters or not made pretty much no difference in the reconstruction error3 (the mean reconstruction errors are within 6% of each other, none of the two clearly beating the other and the standard deviations are twice as large).

To evaluate these approaches we use two real world datasets that have been used in previous research on multivariate microaggregation (Domingo-Ferrer & Mateo-Sanz, 2002). The EIA dataset (Brand et al., 2002)4 consist of 4092 instances and 15 columns, of which we use the 10 numeric columns. The Tarragona dataset (Brand et al., 2002)5 has 834 instances and 13 columns. We average the results of the k-means based approach, the nearest-neighbor based approach, and the random projection approach over 10 runs and show the mean and standard deviation. We measure the quality of the obtained clustering by the multivariate loss (SSE). The results are shown in Figure 6.

We find that on the EIA dataset the k-means based approaches greatly outperform, i.e. have lower reconstruction error, compared to the projection based approaches for almost all values of minimum group size k we considered. It is just that when k approaches 50 the two approaches become more similar in terms of their reconstruction error. The nearest neighbor based approach performs on par with the k-means based approach for values of k less than 10. For larger values of k it becomes increasingly worse compared to k-means based approach even overtaking the projection based approaches at k = 30. On the Tarragona dataset and low values of k > 5 the k-means based approaches are winning by a small margin but for larger values of k the projection based approach are performin better than the k means based approaches. When comparing the PCA approach to the random prohjection approach, the random projection approach is usually slightly better than the PCA based approach. From these preliminary experiments, it seems, that the projection based approaches might be a valid competitor on some datasets, while on other datasets, they are not viable. Overall the nearest-neighbor based method performs on par with the k-means based approach for small values of k (e.g. less than 10) and is inferior in reconstruction error for larger values of k.

6 Conclusions

We considered the problem of univariate microaggregation. We characterized properties of cost functions that lead to complexity improvements over the naive algorithm with exponential runtime. By mapping univariate microaggregation to a least weight subsequence problem we showed that UM for ordered minimizable cost functions is solvable in O(n2). If splitting is beneficial, a maximum group size constraint allows to improve this complexity to O(kn). For ordered minimizable, concave cost functions a different strategy allows to find an optimal solution to univariate microaggregation in O(n) time and space on sorted data. These results apply to many popular cost functions such as sum of squares, sum absolute error, maximum difference as well as round up/down costs. By incorporating both a maximum group size constraint and the concavity of cost functions, we are able to provide an improved O(kn) algorithm (simple+) and an improved O(n)

2The definition of the algorithm is not sound unless this is assumed, even though this is not explicitly mentioned in the original source. In spirit of the presented algorithm we considered the below highlighted overlapping case. 3For the overlapping case we assume that there are not only the overlapping sets Xi used to compute the cluster centroids but for each of them a set Yi Xi exists which highlights the non-auxillary nodes in Xi. The Yi are pairwise disjoint and unify to Ω. We obtain the reconstruction error the following way. For each element we find the only set Yi it is contained in and compute the (squared) distance to the centroid of the correspond set Xi. To obtain the final reconstruction error we sum up those distances for all elements. This naturally captures a slightly different notion of k-privacy as the elements of the new dataset Ω are obtained by aggregating at least k elements of the original dataset. 4https://github.com/sdc Tools/sdc Micro/blob/master/data/EIA.rda 5https://github.com/sdc Tools/sdc Micro/blob/master/data/Tarragona.rda

Published in Transactions on Machine Learning Research (10/2024)

algorithm (staggered) which demonstrate faster empirical runtime in comparison to other algorithms in their respective complexity class.

Limitations In practice one is frequently met with multivariate microaggregation (MM) problems and one of the heuristics used to solve MM problems is to devise an order on the points (see (Mortazavi, 2020) and references therein). The resulting ordered multivariate microaggregation problem is sometimes referred to as a univariate microaggregation problem even though it really is a least-weight subsequence problem with multi-dimensional cost functions. These multi-dimensional cost functions may no longer be concave as the concavity of the cost functions relies on the sorted (i.e. arranged in increasing/decreasing order rather than any fixed but arbitrary order) 1D nature of the points. Thus the presented O(n) algorithms are not applicable to such ordered multivariate microaggregation problems. As the simple O(kn) algorithm does not rely on the concavity of cost functions, it can still be used to solve the ordered multivariate microaggregation problem if a multi dimensional equivalent of the splitting is beneficial property holds.

It is well known, that the privacy concept of k-anonymity may provide insufficient protection if for example the confidential attributes are not sufficiently diverse. Caution should thus be taken if k-anonymity, or microaggregation to achieve k-anonymity, is applied in practice. For practical anonymity guarantees, the concept of differential privacy (Dwork et al., 2006; Dwork, 2008) provides a more robust protection.

While k-anonymity relies on the intuition hidden in a crowd of size at least k , the key idea of differential privacy is plausible deniability , i.e. whether or not any individual is included in the data cannot be said with certainty. To achieve differential privacy guarantees, usually the output of an algorithm is sufficiently perturbed. Li et al. (2012) showed that using k-anonymity, it is possible to essentially perform input perturbation and still achieve differential privacy if the k-anonymity type algorithm is augmented with two steps. First, the k-anonymity algorithm needs to be ϵ-safe: Intuitively this means the k-anonymity algorithm needs to output non-optimal clusterings with nonzero probabilities. Second, k-anonymity is applied to a subsample from the original dataset instead of the full dataset (for the full details see (Li et al., 2012)).

Concave costs where splitting is not beneficial The majority of cost functions considered are concave and have the splitting is beneficial property which allows all the presented algorithms to be applied. Yet, there are concave cost functions which do not have the splitting is beneficial property. As a simple example consider adding a constant group cost δ to a concave cost function C which also has the splitting is beneficial property to penalize having too many clusters (i.e., we obtain the adjusted cost function ˆC(i, j) := C(i, j) + δ). This adjusted cost function is still concave but no longer has the splitting is beneficial property. Thus the generic O(n) algorithms (Wilber, 1988; Galil & Park, 1990) may still be used to solve UM problems for the modified cost function ˆC while the O(kn) algorithms (simple, simple+) and the staggered algorithm are not applicable.

Relation to k-means The definitions presented here also allow to have a more fine grained understanding of other 1D clustering problem with different cost functions. As an example let us consider the k-means problem which restricts the number of clusters to be exactly k. An existing O(kn2) algorithm (Grønlund et al., 2018) solves the 1D k-means problem if the cost function considered is ordered minimizable. If the cost is additionally also concave, O(kn) and O(n log U)6 algorithms exist (Grønlund et al., 2018). The splitting is beneficial property, while not explicitly used in algorithms, seems to be implicitly used when formulating the k-means problem. The most popular formulation of the k-means problem asks to minimize the clustering cost subject to exactly k clusters, while minimizing costs subject to at most k clusters could also be thinkable. However, these two formulations will result in identical optimal clusterings if the cost function has the splitting is beneficial property.

Mixed UM and k-means scenario Another way to avoid using overly many clusters in a UM setting could be to impose a maximum number of clusters constraint in addition to the usual minimal group size constraint. This would correspond to a mixed UM and k-means scenario. We implicitly showed that the considered cost functions are ordered minimizable even in a mixed UM and k-means scenario, as cost functions that are q-partition ordered minimizable also encompass this mixed scenario. Hence, algorithms used to solve the 1D k-means problem (see (Grønlund et al., 2018) for many such algorithms) work correctly in the mixed UM and k-means scenario if we adapt the cost function as explained in eq. (3).

6log U is a number of bits used

Published in Transactions on Machine Learning Research (10/2024)

Regularized UM problem The usual UM problem forbids any clustering with too few entries, which might be overly restrictive when k is larger. In such cases it might be worth considering a regularized UM problem, i.e., we don t entirely forbid invalid clusterings but instead return an additional cost when the minimal cluster size constraint is violated. This can for example be achieved by defining the regularized cluster cost as

Creg(i, j) = C(i, j) +

( λ (k (j i)) j i < k 0 else

with regularization parameter λ 0. By adjusting the regularization parameter λ to be large we can make it more and more unlikely that the cluster size constraint is violated. If C was concave, the regularized UM problem may also be minimized by the generic concave algorithms (Wilber, 1988; Galil & Park, 1990). Similarly, if splitting was beneficial for C, the regularized UM may be solved by a slightly adjusted7 standard algorithm in O(kn).

Alok Aggarwal, Maria M Klawe, Shlomo Moran, Peter Shor, and Robert Wilber. Geometric applications of a matrix-searching algorithm. Algorithmica, 2(1-4):195 208, 1987.

Charu C. Aggarwal and Philip S. Yu. A General Survey of Privacy-Preserving Data Mining Models and Algorithms, pp. 11 52. Springer US, Boston, MA, 2008.

Ruth Brand, J. Domingo-Ferrer, and J. M. Mateo-Sanz. Reference data sets to test and compare SDC methods for protection of numerical microdata. European Project IST-2000-25069 CASC, 2002. URL https://research.cbs.nl/casc/CASCrefmicrodata.pdf.

Rainer E Burkard, Bettina Klinz, and Rüdiger Rudolf. Perspectives of monge properties in optimization. Discrete Applied Mathematics, 70(2):95 161, 1996.

Jordi Casas-Roma, Jordi Herrera-Joancomartí, and Vicenç Torra. k-degree anonymity and edge selection: improving data utility in large networks. Knowledge and Information Systems, 50:447 474, 2017.

D Defays and MN Anwar. Masking microdata using micro-aggregation. Journal of Official Statistics, 14(4): 449, 1998.

Josep Domingo-Ferrer and Josep Maria Mateo-Sanz. Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and data Engineering, 14(1):189 201, 2002.

Josep Domingo-Ferrer and Vicenç Torra. Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining and Knowledge Discovery, 11:195 212, 2005.

Josep Domingo-Ferrer, Francesc Sebé, and Agusti Solanas. A polynomial-time approximation to optimal multivariate microaggregation. Computers & Mathematics with Applications, 55(4):714 732, 2008.

Cynthia Dwork. Differential privacy: A survey of results. In Manindra Agrawal, Dingzhu Du, Zhenhua Duan, and Angsheng Li (eds.), Theory and Applications of Models of Computation, pp. 1 19. Springer, 2008. ISBN 978-3-540-79228-4. doi: 10.1007/978-3-540-79228-4_1.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Shai Halevi and Tal Rabin (eds.), Theory of Cryptography, pp. 265 284. Springer, 2006. ISBN 978-3-540-32732-5. doi: 10.1007/11681878_14.

Z Galil and K Park. A linear-time algorithm for concave one-dimensional dynamic programming. Information Processing Letters, 33(6):309 311, 1990.

7In the regularized UM scenario, we need to also consider clusterings in which some clusters contain less than k points. This corresponds to the red entries in Figure 3 of which there are only k-1 per column. We thus process 2k-1 entries per column resulting in overall O(kn) runtime.

Published in Transactions on Machine Learning Research (10/2024)

Allan Grønlund, Kasper Green Larsen, Alexander Mathiasen, Jesper Sindahl Nielsen, Stefan Schneider, and Mingzhou Song. Fast exact k-means, k-medians and bregman divergence clustering in 1d, 2018.

Stephen Lee Hansen and Sumitra Mukherjee. A polynomial algorithm for optimal univariate microaggregation. IEEE transactions on Knowledge and Data Engineering, 15(4):1043 1044, 2003.

Md Enamul Kabir and Hua Wang. Microdata protection method through microaggregation: A median-based approach. Information Security Journal: A Global Perspective, 20(1):1 8, 2011.

Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, pp. 106 115, Istanbul, Turkey, 2007. IEEE.

Ninghui Li, Wahbeh Qardaji, and Dong Su. On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, ASIACCS 12, pp. 32 33. Association for Computing Machinery, 2012. doi: 10.1145/2414456.2414474.

Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3 es, 2007.

Reza Mortazavi. Improved univariate microaggregation for integer values. ISe Cure, 12(1):35 43, 2020.

Anna Oganian and Josep Domingo-Ferrer. On the complexity of optimal microaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Commission for Europe, 18(4): 345 353, 2001.

Luc Rocher, Julien M Hendrickx, and Yves-Alexandre De Montjoye. Estimating the success of reidentifications in incomplete datasets using generative models. Nature communications, 10(1):1 9, 2019.

P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, Computer Science Laboratory, SRI International, 1998. URL http://www.csl.sri.com/papers/sritr-98-04/.

Pierangela Samarati. Protecting respondents identities in microdata release. IEEE transactions on Knowledge and Data Engineering, 13(6):1010 1027, 2001.

Latanya Sweeney. k-anonymity: A model for protecting privacy. International journal of uncertainty, fuzziness and knowledge-based systems, 10(05):557 570, 2002.

Robert Wilber. The concave least-weight subsequence problem revisited. Journal of Algorithms, 9(3):418 425, 1988.

Xiaolin Wu. Optimal quantization by matrix searching. Journal of algorithms, 12(4):663 673, 1991.

Yan Yan, Anselme Herman Eyeleko, Adnan Mahmood, Jing Li, Zhuoyue Dong, and Fei Xu. Privacy preserving dynamic data release against synonymous linkage based on microaggregation. Scientific Reports, 12(1):2352, 2022.

Published in Transactions on Machine Learning Research (10/2024)

A Additional algorithms

A.1 Pseudo code for the classic algorithms (simple/ simple+)

Input: sorted array v, minimum size k, cost function calculator C Output: array implictly representing the optimal univariate microaggregation of v

1 let n = length(v)

2 if n 2k 1 then

3 return all zeros array of length n

4 C.do_preprocessing(v, k)

5 Min Cost/Argmin = zero based arrays of size n+1

6 Min Cost[0] = 0

7 for j {1, . . . , n} do

8 Argmin[j] = arg min0 i<j Min Cost[i] + C.calc(i, j)

9 Min Cost[j] = Min Cost[Argmin[j]] + C.calc(Argmin[j], j)

10 return Argmin[1:n]

Algorithm 2: Pseudo code for the classical algorithm used to solve least weight subsequence problems. If this algorithm is used to solve univariate microaggregation with ordered minimizable cost functions, the constraint on the arg min in line 8 becomes 0 i < j k. If splitting is beneficial the constraint on the arg min in line 8 is max(0, j 2k + 2) i j k (this is the simple algorithm). Lastly, for concave cost functions that also have the splitting is beneficial property, the bound on the arg min is max(Argmin[j-1], j-2k+2) i j k with Argmin[0]=0 (this is the simple+ algorithm).

A.2 Pseudo code for the backtrack algorithm

Input: zero indexing based implicit cluster array b Output: array explictly representing the clustering

1 out = zero indexing based array of same length as b

2 p=length(b); Num Clusters=0

3 while true do

4 Num Clusters +=1

5 for i {b[p 1], . . . , p 1} do

6 out[i] = Num Clusters

8 if p = 0 then

10 for i {0, . . . , length(b) 1} do

11 out[i] = Num Clusters - out[i]

12 return out

Algorithm 3: The backtrack algorithm converts an implicit cluster representation array b into an explicit cluster representation. All the presented algorithms return such an implicit representation. For example an implicit cluster array [0, 0, 1, 1, 2, 2] represents the clustering [0, 1, 2, 2, 2, 2] that is the first and second element are their own cluster, and the remaining elements are in one cluster. The backtrack algorithm runs in O(n).

Published in Transactions on Machine Learning Research (10/2024)

B Mathematical details and proofs

B.1 Multivariate microaggregation is NP-hard for many cost functions

It was previously known that multivariate microaggregation is NP-hard for the sum of squares error cost function. We noticed that the proof provided by Oganian & Domingo-Ferrer (2001) could be easily adapted for other cost functions as well.

Let li(C), i {3, 4, 5} denote the minimal cluster cost for cost function C computed for the structures of size i outlined in (Oganian & Domingo-Ferrer, 2001). For a (multi)set X = {{x1, . . . , xp}} with xi Rd let us denote with X(i) = {{x1 ei, . . . , xp ei}} the multi-set of values projected onto the i-th standard basis vector.

Theorem 12 (Generalizing the result from Oganian & Domingo-Ferrer (2001)). Microaggregation for cost functions C(d)(X) := Pd i=1 C(X(i)) is NP-hard in dimensions d 2 and k = 3 if splitting is beneficial for cost function C, and the ratios of minimal structure cost are l4(C)

3, and l5(C)

Computing the li, i {3, 4, 5} values for the cost functions in corollary 7, we get the following:

Corollary 13. Microaggregation is NP-hard in dimensions 2 and k = 3 for the cost functions SSE, SAE, and Round up/Round down cost.

The original result by Oganian & Domingo-Ferrer (2001) dealt with SSE only but the proof allows easy adaption for SAE, and round up/round down cost. Unfortunately, theorem 12 is not applicable to the maximum distance cost as both li ratios are too small.

B.2 Total monotonicity and concave cost functions

B.2.1 Monge matrices and their transpose are totally monotone

Before we start with the proof for concavity we recap the relationship of concave cost functions and totally monotone matrices. For a concave cost function we start by rearranging the quandrangle inequality eq. (1) as

C(a, c) C(a, d) C(b, c) C(b, d) for 0 a < b < c < d n

We then notice that now C(b, c) C(b, d) < 0 C(a, c) C(a, d) < 0. If we now consider the matrix C with entries Cij = C(i, j), all 2 by 2 submatrix of C have the form C(a, c) C(a, d) C(b, c) C(b, d)

for suitable choice of of the integers a < b, c < d. To check for monotonicity we see that C(b, c) < C(b, d) which is equivalent to C(b, c) C(b, d) < 0 which by our previous equality implies C(a, c) C(a, d) < 0 C(a, c) < C(a, d). That means that any 2 by 2 submatrix is totally monotone. Thus the entire matrix is totally monotone.

For the transpose CT we find that 2 by 2 submatrices are of the form C(a, c) C(b, c) C(a, d) C(b, d)

To check for total monotonicity we need to show that C(a, d) < C(b, d) C(a, d) C(b, d) < 0 implies C(a, c) < C(b, c) C(a, c) C(b, c) < 0. By rearranging the quadrangle inequality we find

C(a, c) C(b, c) C(a, d) C(b, d) for 0 a < b < c < d n.

Now C(a, d) C(b, d) 0 C(a, c) C(b, c) which is exactly what we need such that any 2 by 2 submatrix of CT is totally monotone. Thus the entire matrix CT is totally monotone.

Published in Transactions on Machine Learning Research (10/2024)

B.2.2 Proof that MT is totally monotone

We prove total monotonicity of MT by showing that all 2 by 2 submatrices fullfill the necessary equations. We never explicitly consider a submatrix but we choose two entries in the same row but different columns (col1 < col2). We then compare the values of MT at those locations (val1:= MT row,col1, val2:= MT row,col2). Depending on the size of the values, this imposes requirements on either the values below (in case of val1 > val2) or the entries above (in case of val1 val2). If row is such a row either above or below, we refer to the entries in the same columns as before as val1 := MT row ,col1, val2 := MT row ,col2.

MT of the adapted cost in eq. (5) Instead of referring to explicit values, we refer to the values by their colors. If the cluster corresponding to the entry is too small, the color is red else the color is green (these are the two cases in eq. (5)).

Case val1=green, val2 = green: val1 val2: The pairs above are either green-green, green-red, or red-red. For green-green pairs monotonicity is guaranteed by eq. (1). This implies total monotonicity as the transpose of monge matrices are also monge Burkard et al. (1996), monge matrices are totally monotone Burkard et al. (1996) and if you add constant values to the columns of totally monotone matrices they remain totally monotone Burkard et al. (1996). For green-red, and red-red pairs by design val1 < val2 . Overall values above col1 are less than or equal to those in col2. val1 > val2: The only case if green-green which is correct by monotonitcity.

Case val1=green, val2=red: The only case is val1 < val2, the pairs above are either green-red or red-red in either cases the values above col1 are smaller than those above col2.

Case val1=red, val2=red: The only case here is val1 < val2: all entries above are red-red as well for which val1 < val2

MT of the adapted cost in eq. (6) As before, instead of referring to explicit entries, we refer to the entries by their colors as visible in Figure 3, i.e. the first case in eq. (6) is red, the second orange and the third is green.

Case val1=green, val2 = green: val1 val2: The cases above are the same as in the previous consideration. val1 > val2: The pairs below are either green-green, orange-green or orange-orange. For green-green pairs monotonicity is guaranteed as in the val1 val2 case. For orange-green and orange-orange pairs, val1 > val2 by design.

Case val1=red, val2=red: the only case here is val1 < val2: all entries above are red-red as well for which val1 < val2

Case val1=orange, val2=green: The only case is val1 > val2, the possible pairings below are orangeorange or orange-green. We see that the values below col1 are larger than those below col2.

Case val1=orange, val2=red: The only case is val1 < val2, the pairs above are either orange-red or green-red. In both cases the values above column 1 are smaller than those above column 2.

Case val1=green, val2=red: The only case is val1 < val2, the pairs above are either green-red or red-red in either cases the values above col1 are smaller than those above col2.

Case val1=orange, val2=orange: The only case is val1 > val2, the pairs below are orange-orange. Thus the values below column 1 are larger than those in column 2.

B.3 Computational complexity of computing cost functions

This section shows, that all the cost functions presented in corollary 7 can be computed in O(1) performing at most O(n) time for preprocessing and using at most O(n) space. Throughout this section we assume that the input data is sorted, i.e. i < j xi xj. We also assume that i < j such that Xi,j = {{xi+1, . . . , xj}} is well defined.

Published in Transactions on Machine Learning Research (10/2024)

B.3.1 Sum of absolute errors

During preprocessing we compute and store the cummulative sum s(i) = Pi l=1 xl. The computation of that cummulative sum is O(n).

For a (multi-)set A we denote with (A)1/(A)2 the j |A|

2 k smallest/largest elements of A. Let l = j i

2 then we note, that

SAE(i, j) = X

x (Xi,j)2 x X

x (Xi,j)1 x

= s(j) s(j l) (s(i + l) s(i))

which is O(1) if the s(i) are stored in an array.

B.3.2 Maximum distance cost

For the maximum distance cost C we have

C (i, j) = max x Xi,j |x xi+1 + xj

2 | = xj xi+1

which is O(1) if the xi are stored in an array.

B.3.3 Round down/ round up cost

Similar as for SAE for precomputation we compute the cummulative sum s(i) = Pi l=1 xl, s(0) = 0 in O(n) and store all of the s(i) values. Then the round down cost is

l=i+1 xl xi+1 = s(j) s(i) (j i)xi+1.

The round up cost is

l=i+1 xj xl = (j i)xj (s(j) s(i)) .

Both are O(1) if the xi and s(i) are stored in an array.

B.4 Concavity of cost functions

B.4.1 Maximum distance cost is concave

In 1D the maximum distance cost is defined as C (a, b) = maxb l=a+1 |xl x a..b| where x a..b = xb+xa+1

2 is chosen such that it minimizes the max expression. Thus in 1D, the maximum distance cost can be written as C (a, b) = |xb x a..b| = |xa+1 x a..b| = xb xa+1

2 . We can now show, that the maximum distance cost is concave: C (a, c) + C (b, d) = xc xa+1

2 + xd xb+1

2 = xd xa+1

2 + xc xb+1

2 = C (a, d) + C (b, c).

Published in Transactions on Machine Learning Research (10/2024)

B.4.2 Round up/round down costs are concave

In 1D the round up cost is C (a, b) = Pb l=a+1 |xb xl| = (b a)xb Pb l=a+1 xl. We can now show, that the round up cost is concave:

C (a, c) + C (b, d) C (a, d) + C (b, c)

(c a)xc + (d b)xd

(d a)xd+(c b)xc

(c a)xc + (d b)xd (d a)xd + (c b)xc axc bxd axd bxc (b a)xc (b a)xd xc xd

which is true as the points are ordered in increasing order.

We can do a similar argument for the round down cost C (a, b) = Pb i=a+1 |xi xa| = Pb i=a+1 xi (b a)xa.

C (a, c) + C (b, d) C (a, d) + C (b, c)

i=b+1 xi (c a)xa+1 (d b)xb+1

i=b+1 xi (d a)xa+1 (c b)xb+1

(c a)xa+1 (d b)xb+1 (d a)xa+1 (c b)xb+1 cxa+1 + dxb+1 dxa+1 cxb+1 (d c)xa+1 (d c)xb+1 xa+1 xb+1

which is true as the points are sorted in increasing order.

B.4.3 SAE cost is concave

For a (multi-)set A let us denote with (A)1/2 the j |A|

2 k smallest/largest elements. For brevity let us define

x A x. Let A, B, C be the three sets obtained by the cut indicies a, b, c, d (A = {{xa+1, . . . , xb}}, B = {{xb+1, . . . , xc}}, C = {{xc+1, . . . , xd}}). Then quadrangle inequality for SAE reads:

C (a, c) + C (b, d) C (a, d) + C (b, c)

s((A B)1) + s((A B)2) s((B C)1) + s((B C)2)

s((A B C)1) + s((A B C)2) s((B)1) + s((B)1)

Published in Transactions on Machine Learning Research (10/2024)

To simplify the above expression we observe the following relations:

(A B)1 (A B C)1 RABC AB,1 := (A B C)1 \ (A B)1 (B C)2 (A B C)2 RABC BC,2 := (A B C)2 \ (B C)2 (B)1 (B C)1 RBC B,1 := (B C)1 \ (B)1 (B)2 (A B)2 RAB B,2 := (A B)2 \ (B)2

Then we have

C (a, c) + C (b, d) C (a, d) + C (b, c)

s(RABC AB,1) + s(RAB B,2) s(RABC BC,2) + s(RBC B,1)

which is true as s(RABC AB,1) s(RBC B,1) and s(RAB B,2) s(RABC BC,2). One can see that the first expression is true by increasing the size of A gradually. If |A| = 0 then certainly the lhs and rhs are the same. If we now add elements to A which are no larger than those in B then the lhs expression will only get smaller. One can employ a similar trick for the second expression by adding elements to C which no smaller than those in B, which only increase the rhs expression.

B.4.4 SSE cost is concave

We already explained how to compute the SSE cost in O(1) using O(n) for preprocessing in section 4.

B.5 Ordered minimizable

B.5.1 Proof of Theorem 6

Before we start with the actual proof lets provide some clarifications regarding multi-sets. When calling for the l < |X| smallest elements of a multi-set X, we mean the multi-set S X of size l which fullfills s S x (X\S) s x.

We conduct the proof of theorem 6 in two steps. First we show that q-partition ordered minimizable for all q implies ordered minimizable. We then show that if a cost is 2 partition ordered minimizable it is 2-partition ordered minimizable for all q. Lemma 14. If a cost function is q-partition ordered minimizable for all q then it is ordered minimizable.

We prove lemma 14 by contradiction. Lets assume that it is not enough to consider only pairwise ordered partitions when finding any partition minimizing a total cost which is q-partition ordered minimizabe for all q. But then there is a partition P (wlog. |P| = q) which minimizes the total cost TC subject to the either the UM or k-means constraint. But then as the total cost TC is q-partition ordered minimizable we are guaranteed a pairwise ordered partition P with has the same multiset of sizes as P but no larger total cost. Thus P also respects the UM or k-means constraint. Thus it would be enough to have considered only pairwise ordered partitions when minimizing the total cost on the relevant partitions, which concludes our contradiction.

Proof. Now we show that if a cost is 2-partition ordered minimizable, then it is q-partition ordered minimizable for all q by induction on q.

The base case is the assumption. Assume the TC is q-ordered minimizable. Let P = {{X0, . . . , Xq+1}} be a (q+1)-partition. We can construct a pairwise ordered partition P which has no worse TC than P: Let A1 = X1. From 2-partition ordered minimizable we know that we can obtain sets Bi+1, Ci+1 = pairwise_order(Ai, Xi+1) (see corollary 7). We then define Ai+1 := min(Bi+1, Ci+1) and Xi := max(Bi+1, Ci+1). Note that Ai+1 contains the smallest elements in Si+1 j=1 Xj. We can then define the partition Pi := {{ X1, . . . , Xi 1, Ai, Xi+1, . . . , Xq+1}}.

Published in Transactions on Machine Learning Research (10/2024)

We see that P1 = P and TC( Pi+1) TC( Pi). The latter is consequence from

TC( Pi+1) =

j=1 C( Xj) + C(Ai+1) +

j=i+2 C(Xj)

j=1 C( Xj) + C(Ai) +

j=i+1 C(Xj)

which is equivalent to C( Xi) + C(Ai+1) C(Ai) + C(Xi+1).

Here we note, that either |Ai+1| = |Xi+1| and | Xi| = |Ai| or |Ai+1| = |Ai| and | Xi| = |Xi+1|. In either case the equations can be written as

TC({L, R}) TC({Y, Z}) (9)

with |L| = |Y |, |R| = |Z| and L, R being pairwise ordered but this is exactly being 2-partition ordered minimizable which we have by assumption. Overall we have now found a partition Pq+1 = {{ X1, . . . , Xq, Aq+1}} with TC no worse than the TC of P. Now we consider the minimization problem of TC on Ω\Aq+1 subject to partitions of size q. By the induction hypothesis we are guaranteed a pairwise ordered partition {{ ˆX1, . . . , ˆXq}} that minimizes TC on Ω\ Aq+1. Now we have found a (q+1) ordered partition P = {{ ˆX1, . . . , ˆXq, Aq+1}} which has the same size multiset and no worse TC than P but is totally ordered (remember the ˆXi are pairwise ordered and all elements in Aq+1 are no larger than any elements in any of the ˆXi). This concludes the induction.

B.5.2 Ordered minimizable and quadrangle inequality

In this subsection we provide evidence that quadrangle inequality and ordered minimizable are indeed not the same concepts by providing a simple counter example.

Lemma 15. Quadrangle inequality does not imply ordered minimizable.

Proof. Let our universe Ω R be a finite set and

Cmin(X) := min(X).

Then by the following Cmin fullfills the quadrangle inequality (a < b < c < d)

Cmin(a, c) + Cmin(b, d) Cmin(a, d) + Cmin(b, c)

xa+1 + xb+1 xa+1 + xb+1

The last line is certainly true.

We will now show, that Cmin is not ordered minimizable. Let |Ω| 4 and let ω1 < ω2 < ω3 be the smallest, second smallest, and third smallest element of the universe Ω. Let the multi-set R := Ω\ {ω1, ω2, ω3}. Then for A = {ω1, ω3} and B = {ω2} R be two sets then TC({A, B}) = ω1 + ω2 but in every ordered partition with min(|L|, |R|) 2 we have that ω1 and ω2 are in the same set. But this means for any pairwise ordered sets L and R with min(|L|, |R|) 2 we have TC({L, R}) ω1 +ω3 > TC({A, B}). Thus Cmin is not ordered minimizable.

B.5.3 SSE is 2-partition ordered minimizable

Looking at the variance var(X) = 1 |X| P

x X(x X)2 = 1 |X| P

x X x2 X2 where the latter is a common

simplification. We now note, that SSE(X) = n var(X) = P

i x2 i n X2 thus when comparing two partitions

Published in Transactions on Machine Learning Research (10/2024)

P1 and P2 of the same original set we see, that

TSSE(P1) TSSE(P2)

X P1 n X X2 X

X P2 n X X2

For the case of two partitions P1 = {{A + A, B + B}} and P2 = {{A + B, B + A}} with | A| = | B| but A = B (think of exchanging elements A from A + A with the B elements in B + B). This means

TSSE(P1) TSSE(P2)

n A+ Aµ2 A+ A n B+ Bµ2 B+ B n A+ Bµ2 A+ B n B+ Aµ2 B+ A n A+ Aµ2 A+ A n A+ Bµ2 A+ B n B+ Aµ2 B+ A n B+ Bµ2 B+ B

Where n S denotes the size of set S and µS denotes the average of the set µ. To treat the lhs and rhs simultaneously, lets consider multisets X, Y, Z with |Y | = |Z|:

n X+ Y µ2 X+ Y n X Zµ2 X+ Z

s2 X + s2 Y + 2s Xs Y s2 X s2 Z 2s Xs Z

s2 Y + 2s Xs Y s2 Z 2s Xs Z

= 1 n X+ Y (s Y s Z) (s Y + s Z + 2s X)

= (s Y s Z) (µX+Y + µX+Z)

Bringing this together with the previous equations yields

TSSE(P1) TSSE(P2)

(s A s B)(µA+ A + µA+ B) (s A s B)(µB+ A + µB+ B)

µA+ A + µA+ B µB+ A + µB+ B µA+ B µB+ A µB+ B µA+ A

Where the the second last line is true if s A < s B. If we can maximize the right hand side of the last expression which is only a function of the partition P1, this is the same as minimizing the TSSE of P1 with respect to P2. The maximum of the rhs is achieved by choosing that B and B contain the largest elements while A and A contain the smallest elements. This naturally fullfills s A < s B as | A| = | B| (remember A = B). Thus we conclude that for all 2-partitions P2 we can find pairwise orderd P1 which has no worse TSSE, i.e. TSSE is 2-partition ordered minimizable.

B.5.4 Infinity norm is 2-partition ordered minimizable

We denote the smallest/largest element in Ωwith l/r. Lets assume we are given a partition P of the elements in Ωinto A and B. Wlog lets assume l A. We choose two sets L and R such that L contains the |A| smallest elements and R the remaining elements, then

C (L) + C (R) C (A) + C (B)

2 + r min(R)

2 + max(B) min(B)

max(L) min(R)

( max(A) min(B) if max(B) = r max(B) min(B) if max(A) = r

In the first case, we have that max(L) max(A) as L contains the smallest elements and |A| = |L|. Further, min(B) min(R) as R contains the largest elements and |B| = |R|. In the second case we observe that the left hand side is always non positive while the right hand side is always non negative.

Published in Transactions on Machine Learning Research (10/2024)

B.5.5 SAE is 2-partition ordered minimizable

For a multiset A we denote with a1/a2 the multisets of size j |A|

2 k containing the smallest/largest elements

of A. We define bi, li, ri similar for multisets B, L, R. We then note, that SAE(A) = P

x a2 x irrespective of whether |A| is even or odd. Lets consider a partition of the elements in Ωinto two sets A, B. Let us further the multisets c1/2 := l1/2 r1/2 and d1/2 := a1/2 b1/2. In a slight abuse of notation let us denote the index (i.e. position in an order of Ω) of the i-th smallest element in c1/2 as c1/2(i). Then xc1/2(1) is the smallest, xc1/2(2) is the second smallest element of c1/2 and so on. We similarly do that with the sets d1/2 (i.e. xd1/2(i) is the i-th smallest element of d1/2). We define L/R as the sets containing the smallest/largest elements of Ω. The size of L is min(|A|, |B|) if d2(1) max(|a1|, |b1|) + 1 and max(|A|, |B|) otherwise. The size of R is |Ω| |A|.

Lets us first consider the first |l1| of the sets c1/2 and d1/2. For i |l1| certainly c1(i) = d1(i) (these are the |l1| smallest elements of Ω). Further, c2(i) d2(i) as c2(1) d2(1) and the c2 continue consecutively for the next |l1| 1 elements. In the case |L| = min(|A|, |B|) the relation c2(1) d2(1) is pretty obvious as in this case c2(1) is minimal among all partitions with the same size. In the other case, we use that now d2(1) > max(|a1|, |b1|) + 1 c2(1).

For the remaining |r1| elements in c1/2 or d1/2, let i |r1| and we can do a similar argument but from the other direction. For ease of notation we set n 1

2 := |l1|+|r1|. We observe that c2(n 1

2 i) = d2(n 1

2 i) as these are the |r1| largest elements. Also c1(n 1

2 i) d1(n 1

2 i) as c1(n 1

2 ) and the next |r1| 1 smaller elements are consecutively. In the case |L| = max(|A|, |B|) the relation c1(n 1

2 ) is pretty obvious as in this case c1(n 1

2 ) is maximal among all partitions with the same size. For the other case let |A| |B|. We use that now d2(1) max(|a1|, |b1|) + 1 which implies that max(a1) d2(1) 1 max(|a1|, |b1|) or in other words all indices of elements in d1 that are larger than max(|a1|, |b1|) are from b1. Because there are at least |b1| elements larger than those in b1, we can conclude that d1(n 1

2 ) n |b1| c1(n 1

Turning back to the SAE we find that

SAE(L) + SAE(R) SAE(A) + SAE(B)

|a1|+|b1| X

i=1 xc2(i) xc1(i)

|a1|+|b1| X

i=1 xd2(i) xd1(i)

Looking at individual terms we find that xc2(i) xc1(i) xd2(i) xd1(i) because c2(i) d2(i) and c1(i) d1(i). If the individual lhs terms are no larger than the rhs term, then certainly the lhs sum is no larger than the rhs sum.

B.5.6 Round up/down costs are 2-partition ordered minimizable

Let {A, B} be a 2-partition of Ω. Wlog max(A) max(B) then certainly max(B) = max(Ω). We now choose L, R to be the smallest/largest elements of Ωwith sizes |A| and |B| respectively. Then the round-up cost is

C (L) + C (R) C (A) + C (B)

x L max(L) x+P

x R max(R) x P

x A max(A) x+P

x B max(B) x

|A| max(L) |A| max(A)

Where a lot of terms cancel as each x appears exactly once, and max B = max R. Certainly max(L) max(A) as both have the same number of elements and max L is minimal over all sets of that size.

We can do a similar argument for the round down cost. Let {A, B} be a 2-partition of Ω. Wlog min(A) min(B) then certainly min(A) = min(Ω). We now choose L, R to be the smallest/largest elements of Ωwith

Published in Transactions on Machine Learning Research (10/2024)

sizes |A| and |B| respectively. Then the round-down cost is

C (L) + C (R) C (A) + C (B)

x L x min(L)+P

x R x min(R) P

x A min(A) x+P

x B min(B) x

|B| min(R) |B| min(B)

min(R) min(B)

Which is true as min(R) is maximal among all sets of its size.

B.6 Negative results on ordered minimizable

The mean absolute error (MAE(X) = 1 |X| P

x X |x Median(X)|) is not ordered minimizable. One counter example is: For Ω= {{ 1, 0, 0, 0, 0, 1}}, the partition {{{ 1, 1, 0, 0}}, {{0, 0}}} has cost 1

2 but the two pairwise ordered partitions {{{ 1, 0, 0, 0}}, {0, 1}} and {{ 1, 0}, {{0, 0, 0, 1}}} both have costs 3

The ℓ2 cost (Cℓ2(X) = p

SSE(X)) is not ordered minimizable. One counter example is: For Ω= {{ 1, 0, 0, 0, 1}}, the partition {{ 1, 1, 0}, {{0, 0}}} has cost

2 1.41 but the two pairwise ordered par-

titions {{{ 1, 0, 0}}, {0, 1}} and {{{ 1, 0}}, {0, 0, 1}} both have costs q

The mean round down error (d (X) = 1 |X| P

x X x min(X)) is not ordered minimizable. One counter example is: For Ω= {{0, 0, 0, 1, 1, 2}}, the partition {{{0, 0, 0, 2}}, {{1, 1}}} has cost 1

2 but the two pairwise ordered partitions {{{0, 0, 0, 1}}, {1, 2}} and {{0, 0}}, {{0, 1, 1, 2}}} have costs 3

4 and 1 respectively.

One can construct a similar counter example for mean round up cost. The mean round up error (d (X) = 1 |X| P

x X max(X) x) is not ordered minimizable. One counter example is: For Ω= {0, 1, 1, 2, 2, 2}, the partition {{{0, 2, 2, 2}}, {{1, 1}}} has cost 1

2 but the two pairwise ordered partitions {{{0, 1, 1, 2}}, {{2, 2}}} and {0, 1}, {{1, 2, 2, 2}}} have costs 1 and 3

4 respectively.

B.7 Ordered minimizable but splitting is not beneficial

We were wondering whether we could provide a cost function which is ordered minimizable but does not have the property, that splitting is beneficial. Let us denote with X>M(X) the set of elements in the set X larger than median of X. Let our universe consist only of non negative reals i.e. Ω R . Then we can define the a cost function C (X) = 2 |X|α X

x X>M(X) x.

If you consider your universe to be a set of item prices, the cost function describes a scenario where you pay a discounted price on only the most expensive half of items. The discount parameter 0 α 1 controls the amount of discount provided. Values of α close to 1 indicate that you get a significant discount if you buy in bulk, while low values of α indicate very low discount when buying bulk.

The cost function C (X) is ordered minimizable but it does not have the splitting is beneficial property for α > 0. For α = 1 an optimal UM clustering is just a single cluster containing all points independent of the actual universe Ω.