# learning_multiple_secrets_in_mastermind__150f71a0.pdf

Learning Multiple Secrets in Mastermind

Milind Prabhu * 1 David Woodruff * 2

Abstract In the Generalized Mastermind problem, there is an unknown subset H of the hypercube {0, 1}d

containing n points. The goal is to learn H by making a few queries to an oracle which given a point q in {0, 1}d, returns the point in H nearest to q. We give a two-round adaptive algorithm for this problem that learns H while making at most exp( e O( d log n)) queries. Furthermore, we show that any r-round adaptive randomized algorithm that learns H with constant probability must make exp(Ω(d3 (r 1))) queries even when the input has poly(d) points; thus, any poly(d) query algorithm must necessarily use Ω(log log d) rounds of adaptivity. We give optimal query complexity bounds for the variant of the problem where queries are allowed to be from {0, 1, 2}d. We also study a continuous variant of the problem in which H is a subset of unit vectors in Rd, and one can query unit vectors in Rd. For this setting, we give an O(n d/2 ) query deterministic algorithm to learn the hidden set of points.

1. Introduction

Mastermind is a classic codebreaking board game that originated in 1970 and over the past few decades has inspired several lines of research in theoretical computer science. The game is played by two players one of whom is a codemaker who chooses a secret sequence of four code pegs each of which has one of six colors; the second player is the codebreaker who wishes to determine the sequence by guessing various four peg patterns. For each pattern the codebreaker guesses, they learn the number of pegs in their guess that are of the correct color and appear in the correct position, as well as the number of pegs that are of the correct color but are in the wrong position. The codebreaker wishes

*Equal contribution 1Department of Computer Science and Engineering, University of Michigan 2Department of Computer Science, Carnegie Mellon University. Correspondence to: Milind Prabhu <milindpr@umich.edu>, David Woodruff <dwoodruf@cs.cmu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

to learn the pattern while making as few guesses as possible. The work of (Knuth, 1977) showed that the original Mastermind game has an optimal strategy that uses 5 queries in the worst case. The obvious generalization of the Mastermind game to n pegs and k colors was studied in (Chv atal, 1983) which gave an optimal strategy up to constant factors when k n1 ε for any fixed constant ε > 0 and n arbitrarily large. Better strategies for the regime when k n were proposed in (Chen et al., 1996) and subsequently improved in (Goodrich, 2009).

In this paper, we study variants of the Mastermind problem in which the codemaker chooses not one hidden point but a set H of n hidden points. The codebreaker operates under a certain query model and has to learn all the points in H while minimizing the number of queries they make. Specifically, we consider three variants of this problem, each of which is defined by the space from which H is chosen and the query model under which the codebreaker operates. In the first variant, we consider the setting of the problem when the hidden points and queries lie in the hypercube.

Problem 1. The hidden set H is a subset of {0, 1}d

of size n. The codebreaker queries q {0, 1}d and learns argminx H dist(x, q), i.e., the point in H with the least Hamming distance to q.

Next, we consider a continuous variant of the problem.

Problem 2. The hidden set H is an n point subset of Sd 1, the unit sphere in d dimensions. In each query, the codebreaker picks q Sd 1 and learns argminx H x q 2, i.e., the point in H that has the least Euclidean distance to q.

We also study the hypercube version of the problem with a stronger query model.

Problem 3. The hidden set H is a subset of {0, 1}d

of size n. The codebreaker queries q {0, 1, 2}d and learns minx H dist(x, q), i.e., the Hamming distance of q to the point in H nearest to it.

One motivation for the study of such a query model with extra characters is the work of (Hu et al., 2022). They study the problem in which a secret binary string in {0, 1}d

has to be recovered via access to distance oracles for non-

Learning Multiple Secrets in Mastermind

decomposable distance metrics such as edit distance, (p)- Dynamic Time Warping, and Fr echet distances. They showed that for the exact recovery problem under some distance metrics such as Dynamic Time Warping, recovery is provably impossible unless extra characters are allowed in the queries. For recovery using the edit distance oracle, they gave algorithms with better query complexity when queries were allowed to have extra characters. The question of what advantage allowing extra characters gives us in our setting is also of interest.

While our main goal is to design query-efficient algorithms for Problems 1-3, we are also interested in the adaptivity of algorithms. The adaptivity of an algorithm is the number of rounds over which the algorithm makes queries and is a measure of its parallelizability. Formally we have the following definition:

Definition 1.1 (r-Round Adaptive Algorithm). An algorithm is said to be r-round adaptive if it makes queries in r batches B1, B2, . . .Br and for each j [r]\{1}, the queries made by the algorithm in the j-th batch Bj are a function only of the responses to queries in the batches B1, . . . Bj 1.

While it would be ideal to design algorithms that are both query efficient and use few rounds of adaptivity (i.e., are very parallelizable) we show that such algorithms do not exist for Problem 1. Given this reality, one of our main objectives is to understand the tradeoff between the query complexity of algorithms and their adaptivity.

1.1. Motivation for Generalized Mastermind

One strong motivation for studying these problems is that they seem to be very fundamental algorithmic questions and have not been studied previously to the best of our knowledge. Furthermore, several other questions similar in spirit have been of interest to the community (see Section 1.3). We now list some areas where the generalized mastermind problem has applications or is relevant.

Data Reconstruction from a Nearest Neighbor Data Structure. An obvious application of the generalized mastermind problem is the efficient recovery of data given access to its nearest neighbor data structure. In particular, a better understanding of these problems may help prevent adversarial attacks from learning private data using a nearest neighbor data structure.

Optimization. Problem 1 can be formulated as the algorithmic task of recovering all global minima of the function f(x) = dist(x, H) whose input is a point x {0, 1}d and output is the hamming distance of x to the hidden set H. Our lower bounds show that recovering all global minima even in such a structured and simple setting is hard. The key difficulty in designing algorithms for Problem 1 stems from a phenomenon akin to getting trapped in local

minima that arises in optimization. In the lower-bound for Problem 1 (i.e., Theorem 3.8), the hardness arises from a small set of hidden points H H preventing the remaining points H \ H from being discovered. More precisely, H

has the property that a uniformly random point from {0, 1}d

is closer to some point in it than to any point in H \ H with high probability. Therefore, the algorithm has to actively evade points in H to discover points in H \ H .

Adverserially Robust Learning. Deep learning methods, in general, are not robust to adversarial examples (Zhang et al., 2020; Miller et al., 2020; Akhtar & Mian, 2018). Recent developments such as (Lecuyer et al., 2019; Cohen et al., 2019) have led to methods achieving provable robustness with respect to ℓp-norm perturbations for continuous domains. A standard procedure to adapt these ideas for discrete domains is to pre-process the inputs by first mapping them to ℓp space before classification. As noted by (Hu et al., 2022), efficient non-adaptive algorithms for the exact recovery problems can be used to do exactly this. A nonadaptive algorithm for the generalized Mastermind problem, for instance, can be used to losslessly map the input to a t-dimensional vector of responses to the t queries made by the algorithm.

1.2. Our Contributions

For Problem 1, we present a two-round adaptive deterministic algorithm which makes 2O( d log d log n) queries to retrieve H. On the negative side, we show that any r-round adaptive algorithm must make exp(Ω(d3 (r 1))) queries even when the input has O(d3) points. Firstly, this implies that algorithms with poly(d) query complexity must use Ω(log log d) rounds of adaptivity. Also, for the special case of non-adaptive algorithms, this corresponds to an exp(Ω(d)) query lower bound, and for the case of tworound adaptive algorithms a lower bound of exp(Ω(d1/3)) queries. For Problem 2, we propose a deterministic algorithm in Section 4. The algorithm makes O(n d/2 ) queries over O(n + d) adaptive rounds. Finally, we give a deterministic algorithm for Problem 3 that makes O(nd) queries in Section 5. This shows that allowing extra characters in the queries of Problem 1 makes the problem much easier. A simple information-theoretic lower bound shows that this is essentially optimal; any randomized algorithm that succeeds with constant probability must make Ω( nd

log d) queries.

1.3. Related Work

Past work has studied several variants of the mastermind problem, sometimes under different names. The work of (Fern andez et al., 2019) studied the Mastermind problem in which the codemaker picks x { k, k + 1, , k 1, k}d and the codebreaker tries to infer x by making ℓp distance queries in which they pick y { k, k+1, , k 1, k}d of their choice and are told x y p. Their tech-

Learning Multiple Secrets in Mastermind

niques give non-adaptive algorithms for separable distance metrics (which are metrics that decompose into coordinate sums) such as ℓp-norms, smooth max, Huber loss, etc. The study of the Mastermind problem with non-separable distance metrics such as edit distance, p-Dynamic Time Warping, and Fr echet distances was initiated by (Hu et al., 2022). Inference problems involving graph metrics have been considered in (Jiang & Polyanskii, 2019; Rodr ıguez-Vel azquez et al., 2014). Permutation-based variants of Mastermind are studied in (Afshani et al., 2019; El Ouali & Sauerland, 2020).

The well-known coin-weighing problem is equivalent to the original Mastermind problem with two colors (Alon et al., 1996). In this problem, given n coins each of whose weight is either w0 or w1, the goal is to determine the weight of each coin in a small number of weighings using a spring balance. Optimal bounds for this problem were achieved by (Bshouty, 2009). The problem of group testing, introduced by (Dorfman, 1943), is also closely related. In this problem, there is a group of people out of which some have been infected by a disease. There is a testing procedure that, when queried with a subset of people, can determine either that there is someone in the subset who is infected or that no one is. The goal of the problem is to identify the infected population using the testing procedure as few times as possible. The work of (Coja-Oghlan et al., 2020) obtains optimal non-adaptive and two-round adaptive group testing algorithms. Several other results for group testing are surveyed in (Aldridge et al., 2019).

2. Notation

We use e O(.) to hide factors that are logarithmic in the dimension d. The set {1, , l} is denoted by [l]. We use {0, 1, , k}d to denote the set of vectors each of whose coordinates is in [k] {0}. For x, y {0, 1, , k}d, the Hamming distance between x and y, denoted by dist(x, y), is the number of coordinates i [d] such that x[i] = y[i]. For a subset S {0, 1}d and a point x {0, 1}d we use the notation dist(x, S) to denote miny S dist(x, y). The support of x, denoted by Supp(x), is the set of non-zero coordinates of x and the Hamming weight of x, denoted by wt(x), is the cardinality of the support. For a set of coordinates I [d], the restriction of x to I is the vector in {0, 1, , k}|I| obtained by deleting all the coordinates of x except those in I. We denote the restriction of x to the set I by x I. For a set S Rd, the convex hull of S will be denoted by CONV(S) and the span of the points in S will be denoted by Span(S). For a subspace W of Rd, its orthogonal complement is denoted by W . We use Sd 1

to denote the unit sphere in Rd.

Algorithm 1 Two-Round Adaptive Algorithm

1: For t := 2 O

d log d log(n) , query t points y1, . . . , yt which are sampled independently and uniformly at random from {0, 1}d. Let z1, . . . , zt be the responses to the queries where zi = query(yi). 2: For each zi, query all points at Hamming distance at most r from it where r = O( p

(d log n)/ log d). 3: Output the set of all points discovered by the queries made by the algorithm.

3. Results for Problem 1

In this section, we present results for Problem 1. Recall that, in this setting the hidden set H consists of n points from the d-dimensional hypercube {0, 1}d. The query model allows the codebreaker to query q {0, 1}d to learn the point in H with the smallest Hamming distance to q. We use query(q) to denote this point. In the following sections, we give upper and lower bounds on the query complexity of Problem 1.

3.1. A Simple 2-Round Adaptive Algorithm

We analyze the natural algorithm which makes random queries in the first round and then uses the information revealed to adaptively make queries in the second round.

Overview of Algorithm 1. In the first round, the algorithm queries t = 2 O( d log n) points y1, y2, . . . , yt uniformly at random from {0, 1}d. In the second round, the algorithm queries all points within Hamming distance r of points discovered in the first round.

Theorem 3.1. Let H be a hidden set with n 2o(d/ log d)

points. Algorithm 1 makes at most 2O( d log d log n) queries and recovers H with probability at least 2/3.

Our main lemma is to show that, for any fixed x H, the first round recovers a point z H such that dist(x, z) r = O( d log n) with probability at least (1 1/3n). It then follows by a union bound that, with probability at least 2/3, each point in the hidden set is at a distance at most r from some point recovered in round 1. Thus, the algorithm recovers all points after the second round of queries with constant probability.

Lemma 3.2. Fix a hidden point x H and let z be the nearest point to x among the points zi learned after the first round of queries. With probability at least (1 1/3n), we have dist(x, z) r.

Proof. Let y = argminyi dist(x, yi) be the nearest query to x among the queries made in the first round. Let x H be some arbitrary but fixed point hidden point satisfying dist(x, x ) > r (the lemma trivially follows if no such x

exists). We show that dist(x, y) < dist(x , y) holds with

Learning Multiple Secrets in Mastermind

probability at least (1 1/3n2). A union bound then shows that with probability at least (1 1/3n), we simultaneously have dist(x, y) < dist(x , y) for all x H satisfying dist(x , x) > r. It then follows that dist(x, query(y)) r holds with probability (1 1/3n) thus proving the lemma. We have reduced our task to showing that dist(x, y) < dist(x , y) holds with probability at least (1 1/3n2). We now make some assumptions that simplify the analysis. For the rest of the analysis, we assume that x is the origin (i.e., all the d coordinates of x are 0). This corresponds to shifting the origin to x by XOR-ing all hypercube points with x. This operation preserves distances between any two points, so our analysis will not lose any generality.

We now have dist(x, y) = wt(y) where wt(y) = |supp(y)| is the number of coordinates in the support of y. It is also easy to see that dist(x , y) = wt(x ) + wt(y) 2I where I := |Supp(x ) Supp(y)|. It follows that dist(x, y) < dist(x , y) is equivalent to

wt(x ) > 2I. (1)

In the rest of the proof, we focus on showing that Equation (1) holds with probability at least (1 1/3n2). We now define some useful events that will help us prove this upper bound on I. Specifically, we let G1 to be the good event that wt(y) = d/2 Ω( d log t). This is a good event because we can use the fact that y has a small support to argue that the size I of the common support of x and y. Also, we let G2 be the good event that I is not too much more than its expectation (wt(y)wt(x ))/d.

For an integer w [0, d], let Ew denote the event that wt(y) = w.

Let G1 denote the event that wt(y) [d/4, d/2 ( d log t)/8].

Let G2 denote the event that I < µ + p

3µ log(12n2) where µ = (wt(x )wt(y))/d.

Using some simple algebra, we can now show the following claim. The details are provided in Appendix B.

Claim 3.3. Suppose that the number of queries t satisfies t = 2Ω((d log n)/r). If the good events G1, G2 both occur, then wt(x ) > 2I.

Note that our choice of parameters t, r in Algorithm 1 satisfy the first condition of the above claim. Therefore, our task reduces to proving that G1 G2 holds with probability (1 1/3n2) to finish the proof of Lemma 3.2. This is what we do next. First, we show that G1 holds with high probability. We observe that wt(y) is the minimum of t i.i.d binomial random variables distributed as Bin(d, 1/2). The claim below then follows from standard concentration inequalities. Its proof is deferred to Appendix B.

Claim 3.4. Pr[G1] (1 1/6n2).

Next, we want to show that the upper bound on I guaranteed by event G2 also holds with high probability. The key observation is the following: conditioned on wt(y) = w, the distribution of y is uniform over points in {0, 1}d with weight w. Hence, the distribution of I is hypergeometric with mean µ = (w wt(x ))/d. Using tail bounds for the hypergeometric distribution we then show that I = µ + O( µ log n) holds with high probability. The full details are provided in Appendix B.

Claim 3.5. For any integer w [d/4, d/2] we have Pr[G2 | Ew] (1 1/6n2).

The below claim follows from Claim 3.4, Claim 3.5, and the law of total probability. The details are again deferred to Appendix B.

Claim 3.6. Pr[G1 G2] (1 1/3n2).

This completes the proof of Lemma 3.2

The following lemma bounds the number of queries made by Algorithm 1 concluding the proof of Theorem 3.1.

Lemma 3.7. Algorithm 1 makes at most 2O( d log d log n)

Proof. The number of queries in the first round is clearly 2O( d log d log n). In round 2, the algorithm queries all points at a Hamming distance at most r = O( p

(d log n)/ log d) from the points discovered in round 1. This is at most t Pr i=0 d i tdr+1 = 2O( d log d log n).

3.2. A Lower Bound Against r-Round Adaptive Algorithms

The main goal of the section is to prove query complexity lower bounds. In particular, we study the trade-off between query complexity and adaptivity.

We say that a randomized algorithm has success probability p, if, for each input H {0, 1}d, it correctly learns all the points in H with probability at least p. The main result of this section is a lower bound on the query complexity of rround adaptive randomized algorithms which have constant success probability.

Theorem 3.8. Let d be a sufficiently large integer and r = O(log log d) be any positive integer. Any r-round adaptive randomized algorithm for Problem 1 with success probability at least 2/3 must make exp(Ω(d3 (r 1))) queries even when the size of the hidden set H is promised to be O(d3).

Learning Multiple Secrets in Mastermind

Let DET(q, r) denote the set of all deterministic algorithms that make at most q queries over r adaptive rounds. We shall use the following version of Yao s lemma which reduces proving randomized query complexity lower bounds to that of designing hard distributions for deterministic algorithms.

Lemma 3.9 (Yao s Principle). Suppose that there exists a distribution D over instances of Problem 1 such that for every deterministic algorithm A DET(q, r), we have, Pr H D[A succeeds on H] < δ; then any r-round randomized algorithm which makes at most q queries has success probability at most δ.

This motivates the following definition of input distributions, which are hard for deterministic algorithms.

Definition 3.10 ((d, m, q, r, δ)-Hard Distribution). Let D be a distribution over subsets of {0, 1}d that contain at most m points. Such a distribution D is called (d, m, q, r, δ)-hard if any algorithm A DET(q, r) has at most a δ-probability of learning a hidden set drawn from D.

It follows from Yao s lemma that if a (d, m, q, r, δ)-hard distribution exists, then any r-round randomized algorithm with query complexity q has success probability at most δ. Next, we construct a hard distribution for non-adaptive algorithms.

Lemma 3.11 (Hard Distribution for 1-Round Algorithms). For some constant c (0, 1), there exists a distribution D1 which is (d, d2, 2cd, 1, 2 Ω(d))-hard.

Proof. For a point x {0, 1}d let N(x, r) {0, 1}d be the set of points at Hamming distance exactly r from x.

The hard distribution D1 generates an input H as follows:

(i) Sample a uniformly random point u from {0, 1}d.

(ii) Add all the points in N(u, 2) to H.

(iii) Add a uniformly random subset S of N(u, 1) to H.

We now show that for some constant c (0, 1), any algorithm in DET(2cd, 1) learns H D1 with probability at most 2 Ω(d). The intuition is that if the number of queries is 2cd, all the queries of the algorithm will be at a distance greater than 2 from u with high probability. In this case, the points in N(u, 2) block the algorithm from learning which subset S of N(u, 1) was added to H in step (iii). We formalize this argument below. Let A DET(s, 1) be a non-adaptive deterministic algorithm that queries the points y1, . . . , ys. Let E be the event that for all i [s] we have dist(yi, u) 2. Since u is sampled uniformly at random, for a fixed yi, dist(yi, u) 2 holds with probability (1 (1 + d)2 d) (1 2d 2 d). A union bound then gives Pr[E] (1 2sd 2 d). Now, observe that if event E occurs, for each query yi we have dist(yi, N(u, 2)) < dist(yi, N(u, 1)). Moreover,

Figure 1. The figure illustrates how a hard distribution for r round algorithms can be used to create a hard distribution for r +1 round algorithms.

since all the points in N(u, 2) are in H, the responses to the queries are entirely determined by the choice of u (and, in particular, are independent of S). It follows that if the event E occurs, then the algorithm has at most a 2 d probability of correctly guessing S. The success probability of the algorithm, therefore, is at most Pr[E] 2 d +Pr[E] 2 d + 2sd 2 d = 2 Ω(d) if s 2cd for a sufficiently small constant c (0, 1).

We now show that hard distributions can be constructed recursively: a hard distribution for (r + 1) round algorithms can be constructed using a hard distribution for r-round algorithms. The new distribution will, however, be over points whose dimension is significantly larger (in fact, it will be approximately the cube of the dimension of the former).

Lemma 3.12 (Inductive Step). Suppose that a (t, m, q, r, δ1)-hard distribution exists; then a (t , m , q, r + 1, δ1 + δ2)-hard distribution also exists for some t 2500t2(t + log2(q/δ2)) + t and m m + 25(t + log2(q/δ2)).

Proof. Let Dr be the (t, m, q, r, δ1)-hard distribution. Let t = 2500t2(t + log2(q/δ2)) + t. The hard distribution Dr+1 generates an input Hr+1 {0, 1}t as follows (see the illustration above):

(i) Sample Hr {0, 1}t from the distribution Dr.

(ii) Let A {0, 1}t be the set of points obtained by padding (t t) zeros to the end of each point in Hr.

(iii) Let ℓ= 100t2 and m1 = 25(t + log2(q/δ2)). Define B = {x1, . . . , xm1} where xi is given by:

( 1 If j [t + (i 1) ℓ+ 1, t + i ℓ] 0 Otherwise

Learning Multiple Secrets in Mastermind

(iv) For u sampled uniformly at random from {0, 1}t let Au = {u x | x A} and Bu = {u x | x B} where denotes the bitwise-XOR operation.

(v) Let Hr+1 = Au Bu.

It is easy to check that the bounds on the size of Hr+1 and the dimension t satisfy the lemma statement. Next, we prove that Dr+1 is indeed a hard distribution for (r + 1)-round adaptive algorithms. Consider an algorithm in DET(r + 1, q) and an input Hr+1 drawn from Dr+1. We show that after the first round of queries, with high probability, the algorithm gains no information about the set Hr. However, the input Hr is itself drawn from a distribution that is hard for r-round algorithms; thus the algorithm has a low probability of learning A in the remaining r rounds. Let C {0, 1}t denote the set of 2t points whose support is a subset of the first t coordinates. The main claim we show is that if z is a uniformly random point in {0, 1}t then its distance to B is less than its distance to any point in C. To prove this claim, we will require the following simple observation whose proof is provided in Appendix B.

Claim 3.13. Let x be a point in B and y be a point in C. If z is a point in {0, 1}t satisfying |Supp(x) Supp(z)| > ℓ/2 + t then dist(z, x) < dist(z, y).

We now prove the main claim of the argument.

Claim 3.14. If z is sampled uniformly at random from {0, 1}t then dist(z, C) > dist(z, B) with probability at least 1 δ2/q.

Proof. Fix a point y C. We shall show that for a uniformly random point z, we have dist(z, y) dist(z, B) with probability at most 2 tq 1δ2. The claim then follows by a union bound over the |C| = 2t points in C. Let x be a point in B. Since z is chosen uniformly at random from {0, 1}t and wt(x) = ℓ, we have that |Supp(x) Supp(z)| is a Bin(ℓ, 1/2) random variable. Using Fact A.2, the probability that |Supp(x) Supp(z)| (ℓ/2 + 2t) is at least 1 15 exp( 64t2/ℓ) = 1 15 exp( 16/25) > 1/30. It follows from Claim 3.13 that dist(z, x) < dist(z, y) holds with probability at least 1/30. Since B = {x1, . . . , xm1} consists of points with pairwise disjoint support, the random variables |Supp(xi) Supp(z)| corresponding to different i are mutually independent. Thus, the probability that dist(z, B) > dist(z, y) is at most (29/30)m1 < 2 tq 1δ2. Claim 3.14

We now use the claim above to show that after the first round of queries, with high probability, the algorithm learns no information about the set of points Hr sampled in line (i). Suppose that the algorithm queries the points z1, . . . , zs in the first round. Let E be the event that dist(zi, C) >

dist(zi, Bu) holds for all zi. Observe that dist(zi, C) > dist(zi, Bu) holds iff dist(zi u, C) > dist(zi u, B). Since z u is distributed uniformly over {0, 1}t , it follows from Claim 3.14, a union bound that Pr[E] δ2s/q δ2. Conditioned on event E, the responses to the queries in round 1 are determined by u and specifically are independent of the set Au C and thus are also independent of the points Hr sampled in line (i). Therefore, if the event E occurs, the conditional distribution of Hr after the first round of queries is stochastically identical to the distribution of Hr prior to the queries. Even if the algorithm is informed what u is after the first round of queries, it still has to learn the set of points Hr in the remaining r rounds. Since the points in Hr are from a (t, m, q, r, δ1)-hard distribution, the algorithm learns Hr with probability at most δ1. Therefore the success probability of the algorithm is bounded by Pr[E] δ1 + Pr[E] δ1 + δ2.

We now combine Lemma 3.11 and Lemma 3.12 to complete the inductive proof and show the existence of a certain hard distribution against r-round adaptive algorithms. Due to space constraints, the proof of the below lemma has been moved to Appendix B.

Lemma 3.15. For any integer t c1 and an integer r satisfying 1 r t there exists a distribution Dr which is ((100t)3r 1, t3r, 2ct, r, 2/3)-hard where c1 is a sufficiently large constant and c is a constant in (0, 1).

We now complete the proof of Theorem 3.8.

Proof of Theorem 3.8. Pick d sufficiently large and r = O(log log d) so that r d3 (r 1). Setting t = (d3 (r 1))/100 in Lemma 3.15 shows the existence of a

(d, d3, 2O(d3 (r 1)), r, 1/3)-hard distribution. Theorem 3.8

then follows by Yao s lemma. Theorem 3.8

4. A Deterministic Algorithm for Problem 2

In Problem 2, there is a hidden set H = {x1, x2, xn} of n points on the unit sphere Sd 1 Rd. The goal is to learn the set H while making queries to unit vectors in Sd 1 where a query to the unit vector q returns a point in H which has the least Euclidean distance to q. In other words, x = argmin xi H xi q 2. Note that since

x q 2 = x 2 + q 2 2 x, q = 2 2 x, q ,

the point x also has the maximum inner product with q among points in H. We present a deterministic algorithm that learns the set H using n O(d) queries using O(n + d) rounds of adaptivity. Overview of Algorithm 2: Suppose that the points in H have rank k. The algorithm first finds a basis B of k linearly

Learning Multiple Secrets in Mastermind

Algorithm 2 Deterministic Algorithm for Points on Sd 1.

1: Initialize B := ϕ. {When the loop on line 2 terminates B will be a set of vectors which form a basis of H}. 2: repeat 3: Find an orthonormal basis {v1, v2, , vt} of Span(B) . 4: Query vj and vj for 1 j t. Let C H be the points learned. 5: if there exists a point x in C such that x, vj = 0 for some vj then 6: B = B {x}. 7: end if 8: until No new point is added to B 9: Define k := |B|. { k is the rank of points in H.} 10: Initialize ˆH := B {When the loop on line 11 terminates ˆH will equal H. This loop finds all points in H B.} 11: repeat 12: Find all the (k 1) dimensional faces of CONV( ˆH). 13: Let D be the set of unit vectors in the Span( ˆH) normal to some (k 1)-face of the CONV( ˆH). {Note that each (k 1)-face of CONV( ˆH) has two unit normal vectors.} 14: Let E H denote the set of new points learned on querying points in D. 15: ˆH = ˆH E. 16: until |E| = 0 17: Output ˆH.

independent points in H. This is done by repeatedly querying points in the orthogonal complement of points discovered thus far until no new points are discovered. Henceforth, the algorithm only queries points in Span(B) effectively reducing the problem to k dimensions. In each iteration, the algorithm constructs a convex hull of the points of H found so far and queries the normal vectors of all the (k 1)- dimensional faces of the convex hull thus constructed. If no new points are recovered, the algorithm terminates. Theorem 4.1. Algorithm 2 recovers the set H Sd 1

containing n points with rank k while making at most O(n k

2 +1 + kd) queries.

It is easy to see that at the end of the first loop, Algorithm 2, recovers a basis B H. We now show that the algorithm discovers all remaining points in the second loop. Towards this end, we state the following simple fact and move its proof to Appendix C.

Fact 4.2. Let T be a finite subset of Sd 1; then CONV(T) Sd 1 = T. Lemma 4.3. The output ˆH of Algorithm 2 is H.

Proof. Consider the stage of the algorithm when it is executing an iteration of the loop on line 11. Also, suppose

that there is an undiscovered point y H \ ˆH at this stage. By Fact 4.2, CONV( ˆH) = ˆH and therefore y is not in CONV( ˆH). Therefore, there is a (k 1)-face of CONV(B) which separates y from all the points in ˆH. Therefore, the unit vector m normal to this face of CONV( ˆH) satisfies m, y > m, x for all x in ˆH. Hence querying the unit vectors corresponding to all (k 1)-faces of CONV( ˆH) (line 14) recovers a point in H \ ˆH. This proves that the algorithm terminates only after all the points in H are recovered.

To bound the query complexity of the algorithm, we use the following standard result about the number of facets of the convex hull of n points in d dimensions.

Lemma 4.4 (Upper Bound Theorem, (Theorem 5.5.1 in (Matousek, 2013)) ). The number of (d 1)-faces of the convex hull of n points in d dimensions is O(n d

Lemma 4.5. Algorithm 2 makes at most O(n k

2 +1 + kd) queries where k = rank(H).

Proof. Algorithm 2 uses O(kd) queries to recover a basis of H. Each iteration of the loop beginning in line 11 queries the two normal vectors to the faces of the convex hull of a set of at most n points in k dimensions. By Lemma 4.4, each iteration makes at most O(n k

2 ) queries. Since there are at most n iterations of this loop, the total number of queries made by the algorithm overall is O(n k

2 +1 + kd).

5. A Near Optimal Algorithm for Problem 3

In this section, we present results for Problem 3. In this problem, the hidden set H is a subset of {0, 1}d of cardinality n and that in each query the codebreaker learns minx H dist(x, q) for q {0, 1, 2}d of their choice. The codebreaker s goal is to learn H with as few queries as possible. For x {0, 1}d and I [d], x I denotes the restriction of x to I (see Section 2 for a definition). Observe that the query model defined above is equivalent to the following query model which is more convenient to work with. In each query, the codebreaker picks I [d] and a vector q {0, 1}|I| and as a response to this query learns minx H dist(x I, q) which is the minimum distance of q to points in H when restricted to the indices in I. We will denote such a query by query(I, q).

5.1. The Algorithm

In the following section, we present a deterministic algorithm for Problem 3 that finds H using O(nd) queries. Throughout this section, we define for each i [d] the set Ci = {1, . . . , i}. In the following lemma, we show that a point in H with a given prefix can be recovered using O(d) queries. This lemma will be used as a subroutine in Algorithm 3.

Learning Multiple Secrets in Mastermind

Algorithm 3 A Deterministic Algorithm for the Stronger Query Model.

1: For 1 i d, let Ci denote the set of coordinates {1, 2, , i} 2: Initialize H := ϕ {H is the subset of H learned by the algorithm.} 3: Initialize N := ϕ {N is the most recently learned subset of H} 4: Use Lemma 5.1 to find some point x0 in H (the prefix in the lemma can be set to be the null prefix). 5: Add x0 to N and H . 6: while N is non-empty do 7: for x H do 8: for i from 1 to d do 9: Let x i {0, 1}i be the i-th prefix of x but with the i-th coordinate negated. 10: if x i is not the prefix of any point in H and query(Ci, x i) = 0 then 11: Use Lemma 5.1 to find a point in H with prefix x i. 12: end if 13: end for 14: end for 15: Let N be the set of new points learned in this iteration of the while loop. H H N. 16: end while 17: Output H

Lemma 5.1. Given x {0, 1}i, with i satisfying 0 i d, we can recover a point x H with prefix x (given that such an x exists in H) using at most d queries.

Proof. Suppose that x is a point in H with prefix x ; since x is a prefix of x we have query(Ci, x ) = 0. Let (x , 0) denote the point in {0, 1}i+1 obtained by appending a 0 to x . To determine the (i + 1)-th coordinate we perform query(Ci+1, (x , 0)) . If the response to the query is 0, then xi+1 = 0; otherwise xi+1 = 1. We now know the prefix of x of length (i + 1). Repeating this for the subsequent coordinates reveals x using at most a total of d queries.

Overview of Algorithm 3: Consider the execution of the algorithm when it has discovered a subset H of H. It finds the prefix p of some point in H \ H such that p is not the prefix of any point in H . It then uses the subroutine of Lemma 5.1 to learn a point in H with this prefix, thereby learning a previously unknown point in H. To find p, the algorithm iterates over all prefixes of points in H . For each prefix q, the algorithm negates the last coordinate of q to obtain q. It then checks if q is the prefix of any point in H and if this is not true queries q. If the query returns 0, then the algorithm has found the required prefix of an undiscovered point.

Theorem 5.2. Algorithm 3 recovers the set H {0, 1}d

while making at most O(nd) queries.

The following lemma proves the correctness of Algorithm 3.

Lemma 5.3. The set H output by Algorithm 3 equals H.

Proof. First, we prove that the algorithm terminates. The while loop ends as soon as no new points are learned in its most recent iteration. Therefore, there are at most n iterations of the while loop after which the algorithm terminates. Next, we show that the algorithm learns all points in H. Suppose that the execution of the algorithm is at the while loop on line 6. Also, suppose that there exists y H \ H ; we show that the algorithm learns a point in H \ H in this iteration of the while loop. Among points in H , let z be the point that has the largest common prefix with y and the length of this prefix be l. Observe that x (l+1) is not the prefix of any point in H and that query(Cl+1, x (l+1)) = 0 since x (l+1) is a prefix of y. Therefore, we are done if we argue that z is contained in N. This is true since, if z H \ N, then a previous iteration of the while loop would have retrieved a point the length of whose common prefix with y would be at least (l + 1). This leads to a contradiction.

Next, we bound the query complexity of Algorithm 3, completing the proof of Theorem 5.2.

Lemma 5.4. Algorithm 3 makes O(nd) queries.

Proof. For each point in H, the algorithm makes at most one query for each of its prefixes. Having learned the prefix of some undiscovered point in line 10, the algorithm uses Lemma 5.1 to learn a new point making at most d queries. Therefore, the total number of queries made is O(nd).

A simple information-theoretic lower bound shows that Algorithm 3 is almost optimal.

Theorem 5.5. Consider a randomized algorithm for Problem 3 that for any set H {0, 1}d containing n points, learns all the points in H with probability at least 2/3. The worst-case query complexity of such an algorithm is Ω nd log d .

Proof. Let D denote the uniform distribution over all nsubsets of {0, 1}d. Let A be a deterministic algorithm that learns an input H D with success probability at least 2/3 while making at most t queries. By Yao s principle, it suffices to show that t = Ω( nd

Since A is deterministic, its output is completely determined by the responses to its queries. For each query, it receives a response of length O(log d) bits. It, therefore, outputs one of 2O(t log d) subsets of {0, 1}d and can be correct on at most 2O(t log d) inputs. Since it succeeds with probability at least 2/3, we must have t = Ω( 1 log d log 2d

n ) = Ω( nd

Learning Multiple Secrets in Mastermind

Adaptivity. While Algorithm 3 is almost optimal in terms of query complexity, it requires several rounds of adaptivity. In particular, the algorithm as described in Algorithm 3 can be implemented so as to use O(n) rounds of adaptivity, i.e., O(1) rounds to discover each new point. Alternatively, one can also get a d-round adaptive algorithm with the same query complexity as follows: for i = 1, . . . d, in the i-th round, the algorithm learns the length i prefix of each point in H. Given all length i prefixes, the algorithm can figure out all length (i + 1) prefixes using at most 2n queries. Therefore, overall, we can achieve a query complexity of O(nd) using O(min(n, d)) adaptive rounds.

It would be interesting if query-efficient algorithms that use fewer rounds of adaptivity exist. In any case, non-adaptive algorithms for Problem 3 can be ruled out via an argument almost identical to the one used to prove a lower bound for Problem 1 against non-adaptive algorithms.

Theorem 5.6. Consider a non-adaptive randomized algorithm for Problem 3 that for any set H {0, 1}d containing n points, learns all the points in H with probability at least 2/3. Such an algorithm must make Ω(2d) queries.

6. Conclusions and Open Problems

This work introduces the natural extension of the Mastermind problem to the setting when there are multiple secrets. In the context of Problem 1, our work focused on understanding the query complexity of two-round adaptive algorithms. We proposed a 2 O( d log n) query upper bound and showed that any algorithm with poly(d) query complexity must use Ω(log log d) rounds of adaptivity. Designing such an algorithm with polynomial query complexity is a natural and interesting open question. For Problem 2 we gave a simple deterministic algorithm with n O(d) query complexity. This query complexity arose from the fact that convex polytopes with n vertices in d dimensions can have as many as n O(d) faces. For this problem, we do not know of any non-trivial lower bounds and it would be nice to make progress in this direction. In the case of Problem 3, we have an almost optimal algorithm that recovers the hidden set while making at most O(nd) queries. It however suffers from having to use O(min(n, d)) rounds of adaptive queries. While we show that any non-adaptive algorithms must make Ω(2d) queries it would be interesting to establish the correct tradeoff between query complexity and adaptivity for Problem 3. Another interesting direction to explore is the study of the generalized mastermind problem for non-separable distance metrics such as edit distance, Fr echet distance, and dynamic time warping.

Acknowledgements

We thank Rahul Ilango for the helpful discussions during the initial phases of this work. We would also like to thank Rajiv Gandhi for making the collaboration between the authors possible.

David P. Woodruff was supported in part by a Simons Investigator Award and NSF Grant No. CCF-2335412.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References Afshani, P., Agrawal, M., Doerr, B., Doerr, C., Larsen, K. G., and Mehlhorn, K. The query complexity of a permutation-based variant of mastermind. Discrete Applied Mathematics, 260:28 50, 2019.

Akhtar, N. and Mian, A. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6:14410 14430, 2018. doi: 10.1109/ACCESS.2018. 2807385.

Aldridge, M., Johnson, O., Scarlett, J., et al. Group testing: an information theory perspective. Foundations and Trends in Communications and Information Theory, 15 (3-4):196 392, 2019.

Alon, N., Kozlov, D., and Vu, V. The geometry of coinweighing problems. In Proceedings of 37th Conference on Foundations of Computer Science, pp. 524 532, 1996. doi: 10.1109/SFCS.1996.548511.

Bshouty, N. H. Optimal algorithms for the coin weighing problem with a spring scale. In COLT, volume 2009, pp. 82, 2009.

Chen, Z., Cunha, C., and Homer, S. Finding a hidden code by asking questions. In International Computing and Combinatorics Conference, pp. 50 55. Springer, 1996.

Chv atal, V. Mastermind. Combinatorica, 3(3):325 329, 1983.

Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pp. 1310 1320. PMLR, 2019.

Coja-Oghlan, A., Gebhard, O., Hahn-Klimroth, M., and Loick, P. Optimal group testing. In Conference on Learning Theory, pp. 1374 1388. PMLR, 2020.

Learning Multiple Secrets in Mastermind

Dorfman, R. The detection of defective members of large populations. The Annals of mathematical statistics, 14 (4):436 440, 1943.

El Ouali, M. and Sauerland, V. The exact query complexity of yes-no permutation mastermind. Games, 11(2):19, 2020.

Fern andez, M., Woodruff, D. P., and Yasuda, T. The query complexity of mastermind with lp distances. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 2019.

Goodrich, M. T. On the algorithmic complexity of the mastermind game with black-peg results. Information Processing Letters, 109(13):675 678, 2009.

Hu, Z., Li, X., Woodruff, D. P., Zhang, H., and Zhang, S. Recovery from non-decomposable distance oracles. ar Xiv preprint ar Xiv:2209.05676, 2022.

Janson, S. Large deviation inequalities for sums of indicator variables. ar Xiv preprint ar Xiv:1609.00533, 2016.

Jiang, Z. and Polyanskii, N. On the metric dimension of cartesian powers of a graph. Journal of Combinatorial Theory, Series A, 165:1 14, 2019.

Knuth, D. E. The computer as master mind. Journal of Recreational Mathematics, 9:1 6, 1977.

Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 656 672. IEEE, 2019.

Matousek, J. Lectures on discrete geometry, volume 212. Springer Science & Business Media, 2013.

Miller, D. J., Xiang, Z., and Kesidis, G. Adversarial learning targeting deep neural network classification: A comprehensive review of defenses against attacks. Proceedings of the IEEE, 108(3):402 433, 2020. doi: 10.1109/JPROC.2020.2970615.

Rodr ıguez-Vel azquez, J. A., Yero, I. G., Kuziak, D., and Oellermann, O. R. On the strong metric dimension of cartesian and direct products of graphs. Discrete Mathematics, 335:8 19, 2014.

Zhang, W. E., Sheng, Q. Z., Alhazmi, A., and Li, C. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1 41, 2020.

Learning Multiple Secrets in Mastermind

A. Some Useful Concentration Inequalities

The binomial distribution with n trials and success probability p is denoted by Bin(n, p). The following are some standard concentration inequalities for the binomial distribution.

Fact A.1 (Hoeffding s Inequality). Let Y be random variable sampled from Bin(d, 1/2). For t > 0 we have,

Pr[ |Y d/2| t] 2 exp( 2t2/d).

The following is an anti-concentration inequality for binomial random variables.

Fact A.2 (Anti-Concentration Inequality for the Binomial Distribution). Let Y be random variable sampled from Bin(d, 1/2). When 0 t d/8 we have:

Pr[ Y d/2 t] 1 15 exp( 16t2/d).

The above fact implies the following inequality for the minimum of m independent binomial random variables.

Corollary A.3. Suppose that Y1, Y2, , Ym are m independent Bin(d, 1/2) random variables. Define Ymin to be the minimum of the random variables Yi, i.e., Ymin = min i Yi. For m = 2o(d) and 1 2m < δ < 1 we have,

d(log m log(15 log(1/δ)) ) 1 δ.

We now define the hypergeometric distribution and state a corresponding concentration inequality.

Definition A.4 (Hypergeometric Distribution). Suppose that d balls are drawn randomly without replacement from an urn containing N balls out of which K are white. The distribution of the random variable X which equals the number of white balls drawn is denoted by Hypergeometric(N, K, d).

Fact A.5 (Theorem 4 of (Janson, 2016)). Let Y Hypergeometric(N, K, d) and µ = E[Y ] = NK

d . For ε > 0 less than a sufficiently small constant,

Pr [ |Y µ| εµ] 2 exp ε2µ

B. Missing Details from Section 3

Proof of Claim 3.3 If the event G1 G2 occurs then we have,

2I < 2wt(y) wt(x )

3wt(y) wt(x ) log(12n2)

8 d log t wt(x ) d + 2

d 2 wt(x ) log(12n2)

6 wt(x ) log(12n2).

Therefore, wt(x ) > 2I holds if

6 wt(x ) log(12n2).

The above inequality on rearranging is equivalent to t > exp(96 d log(12n2)/wt(x )). Since wt(x ) > r, we conclude that t > exp(96 d log(12n2)/r) is a sufficient condition for wt(x ) > 2I. Claim 3.3

Proof of Claim 3.4 The queries y1, . . . , yt are chosen independently uniformly at random, and therefore, the distribution of the random variable dist(x, yi) = wt(yi) is Bin(d, 1/2). Since y is the nearest query to x it follows that wt(y) is the minimum of t independent Bin(d, 1/2) random variables. If n = 2o(d/ log d) we have that t = 2O( d log d log n) = 2o(d).

Learning Multiple Secrets in Mastermind

Applying Corollary A.3, we conclude that wt(y) < d/2 1

d (log t log(15(log 12n2))) holds with probability at least (1 1/12n2). For sufficiently large d we have, log(15(log 12n2)) 2 log d 3

4 log t. Therefore, wt(y) < d/2 ( d log t)/8 holds with probability at least (1 1/12n2).

To show the lower bound on wt(y) we use Fact A.1. The probability that a binomial random variable Bin(d, 1/2) is less than d/4 is at most 2 exp( d). By a union bound the probability that the minimum of t = 2o(d) such binomial random variables is less than d/4 is at most exp( Ω(d)) 1/12n2. Therefore, wt(y) d/4 with probability at least (1 1/12n2). A final union bound shows that both the lower and upper bounds on wt(y) hold with probability at least (1 1/6n2).

Proof of Claim 3.5 Conditioning on wt(y) = w (the event Ew), by symmetry, the distribution of y is uniform over the points in the hypercube with weight w. It follows that conditioned on Ew, the random variable I is distributed according to Hypergeometric(d, wt(y), wt(x )) (see Definition A.4). The claim then follows by setting ε = p

3 log(12n2)/µ and µ = (wt(y) wt(x ))/d in Fact A.5. Claim 3.5

Proof of Claim 3.6 Let J denote the set of integers in the interval [d/4, d/2 ( d log t)/8]. Observe that G1 = S

w J Ew. It follows that:

Pr[G1 G2] = X

w J Pr[G2 Ew] = X

w J Pr[G2 | Ew] Pr[Ew] (1 1/6n2) X

= (1 1/6n2) Pr[G1] (1 1/6n2)2 1 1/3n2. Claim 3.6

Proof of Claim 3.14 Note that we have the following formula for the Hamming distance between a, b {0, 1}t : dist(a, b) = wt(a) + wt(b) 2|Supp(a) Supp(b)|. Using this formula and the fact that |Supp(x)| = ℓand 0 |Supp(y)| t we obtain,

dist(z, x) dist(z, y)

= wt(x) 2|Supp(x) Supp(z)| wt(y) + 2|Supp(y) Supp(z)|

< ℓ 2 (ℓ/2 + t) 0 + 2t = 0. Claim 3.14

Proof of Lemma 3.15 We shall prove this by the following induction. For each i [r], we show that there is a (d(i), m(i), 2ct, i, i/3r)-hard distribution where

d(i) 201.5(3i 1 1) t3i 1and m(i) t3i.

The lemma then follows by setting i = r, since d(r) 201.5(3r 1 1) t3r 1 (100t)3r 1. The bounds on m(1) and d(1) in the base case i = 1 follow from Lemma 3.11 which guarantees the existence of (t, t2, 2ct, 1, 2 Ω(t))-hard distribution. Suppose that the claim holds until i = j for some j 1. We now use Lemma 3.12 to perform the induction step. Setting δ2 = 1/r 1/t in this lemma, we deduce that there is a (d(j + 1), m , 2ct, j + 1, (j + 1)/r)-hard distribution with d(j + 1) 2500d(j)2(d(j) + log2(q/δ2)) + d(j). Noting that q 2t and 1/δ2 t 2t we conclude that log2(q/δ2) 2t 2d(j). This implies that

d(j + 1) 2500d(j)2(3d(j)) + d(j) (20d(j))3.

Using the induction hypothesis for i = j, we further conclude that

d(j + 1) (20 201.5(3j 1 1) t3j 1)3 201.5(3j 1) t3j.

Next, we use Lemma 3.12 to bound m(j + 1). We have,

m(j + 1) m(j) + 25(d(j) + log2(q/δ2)) m(j) + 75d(j) t3j + 75(100t)3j 1 t3j+1

where the last inequality holds for t sufficiently large. Lemma 3.15

Learning Multiple Secrets in Mastermind

C. Missing Details from Section 4

Proof of Fact 4.2. Let x be a point in CONV(T). By definition, x = Pm i=1 αiui where αi [0, 1] for 1 i m and Pm i=1 αi = 1. We have by the Cauchy-Schwarz inequality that,

1 i,j m αiαj ui, uj X

1 i,j m αiαj ui 2 uj 2 = X

1 i,j m αiαj =

Since the ui are distinct, the equality above holds only if αiαj = 0 whenever i = j. This is only possible if there exists k such that αk = 1 and αi = 0 for i = k. It therefore follows that if x CONV(T) and x 2 = 1 then x T. Fact 4.2