# truthful_data_acquisition_via_peer_prediction__474f477d.pdf

Truthful Data Acquisition via Peer Prediction

Yiling Chen Harvard University yiling@seas.harvard.edu

Yiheng Shen Tsinghua University shen-yh17@mails.tsinghua.edu.cn

Shuran Zheng Harvard University shuran_zheng@seas.harvard.edu

We consider the problem of purchasing data for machine learning or statistical estimation. The data analyst has a budget to purchase datasets from multiple data providers. She does not have any test data that can be used to evaluate the collected data and can assign payments to data providers solely based on the collected datasets. We consider the problem in the standard Bayesian paradigm and in two settings: (1) data are only collected once; (2) data are collected repeatedly and each day s data are drawn independently from the same distribution. For both settings, our mechanisms guarantee that truthfully reporting one s dataset is always an equilibrium by adopting techniques from peer prediction: pay each provider the mutual information between his reported data and other providers reported data. Depending on the data distribution, the mechanisms can also discourage misreports that would lead to inaccurate predictions. Our mechanisms also guarantee individual rationality and budget feasibility for certain underlying distributions in the ﬁrst setting and for all distributions in the second setting.

1 Introduction

Data has been the fuel of the success of machine learning and data science, which is becoming a major driving force for technological and economic growth. An important question is how to acquire high-quality data to enable learning and analysis when data are private possessions of data providers.

Naively, we could issue a constant payment to data providers in exchange for their data. But data providers can report more or less data than they actually have or even misreport values of their data without affecting their received payments. Alternatively, if we have a test dataset, we could reward data providers according to how well the model trained on their reported data performs on the test data. However, if the test dataset is biased, this could potentially incentivize data providers to bias their reported data toward the test set, which will limit the value of the acquired data for other learning or analysis tasks. Moreover, a test dataset may not even be available in many settings. In this work, we explore the design of reward mechanisms for acquiring high-quality data from multiple data providers when a data buyer doesn t have access to a test dataset. The ultimate goal is that, with the designed mechanisms, strategic data providers will ﬁnd that truthfully reporting their possessed dataset is their best action and manipulation will lead to lower expected rewards. To make the mechanisms practical, we also require our mechanisms to always have non-negative and bounded payments so that data providers will ﬁnd it beneﬁcial to participate in (a.k.a. individual rationality) and the data buyer can afford the payments.

In a Bayesian paradigm where data are generated independently conditioned on some unknown parameters, we design mechanisms for two settings: (1) data are acquired only once, and (2) data

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

are acquired repeatedly and each day s data are independent from the previous days data. For both settings, our mechanisms guarantee that truthfully reporting the datasets is always an equilibrium. For some models of data distributions, data providers in our mechanisms receive strictly lower rewards in expectation if their reported dataset leads to an inaccurate prediction of the underlying parameters, a property we called sensitivity.1 While sensitivity doesn t strictly discourage manipulations of datasets that do not change the prediction of the parameters, it is a signiﬁcant step toward achieving strict incentives for truthful reporting one s datasets, an ideal goal, especially because ﬁnding a manipulation without affecting the prediction of the parameters can be difﬁcult. Our mechanisms guarantee IR and budget feasibility for certain underlying distributions in the ﬁrst setting and for any underlying distributions in the second setting.

Our mechanisms are built upon recent developments [17, 16] in the peer prediction literature. The insight is that if we reward a data provider the mutual information [17] between his data and other providers data, then by the data processing inequality, if other providers report their data truthfully, this data provider will only decrease the mutual information, hence his reward, by manipulating his dataset. We extend the peer prediction method developed by [16] to the data acquisition setting, and to further guarantee IR and budget feasibility. One of our major technical contributions is the explicit sensitivity guarantee of the peer-prediction style mechanisms, which is absent in the previous work.

2 Related Work

The problem of purchasing data from people has been investigated with different focuses, e.g. privacy concerns [14, 10, 13, 23, 7, 28], effort and cost of data providers[25, 4, 1, 5, 30, 6], reward allocation [12, 2]. Our work is the ﬁrst to consider rewarding data without (good) test data that can be used evaluate the quality of reported data. Similar to our setting, [12, 2] consider paying to multiple data providers in a machine learning task. They use a test set to assess the contribution of subsets of data and then propose a fair measurement of the value of each data point in the dataset, which is based on the Shapley value in game theory. Both of the works do not formally consider the incentive compatibility of payment allocation. [28] proposes a market framework that purchases hypotheses for a machine learning problem when the data is distributed among multiple agents. Again they assume that the market has access to some true samples and the participants are paid with their incremental contributions evaluated by these true samples. Besides, there is a small literature (see [9] and subsequent work) on aggregating datasets using scoring rules that also considers signal distributions in exponential families.

The main techniques of this work come from the literature of peer prediction [19, 24, 8, 11, 26, 17, 16, 18, 15]. Peer prediction is the problem of information elicitation without veriﬁcation. The participants receive correlated signals of an unknown ground truth and the goal is to elicit the true signals from the participants. In our problem, the dataset can be viewed as a signal of the ground truth. What makes our problem more challenging than the standard peer prediction problem is that (1) the signal space is much larger and (2) the correlation between signals is more complicated. Standard peer prediction mechanisms either require the full knowledge of the underlying signal distribution, or make assumptions on the signal distribution that are not applicable to our problem. [16] applies the peer prediction method to the co-training problem, in which two participants are asked to submit forecasts of latent labels in a machine learning problem. Our work is built upon the main insights of [16]. We discuss the differences between our model and theirs in the model section, and show how their techniques are applied in the result sections.

Our work is also related to Multi-view Learning (see [29] for a survey). But our work focuses on the data acquisition, but not the machine learning methods used on the (multi-view) data.

A data analyst wants to gather data for some future statistical estimation or machine learning tasks. There are n data providers. The i-th data provider holds a dataset Di consisting of Ni data points d(1) i , . . . , d(Ni) i with support Di. The data generation follows a standard Bayesian process. For each

1This means that a data provider can report a different dataset without changing his reward as long as the dataset leads to the same prediction for the underlying parameters as his true dataset.

data set Di, data points d(j) i Di are i.i.d. samples conditioned on some unknown parameters θ Θ. Let p(θ, D1, . . . , Dn) be the joint distribution of θ and n data providers datasets. We consider two types of spaces for Θ in this paper: (1) θ has ﬁnite support, i.e., |Θ| = m is ﬁnite, and (2) θ has continuous support, i.e. θ Rm and Θ Rm. For the case of continuous support, to alleviate computational issues, we consider a widely used class of distributions, an exponential family.

The data analyst s goal is to incentivize the data providers to give their true datasets with a budget B. She needs to design a payment rule ri( e D1, . . . , e Dn) for i [n] that decides how much to pay data provider i according to all the reported datasets e D1, . . . , e Dn. The payment rule should ideally incentivize truthful reporting, that is, e Di = Di for all i.

Before we formally deﬁne the desirable properties of a payment rule, we note that the analyst will have to leverage the correlation between people s data to distinguish a misreported dataset from a true dataset because all she has access to is the reported datasets. To make the problem tractable, we thus make the following assumption about the data correlation: parameters θ contains all the mutual information between the datasets. More formally, the datasets are independent conditioned on θ.

Assumption 3.1. D1, . . . , Dn are independent conditioned on θ,

p(D1, . . . , Dn|θ) = p(D1|θ) p(Dn|θ).

This is deﬁnitely not an assumption that would hold for arbitrarily picked parameters θ and any datasets. One can easily ﬁnd cases where the datasets are correlated to some parameters other than θ. So the data analyst needs to carefully decide what to include in θ and Di, by either expanding θ to include all relevant parameters or reducing the content of Di to exclude all redundant data entries that can cause extra correlations.

Example 3.1. Consider the linear regression model where provider i s data points d(j) i = (z(j) i , y(j) i ) consist of a feature vector z(j) i and a label y(j) i . We have a linear model

y(j) i = θT z(j) i + ε(j) i .

Then datasets D1, . . . , Dn will be independent conditioning on θ as long as (1) different data providers draw their feature vectors independently, i.e., z(j1) 1 , . . . , z(jn) n are independent for all j1 [N1], . . . , jn [Nn], and (2) the noises are independent.

We further assume that the data analyst has some insight about the data generation process.

Assumption 3.2. The data analyst possesses a commonly accepted prior p(θ) and a commonly accepted model for data generating process so that she can compute the posterior p(θ|Di), i, Di.

When |Θ| is ﬁnite, p(θ|Di) can be computed as a function of p(θ|di) using the method in Appendix B. For a model in the exponential family, p(θ|Di) can be computed as in Deﬁnition 4.2.

Note that we do not always require the data analyst to know the whole distribution p(Di|θ), it sufﬁces for the data analyst to have the necessary information to compute p(θ|Di).

Example 3.2. Consider the linear regression model in Example 3.1. We use zi to represent all the features in Di and use yi to represent all the labels in Di. If the features zi are independent from θ, the data analyst does not need to know the distribution of zi. It sufﬁces to know p(yi|zi, θ) and p(θ) to know p(θ|Di) because

p(θ|(zi, yi)) p((zi, yi)|θ)p(θ) = p(yi|zi, θ)p(zi|θ)p(θ) = p(yi|zi, θ)p(zi)p(θ) p(yi|zi, θ)p(θ).

Finally we assume that the identities of the providers can be veriﬁed.

Assumption 3.3. The data analyst can verify the data providers identities, so one data provider can only submit one dataset and get one payment.

We now formally introduce some desirable properties of a payment rule. We say that a payment rule is truthful if reporting true datasets is a weak equilibrium, that is, when the others report true datasets, it is also (weakly) optimal for me to report the true dataset (based on my own belief).

Deﬁnition 3.1 (Truthfulness). Let D i be the datasets of all providers except i. A payment rule r(D1, . . . , Dn) is truthful if: for any (commonly accepted model of) underlying distribution p(θ, D1, . . . , Dn), for every data provider i and any realization of his dataset Di, when all other data providers truthfully report D i, truthfully reporting Di leads to the highest expected payment, where the expectation is taken over the distribution of D i conditioned on Di, i.e.,

ED i p(D i|Di)[ri(Di, D i)] ED i p(D i|Di)[ri(D i, D i)], i, Di, D i.

Note that this deﬁnition does not require the agents to actually know the conditional distribution and to be able to evaluate the expectation themselves. It is a guarantee that no matter what the underlying distribution is, truthfully reporting is an equilibrium.

Because truthfulness is deﬁned as a weak equilibrium, it does not necessarily discourage misreporting.2 What it ensures is that the mechanism does not encourage misreporting.3 So, we want a stronger guarantee than truthfulness. We thus deﬁne sensitivity: the expected payment should be strictly lower when the reported data does not give the accurate prediction of θ.

Deﬁnition 3.2 (Sensitivity). A payment rule r(D1, . . . , Dn) is sensitive if for any (commonly accepted model of) underlying distribution p(θ, D1, . . . , Dn), for any provider i and any realization of his dataset Di, when all other providers j = i report e Dj(Dj) with accurate posterior p(θ| e Dj(Dj)) = p(θ|Dj), we have (1) truthfully reporting Di leads to the highest expected payment

ED i p(D i|Di)[ri(Di, e D i(D i))] ED i p(D i|Di)[ri(D i, e D i(D i))], D i

and (2) reporting a dataset D i with inaccurate posterior p(θ|D i) = p(θ|Di) is strictly worse than reporting a dataset e Di with accurate posterior p(θ| e Di) = p(θ|Di),

ED i p(D i|Di)[ri( e Di, e D i(D i))] > ED i p(D i|Di)[ri(D i, e D i(D i))],

Furthermore, let i = p(θ|D i) p(θ|Di), a payment rule is α-sensitive for agent i if

ED i p(D i|Di)[ri(Di, e D i(D i))] ED i p(D i|Di)[ri(D i, e D i(D i))] α i ,

for all Di, D i and reports e D i(D i) that give the accurate posteriors.

Our deﬁnition of sensitivity guarantees that at an equilibrium, the reported datasets must give the correct posteriors p(θ| e Di) = p(θ|Di). We can further show that at an equilibrium, the analyst will get the accurate posterior p(θ|D1, . . . , Dn).

Lemma 3.1. When D1, . . . , Dn are independent conditioned on θ, for any (D1, . . . , Dn) and ( e D1, . . . , e Dn), if p(θ|Di) = p(θ| e Di) i, then p(θ|D1, . . . , Dn) = p(θ| e D1, . . . , e Dn).

A more ideal property would be that the expected payment is strictly lower for any dataset D i = Di. Mechanisms that satisfy sensitivity can be viewed as an important step toward this ideal goal, as the only possible payment-maximizing manipulations are to report a dataset e Di that has the correct posterior p(θ| e Di) = p(θ|Di). Arguably, ﬁnding such a manipulation can be challenging. Sensitivity guarantees the accurate prediction of θ at an equilibrium.

Second, we want a fair payment rule that is indifferent to data providers identities.

Deﬁnition 3.3 (Symmetry). A payment rule r is symmetric if for all permutation of n elements π( ), ri(D1, . . . , Dn) = rπ(i)(Dπ(1), . . . , Dπ(n)) for all i.

Third, we want non-negative payments and the total payment should not exceed the budget.

Deﬁnition 3.4 (Individual rationality and budget feasibility). A payment rule r is individually rational if ri(D1, . . . , Dn) 0, i, D1, . . . , Dn. A payment rule r is budget feasible if Pn i=1 ri(D1, . . . , Dn) B, D1, . . . , Dn.

We will consider two acquisition settings in this paper:

2A constant payment rule is just a trivial truthful payment rule. 3Using a ﬁxed test set may encourage misreporting.

One-time data acquisition. The data analyst collects data in one batch. In this case, our problem is very similar to the single-task forecast elicitation in [16]. But our model considers the budget feasibility and the IR, whereas they only consider the truthfulness of the mechanism.

Multiple-time data acquisition. The data analyst repeatedly collects data for T 2 days. On day t, (θ(t), D(t) 1 , . . . , D(t) n ) is drawn independently from the same distribution p(θ, D1, . . . , Dn). The analyst has a budget B(t) and wants to know the posterior of θ(t), p(θ(t)|D(t) 1 , . . . , D(t) n ). In this case, our setting differs from the multi-task forecast elicitation in [16] because providers can decide their strategies on a day based on all the observed historical data before that day.4 The multi-task forecast elicitation in [16] asks the agents to submit forecasts of latent labels in multiple similar independent tasks. It is assumed that the agent s forecast strategy for one task only depends on his information about that task but not the information about other tasks.

4 Preliminaries

In this section, we introduce some necessary background for developing our mechanisms. We ﬁrst give the deﬁnitions of exponential family distributions. Our designed mechanism will leverage the idea of mutual information between reported datasets to incentivize truthful reporting.

4.1 Exponential Family

Deﬁnition 4.1 (Exponential family [21]). A likehihood function p(x|θ), for x = (x1, . . . , xn) X n and θ Θ Rm is said to be in the exponential family in canonical form if it is of the form

p(x|θ) = 1 Z(θ)h(x) exp θT φ(x) or p(x|θ) = h(x) exp θT φ(x) A(θ) (1)

Here φ(x) Rm is called a vector of sufﬁcient statistics, Z(θ) = R

X n h(x) exp θT φ(x) is called the partition function, A(θ) = ln Z(θ) is called the log partition function.

In Bayesian probability theory, if the posterior distributions p(θ|x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. Deﬁnition 4.2 (Conjugate prior for the exponential family [21]). For a likelihood function in the exponential family p(x|θ) = h(x) exp θT φ(x) A(θ) . The conjugate prior for θ with parameters ν0, τ 0 is of the form

p(θ) = P(θ|ν0, τ 0) = g(ν0, τ 0) exp ν0θT τ 0 ν0A(θ) . (2)

n Pn i=1 φ(xi). Then the posterior of θ can be represented in the same form as the prior

p(θ|x) exp θT (ν0τ 0 + ns) (ν0 + n)A(θ) = P θ|ν0 + n, ν0τ 0 + ns

where P θ|ν0 + n, ν0τ 0+ns

ν0+n is the conjugate prior with parameters ν0 + n and ν0τ 0+ns

A lot of commonly used distributions belong to the exponential family. Gaussian, Multinoulli, Multinomial, Geometric, etc. Due to the space limit, we introduce only the deﬁnitions and refer the readers who are not familiar with the exponential family to [21] for more details.

4.2 Mutual Information

We will use the point-wise mutual information and the f-mutual information gain deﬁned in [16]. We introduce this notion of mutual information in the context of our problem.

4This is not to say that the providers will update their prior for θ(t) using the data on ﬁrst t 1 days. Because we assume that θ(t) is independent from θ(t 1), . . . , θ(t 1), so the data on ﬁrst t 1 days contains no information about θ(t). We use the same prior p(θ) throughout all T days. What it means is that when the analyst decides the payment for day t not only based on the report on day t but also the historical reports, the providers may also use different strategies for different historical reports.

Deﬁnition 4.3 (Point-wise mutual information). We deﬁne the point-wise mutual information between two datasets D1 and D2 to be

PMI(D1, D2) = Z

p(θ|D1)p(θ|D2)

p(θ) dθ. (3)

For ﬁnite case, we deﬁne PMI(D1, D2) = P

θ Θ p(θ|D1)p(θ|D2)

When |Θ| is ﬁnite or a model in the exponential family is used, the PMI will be computable.

Lemma 4.1. When |Θ| is ﬁnite, PMI( ) can be computed in O(|Θ|) time. If a model in exponential family is used, so that the prior and all the posterior of θ can be written in the form

p(θ) = P(θ|ν0, τ 0) = g(ν0, τ 0) exp ν0θT τ 0 ν0A(θ) ,

p(θ|Di) = P(θ|νi, τ i) and p(θ|D i) = P(θ|ν i, τ i), then the point-wise mutual information can be computed as

PMI(Di, D i) = g(νi, τ i)g(ν i, τ i)

g(ν0, τ 0)g(νi + ν i ν0, νiτ i+ν iτ i ν0τ 0

νi+ν i ν0 ) .

For single-task forecast elicitation, [16] proposes a truthful payment rule.

Deﬁnition 4.4 (log-PMI payment [16]). Suppose there are two data providers reporting e DA and e DB respectively. Then the log-PMI rule pays them r A = r B = log(PMI( e DA, e DB)).

Proposition 4.1. When the log-PMI rule is used, the expected payment equals the mutual information between e DA and e DB, where the expectation is taken over the distribution of e DA and e DB.

We now give the deﬁnition of the f-mutual information. An f-divergence is a function that measures the difference between two probability distributions.

Deﬁnition 4.5 (f-divergence). Given a convex function f with f(1) = 0, for two distributions over

Ω, p, q Ω, deﬁne the f-divergence of p and q to be Df(p, q) = R

ω Ωp(ω)f q(ω) p(ω) .

The f-mutual information of two random variables is a measure of the mutual dependence of two random variables, which is deﬁned as the f-divergence between their joint distribution and the product of their marginal distributions.

In duality theory, the convex conjugate of a function is deﬁned as follows.

Deﬁnition 4.6 (Convex conjugate). For any function f : R R, deﬁne the convex conjugate function of f as f (y) = supx xy f(x).

The following inequality ([22, 16]) will be used in our proof.

Lemma 4.2 (Lemma 1 in [22]). For any differentiable convex function f with f(1) = 0, any two distributions over Ω, p, q Ω, let G be the set of all functions from Ωto R, then we have

Df(p, q) sup g G

ω Ω g(ω)q(ω) f (g(ω))p(ω) dω = sup g G Eqg Epf (g).

A function g achieves equality if and only if g(ω) f q(ω) p(ω) ω with p(ω) > 0, where f q(ω)

represents the subdifferential of f at point q(ω)/p(ω).

5 One-time Data Acquisition

In this section we apply [16] s log-PMI payment rule to our one-time data acquisition problem. The log-PMI payment rule ensures truthfulness, but its payment can be negative or unbounded or even ill-deﬁned. So we mainly focus on the mechanism s sensitivity, budget feasibility and IR. To guarantee budget feasibility and IR, our mechanism requires a lower bound and an upper bound of PMI, which may be difﬁcult to ﬁnd for some models in the exponential family.

If the analyst knows the distribution p(Di|θ), then she will be able to compute p(D i|Di) = P

θ p(D i|θ)p(θ|Di). In this case, we can employ peer prediction mechanisms [19] to design payments and guarantee truthfulness. In Appendix C.1, we give an example of such mechanisms.

In this work we do not assume that p(Di|θ) is known (see Example 3.2). When p(Di|θ) is unknown but the analyst can compute p(θ|Di), our idea is to use the log-PMI payment rule in [16] and then add a normalization step to ensure budget feasibility and IR. However the log-PMI will be ill-deﬁned if PMI = 0. To avoid this, for each possible D i, we deﬁne set Di(D i) = {Di|PMI(Di, D i) > 0} and the log-PMI will only be computed for e Di Di( e D i).

The normalization step will require an upper bound R and lower bound L of the log-PMI payment.5 If |Θ| is ﬁnite, we can ﬁnd a lower bound and an upper bound in polynomial time, which we prove in Appendix C.2. When a model in the exponential family is used, it is more difﬁcult to ﬁnd L and R. By Lemma 4.1, if the g function is bounded, we will be able to bound the payment. For example, if we are estimating the mean of a univariate Gaussian with known variance, L and R will be bounded if the number of data points is bounded. Details can be found in Appendix C.3. Our mechanism works as follows.

Mechanism 1: One-time data collecting mechanism.

(1) Ask all data providers to report their datasets e D1, . . . , e Dn. (2) If data provider i s reported dataset e Di Di( e D i), we compute a score for his dataset si = log PMI( e Di, e D i). (3) The ﬁnal payment for data provider i is:

ri( e D1, . . . , e Dn) = B

R L if e Di Di( e D i) 0 otherwise.

Theorem 5.1. Mechanism 1 is IR, truthful, budget feasible, symmetric.

Note that by Proposition 4.1, the expected payment for a data provider is decided by the mutual information between his data and other people s data. The payments are efﬁciently computable for ﬁnite-size Θ and for models in exponential family (Lemma 4.1).

Next, we discuss the sensitivity. In [16], checking whether the mechanism will be sensitive requires knowing whether a system of linear equations (which has an exponential size in our problem) has a unique solution. So it is not clear how likely the mechanisms will be sensitive. In our data acquisition setting, we are able to give much stronger and more explicit guarantees. This kind of stronger guarantee is possible because of the special structure of the reports (or the signals) that each dataset consists of i.i.d. samples.

We ﬁrst deﬁne some notations. When |Θ| is ﬁnite, let Q i be a (Πj [n],j =i|Dj|Nj) |Θ| matrix that represents the conditional distribution of θ conditioning on every realization of D i. So the element in row D i and column θ is equal to p(θ|D i). We also deﬁne the data generating matrix Gi with |Di| rows and |Θ| columns. Each row corresponds to a possible data point di Di in the dataset and each column corresponds to a θ Θ. The element in the row corresponding to data point di and the column θ is p(θ|di). We give the sufﬁcient condition for the mechanism to be sensitive. Theorem 5.2. When |Θ| is ﬁnite, Mechanism 1 is sensitive if for all i, Q i has rank |Θ|.

Since the size of Q i can be exponentially large, it may be computationally infeasible to check the rank of Q i. We thus give a simpler condition that only uses Gi, which has a polynomial size. Deﬁnition 5.1. The Kruskal rank (or k-rank) of a matrix M, denoted by rankk(M), is the maximal number r such that any set of r columns of M is linearly independent. Corollary 5.1. When |Θ| is ﬁnite, Mechanism 1 is sensitive if for all i, P j =i (rankk(Gj) 1) Nj + 1 |Θ|, where Nj is the number of data points in Dj.

In Appendix C.5.1, we also give a lower bound for α so that Mechanism 1 is α-sensitive.

Our sensitivity results (Theorem 5.2 and Corollary 5.1) basically mean that when there is enough correlation between other people s data D i and θ, the mechanism will be sensitive. Corollary 5.1

5WLOG, we can assume that L < R here. Because L = R implies that all agents datasets are independent.

quantiﬁes the correlation using the k-rank of the data generating matrix. It is arguably not difﬁcult to have enough correlation: a naive relaxation of Corollary 5.1 says that assuming different θ lead to different data distributions (so that rankk(Gj) 2), the mechanism will be sensitive if for any provider i, the total number of other people s data points |Θ| 1.

When Θ Rm, it becomes more difﬁcult to guarantee sensitivity. Suppose the data analyst uses a model from the exponential family so that the prior and all the posterior of θ can be written in the form in Lemma 4.1. The sensitivity of the mechanism will depend on the normalization term g(ν, τ) (or equivalently, the partition function) of the pdf. More speciﬁcally, deﬁne

h D i(νi, τ i) = g(νi, τ i)

g(νi + ν i ν0, νiτ i+ν iτ i ν0τ 0

νi+ν i ν0 ) , (4)

then we have the following sufﬁcient and necessary conditions for the sensitivity of the mechanism. Theorem 5.3. When Θ Rm, if the data analyst uses a model in the exponential family, then Mechanism 1 is sensitive if and only if for any (ν i, τ i) = (νi, τ i), we have Pr D i[h D i(ν i, τ i) = h D i(νi, τ i)] > 0.

The theorem basically means that the mechanism will be sensitive if any pairs of different reports that will lead to different posteriors of θ can be distinguished by h D i( ) with non-zero probability. However, for different models in the exponential family, this is not always true. For example, if we estimate the mean µ of a univariate Gaussian with a known variance and the Gaussian conjugate prior is used, then the normalization term only depends on the variance but not the mean, so in this case h( ) can only detect the change in variance, which means that the mechanism will be sensitive to replication and withholding, but not necessarily other types of manipulations. But if we estimate the mean of a Bernoulli distribution whose conjugate prior is the Beta distribution, then the partition function will be the Beta function, which can detect different posteriors and thus the mechanism will be sensitive. See Appendix C.4 for more details. The missing proofs can be found in Appendix C.5.

6 Multiple-time Data Acquisition

Now we consider the case when the data analyst needs to repeatedly collect data for the same task. At day t, the analyst has a budget B(t) and a new ensemble (θ(t), D(t) 1 , . . . , D(t) n ) is drawn from the same distribution p(θ, D1, . . . , Dn), independent of the previous data. Again we assume that the data generating distribution p(Di|θ) can be unknown, but the analyst is able to compute p(θ|Di)) after seeing the data(See Example 3.2). The data analyst can use the one-time purchasing mechanism (Section 5) at each round. But we show that if the data analyst can give the payment one day after the data is reported, a broader class of mechanisms can be applied to guarantee the desirable properties, which ensures bounded payments without any assumptions on the underlying distribution. Our method is based on the f-mutual information gain in [16] for multi-task forecast elicitation. The payment function in [16] has a minor error. We correct the payment function in this work.6

Our mechanism (Mechanism 2) works as follows. On day t, the data providers are ﬁrst asked to report their data for day t. Then for each provider i, we use the other providers reported data on day t 1 and day t to evaluate provider i s reported data on day t 1. A score si will be computed for each provider s e D(t 1) i . The score si is deﬁned in the same way as the f-mutual information gain in [16], which is speciﬁed by a differentiable convex function f : R R and its convex conjugate f (Deﬁnition 4.6),

PMI( e D(t 1) i , e D(t) i)

PMI( e D(t 1) i , e D(t 1) i )

The score is deﬁned in this particular way because it can guarantee truthfulness according to Lemma 4.2. It can be proved that when the agents truthfully report Di, the expectation of si will reach the supremum in Lemma 4.2, and will then be equal to the f-mutual information of D(t 1) i

6The term PMI( ) in the payment function of [16] should actually be 1/PMI( ). This is because when [16] cited Lemma 1 in [22], q( )/p( ) is mistakenly replaced by p( )/q( ). 7Here we assume that PMI( ) is non-zero. For PMI( ) = 0, we can just do the same as in the one-time acquisition mechanism.

and D(t 1) i . We can further prove that if data provider i reports a dataset e D(t 1) i that leads to a

different posterior p(θ| e D(t 1) i ) = p(θ|D(t 1) i ), the expectation of si will deviate from the supremum in Lemma 4.2 and thus get lower.

According to the deﬁnition (5), if we carefully choose the convex function f to be a differentiable convex function with a bounded derivative f [0, U] and with the convex conjugate f bounded on [0, U], then the scores s1, . . . , sn will always be bounded. We can then normalize s1, . . . , sn so that the payments are non-negative and the total payment does not exceed B(t 1). Here we give one possible choice of f that can guarantee bounded scores: the Logistic function 1 1+e x .

f(x) f (x) range of f (x) f (x) range of f (x) ln(1 + ex) 1 1+e x [ 1

2, 1) on R 0 x ln x + (1 x) ln(1 x) [ ln 2, 0] on [ 1

Finally, if day t is the last day, we adopt the one-time mechanism to pay for day t s data as well.

Mechanism 2: Multi-time data collecting mechanism.

Given a differentiable convex function f with f [0, U] and f bounded on [0, U] for t = 1, . . . , T do

(1) On day t, ask all data providers to report their datasets e D(t) 1 , . . . , e D(t) n . (2) If t is the last day t = T, use the payment rule of Mechanism 1 to pay for day T s data or just give each data provider B(T )/n. (3) If t > 1, give the payments for day t 1 as follows. First compute all the scores si as in (5). Then normalize the scores so that the total payment is no more than B(t 1). Let the range of the scores be [L, R]. Assign payments

ri( e D(t 1) 1 , . . . , e D(t 1) n ) = B(t 1)

R L . end for

Our ﬁrst result is that Mechanism 2 guarantees all the basic properties of a desirable mechanism. Theorem 6.1. Given any differentiable convex function f that has (1) a bounded derivative f [0, U] and (2) the convex conjugate f bounded on [0, U], Mechanism 2 is IR, budget feasible, truthful and symmetric in all T rounds.

If we choose computable f and f (e.g. f equal to the Logistic function), the payments will also be computable for ﬁnite-size Θ and for models in exponential family (Lemma 4.1). If we use a strictly convex function f with f > 0, then Mechanism 2 has basically the same sensitivity guarantee as Mechanism 1 in the ﬁrst T 1 rounds. We defer the sensitivity analysis to Appendix D.1. The missing proofs in this section can be found in Appendix D.2.

7 Discussion

Our work leaves some immediate open questions. Our one-time data acquisition mechanism requires a lower bound and an upper bound of PMI, which may be difﬁcult to ﬁnd for some models in the exponential family. Can we ﬁnd a mechanism that would work for any data distribution, just as our multi-time data acquisition mechanism? Another interesting direction is to design stronger mechanisms to strengthen the sensitivity guarantees. Finally, our method incentivizes truthful reporting, but it is not guaranteed that datasets that give more accurate posteriors will receive higher payments (in expectation). It would be desirable if the mechanism could have this property as well.

An observation of our mechanism also raises an important issue of adversarial attacks for data acquisition. Our mechanism guarantees that the data holders who have some data will truthfully report the data at the equilibrium. But an adversary without any real data can submit some fake data and get a positive payoff. The situation is even worse if the adversary can create multiple fake accounts to submit data. This is tied to the individual rationality guarantee that we placed on our mechanism, which requires the payments to always be non-negative and which is generally considered an important property for incentivizing desirable behavior of strategic agents. However, this observation suggests that future work needs to carefully explore the tradeoff between guarantees for strategic agents and guarantees for adversarial agents for data acquisition problems.

Broader Impact

The results in this work are mainly theoretical. They contribute to the ongoing efforts on encouraging data sharing, ensuring data quality and distributing values of generated from data back to data contributors.

Acknowledgments and Disclosure of Funding

This work is supported by the National Science Foundation under grants CCF-1718549 and IIS2007887.

[1] Jacob Abernethy, Yiling Chen, Chien-Ju Ho, and Bo Waggoner. Low-cost learning via active data procurement. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, EC 15, pages 619 636, New York, NY, USA, 2015. ACM.

[2] Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. A marketplace for data: An algorithmic solution. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 701 726, 2019.

[3] Glenn W Brier. Veriﬁcation of forecasts expressed in terms of probability. Monthly weather review, 78(1):1 3, 1950.

[4] Y. Cai, C. Daskalakis, and C. H. Papadimitriou. Optimum statistical estimation with strategic data sources. In Proceedings of the 28th Conference on Learning Theory, pages 280 296, 2015.

[5] Yiling Chen, Nicole Immorlica, Brendan Lucier, Vasilis Syrgkanis, and Juba Ziani. Optimal data acquisition for statistical estimation. In Proceedings of the 2018 ACM Conference on Economics and Computation, EC 18, pages 27 44, New York, NY, USA, 2018. ACM.

[6] Yiling Chen and Shuran Zheng. Prior-free data acquisition for accurate statistical estimation. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 659 677, 2019.

[7] Rachel Cummings, Stratis Ioannidis, and Katrina Ligett. Truthful linear regression. In Proceedings of the 28th Conference on Learning Theory, pages 448 483, 2015.

[8] Anirban Dasgupta and Arpita Ghosh. Crowdsourced judgement elicitation with endogenous proﬁciency. In Proceedings of the 22nd international conference on World Wide Web, pages 319 330, 2013.

[9] Fang Fang, Maxwell Stinchcombe, and Andrew Whinston. " putting your money where your mouth is"-a betting platform for better prediction. Review of Network Economics, 6(2), 2007.

[10] Lisa Fleischer and Yu-Han Lyu. Approximately optimal auctions for selling privacy when costs are correlated with data. Co RR, abs/1204.4031, 2012.

[11] Rafael Frongillo and Jens Witkowski. A geometric perspective on minimal peer prediction. ACM Transactions on Economics and Computation (TEAC), 5(3):1 27, 2017.

[12] Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. ar Xiv preprint ar Xiv:1904.02868, 2019.

[13] Arpita Ghosh, Katrina Ligett, Aaron Roth, and Grant Schoenebeck. Buying private data without veriﬁcation. Co RR, abs/1404.6003, 2014.

[14] Arpita Ghosh and Aaron Roth. Selling privacy at auction. In Proceedings of the 12th ACM Conference on Electronic Commerce, EC 11, pages 199 208, New York, NY, USA, 2011. ACM.

[15] Yuqing Kong. Dominantly truthful multi-task peer prediction with a constant number of tasks. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2398 2411. SIAM, 2020.

[16] Yuqing Kong and Grant Schoenebeck. Water from two rocks: Maximizing the mutual information. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 177 194, 2018.

[17] Yuqing Kong and Grant Schoenebeck. An information theoretic framework for designing information elicitation mechanisms that reward truth-telling. ACM Transactions on Economics and Computation (TEAC), 7(1):1 33, 2019.

[18] Yang Liu, Juntao Wang, and Yiling Chen. Surrogate scoring rules. ar Xiv preprint ar Xiv:1802.09158, 2018.

[19] Nolan Miller, Paul Resnick, and Richard Zeckhauser. Eliciting informative feedback: The peer-prediction method. Management Science, 51(9):1359 1373, 2005.

[20] Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution. 1(2σ2):16.

[21] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

[22] Xuan Long Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010.

[23] Kobbi Nissim, Salil Vadhan, and David Xiao. Redrawing the boundaries on purchasing data from privacy-sensitive individuals. In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, ITCS 14, pages 411 422, New York, NY, USA, 2014. ACM.

[24] Dražen Prelec. A bayesian truth serum for subjective data. science, 306(5695):462 466, 2004.

[25] Aaron Roth and Grant Schoenebeck. Conducting truthful surveys, cheaply. In Proceedings of the 13th ACM Conference on Electronic Commerce, EC 12, 2012.

[26] Victor Shnayder, Arpit Agarwal, Rafael Frongillo, and David C Parkes. Informed truthfulness in multi-task peer prediction. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 179 196, 2016.

[27] Nicholas D Sidiropoulos and Rasmus Bro. On the uniqueness of multilinear decomposition of n-way arrays. Journal of Chemometrics: A Journal of the Chemometrics Society, 14(3):229 239, 2000.

[28] Bo Waggoner, Rafael Frongillo, and Jacob D Abernethy. A market framework for eliciting private data. In Advances in Neural Information Processing Systems, pages 3510 3518, 2015.

[29] Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. ar Xiv preprint ar Xiv:1304.5634, 2013.

[30] Shuran Zheng, Bo Waggoner, Yang Liu, and Yiling Chen. Active information acquisition for linear optimization. ar Xiv preprint ar Xiv:1709.10061, 2017.