# privacyaware_synthesizing_for_crowdsourced_data__8eac8048.pdf

Privacy-aware Synthesizing for Crowdsourced Data

Mengdi Huai1 , Di Wang2 , Chenglin Miao2 , Jinhui Xu2 and Aidong Zhang1

1Department of Computer Science, University of Virginia, VA, USA 2Department of Computer Science and Engineering, SUNY at Buﬀalo, NY, USA mh6ck@virginia.edu, {dwang45, cmiao, jinhui}@buﬀalo.edu, aidong@virginia.edu

Although releasing crowdsourced data brings many beneﬁts to the data analyzers to conduct statistical analysis, it may violate crowd users data privacy. A potential way to address this problem is to employ traditional diﬀerential privacy (DP) mechanisms and perturb the data with some noise before releasing them. However, considering that there usually exist conﬂicts among the crowdsourced data and these data are usually large in volume, directly using these mechanisms can not guarantee good utility in the setting of releasing crowdsourced data. To address this challenge, in this paper, we propose a novel privacy-aware synthesizing method (i.e., Pris Crowd) for crowdsourced data, based on which the data collector can release users data with strong privacy protection for their private information, while at the same time, the data analyzer can achieve good utility from the released data. Both theoretical analysis and extensive experiments on real-world datasets demonstrate the desired performance of the proposed method.

1 Introduction In recent years, crowdsourcing has emerged as a popular and fast paradigm to solve many challenging data analysis tasks. Through the power of the crowd, the data collectors (e.g., hospitals, foundations and government agencies) can easily obtain large amounts of useful information. At the same time, the proliferation of new information techniques enables these data collectors to easily share their data that are collected from a crowd of users (e.g., patients, customers) with researchers or data analyzers. From such a wealth of shared data, researchers or data analyzers can discover useful knowledge or patterns to improve the quality of products, the management of public health, and so on. For example, in healthcare applications, the adverse events about a new drug can be easily collected by the hospitals from diﬀerent patients. If the hospitals are willing to share these medical data, it would be very useful for the drug makers or medical research institutions to understand the eﬃcacy of the drug. Although the sharing of crowdsourced data brings many beneﬁts, it may introduce privacy issues [Miao et al., 2015;

Shi and Wu, 2017; Miao et al., 2017; Feng et al., 2017]. Considering the above example, the hospital aims to collect the adverse events about a new drug from diﬀerent patients. The patients usually trust the hospital and are willing to provide all the requested information. But if the hospital directly releases the patients medical data to the drug makers, the private information of patients would be disclosed. Without eﬀective privacy-preserving mechanisms, the patients may not allow their data to be released. Thus, it is essential to address how to enable the data collectors to release the crowdsourced data without disclosing users private information. Among existing privacy-preserving techniques, diﬀerential privacy (DP) has drawn signiﬁcant attention as it can provide very rigorous privacy and utility guarantee [Dwork et al., 2006]. However, this technique has several practical limitations when it is applied in the setting of releasing crowdsourced data. First of all, since the crowdsourced data on an object (e.g., the new drug) are usually collected from multiple users or sources, there inevitably exist conﬂicts among these data. The reasons include incomplete views of observations, environment noise, diﬀerent knowledge bases and even the intent to deceive, etc. Directly applying DP on these data can not eliminate the conﬂicts, and this will certainly degrade the accuracy of the data analysis results. Additionally, DP is usually achieved by adding noise following the Laplace or exponential mechanisms [Dwork et al., 2006]. The noise scale introduced by the Laplace mechanism is proportional to the number of data records, and such noise may make the data useless considering that crowdsourced dataset usually contains large amounts of data records. Although the noise introduced by the exponential mechanism does not depend on the number of data records, it depends on the domains of the input data [Dwork et al., 2014], which may also make the crowdsourced data useless because these data usually have large domains. To address the above challenges, in this paper, we propose a novel sampling-based privacy-aware synthesizing method for crowdsourced data (Pris Crowd). In this method, the data collector ﬁrst learns the underlying patterns (i.e., densities) of the data for the objects through assigning each user a ﬁne grained weight (or reliability degree) on each object. Then, for each object, the data collector samples a set of candidate synthetic data from the learned density. Finally, these synthetic data are subjected to our proposed privacy test, and the data collector only releases the synthetics that can pass the privacy

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

test. The proposed method can not only extract high quality crowdsourced data via diﬀerentiating each user s ﬁne grained reliability degrees on diﬀerent objects but also achieve DP without injecting noise to the data. Both theoretical analysis and extensive experiments on real-world datasets are provided to verify the desirable performance of the proposed method.

2 Problem Setting This paper considers a data releasing scenario, where a crowd of users and a data collector are involved. The users (or data sources) are the individuals (e.g., patients, customers) who can observe some objects (e.g., drugs, commodities) and provide claims for them. The data collector is an individual or institution (e.g., a hospital, an online store) who can collect the claims for these objects from a crowd of users and then release these claims to the public either voluntarily or for ﬁnancial incentives. Here, we assume that the collector is trusted and the security threats mainly come from the public. Problem formulation. Suppose there are 푁objects = {표푖}푁 푖=1 which are observed by 푀users = {1, 2, ..., 푀}. For each object 표푖, the claims of users are denoted as 푖= {푥푖,푠}푠 푖, where 푥푖,푠represents the claim provided by user 푠for object 표푖and 푖represents the set of users who provide claims for this object. The claims collected by the data collector from all users are denoted as = { 푖}푁 푖=1, which need to be released to the public. Our goal in this paper is to design a mechanism based on which the data collector can release users claims with strong privacy protection for their private information, while at the same time, the data analyzer can achieve good utility from the released data.

3 Preliminary Deﬁnition 1 (Diﬀerential Privacy [Dwork et al., 2006]). A randomized algorithm is (휖, 훿)-diﬀerentially private if for all neighboring datasets 퐷, 퐷 푛and for all events 푆in the output space of , the following holds: Pr( (퐷) 푆) 푒휖Pr( (퐷 ) 푆) + 훿. The kernel density estimation (KDE) is a statistically-sound method to estimate a continuous distribution. Suppose there are 푛independent observations 푋= {푥1, ..., 푥푛} ℝ푑following an unknown true density 푓 (푥). The standard KDE 푓(푥) for the estimation of 푓 (푥) at those points is deﬁned as 푓(푥) = 1

푛 푛 푖=1 (푥, 푥푖). The following assumption will be used throughout the paper. Assumption 1. For a vector 푥푖 ℝ푑, we assume that the kernel function satisﬁes (푥, 푥푖) = (푥 푥푖). Furthermore, (푥 푥푖) is essentially a bump centered at 푥푖. More specif-

ically, we take (푥) = | | 1

2 푧), where the kernel itself is a probability density with zero mean and identity covariance and satisfying lim 푥 푥 푑 (푥) = 0. Common choices for that satisfy the above assumption include Gaussian and Epanechinikov kernels. As an example, Fig. 1 visualizes the construction of the standard KDE of 5 data points (black circles) using the well-known Gaussian ker-

nel that is deﬁned as (푥 푥푖) = ( 1

2휋ℎ)푑exp( 푥 푥푖 2

2 4 6 8 10 12 14 X

0.00 0.05 0.10 0.15 0.20

Components KDE

Figure 1: An example for the standard KDE

where ℎis the bandwidth. The red curves are the component densities, and each red curve is a scaled version of the normal density curve centered at a datum. The standard KDE is obtained by summing these ﬁve scaled components.

4 Methodology

4.1 Overview To achieve the goal described in Section 2, we propose a novel privacy-aware synthesizing method for crowdsourced data (i.e., Pris Crowd), which contains two phases. In the ﬁrst phase, we propose to use the weighted KDE as an intermediate representation of the raw data. This intermediate representation can well capture the statistical properties of the raw data. In the second phase, we ﬁrst sample a set of candidate synthetic claims from the learned densities in the ﬁrst phase, then each of these candidate claims is subjected to the proposed privacy test. If the claim passes the privacy test, it will be released, otherwise it will be discarded. The ﬂowchart of the proposed two-phase method is shown in Fig. 2.

Figure 2: Privacy-aware synthesizing for crowdsourced data

4.2 Weighted KDE-based Data Representation In order to share wealth data with the data analyzer, the data collector ﬁrst needs to learn the characteristics or the underlying patterns of original data, i.e., the informative density distributions of objects. To estimate the density for each object, the standard kernel density estimation (KDE) can be adopted. Additionally, since diﬀerent users may provide different claims for the same object, the reliability degrees (or weights) of these users should be taken into account when estimating the densities [Li et al., 2014b; Li et al., 2014a; Li et al., 2016; Miao et al., 2019]. However, the standard KDE cannot diﬀerentiate the importance of users (i.e., user reliability degrees). In order to learn users reliability and compute the densities of objects simultaneously, we propose a novel method which can estimate users global and local weights, and then combine them to learn objects informative density distributions. A user s global weight reﬂects his capability to provide truthful information for all the objects, and the local

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

weights represent that this user may have diﬀerent conﬁdence when providing claims for diﬀerent objects. The advantage of the proposed method is that it can estimate reasonable reliability for each user, and in turn, learn the accurate informative density distributions for objects.

Global Weight Estimation To evaluate the overall importance of users, the data collector assigns a global weight 푔푠 ℝto each user 푠. Meanwhile, we can obtain a global density 푓푔 푖for each object 표푖, which should be close to the distribution of claims from reliable users. The distribution of the input claims 푖can be obtained by 푖(푥, 푖) (푥 ℝis a variable), i.e., the kernel function associated with a reproducing kernel Hilbert space 푖. To minimize the weighted deviation from the estimated density 푄= {푓푔 푖(푥)}푁 푖=1 to the multi-user input = { 푖}푁 푖=1, we propose the following optimization framework

푖 푠 푑 푖( 푖(푥, 푥푖,푠), 푓푔 푖(푥)) (1)

푠 exp(푔푠) = 1,

where 푠denotes the set of objects observed by user 푠, 퐺 = {푔푠}푠 and the normalized squared loss 푑 푖( 푖( , ), 푓푔 푖(푥)) is deﬁned as 푑 푖( 푖(푥, 푥푖,푠), 푓푔 푖(푥)) = 푖(푥, 푥푖,푠) 푓푔 푖(푥) 2 푖. The global loss function (i.e., Eq. (1)) extends the framework in [Li et al., 2014b] from real space to Hilbert space. We can use an iterative procedure to solve it. Speciﬁcally, in the 푘-th iteration, 푔푠is updated as

푔(푘+1) 푠 = log

푖 푠푑 푖( 푖(푥, 푥푖,푠), 푓푔(푘) 푖 (푥)) 푠 푖 푠 푑 푖( 푖(푥, 푥푖,푠 ), 푓푔(푘) 푖 (푥)) ,

where 푓푔(푘) 푖 (푥) = 푡 푖푔(푘) 푡 푖(푥, 푥푖,푡) ( 푡 푖푔(푘) 푡). Eq. (2) shows that a user s global weight is inversely proportional to the distance between its claims and the estimated global densities at the log scale. Users whose claims are close to the derived global densities will have higher global weights.

Local Weight Estimation As described above, each user may have diﬀerent conﬁdence when providing claims for diﬀerent objects. Thus, we need to model the local weight of each user on every object, which will in turn help to infer the accurate density estimations. A potential way to achieve this is to establish a square loss function. However, it leads to a problem that each user would receive the same local weight, and the trustworthiness of the claims provided by diﬀerent users would be equal. In order to address this problem, we use Hampel loss function [Hampel et al., 2011]:

휁푞1,푞2,푞3(푦) =

푦2 2, 0 푦< 푞1 푞1푦 푞2 1 2, 푞1 푦< 푞2 푞1(푦 푞3)2

2(푞2 푞3) + 푞1(푞2+푞3 푞1)

2 , 푞2 푦< 푞3 푞1(푞2 + 푞3 푞1) 2, 푞3 푦,

where 푞1 < 푞2 < 푞3 are predeﬁned nonnegative parameters. These parameters allow us to decrease the trustworthiness of

bad claims and increase that of good ones for each object, so the importance of users can be well distinguished. Since we incorporate users reliability into estimating the local densities, the local kernel density of object 표푖can be deﬁned as 푓푙 푖(푥) = 푠 푖푙푖,푠 푖(푥, 푥푖,푠), where 푙푖,푠is the local weight of the user 푠on object 표푖. Thus, the objective function for estimating 푙푖= {푙푖,푠}푠 푖is

퐽(푙푖) = min 푙푖

푠 푖 휁( 푖(푥, 푥푖,푠 )

푠 푖 푙푖,푠 푖(푥, 푥푖,푠) ),

(3) where denotes the diﬀerence between users claims and the estimated local density 푓푙 푖(푥). This objective function is not convex, i.e., Eq. (3) does not have a closed form solution. Fortunately, it is possible to approximate 푙푖= {푙푖,푠}푠 푖 with a standard iteratively re-weighted least squares (IRWLS) algorithm. The iterative procedure for computing {푙푖,푠}푠 푖is

푙(푘+1) 푖,푠 = 휁( 푖(푥, 푥푖,푠) 푡 푖푙(푘) 푖,푡 푖(푥, 푥푖,푡) ) 푠 푖휁( 푖(푥, 푥푖,푠 ) 푡 푖푙(푘) 푖,푡 푖(푥, 푥푖,푡) ) ,

where 푘denotes the number of iterations. This equation shows that users would receive lower weights when they provide bad claims which deviate largely from the center 푓푙 푖(푥) = 푡 푖푙푖,푡 푖(푥, 푥푖,푡).

Combined Weight Estimation For each user 푠, to measure the consistency degree of the global and local weights (i.e., 푔푠and 푙푖,푠), we deﬁne a mixture weight, named combined weight 푐푖,푠. To learn the combined weight, the relative entropy is employed, which minimizes the information loss between user s global weight and local weight. The smaller the relative entropy value of those weights, the higher the degree of their consistency. The objective of the combined model is

min {푐푖,푠}푠 푖

푠 푖 푐푖,푠log 푐푖,푠 푙푖,푠 +

푠 푖 푐푖,푠log 푐푖,푠

푠 푖 푐푖,푠= 1, 푐푖,푠 0. (4)

By solving Eq. (4), we can obtain the combined weight 푐푖,푠 of user 푠on the object 표푖as 푐푖,푠= 푙푖,푠푔푠 ( 푡 푖 푙푖,푡푔푡). Based on the learned combined weights, we can obtain the density of object 표푖which is the weighted sum of claims in Hilbert space and is given as

푙푖,푠푔푠 푡 푖 푙푖,푡푔푡 푖(푥, 푥푖,푠). (5)

4.3 Privacy Test-based Synthetics Release To provide strong privacy protection for users private information, in this section, we propose a privacy test-based synthetics release method, which contains two steps: Candidate synthetics generation and Privacy test for candidate synthetics. In the ﬁrst step, we sample a set of synthetic claims from the learned density in Eq. (5) as the candidate data to release. Then, in the second step, these sampled synthetics are subjected to a privacy test. If a synthetic claim passes the test, it will be released, otherwise it will be discarded.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Candidate Synthetics Generation We ﬁrst discuss how to generate the synthetic claims 푖for each object 표푖. Speciﬁcally, we generate each element in 푖 as follows:

1. Select a random integer 푠 푖with probability 푐푖,푠;

2. Generate a synthetic claim 푥푖,푠through sampling from the probability distribution 푖(푥, 푥푖,푠).

Here 푐푖,푠can be treated as the sampling probability that determines whether 푥푖,푠is selected or not. In this step, we aim to select some seed data (e.g., 푥푖,푠) and then probabilistically transform them into the synthetic data. The sampling mechanism used here can increase the uncertainty of the adversary about whether a user s data is in the released dataset or not, and thus it can help to protect users privacy to some extent. However, it is not enough to only use the sampling mechanism, directly releasing the sampled data can still violate users privacy [Gehrke et al., 2012]. To tackle this problem, we design the following privacy test mechanism to further prevent users private information from being disclosed.

Privacy Test for Candidate Synthetics To prevent an adversary from deducing that a particular claim in 푖is more responsible for generating the released synthetic data than other claims, the following randomized privacy test mechanism is proposed. Each candidate synthetic data in 푖 is subjected to the randomized privacy test, and it is released only when it passes this test. Suppose 푘 1 and 훾> 1 are the privacy parameters, and 휖0 is the randomness parameter. Let ( ) denote the above synthetic data generation procedure, which samples a candidate synthetic based on a seed data. Given 푥푖,푠 푖, we use Pr{ 푥푖,푠= (푥푖,푠)} to denote the probability that a synthetic data 푥푖,푠is generated based on ( ). Then the privacy test procedure for 푥푖,푠is described as follows:

1. Randomize 푘by adding a noise: 푘= 푘+ 푧, where 푧 퐿푎푝(1 휖0) is sampled from the Laplace Distribution.

2. Let 푎 0 be the integer that satisﬁes the inequalities 훾 푎 1 < Pr{ 푥푖,푠= (푥푖,푠)} 훾 푎.

3. Let 푘 be the number of records 푥푖,푠 푖that satisﬁes 훾 푎 1 < Pr{ 푥푖,푠= (푥푖,푠 } 훾 푎.

4. If 푘 푘, 푥푖,푠passes the test, otherwise it fails.

Note that 푘 denotes the number of possible data seeds that can generate 푥푖,푠with a probability value falling into a very stringent interval [훾 푎 1, 훾 푎]. The threshold parameter 푘 prevents releasing sensitive synthetic data. Under this randomized privacy test, a candidate synthetic data is released only when there are at least 푘possible data seeds that can generate 푥푖,푠. Intuitively, the larger the value of 푘, the larger the number of the possible seed data that are indistinguishable from 푥푖,푠. Also, the less the value of 훾, the more difﬁcult to distinguish 푥푖,푠from other possible seed data. Algorithm 1 summarizes the proposed privacy test-based synthetics release procedure, in which 푚denotes the number of synthetic claims that need to be released for object 표푖.

Algorithm 1 Private test-based synthetics release for 표푖

Input: {푐푖,푠}푠 푖, 푖= {푥푖,푠}푠 푖, 푘, 훾, 휖0, and 푚. Output: The dataset 푖that can be released. 1: 푖= 2: while | 푖| < 푚do 3: Select a random integer 푠 푖with probability 푐푖,푠; 4: Generate a synthetic claim 푥푖,푠based on the probability distribution 푖(푥, 푥푖,푠); 5: Conduct randomized privacy test for 푥푖,푠; 6: if 푥푖,푠passes the privacy test then 7: 푖= 푖 { 푥푖,푠}; 8: end if 9: end while 10: return 푖;

4.4 Theoretical Analysis Consistency Analysis In Section 4.3, we generate the synthetic claims for object 표푖 by sampling from the mixture distribution 푓푖(푥), i.e., 푥푖,푠 푓푖(푥). After obtaining the dataset 푖= { 푥푖,푠}푚 푠=1, a basic question here is that how well the generated dataset can reﬂect the original density function 푓푖(푥). Since each 푥푖,푠is sampled from 푓푖(푥) independently, the density function over { 푥푖,푠}푚 푠=1 can be denoted as 푓푖(푥) = 1

푚 푚 푠=1 푖(푥, 푥푖,푠). In Theorem 1, we provide the expected squared 퐿2-norm distance between 푓푖(푥) and 푓푖(푥). Theorem 1. Under Assumption 1 for 푖with the diagonal bandwidth matrix 푖= ℎ2퐼푑, we further assume that the support of (푧) satisﬁes 푧 1. Then, the expected squared 퐿2-norm distance between 푓푖(푥) and 푓푖(푥), i.e., 퐽= 피[ (푓푖(푥) 푓푖(푥))2푑푥], satisﬁes

퐽 4퐴 ℎ+ 퐴2 ℎ2푉+ 퐵 푚 ℎ푑+ 퐴퐵푉

푚 ℎ푑 1 , (6)

where 퐴= sup푥 ℝ푑 푓푖(푥) , 퐵= ( (푧))2푑푧and 푉is the volume of the support of 푓푖(푥). The expectation is respected to { 푥푖,푠}푚 푠=1 푓푖(푥). This theorem is a general result for 푑 dimensional case, in this paper, the value of 푑is 1.

Privacy Analysis Next, we conduct privacy analysis for Algorithm 1. Based on Theorem 2, we know that the proposed algorithm is diﬀerentially private. Theorem 2. Note that the input parameters of Algorithm 1 include {푐푖,푠}푠 푖, 푘 1, 훾> 0, and 휖0. For any neighboring datasets 푖and 푖such that | 푖|, | 푖| 푘and any integer 1 푡< 푘, we have that Algorithm 1 is (휖, 훿)- diﬀerentially private, where 휖= 휖0+log(1+ 훾

min푠 푖푐푖,푠), 훿=

| 푖| max푠 푖푐푖,푠푒 휖(푘 푡).

Remark 1. Note that the proposed Algorithm 1 is diﬀerent from the mechanism in [Bindschaedler et al., 2017]. The probability of choosing the seed 푥푖,푠is non-uniform in Algorithm 1 while that is uniform in [Bindschaedler et al., 2017].

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Dataset # users # objects Population 2,344 1,124 Stock 55 5,521 Indoor Floorplan 247 129

Table 1: The statistics of the adopted datasets.

The non-uniform property may generate diﬀerent parameters of diﬀerential privacy. When max푠 푖푐푖,푠= min푠 푖푐푖,푠= 1 | 푖| (i.e., we uniformly sample the seed 푥푖,푠), the above Theorem 2 is actually Theorem 1 in [Bindschaedler et al., 2017]. Thus, Theorem 2 in our paper is a generalization of Theorem 1 in [Bindschaedler et al., 2017]. Although the main idea of the proof for Theorem 2 is similar to that in [Bindschaedler et al., 2017], the details are quite diﬀerent: in [Bindschaedler et al., 2017] the proof consider 푖= 푖 {푥푖,푠 } as the neighborhood dataset while ours consider 푖= { 푖 {푥푖,푠}} {푥푖,푠 } as the neighborhood dataset. That is because if we add one data record, the probability of sampling seeds, i.e., {푐푖,푠}, will be totally changed. So the proof in [Bindschaedler et al., 2017] cannot satisfy our case.

5 Experiments

Performance measure. To evaluate the performance of our method, we adopt two measure metrics: the integrated squared error (ISE) and the squared integrated squared error (SISE). ISE is deﬁned as: 푁 푖=1 + (푓푖 푓푖)2푑푥, where 푓푖and 푓푖are respectively the original density and the density derived from the synthetic data for object 표푖. SISE is deﬁned as: 푁 푖=1( + (푓푖 푓푖)2푑푥)2. Compared with ISE, SISE tends to penalize more on the large distance and less on the small distance. Since the goal of the collector is to release the data whose pattern is similar to the true underlying pattern for the objects, the lower the ISE or SISE, the better the method.

Datasets. We adopt the following real-world datasets for our experiments: Population Dataset [Pasternack and Roth, 2010; Wan et al., 2016], Stock Dataset [Li et al., 2012], and Indoor Floorplan Dataset [Li et al., 2014a]. The statistics of these datasets are provided in Table 1.

Baselines. Here, we adopt two baselines, i.e. Basic and Uniform. In the Basic method, the data collector adds three level noise to the original data: 휖= 0.1 (Strong), 휖= 1 (normal) and 휖= 10 (Weak). In the Uniform method, the collector treats all users equally and the entities densities are learned with the uniformly weighted kernel density estimation. Here, the synthetic data generation and the privacy tests procedures are the same with those in our proposed method.

Case study. In order to investigate the advantages of the users combined weights, we conduct case studies on the three real-world datasets. For each dataset, we randomly select two objects as the cases, and then estimate their densities. The estimated densities are shown in Fig. 3. The red line in each subﬁgure represents the density estimated based on users combined weights. The black line represents the result estimated only based on the global weight of each user. We also conduct estimations without considering user quality, i.e., treating all

0 10 20 30 40 50 X

Combined Global

Uniform True density

30 35 40 45 50 55 60 X

Combined Global

Uniform True density

-8 -6 -4 -2 0 2 4 6 X

Combined Global

Uniform True density

-8 -6 -4 -2 0 2 4 6 X

Combined Global

Uniform True density

-5 0 5 10 15 20 25 X

Combined Global

Uniform True density

-5 0 5 10 15 20 25 X

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

Combined Global

Uniform True density

Figure 3: Case study on real-world datasets. (a) and (b): the two cases for Population dataset. (c) and (d): the two cases for Stock dataset. (e) and (f): the two cases for Indoor Floorplan dataset.

users equally, and the estimated density for each object is represented with the green line. The results in Fig.3 show that the densities estimated based on users combined weights are the closest to the true densities which are represented with the blue lines. Additionally, we show the claims of each object in this ﬁgure with magenta circles and crosses. We can see that some claims (i.e., the magenta crosses) are far away from others (i.e., the magenta circles). These claims are usually provided by the users with low weights, and they can be treated as outliers when estimating each object s density. The results show that the method based on the combined weight is more robust to outliers than the methods which only adopt users global weights or treat all users equally. In other words, the estimated density for each object based Pris Crowd can well reﬂect the underlying true density of this object.

Accuracy comparison. In this experiment, we evaluate the accuracy (or quality) of the published synthetic data and explore whether these data can well reﬂect the underlying true densities of the objects. Here we assume that the data collector releases 30 synthetic claims for each object to the public. The parameters 훾and 푘are set as 4 and 5 respectively. In order to evaluate the accuracy of the synthetic claims, we ﬁrst derive the density (i.e., 푓푖) of each object from the synthetic data, and then calculate ISE and SISE for each dataset. The results are shown in Table 2, from which we can see the proposed approach performs much better than the baseline methods on all real-world datasets. That is to say, the synthetic data generated based on our proposed method could well preserve the characteristics of the underlying pattern for the objects. Additionally, the results also show that the advantages of our proposed approach on the Stock dataset is larger than that on the Population and Indoor Floorplan datasets. The reason is that there are more outlying data points in the Stock dataset, and our proposed approach is robust to these outliers while the baseline methods are very sensitive to them.

The eﬀect of the number of sampled claims. In this experiment, we evaluate the eﬀect of the number of sampled claims for each object on the performance of the proposed method. Here we vary the number of the sampled claims for each ob-

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Measure Method Population Stock Indoor

Pris Crowd 0.479 1.699 1.051 Uniform 0.628 17.628 1.220 ISE Basic(Strong) 1.183 12.799 1.943 Basic(Normal) 1.119 9.867 1.937 Basic(Weak) 0.866 2.125 1.882

Pris Crowd 6.209 11.430 8.405 Uniform 8.502 47.217 11.112 SISE Basic(Strong) 12.013 40.420 15.111 Basic(Normal) 11.768 35.391 15.046 Basic(Weak) 10.149 15.412 14.723

Table 2: Accuracy comparison on the real-world datasets

ject from 1 to 30 and then calculate the ISE and SISE on the three real-world datasets. The results are shown in Fig. 4, from which we can see the ISE and SISE gradually get ﬂattened with the increase of the number of the sampled claims for each object. Take the population dataset as an example, when the number of sampled claims is lager than 10, the values of ISE and SISE are almost the same. That is to say, the released data generated based on our proposed method could well reﬂect the underlying patterns of the objects even only a few claims are sampled for each object.

Computational cost. Next we evaluate the computational cost of the synthetic claims generation procedure, i.e., the second phase in our proposed method. In this experiment, we only generate synthetic claims for the objects whose ground truths can be achieved from the original datasets, and consider two scenarios, i.e., with privacy tests and without privacy tests. Then we vary the number of the sampled claims for each object from 1 to 30. The running time of the synthetic claims generation procedure for the Population and Indoor Floorplan datasets is shown in Fig. 5, from which we can see the running time in the two scenarios is approximately linear with respect to the number of sampled claims for each object. Additionally, the results also show that the privacy test step introduce extra computational cost during the released data generation procedure. This is because each candidate synthetic data record needs to be tested in the privacy test step. Since good utility

0 5 10 15 20 25 30 Number of Sampled Claims

0 5 10 15 20 25 30 Number of Sampled Claims

1 3 5 7 9 11 13 15

0 5 10 15 20 25 30 Number of Sampled Claims

0 5 10 15 20 25 30 Number of Sampled Claims

11.0 11.3 11.6 11.9 12.2 12.5 11.0

0 5 10 15 20 25 30 Number of Sampled Claims

0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

0 5 10 15 20 25 30 Number of Sampled Claims

6 7 8 9 10 11 12

Figure 4: Accuracy w.r.t. Number of Sampled Claims. (a) and (b): Population. (c) and (d): Stock. (e) and (f): Indoor Floorplan.

0 5 10 15 20 25 30 Number of Sampled Claims

Running Time (s)

With privacy tests Without privacy tests

0 5 10 15 20 25 30 Number of Sampled Claims

Running Time (s)

With privacy tests Without privacy tests

Figure 5: Running time vs. number of sampled claims for each object. (a): Population. (b): Indoor Floorplan.

can be achieved based on our proposed method even only a few synthetic claims are generated for each object, the computational cost is tolerable in practice.

6 Related Work Recently, various diﬀerential private data release approaches have been proposed. Those methods can be roughly partitioned into two categories: the interactive ones and the noninteractive ones. In an interactive method [Li et al., 2010; Hardt and Rothblum, 2010; Roth and Roughgarden, 2010], a data analyzer can pose queries via a private mechanism, and a dataset owner answers these queries in response. In the noninteractive framework [Nissim et al., 2007; Bindschaedler and Shokri, 2016; Blum et al., 2013; Wang et al., 2018; Wang et al., 2019], a data owner releases the private version of the original data. Once data are published, the owner has no further control over the published data. The method in our paper is non-interactive. The typical approach to protect data privacy in the non-interactive context is to directly add noise, which is taken by [Bindschaedler and Shokri, 2016; Blum et al., 2013]. These works are either computationally infeasible on high-dimensional data, or practically ineﬀective because of their large utility costs. There are also some other works [Bindschaedler and Shokri, 2016] which release private data without adding noise, but they are unsuitable to be used in the newly appearing crowdsourcing setting considered in this paper where multi-sources provide multi-observations for multi-objects.

7 Conclusions In this paper, we propose a novel privacy-aware synthesizing method for crowdsourced data. Based on this method, the data collector can release the crowdsourced data with strong privacy protection for users private information, while at the same time, the data analyzer can achieve good utility from the released data. Both theoretical analysis and extensive experiments on real-world datasets verify the eﬀectiveness of the proposed method.

Acknowledgements This work was supported in part by the US National Science Foundation (NSF) under grants IIS-1514204 and CCF1716400. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of the NSF.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Bindschaedler and Shokri, 2016] Vincent Bindschaedler and Reza Shokri. Synthesizing plausible privacypreserving location traces. In Proceedings of S&P, pages 546 563, 2016. [Bindschaedler et al., 2017] Vincent Bindschaedler, Reza Shokri, and Carl A Gunter. Plausible deniability for privacy-preserving data synthesis. Proceedings of the VLDB Endowment, 10(5):481 492, 2017. [Blum et al., 2013] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to noninteractive database privacy. Journal of the ACM (JACM), 60(2):12, 2013. [Dwork et al., 2006] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284, 2006. [Dwork et al., 2014] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of diﬀerential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014. [Feng et al., 2017] Wei Feng, Zheng Yan, Hengrun Zhang, Kai Zeng, Yu Xiao, and Y Thomas Hou. A survey on security, privacy, and trust in mobile crowdsourcing. IEEE Internet of Things Journal, 5(4):2971 2992, 2017. [Gehrke et al., 2012] Johannes Gehrke, Michael Hay, Edward Lui, and Rafael Pass. Crowd-blending privacy. In Proceedings CRYPTO, pages 479 496. 2012. [Hampel et al., 2011] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: The approach based on inﬂuence functions. Wiley Online Library, 2011. [Hardt and Rothblum, 2010] Moritz Hardt and Guy N Rothblum. A multiplicative weights mechanism for privacypreserving data analysis. In Proceedings of FOCS, pages 61 70, 2010. [Li et al., 2010] Chao Li, Michael Hay, Vibhor Rastogi, Gerome Miklau, and Andrew Mc Gregor. Optimizing linear counting queries under diﬀerential privacy. In Proceedings of PODS, pages 123 134, 2010. [Li et al., 2012] Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth ﬁnding on the deep web: Is the problem solved? Proceedings of the VLDB Endowment, 6(2):97 108, 2012. [Li et al., 2014a] Qi Li, Yaliang Li, Jing Gao, Lu Su, Bo Zhao, Murat Demirbas, Wei Fan, and Jiawei Han. A conﬁdence-aware approach for truth discovery on long-tail data. PVLDB, 2014. [Li et al., 2014b] Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, and Jiawei Han. Resolving conﬂicts in heterogeneous data by truth discovery and source reliability estimation. In Proceedings of SIGMOD, pages 1187 1198, 2014. [Li et al., 2016] Yaliang Li, Qi Li, Jing Gao, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. Conﬂicts to harmony: A framework for resolving conﬂicts in heterogeneous data by truth

discovery. IEEE Transactions on Knowledge and Data Engineering (TKDE), 28(8):1986 1999, 2016. [Miao et al., 2015] Chenglin Miao, Wenjun Jiang, Lu Su, Yaliang Li, Suxin Guo, Zhan Qin, Houping Xiao, Jing Gao, and Kui Ren. Cloud-enabled privacy-preserving truth discovery in crowd sensing systems. In Proceedings of Sen Sys, pages 183 196, 2015. [Miao et al., 2017] Chenglin Miao, Lu Su, Wenjun Jiang, Yaliang Li, and Miaomiao Tian. A lightweight privacypreserving truth discovery framework for mobile crowd sensing systems. In Proceedings of INFOCOM, pages 1 9, 2017. [Miao et al., 2019] Chenglin Miao, Wenjun Jiang, Lu Su, Yaliang Li, Suxin Guo, Zhan Qin, Houping Xiao, Jing Gao, and Kui Ren. Privacy-preserving truth discovery in crowd sensing systems. ACM Transactions on Sensor Networks (TOSN), 15(1):9, 2019. [Nissim et al., 2007] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of STOC, pages 75 84, 2007. [Pasternack and Roth, 2010] JeﬀPasternack and Dan Roth. Knowing what to believe (when you already know something). In Proceedings of Coling, pages 877 885, 2010. [Roth and Roughgarden, 2010] Aaron Roth and Tim Roughgarden. Interactive privacy via the median mechanism. In Proceedings of STOC, pages 765 774, 2010. [Shi and Wu, 2017] Xinghua Shi and Xintao Wu. An overview of human genetic privacy. Annals of the New York Academy of Sciences, 1387(1):61 72, 2017. [Wan et al., 2016] Mengting Wan, Xiangyu Chen, Lance Kaplan, Jiawei Han, Jing Gao, and Bo Zhao. From truth discovery to trustworthy opinion discovery: An uncertaintyaware quantitative modeling approach. In Proceedings of SIGKDD, pages 1885 1894, 2016. [Wang et al., 2018] Di Wang, Marco Gaboardi, and Jinhui Xu. Empirical risk minimization in non-interactive local diﬀerential privacy revisited. In Proceedings of Neur IPS, pages 965 974, 2018. [Wang et al., 2019] Di Wang, Adam Smith, and Jinhui Xu. Noninteractive locally private learning of linear models via polynomial approximations. In Algorithmic Learning Theory, pages 897 902, 2019.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)