# federated_patient_hashing__fac4e33f.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Federated Patient Hashing

Jie Xu,1 Zhenxing Xu,1 Peter Walker,2 Fei Wang1

1Weill Cornell Medical College, Cornell University, USA 2U.S. Department of Defense Joint Artiﬁcial Intelligence Center, USA {jix4002, zhx2005, few2001}@med.cornell.edu, peter.b.walker.mil@mail.mil

Privacy concerns on sharing sensitive data across institutions are particularly paramount for the medical domain, which hinders the research and development of many applications, such as cohort construction for cross-institution observational studies and disease surveillance. Not only that, the large volume and heterogeneity of the patient data pose great challenges for retrieval and analysis. To address these challenges, in this paper, we propose a Federated Patient Hashing (FPH) framework, which collaboratively trains a retrieval model stored in a shared memory while keeping all the patientlevel information in local institutions. Speciﬁcally, the objective function is constructed by minimization of a similarity preserving loss and a heterogeneity digging loss, which preserves both inter-data and intra-data relationships. Then, by leveraging the concept of Bregman divergence, we implement optimization in a federated manner in both centralized and decentralized learning settings, without accessing the raw training data across institutions. In addition to this, we also analyze the convergence rate of the FPH framework. Extensive experiments on real-world clinical data set from critical care are provided to demonstrate the effectiveness of the proposed method on similar patient matching across institutions.

Introduction

Nowadays more and more complex and heterogeneous healthcare data are becoming readily available. This provides an unprecedented opportunity for digging out insights from those data with the development of machine learning algorithms. However, due to privacy and other considerations, data access is very restricted, which limits the research and methodology development in many medical applications, such as cross-institution observational studies on cohort construction, clinical trial recruitment and disease surveillance (Lee et al. 2018; Vatsalan and Christen 2016). Therefore, it is crucial to develop an effective way to analyze fragmented and heterogeneous patient data from multiple institutions. Hashing function, due to its noninvertible nature, is attractive in this scenario as they can map the sensitive data into

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

compact binary codes. Rather than data-independent hashing methods (Athitsos et al. 2008; Chi, Li, and Zhu 2013; Indyk and Motwani 1998), there have been lots of research on learning hashing functions (e.g., Laplacian Co Hashing (LCH) (Zhang et al. 2010), Collaborative Hashing (CH) (Liu et al. 2014), Semantic Correlation Maximization (SCM) (Zhang and Li 2014), Spectral Hashing (SH) (Weiss, Fergus, and Torralba 2012), Anchor Graph Hashing (AGH) (Liu, Wang, and fu Chang 2011), and the latest deep hashing methods (Li, Wang, and Kang 2015; Zhu et al. 2016)) based on a given set of training data so that certain data characteristics (such as similarity) can be preserved. Analytics with the learned binary hash codes can lead to less memory consumption and short query time (Chi and Zhu 2017). However, the training process of these hashing functions is usually conducted on the entire data set. What s more, most methods focus on image or video retrieval, especially for the deep hashing approaches. Hence, they may not be suitable for healthcare settings where the patient data are heterogeneous and usually scattered across multiple different institutions (Denny et al. 2013; Newton et al. 2013).

Efﬁcient learning of the retrieval model across institutions without exchanging raw patient data is also challenging. Although some data-independent hashing algorithms, e.g., locality sensitive hashing (Indyk and Motwani 1998), can avoid accessing data, their computational complexity increases exponentially with the feature dimension and their retrieval accuracy is usually low comparing with learning to hash approaches. The most intuitive solution is to collaboratively learn a shared model stored in a central server while keeping all the raw training data on local clients, a.k.a. federated learning (Koneˇcn y et al. 2016; Lee et al. 2018). Many researches have been conducted recently on different aspects of federated learning. For example, Koneˇcn y et al. (Koneˇcn y et al. 2016) and Caldas et al. (Caldas et al. 2018) worked on reducing the communication cost between local clients and server. Beyond that, aiming at non-independent identical distribution (non-IID) and unbalanced properties of the optimization, Mc Mahan et al. (Mc Mahan et al. 2016) proposed a practical method for the federated learning of deep networks based on iterative model averaging. Smith et al. (Smith et al.

2017) and Sahu et al. (Sahu et al. 2018) analyzed the convergence guarantee of federated optimization based on some strong assumptions. In this paper, we propose a Federated Patient Hashing (FPH) model, which is trained in a distributed manner without data sharing across different institutions. The goal is to improve the local retrieval models by leveraging a more expansive view of patients. We also provide the convergence analysis as well. The main contributions are summarized as follows:

Firstly, we formulate the general federated patient hashing problem, equipped with a similarity preserving loss and a heterogeneity digging loss. The former is used to preserve the similarity order, while the latter to capture the potential relationship within the heterogeneous data.

Secondly, by leveraging the Bregman divergence, we develop both centralized and decentralized learning strategies to optimize our model in a federated manner without accessing the patient-level information across institutions and provide the convergence analysis as well based on some assumptions.

Finally, we provide a speciﬁc loss function including a triplet ranking loss and a multi-modal consistency loss, and conduct empirical evaluations to validate the proposed framework. Our results on patients with Acute Kidney Injury (AKI) in critical care demonstrate the effectiveness of the proposed method.

Problem Deﬁnition Notations Let boldface lowercase letters like z Rd be vectors with dimension d and boldface uppercase letters like Z Rd c be matrices with size d c. The transpose of Z is denoted as Z and the real numbers are denoted as uppercase letters like Z R.

Problem Setting Assume there are Q institutions (e.g., hospitals), where the q-th institution contains the patient data X(q) = [X(q) 1 , ..., X(q) M ] Rd Nq with M modalities. The socalled modality is due to the heterogeneous of patient data, e.g., the lab measures and clinical notes in electronic health records (EHRs) are usually regarded as different modalities. Nq is the number of patients in q-th institution and d = M m=1 dm is the feature dimension. x(q) im is the i-th column of X(q) m , it represents the m-th modality of the i-th patient in q-th institution. N = Q q=1 Nq. The task is to learn a hashing function that maps ddimensional input x Rd onto c-dimensional Hamming space h H { 1, +1}c through h = sign(f(x; W)), where sign( ) denotes the element-wise sign function, and f(x; W) = W x : Rd Rc is a real-valued transformation1.

1Let assume the data be zero-centered, then we omit the bias term.

During training process, exchanging patients information across institutions is not allowed due to privacy policy. Meanwhile, the learnt mapping is usually stored in a shared memory which can be accessed by any institutions.

Overall Objective As discussed earlier, the main objective is to learn compact hash codes to represent heterogeneous data across institutions without exchanging patient-level information. Besides the federated learning manner, there are two other things to consider: ﬁrst, the similarity relationship between patients data should be preserved in the projected binary space; second, the heterogeneity of the patient data should be in view. In the following, we will detailedly formulate the federated patient hashing problem based on these two concepts. Let H(q) be the matrix of columns h(q) i , i = 1, ..., Nq. After we access all the patients in Q institutions, the cumulative loss suffered on all the patients can be formulated as:

min W W 1 Q

Ls(X(q); H(q)) +

m,m =1 Lh(H(q) m ; H(q) m )

where Ls is a general similarity preserving loss to preserve the similarity order, i.e., minimizing the gap between the approximate nearest neighbor search result obtained from the input patient data space and the projected binary space. Lh can be seen as a heterogeneity digging loss, which is used to capture the potential relationship between different modalities of same patient. Simple examples for these two kinds of losses are described in the next part.

Similarity Preserving Loss Ls. Suppose we have the positive pairs by (x, x+) P and negative pairs by (x, x ) N, where (x, x+) have same label and x+ belongs to the k-neighborhood of x, (x, x ) have different labels. Simple examples for the similarity preserving loss can be: Pairwise similarity preserving loss:

Ls = E{d H(h, h+)|P} λE{d H(h, h )|N}, (2)

where λ is a regularization parameter and (h, h+, h ) is the binary embedding of (x, x+, x ). The Hamming metric d H(h, h+) = c 1

2 c i=1 |hi + h+ i | computes the embedding dissimilarity between the sample pair (x, x+). In practice, the conditional expectations E{ |P}, E{ |N} are replaced by averages on a training set of positive pairs and negative pairs of embedding, respectively. If we further denote the triplet constraints by (x, x+, x ) R, the mapping function is constructed such that the embedding from the same class are forced to be close to each other and those from different classes are pushed far away. Thus, another widely used triplet loss can be derived as following (Weinberger, Blitzer, and Saul 2006): Triplet ranking loss:

Ls = E{d H(h, h+)|R} E{d H(h, h )|R} + ε

More complicated non-convex problems occur in neural networks, where the networks make prediction through a non-convex function of the feature vector instead of via the linear-in-the-feature mapping W x, however, the resulting loss can still be written as Ls with the gradients computed efﬁciently using back-propagation (Koneˇcn y 2017).

Heterogeneity Digging Loss Lh. For an individual patient, the data are usually of multiple forms, e.g., structured data (including patient demographics, medications, lab results, etc), clinical notes (recorded by physicians and medical professionals), medical images, etc. A natural way is to view these multiple forms of data as multiple modalities. Ideally, for each patient, we want the same retrieval results from any form of data. It could be satisﬁed by making different modalities of each sample share same hash codes. This could derive the following consistency loss.

Multi-modal consistency loss:

m,m =1 Lh = 2

m =m+1 E{d H(hm, hm )| x}, (4)

where hm and hm are the hash codes of the m-th and m -th modality of data x Rd, respectively. Similarly, we can also construct the pairwise or the triplet ranking loss with regard to different modalities of positive and negative pairs. Let s take triplet loss as an example.

Triplet modality ranking loss:

Lh = E{d H(hm, h+ m ) d H(hm, h m )|R} + ε

In this case, this triplet loss implies different modalities of the similar sample pairs should also be closed to each other, and accordingly, those from different classes are pushed far away. After constructing an appropriate model for the patients retrieval task and heterogeneous data, another important issue is to apply this model to Q institutions in consideration of privacy concern. We will solve this problem in the next section.

Optimization To avoid exchanging patient-level information across institutions during training process, we propose to optimize (1) in a federated manner by two mechanisms, i.e., centralized and decentralized learning strategy, as shown in Fig. 1.

Centralized Learning Strategy Algorithm Description In centralized learning strategy, we assume there is one server (i.e., a shared memory which can be accessed by any institutions) besides Q institutions. First, let s denote the local accumulated loss of q-th institution in Eq. (1) as follows:

L(X(q); W) = Ls(X(q); H(q)) +

m,m =1 Lh(H(q) m ; H(q) m ).

(a) Centralized (b) Decentralized

Figure 1: Two learning strategies.

We assume that L(X; W) is differentiable in W W after appropriate relaxation, and W is a closed convex subset. We denote the gradient of L with respect to W as L(X; W). At iteration t, the server is connected with a set of institutions. After broadcasting model Wt to Q institutions, the server waits until receiving the updated model Wt+1 q :

Wt+1 q = arg min W W{ L(X(q); Wt), W +βD(W, Wt)},

(7) where q = 1, ..., Q and D( ) is the Bregman divergence generated by a differentiable 1-strongly convex function h : W R with min W W h(W) = 0, where W {W : W B} (Lan, Lu, and Monteiro 2011). Also, suppose D2 = max U,V W D(U, V). In this paper, we choose:

2 W Wq 2. (8)

Finally, we update model Wt on the server via model averaging (Koneˇcn y et al. 2016):

q=1 Wt+1 q . (9)

Our centralized learning strategy can be viewed as a stochastic Mirror Descent (SMD) method in federated environment, which we will call it Federated SMD (FSMD). The speciﬁc procedures are summarized in Alg. 1.

Algorithm 1 Centralized Learning Strategy to Minimize Eq. (1)

1: Input: S, R, X = [X(1), X(2), ..., X(Q)] Rd N

2: Output: W Rd c

3: for <t = 1, ..., T> do 4: for each institution q in parallel do 5: Pull Wt from the master; 6: Wt+1 q =arg min W W{ L(X(q); Wt), W + βD(W, Wt)}. 7: Push Wt+1 q to the master; 8: end for 9: Update Wt+1 = 1

Q Q q=1 Wt+1 q . 10: end for

Convergence Analysis In order to analyze the convergence of the centralized learning strategy, we take the following commonly used assumptions. Assumption 1. (Bounded Second Moment)

L(X(q); W) 2 G2, W, q. (10)

Assumption 2. (Bounded Gradient Variance) L(X(q); W) L(X; W) 2 σ2, W, q. (11)

Before introducing the main convergence result in Theorem 1, we ﬁrst introduce a lemma that describes the behavior of Algorithm 1 at each iteration in q-th institution. Lemma 1. Under Assumptions 1 and 2, suppose that W = arg min W W L(X; W), Algorithm 1 runs with convex L(X(q); W) and step size ηs = log Sq

2μSq on q-th institution satisfying the following inequality:

E WSq q W 2 W0 q W 2

Sq + (G2 + 2β2D2) log Sq

where Sq is the total number of iterations to optimize function (7) in q-th institution. It is easy to know that the right side of inequality (12) is zero when Sq = . That is to say, the Algorithm 1 runs on q-th institution (e.g., the inner for loop) is guaranteed to converge to local minima at a rate of O( log Sq

Sq ). Next, we will derive the global convergence of Algorithm 1. Theorem 1. Under Assumptions 1 and 2, suppose that W = arg min W W L(X; W), Algorithm 1 has the following convergence rate:

E WT W 2 W0 W 2

ST + G2 + 2β2D2

where S = min{S1, S2, ..., SQ}. According to Theorem 1, the outer for loop of Algorithm 1 converges exponentially to the neighborhood of local minimum. It s worth noting that there is no communication assumptions for our centralized learning strategy, thus it is perfect for the proposed patient hashing scenario. However, the patient data is continuously accumulated, so it is natural to update the model using new patient data. To this end, in the next section, we introduce a decentralized learning strategy to update our model, which can make the most use of the new coming data.

Decentralized Learning Strategy Algorithm Description In our decentralized learning strategy, we assume the algorithm sequentially goes over Q institutions, and before processing the q-th institution, produce an iterative Wq W. We assume that L(X; W) is L-smooth for any realization of W. Namely, we assume that L( ; W) is differential and that:

L(W1) L(W2) L W1 W2 . (14)

Then, we solve the following problem to obtain the updated model Wq+1:

arg min W W{ L(X(q); Wq), W +(L+βq+1)D(W, Wq)}.

(15) The speciﬁc procedures are summarized in Alg. 2.

Algorithm 2 Decentralized Learning Strategy to Minimize Eq. (1)

1: Input: S, R, X = [X(1), X(2), ..., X(Q)] Rd N

2: Output: W Rd c

3: for <q = 1, ..., Q> do 4: Wq+1 = arg min W W{ L(X(q); Wq), W + (L + βq+1)D(W, Wq)}. 5: end for

Convergence Analysis Apparently, each iteration of Algorithm 2 has same convergence rate as the inner for loop of Algorithm 1 as long as L(X; W) is based on same assumptions. To introduce the main convergence result in Theorem 2, we provide another lemma that could describe the behavior of Algorithm 2 at each iteration more detailedly.

Lemma 2. Under Assumptions 1 and 2, suppose that W = arg min W W L(X; W), and Algorithm 2 runs with a nondecreasing step size βq+1 βq, the iterations of Algorithm 2 satisfy the following inequality for all q:

L(Wq+1) L(W ) + (L + βq)D(W , Wq)

(L + βq+1)D(W , Wq+1) + gq 2 2βq + (βq+1 βq)D2 + gq, W Wq (16) where gq = L(X(q); Wq) L(X; Wq).

With Lemma 2, we could further derive the convergence rate of Algorithm 2.

Theorem 2. Under Assumptions 1 and 2, suppose that each Wq is chosen from a ﬁxed domain W {W : W B}, Algorithm 2 has the following convergence rate (Shamir 2016):

q=1 L(X; Wq) L(X; W )

Q + 2σD Q + 2(12 +

According to Theorem 2 and the deﬁnition of N and Q, we know that the right side of inequality (17) is zero when Q = . While in practice, the number of institutions is usually limited. Then, we could sequentially goes over Q institutions for more than one pass to update the model (1).

Experiments In this section, we evaluate the proposed model on a realworld clinical data set from critical care.

Table 1: MAP (%) of different hashing methods using 16 64 bits on AKI case task.

# of Bits LCH CH CCA SCM MLPcen FPHde FPHcen 16 46.62 0.17 47.74 0.24 49.20 0.21 53.52 0.69 53.09 3.96 53.30 0.55 53.74 1.57 32 47.28 0.49 48.96 0.05 49.60 0.12 54.29 0.36 54.46 2.77 53.99 0.90 54.40 1.34 48 48.21 0.39 49.39 0.14 49.86 0.11 53.94 0.49 53.95 2.29 53.91 1.05 53.48 0.52 64 48.71 0.27 49.74 0.05 50.07 0.11 53.67 0.50 54.58 2.16 53.91 0.97 53.26 1.10

Table 2: MAP (%) of different hashing methods using 16 64 bits on AKI stage task.

# of Bits LCH CH CCA SCM MLPcen FPHde FPHcen 16 42.43 0.61 44.90 0.16 43.71 0.19 45.37 0.22 47.49 3.42 47.39 0.61 48.30 0.90 32 43.00 0.67 46.27 0.18 44.16 0.16 45.31 0.16 47.92 1.66 47.93 0.30 47.80 1.04 48 43.44 0.74 46.68 0.14 44.62 0.12 45.31 0.13 48.25 1.25 48.35 0.65 48.66 0.51 64 44.06 0.59 47.10 0.13 44.94 0.07 45.27 0.11 49.53 0.60 48.55 0.85 48.72 0.63

Datasets We evaluate the proposed model over EHR data acquired from Medical Information Mart for Intensive Care III (MIMIC-III) (Johnson et al. 2016). MIMIC-III is a freely and publicly-available database which encompasses a diverse and very large population of Intensive Care Unit (ICU) patients. Our experiments are designed to perform similar patient matching with regard to the onset of AKI and corresponding AKI stages in terms of severity during an ICU admission. As in (Li et al. 2018), a total of 16,558 ICU stays of 14,469 patients with structured data and electronic documentation are included in this study.

Evaluation Setup We compare the proposed Federated Patient Hashing (FPH) model with the following methods: Canonical Correlation Analysis (CCA) (Hotelling 1992), Laplacian Co Hashing (LCH) (Zhang et al. 2010), Collaborative Hashing (CH) (Liu et al. 2014), Semantic Correlation Maximization (SCM) (Zhang and Li 2014) and MLP (Zhang, Lai, and Feng 2018). CCA and SCM are two multi-modal hashing methods. CCA maps two views into a common latent space and SCM seamlessly integrates semantic labels into the hashing learning procedure. LCH simultaneously hashes both terms and documents according to their semantic similarities and CH fully explores the duality between two views. MLP is widely used in deep hashing methods to process text data. For convenience, we represent our FPH method trained by centralized and decentralized strategy as FPHcen and FPHde , respectively. Next, we will detailedly describe the speciﬁc loss function adopted in the experiments.

Similarity Preserving Loss Ls. We choose the triplet ranking loss (3) with addition of a positive pair similarity preserving loss. Apparently, direct minimization of Ls is difﬁcult since the term h involves a non-differentiable sign function, which is also inconsistent with our convergence analysis. Thus, we relax the problem by directly dropping the sigh function sign(z) z, also let Dx,y = (x y)(x y) , then we have d H(h, h+) = ||W x W x+||2 = Tr(W D x,x+W). Therefore, the similarity preserving loss Ls for the q-th institution can be formulated as:

Ls(X(q); W) = λ1E Tr(W D x,x+W) S(q)

+ λ2 E Tr(W (Dx,x Dx,x+) W) R(q) + ε

(18) where S(q) and R(q) are the similar pairs and triplet constraints constructed from the patient data in q-th institution, respectively. In the experiments, we choose one neighbor for each sample, i.e. k = 1. It s worth mentioning that patient data are usually with serious imbalance problem. For example, in the task of predicting the onset of a disease, the number of controls are far more than cases. In this case, we can reduce the imbalance problem to a certain extent by choosing more neighbors for the classes with less samples.

Heterogeneity Digging Loss Lh. We choose the multimodal consistency loss (4) in the experiment. Similarly, we relax the loss by dropping the sign function, then for any two modalities, we need to minimize W i Xi W j Xj 2 F , i, j = 1, ..., M. Apparently, this leads to M dependent problems with M parameters Wi Rdi b, i = 1, ..., M, which are tedious to solve. For convenience, we introduce the following two concatenate matrices:

W = [W 1 , W 2 , ..., W M] R(d1+...+d M) b, Xm = [0, ..., X m, ..., 0] R(d1+...+d M) N. (19)

It is clear that W i Xi = W Xi. Let Zij = [0, ..., 0, Ii, 0, ..., 0, Ij, 0, ..., 0], X = [ X1, ..., XM] = diag(X1, ..., XM) be a diagonal block matrix. Thus, we have the following heterogeneity digging loss for q-th institution:

m,m =1 Lh = 2

m =m+1 E X(q) WZij 2 F . (20)

Eq. (20) has only one parameter W to be solved. Such processing is mainly for the convenience of analysis. If the space complexity is a problem in reality, then this loss can still be viewed as multiple sub-problems.

0 100 200 300 # retrieved samples

AKI case @64bit - FPH

de5 cen5 de10 cen10 de20 cen20 de

0 100 200 300 # retrieved samples

AKI case @32bit

LCH CH CCA SCM MLP FPHde FPHcen

0 100 200 300 # retrieved samples

AKI case @64bit

LCH CH CCA SCM MLP FPHde FPHcen

Figure 2: Precision curves on similar patient matching with regard to onset of AKI.

0 100 200 300 # retrieved samples

AKI stage @64bit - FPH

de5 cen5 de10 cen10 de20 cen20 de

0 100 200 300 # retrieved samples

AKI stage @32bit

LCH CH CCA SCM MLP FPHde FPHcen

0 100 200 300 # retrieved samples

AKI stage @64bit

LCH CH CCA SCM MLP FPHde FPHcen

Figure 3: Precision curves on similar patient matching with regard to AKI stages in terms of severity.

The speciﬁc subgradient of L(X(q); W) with respect to W used in experiments for two learning strategies is:

L(X(q); W) = α

j=i+1 E X(q) X(q) WZij Z ij +

λ1E D x,x+W S(q) + λ2E (Dx,x Dx,x+) W R(q) + , (21) where R+ denotes the subset of constraints in R that is larger than 0.

Patient Matching: Onset of AKI & Corresponding Stage Among 16558 ICU stays, 2785 were case patients who were diagnosed with AKI within a 7-day window after 24-hour observation. The remaining 13773 stays were controls (Li et al. 2018). In accordance with the AKI staging criteria from KDIGO (Kellum et al. 2012), the 2785 cases were further divided into three stages, i.e., 1499 stays in stage I, 720 stays in stage II and 566 stays in stage III. In our setting, we treat structured data (e.g., lab results and chart-events, etc) and unstructured data (i.e., clinical notes) as two modalities. The preprocessing details are similar to (Xu et al. 2019). For unstructured data, clinical notes are ﬁrst represented as bagof-words , and latent semantic analysis (LSA) is further applied to ﬁnd the underlying meaning. For all hashing function learners, the dataset is splited into 5 folds based on sample proportion, where 4 folds are

used for training and 1 fold for testing. We repeat the process for ﬁve times, and gauge the mean of Mean Average Precision (MAP) (Buckley and Voorhees 2000) and standard deviation as ﬁnal performance, as shown in Table 1 and Table 2. We abbreviate these two tasks, i.e., similar patient matching on onset of AKI and corresponding stage, as AKI case and AKI stage . In the experiments, we set λ1 = λ2 = 0.5, the other regularization parameters are tuned from range {10 3, 10 2, 10 1, 1, 10, 102, 103}. In addition, to simulate the cross-institution scenario for our algorithm, we randomly divide the training data into 20 subsets and avoid the communication across subsets in the training process. For decentralized learning strategy, we only access the full training data subsets by one full pass. From Tables 1 and 2, we can see that our model obtains good results even if we use less information (i.e., avoiding exchanging patient-level information across institutions) than other compared methods. We also compare precision performance using 32 and 64 bits on two tasks in left two subﬁgures in Figs. 2 and 3. The precision curves are plotted based on the ﬁrst retrieved 300 patients. Apparently, our method achieves best results in both centralized and decentralized learning strategies. For the right subﬁgures in Figs.2 and 3, we choose a part of training subsets and optimize the model to converge (i.e., we access the subsets for several passes). For example, de5 means we choose 5 of 20 subsets to train the model to converge using decentralized learning strategy. The dash red

0 200 400 600 # iterations

Objective value

Decentralized strategy

Case vs. Control

0 200 400 600 0

Objective value

Stage I, II, III

1 2 3 4 5 # iterations

Objective value

Centralized strategy

Case vs. Control

1 2 3 4 5 0

Objective value

Stage I, II, III

0 20 40 60 # of iterations

Objective value

Decentralized strategy

7.3677 5.7304

Case vs Control

0 20 40 60 0

Objective value

7.2641 5.254

Stage I, II, III

1 2 3 4 5 # of iterations

Objective value

Centralized strategy

Case vs Control

1 2 3 4 5 0

Objective value

Stage I, II, III

(a) # of samples are same (b) # of samples are different

Figure 4: Objective function value vs. number of iterations.

0 100 200 300 # retrieved samples

AKI stage @64bit

One full pass Two full passes Three full passes To converge

0 100 200 300 # retrieved samples

AKI stage @48bit

One full pass Two full passes Three full passes To converge

0 100 200 300 # retrieved samples

AKI stage @32bit

One full pass Two full passes Three full passes To converge

Figure 5: Precision curves on AKI stage task after accessing data for different passes by decentralized learning strategy.

line de represents we access the data for one full pass (i.e., the FPHde case reported in Tables 1 and 2). From the ﬁgure, we can see that our method obtains promising results even with a small part of data. Also, if we could access the data for multiple times and optimize the model to converge using decentralized strategy, our results would be better.

Further Analysis

Convergence of Learning Strategy. As we described before, we adopt two mechanisms, i.e., centralized and decentralized learning strategy, to optimize our algorithm by considering the privacy of the sensitive data and theoretically prove the convergence as well. To further illustrate the convergence of our learning strategies intuitively, we show the changes of objective values with the increase of iterations in Fig. 4. When we randomly divide the training data into 20 subsets to simulate the crossinstitution scenario, we have two settings, i.e., the number of samples in each subset is same or different. The objective function values versus number of iterations related to two settings are shown in Fig. 4. From the ﬁgure, we can see that whether the number of samples in each subset are same or not, the objective values of the proposed two learning strategies constantly decline. This property makes our model more applicable in practice.

Feasibility of Model Updating with New Samples. To test the feasibility of model updating with new samples by

decentralized learning strategy, we plot the precision curves using 32, 48 and 64 bits on AKI stage task in Fig. 5. Different color lines in each subﬁgure represent the training data subsets can be accessed for different full passes when using decentralized learning strategy. Apparently, If we could optimize the model to converge, better results could be obtained. But the previous results have already shown the proposed algorithm obtains good results even if we access the data subsets for only one full pass. We recall the convergence analysis in the previous section, the convergence rate of the decentralized learning strategy we proved is related to the number of institutions. So it is natural to derive that with the increase of the number of institutions, the model would be better and close to the optimum. In addition, to intuitively show the gap produced by insufﬁcient iterations, let s take Fig. 4(a) as an example. Since we divide the training data for 20 subsets, it means the objective value at 20-th iteration is achieved when we access the data for one full pass. Apparently, the objectives decline most at ﬁrst full pass. Therefore, if multiple accesses to data are not allowed in practice, we could sacriﬁce the performance obtained by sufﬁcient iterations to some extent, and use the model updated by accessing the data for only one full pass.

Conclusion In this paper, we focus on designing a federated patient hashing framework, which queries the potentially distributed, heterogeneous patient data scattered in multiple institutions,

without exchanging patient-level information. Our framework includes a similarity preserving loss and a heterogeneity digging loss, in order to preserve both intra-data and inter-data relationships. Meanwhile, two learning mechanisms including decentralized and centralized strategies are adopted to optimize the proposed model in a federated manner with the proof of convergence. Finally, extensive experiments on a real-world clinical data set with simulated federated environment demonstrate the effectiveness and convergence of the proposed method.

Acknowledgments This work is supported by NSF IIS 1716432, 1750326, ONR N00014-18-1-2585 and Amazon Web Service (AWS) Machine Learning for Research Award.

Athitsos, V.; Alon, J.; Sclaroff, S.; and Kollios, G. 2008. Boostmap: An embedding method for efﬁcient nearest neighbor retrieval. IEEE TPAMI 30(1):89 104. Buckley, C., and Voorhees, E. M. 2000. Evaluating evaluation measure stability. In SIGIR, 33 40. ACM. Caldas, S.; Koneˇcny, J.; Mc Mahan, H. B.; and Talwalkar, A. 2018. Expanding the reach of federated learning by reducing client resource requirements. ar Xiv preprint ar Xiv:1812.07210. Chi, L., and Zhu, X. 2017. Hashing techniques: A survey and taxonomy. ACM Computing Surveys (CSUR) 50(1):11. Chi, L.; Li, B.; and Zhu, X. 2013. Fast graph stream classiﬁcation using discriminative clique hashing. In PAKDD, 225 236. Springer. Denny, J. C.; Bastarache, L.; Ritchie, M. D.; Carroll, R. J.; Zink, R.; Mosley, J. D.; Field, J. R.; Pulley, J. M.; Ramirez, A. H.; Bowton, E.; et al. 2013. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature biotechnology 31(12):1102. Hotelling, H. 1992. Relations between two sets of variates. In Breakthroughs in statistics. Springer. 162 190. Indyk, P., and Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, 604 613. ACM. Johnson, A. E.; Pollard, T. J.; Shen, L.; Li-wei, H. L.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L. A.; and Mark, R. G. 2016. Mimic-iii, a freely accessible critical care database. Scientiﬁc data 3:160035. Kellum, J. A.; Lameire, N.; Aspelin, P.; Barsoum, R. S.; Burdmann, E. A.; Goldstein, S. L.; Herzog, C. A.; Joannidis, M.; Kribben, A.; Levey, A. S.; et al. 2012. Kidney disease: improving global outcomes (kdigo) acute kidney injury work group. kdigo clinical practice guideline for acute kidney injury. Kidney international supplements 2(1):1 138. Koneˇcn y, J.; Mc Mahan, H. B.; Yu, F. X.; Richt arik, P.; Suresh, A. T.; and Bacon, D. 2016. Federated learning: Strategies for improving communication efﬁciency. ar Xiv preprint ar Xiv:1610.05492. Koneˇcn y, J. 2017. Stochastic, distributed and federated optimization for machine learning. ar Xiv preprint ar Xiv:1707.01155. Lan, G.; Lu, Z.; and Monteiro, R. D. 2011. Primal-dual ﬁrst-order methods with O(1/ϵ) iteration-complexity for cone programming. Mathematical Programming 126(1):1 29.

Lee, J.; Sun, J.; Wang, F.; Wang, S.; Jun, C.-H.; and Jiang, X. 2018. Privacy-preserving patient similarity learning in a federated environment: Development and analysis. JMIR medical informatics 6(2):e20. Li, Y.; Yao, L.; Mao, C.; Srivastava, A.; Jiang, X.; and Luo, Y. 2018. Early prediction of acute kidney injury in critical care setting using clinical notes. In BIBM, 683 686. IEEE. Li, W.-J.; Wang, S.; and Kang, W.-C. 2015. Feature learning based deep supervised hashing with pairwise labels. ar Xiv preprint ar Xiv:1511.03855. Liu, X.; He, J.; Deng, C.; and Lang, B. 2014. Collaborative hashing. In CVPR, 2139 2146. Liu, W.; Wang, J.; and fu Chang, S. 2011. Hashing with graphs. In ICML. Mc Mahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; et al. 2016. Communication-efﬁcient learning of deep networks from decentralized data. ar Xiv preprint ar Xiv:1602.05629. Newton, K. M.; Peissig, P. L.; Kho, A. N.; Bielinski, S. J.; Berg, R. L.; Choudhary, V.; Basford, M.; Chute, C. G.; Kullo, I. J.; Li, R.; et al. 2013. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the emerge network. JAMIA 20(e1):e147 e154. Sahu, A. K.; Li, T.; Sanjabi, M.; Zaheer, M.; Talwalkar, A.; and Smith, V. 2018. On the convergence of federated optimization in heterogeneous networks. ar Xiv preprint ar Xiv:1812.06127. Shamir, O. 2016. Without-replacement sampling for stochastic gradient methods: Convergence results and application to distributed optimization. ar Xiv preprint ar Xiv:1603.00570. Smith, V.; Chiang, C.-K.; Sanjabi, M.; and Talwalkar, A. S. 2017. Federated multi-task learning. In Neur IPS, 4424 4434. Vatsalan, D., and Christen, P. 2016. Privacy-preserving matching of similar patients. JBI 59:285 298. Weinberger, K. Q.; Blitzer, J.; and Saul, L. K. 2006. Distance metric learning for large margin nearest neighbor classiﬁcation. In Neur IPS, 1473 1480. Weiss, Y.; Fergus, R.; and Torralba, A. 2012. Multidimensional spectral hashing. In ECCV, 340 353. Springer. Xu, Z.; Chou, J.; Zhang, X. S.; Luo, Y.; Isakova, T.; Adekkanattu, P.; Ancker, J. S.; Jiang, G.; Kiefer, R. C.; Pacheco, J. A.; et al. 2019. Identiﬁcation of predictive sub-phenotypes of acute kidney injury using structured and unstructured electronic health record data with memory networks. ar Xiv preprint ar Xiv:1904.04990. Zhang, D., and Li, W.-J. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI. Zhang, D.; Wang, J.; Cai, D.; and Lu, J. 2010. Laplacian cohashing of terms and documents. In ECIR, 577 580. Springer. Zhang, X.; Lai, H.; and Feng, J. 2018. Attention-aware deep adversarial hashing for cross-modal retrieval. In ECCV, 591 606. Zhu, H.; Long, M.; Wang, J.; and Cao, Y. 2016. Deep hashing network for efﬁcient similarity retrieval. In AAAI.