# glucosynth_generating_differentiallyprivate_synthetic_glucose_traces__496e8179.pdf Gluco Synth: Generating Differentially-Private Synthetic Glucose Traces Josephine Lamp1,2 Mark Derdzinski2 Christopher Hannemann2 Joost van der Linden2 Lu Feng1 Tianhao Wang1 David Evans1 1University of Virginia, Charlottesville, VA, USA; 2Dexcom, USA jl4rj@virginia.edu; {mark.derdzinski; christopher.hannemann; joost.vanderlinden}@dexcom.com; {lu.feng; tianhao; evans}@virginia.edu We focus on the problem of generating high-quality, private synthetic glucose traces, a task generalizable to many other time series sources. Existing methods for time series data synthesis, such as those using Generative Adversarial Networks (GANs), are not able to capture the innate characteristics of glucose data and cannot provide any formal privacy guarantees without severely degrading the utility of the synthetic data. In this paper we present Gluco Synth, a novel privacy-preserving GAN framework to generate synthetic glucose traces. The core intuition behind our approach is to conserve relationships amongst motifs (glucose events) within the traces, in addition to temporal dynamics. Our framework incorporates differential privacy mechanisms to provide strong formal privacy guarantees. We provide a comprehensive evaluation on the real-world utility of the data using 1.2 million glucose traces; Gluco Synth outperforms all previous methods in its ability to generate high-quality synthetic glucose traces with strong privacy guarantees. 1 Introduction The sharing of medical time series data can facilitate therapy development. As a motivating example, sharing glucose traces can contribute to the understanding of diabetes disease mechanisms and the development of artificial insulin delivery systems that improve people with diabetes quality of life. Unsurprisingly, there are serious legal and privacy concerns (e.g., HIPAA, GDPR) with the sharing of such granular, longitudinal time series data in a medical context [1]. One solution is to generate a set of synthetic traces from the original traces. In this way, the synthetic data may be shared publicly in place of the real ones with significantly reduced privacy and legal concerns. This paper focuses on the problem of generating high-quality, privacy-preserving synthetic glucose traces, a task which generalizes to other time series sources and application domains, including activity sequences, inpatient events, hormone traces and cyber-physical systems. Specifically, we focus on long (over 200 timesteps), bounded, univariate time series glucose traces. We assume that available data does not have any labels or extra information including features or metadata, which is quite common, especially in diabetes. Continuous Glucose Monitors (CGMs) easily and automatically send glucose measurements taken subcutaneously at fixed intervals (e.g., every 5 minutes) to data storage facilities, but tracking other sources of diabetes-related data is challenging [2]. We characterize the quality of the generated traces based on three criteria synthetic traces should (1) conserve characteristics of the real data, i.e., glucose dynamics and control-related metrics (fidelity); (2) contain representation of diverse types of realistic traces, without the introduction of anomalous patterns that do not occur in real traces (breadth); and (3) be usable in place of the original data for real-world use cases (utility). 37th Conference on Neural Information Processing Systems (Neur IPS 2023). 0 48 96 144 192 240 288 Timesteps Glucose (mg/d L) Sample Glucose Traces 0 12 24 36 48 Timesteps Glucose (mg/d L) Motif 1: High Peak 0 12 24 36 48 Timesteps Glucose (mg/d L) Motif 2: Deep Trough 0 12 24 36 48 Timesteps Glucose (mg/d L) Motif 3: Stable Line Figure 1: Example Real Glucose Traces and Glucose Motifs from our Dataset. Generative Adversarial Networks (GANs) [3] have shown promise in the generation of time series data. However, previous methods for time series synthesis, e.g., [4, 5, 6], suffer from one or more of the following issues when applied to glucose traces: 1) surprisingly, they do not generate realistic synthetic glucose traces in particular, they produce human physiologically impossible phenomenon in the traces; 2) they require additional information (features, metadata or labels) to guide the model learning which are not available for our traces; 3) they do not include any privacy guarantees, or, in order to uphold a strong formal privacy guarantee, severely degrade the utility of the synthetic data. Generating high-quality synthetic glucose traces is a difficult task due to the innate characteristics of glucose data. Glucose traces can be best understood as sequences of events, which we call motifs, shown in Figure 1, and they are more event-driven than many other types of time series. As such, a current glucose value may be more influenced by an event that occurred in the far past compared to values from immediate previous timesteps. For example, a large meal eaten earlier in the day (30-90 minutes ago) may influence a patient s glucose more than the glucose values from the past 15 minutes. As a result, although there is some degree of temporal dependence within the traces, only conserving the immediate temporal relationships amongst values at previous timesteps does not adequately capture the dynamics of this type of data. In particular, we find that the main reason previous methods fail is because they may not sufficiently learn event-related characteristics of glucose traces. Contributions. We present Gluco Synth, a privacy-preserving GAN framework to generate synthetic glucose traces. The core intuition behind our approach is to conserve relationships amongst motifs (events) within the traces, in addition to the typical temporal dynamics contained within time series. We formalize the concept of motifs and define a notion of motif causality, inspired from Granger causality [7], which characterizes relationships amongst sequences of motifs within time series traces (Section 4). We define a local motif loss to first train a motif causality block that learns the motif causal relationships amongst the sequences of motifs in the real traces. The block outputs a motif causality matrix, that quantifies the causal value of seeing one particular motif after some other motif. Unrealistic motif sequences (such as a peak to an immediate drop in glucose values) will have causal relationships close to 0 in the causality matrix. We build a novel GAN framework that is trained to optimize motif causality within the traces in addition to temporal dynamics and distributional characteristics of the data (Section 5). Explicitly, the generator computes a motif causality matrix from each batch of synthetic data it generates, and compares it with the real causality matrix. As such, as the generator learns to generate synthetic data that yields a realistic causal matrix (thereby identifying appropriate causal relationships from the motifs), it implicitly learns not to generate unrealistic motif sequences. We also integrate differential privacy (DP) [8] into the framework (Section 6), which provides an intuitive bound on how much information may be disclosed about any individual in the dataset, allowing the Gluco Synth model to be trained with privacy guarantees. Finally, in Section 7, we present a comprehensive evaluation using 1.2 million glucose traces from individuals with diabetes collected across 2022, showcasing the suitability of our model to outperform all previous models and generate high-quality synthetic glucose traces with strong privacy guarantees. 2 Related Work We focus the scope of our comparison on current state-of-the-art methods for synthetic time series which all build upon Generative Adversarial Networks (GANs) [3] and transformation-based approaches [9]. An extended related work is in Appendix A. (a) Glucose Motif 1 (b) Glucose Motif 2 (c) Temporal Motif 1 (d) Temporal Motif 2 Figure 2: Temporal Distributions of Sample Motifs. Each radial graph displays the temporal distribution of a motif; there are 24 radial bars from 00:00 to 23:00, and each segment displays the % of motif occurrences by each hour. Glucose motifs 1 and 2 are from Fig. 1; they are not temporallydependent and show up across the day. Temporal motifs 1 and 2 are from a cardiology dataset [15]. Time Series. Brophy et al. [10] provides a survey of GANs for time series synthesis. Time Gan [4] is a popular benchmark that jointly learns an embedding space using supervised and adversarial objectives in order to capture the temporal dynamics amongst traces. Esteban et al. [11] develops two time series GAN models (RGAN/RCGAN) with RNN architectures, conditioned on auxiliary information provided at each timestep during training. TTS-GAN [5] trains a GAN model that uses a transformer encoding architecture in order to best preserve temporal dynamics. Transformation-based approaches such as real-valued non-volume preserving transformations (NVP) [9] and Fourier Flows (FF) [12], have also had success for time series data. These methods model the underlying distribution of the real data to transform the input traces into a synthetic data set. Methods that only focus on learning the temporal or distributional dynamics in time series are not sufficient for generating realistic synthetic glucose traces due to the lack of temporal dependence within sequences of glucose motifs. Differentially-Private GANs. To protect sensitive data, several GAN architectures have been designed to incorporate privacy-preserving noise needed to satisfy differential privacy guarantees [13]. Frigerio et al. [14] extends a simple differentially-private architecture (dp GAN) to time-series data and RDP-CGAN [6] develops a convolutional GAN architecture specifically for medical data. These methods find large gaps in performance between the non-private and private models. Providing strong theoretical DP guarantees using these methods often results in synthetic data with too little fidelity for use in real-world scenarios. Our framework carefully integrates DP into the motif causality block and each network of the GAN, resulting in a better utility-privacy tradeoff than previous methods. 3 Preliminaries Glucose (and many other) traces can be best understood as sequences of events or motifs. Motifs characterize phenomenon in the traces, such as peaks or troughs. We define a motif, µ, as a short, ordered sequence of values (v) of specified length τ, µ = [vi, vi+1, . . . , vi+τ] and σ is a tolerance value to allow approximate matching (within σ for each value). Some examples of glucose traces and motifs are shown in Figure 1. We denote a set of n time series traces as X = [x1, ..., xn]. Each time series may be represented as a sequence of motifs: xi = [µi1, µi2...] where each ij gives the index of the motif in the set that matches xij τ , ...xi(j+1) τ 1. Given the motif length τ, the motif set is the union of all size-τ chunks in the traces. This definition is chosen for a straightforward implementation but motifs can be generated in other ways, such as through the use of rolling windows or signal processing techniques [16, 17]. Motifs are pulled from the data such that there is always a match from a trace motif to a motif from the set (if multiple matches, the closest one is chosen). 3.2 Glucose Dynamics (Why Standard Approaches Fail) We first present a study of the characteristics of glucose data in order to motivate the development of our framework. Although there are general patterns in sequences of glucose motifs (e.g., motif patterns corresponding to patients that eat 2x vs. 3x a day), individual glucose motifs are typically not time-dependent, as illustrated in Figure 2. The radial graphs display the temporal distribution of the first two glucose motifs from Figure 1 and two temporally-dependent motifs from a cardiology dataset [15]. There are 24 radial bars from 00:00 to 23:00 for each hour of the day, and the bar value is the percentage of total motif occurrences at that hour across the entire dataset (i.e., value of 10 would indicate that 10% of the time that motif occurs during that hour). Note that the glucose motifs show up fairly evenly across all hours of the day whereas the motifs from the cardiology dataset have shifts in their distribution and show up frequently at specific hours of the day. The lack of temporal dependence in glucose motifs is likely due to the diverse patient behaviors within a patient population. Glucose in particular is highly variable and influenced by many factors including eating, exercise, stress levels, and sleep patterns. Moreover, due to innate variability within human physiology, motif occurrences can differ even for the same patient across weeks or months. These findings indicate that only conserving the temporal relationships within glucose traces (as many previous methods do) may not be sufficient to properly learn glucose dynamics and output realistic synthetic traces. 3.3 Granger Causality Granger causality [7] is commonly used to quantify relationships amongst time series without limiting the degree to which temporal relationships may be understood as done in other time series models, e.g., pure autoregressive ones. In this framework, an entire system (set of traces) is studied together, allowing for a broader characterization of their relationships, which may be advantageous, especially for long time series. We define xt Rn as an n-dimensional vector of time series observed across n traces and T timesteps. To study causality, a vector autoregressive model (VAR) [18] may be used. A set of traces at time t is represented as a linear combination of the previous K lags in the series: xt = PK k=1 A(k)xt k + et where each A(k) is a n n dimensional matrix that describes how lag k affects the future timepoints in the series and et is a zero mean noise. Given this framework, we state that time series q does not Granger-cause time series p, if and only if for all k, A(k) p,q = 0. To better represent nonlinear dynamics amongst traces, a nonlinear autoregressive model (NAR) [19], g, may be defined, in which xt = g (x1