# neural_universal_discrete_denoiser__3311676b.pdf Neural Universal Discrete Denoiser Taesup Moon DGIST Daegu, Korea 42988 tsmoon@dgist.ac.kr Seonwoo Min, Byunghan Lee, Sungroh Yoon Seoul National University Seoul, Korea 08826 {mswzeus, styxkr, sryoon}@snu.ac.kr We present a new framework of applying deep neural networks (DNN) to devise a universal discrete denoiser. Unlike other approaches that utilize supervised learning for denoising, we do not require any additional training data. In such setting, while the ground-truth label, i.e., the clean data, is not available, we devise pseudolabels and a novel objective function such that DNN can be trained in a same way as supervised learning to become a discrete denoiser. We experimentally show that our resulting algorithm, dubbed as Neural DUDE, significantly outperforms the previous state-of-the-art in several applications with a systematic rule of choosing the hyperparameter, which is an attractive feature in practice. 1 Introduction Cleaning noise-corrupted data, i.e., denoising, is a ubiquotous problem in signal processing and machine learning. Discrete denoising, in particular, focuses on the cases in which both the underlying clean and noisy data take their values in some finite set. Such setting covers several applications in different domains, such as image denoising [1, 2], DNA sequence denoising [3], and channel decoding [4]. A conventional approach for addressing the denoising problem is the Bayesian approach, which can often yield a computationally efficient algorithm with reasonable performance. However, limitations can arise when the assumed stochastic models do not accurately reflect the real data distribution. Particularly, while the models for the noise can often be obtained relatively reliably, obtaining the accurate model for the original clean data is more tricky; the model for the clean data may be wrong, changing, or may not exist at all. In order to alleviate the above mentioned limitations, [5] proposed a universal approach for discrete denoising. Namely, they first considered a general setting that the clean finite-valued source symbols are corrupted by a discrete memoryless channel (DMC), a noise mechanism that corrupts each source symbol independently and statistically identically. Then, they devised an algorithm called DUDE (Discrete Universal DEnoiser) and showed rigorous performance guarantees for the semi-stochastic setting; namely, that where no stochastic modeling assumptions are made on the underlying source data, while the corruption mechanism is assumed to be governed by a known DMC. DUDE is shown to universally attain the optimum denoising performance for any source data as the data size grows. In addition to the strong theoretical performance guarantee, DUDE can be implemented as a computationally efficient sliding window denoiser; hence, it has been successfully applied and extended to some practical applications, e.g., [1, 3, 4, 2]. However, it also had limitations; namely, the performance is sensitive on the choice of sliding window size k, which has to be hand-tuned without any systematic rule. Moreover, when k becomes large and the alphabet size of the signal increases, DUDE suffers from the data sparsity problem, which significantly deteriorates the performance. In this paper, we present a novel framework of addressing above limitations of DUDE by adopting the machineries of deep neural networks (DNN) [6], which recently have seen great empirical success 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. in many practical applications. While there have been some previous attempts of applying neural networks to grayscale image denoising [7, 8], they all remained in supervised learning setting, i.e., large-scale training data that consists of clean and noisy image pairs was necessary. Such approach requires significant computation resources and training time and is not always transferable to other denoising applications, in which collecting massive training data is often expensive, e.g., DNA sequence denoising [9]. Henceforth, we stick to the setting of DUDE, which requires no additional data other than the given noisy data. In this case, however, it is not straightforward to adopt DNN since there is no ground-truth label for supervised training of the networks. Namely, the target label that a denoising algorithm is trying to estimate from the observation is the underlying clean signal, hence, it can never be observed to the algorithm. Therefore, we carefully exploit the known DMC assumption and the finiteness of the data values, and devise pseudo-labels for training DNN. They are based on the unbiased estimate of the true loss a denoising algorithm is incurring, and we show that it is possible to train a DNN as a universal discrete denoiser using the devised pseudo-labels and generalized cross-entropy objective function. As a by-product, we also obtain an accurate estimator of the true denoising performance, with which we can systematically choose the appropriate window size k. In results, we experimentally verify that our DNN based denoiser, dubbed as Neural DUDE, can achieve significantly better performance than DUDE maintaining robustness with respect to k. Furthermore, we note that although the work in this paper is focused on discrete denoising, we believe the proposed framework can be extended to the denoising of continuous-valued signal as well, and we defer it to the future work. 2 Notations and related work 2.1 Problem setting of discrete denoising Throughout this paper, we will generally denote a sequence (n-tuple) as, e.g., an = (a1, . . . , an), and aj i refers to the subsequence (ai, . . . , aj). In discrete denoising problem, we denote the clean, underlying source data as xn and assume each component xi takes a value in some finite set X. The source sequence is corrupted by a DMC and results in a noisy version of the source zn, of which each component zi takes a value in , again, some finite set Z. The DMC is completely characterized by the channel transition matrix Π R|X| |Z|, of which the (x, z)-th element, Π(x, z), stands for Pr(Zi = z|Xi = x), i.e., the conditional probability of the noisy symbol taking value z given the original source symbol was x. An essential but natural assumption we make is that Π is of the full row rank. Upon observing the entire noisy data zn, a discrete denoiser reconstructs the original data with ˆXn = ( ˆX1(zn), . . . , ˆXn(zn)), where each reconstructed symbol ˆXi(zn) also takes its value in a finite set ˆ X. The goodness of the reconstruction by a discrete denoiser ˆXn is measured by the average loss, L ˆ Xn(Xn, Zn) = 1 n Pn i=1 Λ(xi, ˆXi(zn)), where Λ(xi, ˆxi) is a single-letter loss function that measures the loss incurred by estimating xi with ˆxi at location i. The loss function can be also represented with a loss matrix Λ R|X| | ˆ X|. Throughout the paper, for simplicity, we will assume X = Z = ˆ X, thus, assume that Π is invertible. 2.2 Discrete Universal DEnoiser (DUDE) DUDE in [5] is a two-pass algorithm that has a linear complexity in the data size n. During the first pass, the algorithm with the window size k collects the statistics vector m[zn, lk, rk](a) = {i : k + 1 i n k, zi+k i k = lkark} , (1) for all a Z, which is the count of the occurrence of the symbol a Z along the noisy sequence zn that has the double-sided context (lk, rk) Z2k. Once the m vector is collected, for the second pass, DUDE applies the rule ˆXi,DUDE(zn) = arg min ˆx X m[zn, ci] Π 1[λˆx πzi] for each k + 1 i n k, (2) where ci (zi 1 i k, zi+k i+1) is the context of zi, πzi is the zi-th column of the channel matrix Π, λˆx is the ˆx-th column of the loss matrix Λ, and stands for the element-wise product. The form of (2) shows that DUDE is a sliding window denoiser with window size 2k + 1; namely, DUDE returns the same denoised symbol at all locations i s with the same value of zi+k i k. We will call such denoisers as the k-th order sliding window denoiser from now on. DUDE is shown to be universal, i.e., for any underlying clean sequence xn, it can always attain the performance of the best k-th order sliding window denoiser as long as k|Z|2k = o(n/ log n) holds [5, Theorem 2]. For more rigorous analyses, we refer to the original paper [5]. 2.3 Deep neural networks (DNN) and related work Deep neural networks (DNN), often dubbed as deep learning algorithms, have recently made significant impacts in several practical applications, such as speech recognition, image recognition, and machine translation, etc. For a thorough review on recent progresses of DNN, we refer the readers to [6] and refereces therein. Regarding denoising, [7, 8, 10] have successfully applied the DNN to grayscale image denoising by utilizing supervised learning at the small image patch level. Namely, they generated clean and noisy image patches and trained neural networks to learn a mapping from noisy to clean patches. While such approach attained the state-of-the-art performance, as mentioned in Introduction, it has several limitations. That is, it typically requires massive amount of training data, and multiple copies of the data need to be generated for different noise types and levels to achieve robust performance. Such requirement of large training data cannot be always met in other applications, e.g., in DNA sequence denoising, collecting large scale clean DNA sequences is much more expensive than obtaining training images on the web. Moreover, for image denoising, working in the small patch level makes sense since the image patches may share some textual regularities, but in other applications, the characterstics of the given data for denoising could differ from those in the pre-collected training set. For instance, the characteristics of substrings of DNA sequences vary much across different species and genes, hence, the universal setting makes more sense in DNA sequence denoising. 3 An alternative interpretation of DUDE 3.1 Unbiased estimated loss In order to make an alternative interpretation of DUDE, which can be also found in [11], we need the tool developed in [12]. To be self-contained, we recap the idea here. Consider a single letter case, namely, a clean symbol x is corrupted by Π and resulted in the noisy observation1 Z. Then, suppose a single-symbol denoiser s : Z ˆ X is applied and obtained the denoised symbol ˆX = s(Z). In this case, the true loss incurred by s for the clean symbol x and the noisy observation Z is Λ(x, s(Z)). It is clear that s cannot evaluate its loss since it does not know what x is, but the following shows an unbiased estimate of the expected true loss, which is only based on Z and s, can be derived. First, denote S as the set of all possible single-symbol denoisers. Note |S| = | ˆ X||Z|. Then, we define a matrix ρ R|X| |S| with ρ(x, s) = X z Z Π(x, z)Λ(x, s(z)) = ExΛ(x, s(Z)), x X, s S. (3) Then, we can define an estimated loss matrix2 L Π 1ρ R|Z| |S|. With this definition, we can show that L(Z, s) is an unbiased estimate of ExΛ(x, s(Z)) as follows (as shown in [12]): Ex L(Z, s) = X z Π(x, z) X x Π 1(z, x )ρ(x , s) = δ(x, x )ρ(x , s) = ρ(x, s) = ExΛ(x, s(Z)). 3.2 DUDE: Minimizing the sum of estimated losses As mentioned in Section 2.2, DUDE with context size k is the k-th order sliding window denoiser. Generally, we can denote such k-th order sliding window denoiser as sk : Z2k+1 ˆ X, which 1We use uppercase letter Z to stress it is a random variable 2For general case in which Π is not a square matrix, Π 1 can be replaced with the right inverse of Π. obtains the reconstruction at the i-th location as ˆXi(zn) = sk(zi+k i k) = sk(ci, zi). (4) To recall, ci = (zi 1 i k, zi+k i+1). Now, from the formulation (4), we can interpret that sk defines a single-symbol denoiser at location i, i.e., sk(ci, ), depending on ci. With this view on sk, as derived in [11], we can show that the DUDE defined in (2) is equivalent to finding a single-symbol denoiser sk,DUDE(c, ) = arg min s S {i:ci=c} L(zi, s), (5) for each context c Ck {(lk, rk) : (lk, rk) Z2k} and obtaining the reconstruction at location i as ˆXi,DUDE(zn) = sk,DUDE(ci, zi). The interpretation (5) gives some intuition on why DUDE enjoys strong theoretical guarantees in [5]; since L(Zi, s) is an unbiased estimate of ExiΛ(xi, s(Zi)), P i {i:ci=c} L(Zi, s) will concentrate on P i {i:ci=c} Λ(xi, s(Zi)) as long as |{i : ci = c}| is sufficiently large. Hence, the single symbol denoiser that minimizes the sum of the estimated losses for each c (i.e., (5)) will also make the sum of the true losses small, which is the goal of a denoiser. We can also express (5) using vector notations, which will become useful for deriving the Neural DUDE in the next section. That is, we let |S| be a probability simplex in R|S|. (Suppose we have uniquely assigned each coordinate of R|S| to each single-symbol denoiser in S from now on.) Then, we can define a probability vector for each c, ˆp(c) arg min p |S| {i:ci=c} 1 zi L p, (6) which will be on the vertex of |S| that corresponds to sk,DUDE(c, ) in (5). The reason is because the objective function in (6) is a linear function in p. Hence, we can simply obtain sk,DUDE(c, ) = arg maxs ˆp(c)s, where ˆp(c)s stands for the s-th coordinate of ˆp(c). 4 Neural DUDE: A DNN based discrete denoiser As seen in the previous section, DUDE utilizes the estimated loss matrix L, which does not depend on the clean sequence xn. However, the main drawback of DUDE is that, as can be seen in (5), it treats each context c independently from others. Namely, when the context size k grows, then the number of different contexts |Ck| = |Z|2k will grow exponentially with k, hence, the sample size for each context |{i : ci = c}| will decrease exponentially for a given sequence length n. Such phenomenon will hinder the concentration of P i {i:ci=c} L(Zi, s) mentioned in the previous section, which causes the performance of DUDE deteriorate when k grows too large. In order to resolve above problem, we develop Neural DUDE, which adopts a single neural network such that the information from similar contexts can be shared via network parameters. We note that our usage of DNN resembles that of the neural language model (NLM) [13], which improved upon the conventional N-gram models. The difference is that NLM is essentially a prediction problem, hence the ground truth label for supervised training is easily availble, but in denoising, this is not the case. Before describing the algorithm more in detail, we need one following lemma. 4.1 A lemma Let R|S| + be the space of all |S|-dimensional vectors of which elements are nonnegative. Then, for any g R|S| + and any p |S|, define a cost function C(g, p) P|S| i=1 gi log pi, i.e., a generalized cross-entropy function with the first argument not normalized to a probability vector. Note C(g, p) is linear in g and convex in p. Now, following lemma shows another way of obtaining DUDE. Lemma 1 Define Lnew L + Lmax11 in which Lmax maxz,s L(z, s), the maximum element of L. Using the cost function C( , ) defined above, for each c Ck, let us define p (c) arg min p |S| X {i:ci=c} C L new1zi, p . Then, we have sk,DUDE(c, ) = arg maxs p (c)s. Proof: The proof of lemma is given in the Supplementary Material. 4.2 Neural DUDE The main idea for Neural DUDE is to use a single neural network to learn the k-th order slinding window denoising rule for all c s. Namely, we define p(w, ) : Z2k |S| as a feed-forward neural network that takes the context vector c Ck as input and outputs a probability vector on |S|. We let w stand for all the parameters in the network. The network architecture of p(w, ) has the softmax output layer, and it is analogous to that used for the multi-class classification. Thus, when the parameters are properly learned, we expect that p(w, ci) will give predictions on which single-symbol denoiser to apply at location i with the context ci. 4.2.1 Learning When not resorting to the supervised learning framework, learning the network parameters w is not straightforward as mentioned in the Introduction. However, inspired by Lemma 1, we define the objective function to minimize for learning w as L(w, zn) 1 n i=1 C L new1zi, p(w, ci) , (7) which resembles the widely used cross-entropy objective function in supervised multi-class classification. Namely, in (7), {(ci, L new1zi)}n i=1, which solely depends on the noisy sequence zn, can be analogously thought of as the input-label pairs in supervised learning. (Note for i k and i n k, dummy variables are padded for obtaining ci.) But, unlike classification, in which the ground-truth label is given as a one-hot vector, we treat L new1zi R|S| + as a target pseudo-label on S. Once the objective function is set as in (7), we can then use the widely used optimization techniques, namely, the back-propagation and Stochastic Gradient Descent (SGD)-based methods, for learning the parameters w. In fact, most of the well-known improvements to the SGD method, such as the momentum [14], mini-batch SGD, and several others [15, 16], can be all used for learning w. Note that there is no notion of generalization in our setting, since the goal of denoising is to simply achieve as small average loss as possible for the given noisy sequence zn, rather than performing well on the separate unseen test data. Hence, we do not use any regularization techniques such as dropout in our learning, but simply try to minimize the objective function. 4.2.2 Denoising After sufficient iterations of weight updates, the objective function (7) will converge, and we will denote the converged parameters as w . The Neural DUDE algorithm then applies the resulting network p(w , ) to the exact same noisy sequence zn used for learning to denoise. Namely, for each c Ck, we obtain a single-symbol denoiser sk,Neural DUDE(c, ) = arg max s p(w , c)s (8) and the reconstruction at location i by ˆXi,DUDE(zn) = sk,Neural DUDE(ci, zi). From the objective function (7) and the definition (8), it is apparent that Neural DUDE does share information across different contexts since w is learnt from all data and shared across all contexts. Such property enables Neural DUDE to robustly run with much larger k s than DUDE without running into the data sparsity problem. As shown in the experimental section, Neural DUDE with large k can significantly improve the denoising performance compared to DUDE. Furthermore, in the experimental section, we show that the concentration i=1 L(Zi, sk,Neural DUDE(ci, )) 1 i=1 Λ(xi, sk,Neural DUDE(ci, Zi)) (9) holds with high probability even for very large k s, whereas such concentration quickly breaks for DUDE as k grows. While deferring the analyses on why such concentration always holds to the future work, we can use the property to provide a systematic mechanism for choosing the best context size k for Neural DUDE - simply choose k = arg mink 1 n Pn i=1 L(Zi, sk,Neural DUDE(ci, )). As shown in the experiments, such choice of k for Neural DUDE gives an excellent denoising performace. Algorithm 1 summarizes the Neural DUDE algorithm. Algorithm 1 Neural DUDE algorithm Input: Noisy sequence zn, Π, Λ, Maximum context size kmax Output: Denoised sequence ˆXn Neural DUDE = { ˆXi,Neural DUDE(zn)}n i=1 Compute L = Π 1ρ as in Section 3.1 and Lnew as in Lemma 1 for k = 1, . . . , kmax do Initialize p(w, ) with input dimension 2k|Z| (using one-hot encoding of each noisy symbol) Obtain w k minimizing L(w, zn) in (7) using SGD-like optimization method Obtain sk,Neural DUDE(c, ) for all c Ck as in (8) using w k Compute Lk 1 n Pn i=1 L(zi, sk,Neural DUDE(ci, )) end for Get k = arg mink Lk and obtain ˆXi,Neural DUDE(zn) = sk ,Neural DUDE(ci, zi) for i = 1, . . . , n Remark: We note that using the cost function in (7) is important. That is, if we use a simpler objective like (5), 1 n Pn i=1(L 1zi) p(w, ci), it becomes highly non-convex in w, and the solution w becomes very unstable. Moreover, using Lnew instead of L in the cost function is important as well, since it guarantees to have the cost function C( , ) always convex in the second argument. 5 Experimental results In this section, we show the denoising results of Neural DUDE for the synthetic binary data, real binary images, and real Oxford Nanopore Min ION DNA sequence data. All of our experiments were done with Python 2.7 and Keras package (http://keras.io) with Theano [17] backend. 5.1 Synthetic binary data We first experimented with a simple synthetic binary data to highlight the core strength of Neural DUDE. That is, we assume X = Z = ˆ X = {0, 1} and Π is a binary symmetric channel (BSC) with crossover probability δ = 0.1. We set Λ as the Hamming loss. We generated the clean binary 0 2 4 6 8 10 12 14 Window size k (Bit Error Rate) / δ DUDE Neural DUDE (1L) Neural DUDE (2L) Neural DUDE (3L) Neural DUDE (4L) FB Recursion (a) BER/δ vs. Window size k 0 2 4 6 8 10 12 14 Window size k (Bit Error Rate) / δ BER Estimated BER FB Recursion 0 2 4 6 8 10 12 14 Window size k (Bit Error Rate) / δ BER Estimated BER FB Recursion (c) Neural DUDE (4L) Figure 1: Denoising results of DUDE and Neural DUDE for the synthetic binary data with n = 106. sequence xn of length n = 106 from a binary symmentric Markov chain (BSMC) with transition probability α = 0.1. The noise-corrupted sequence zn is generated by passing xn through Π. Since we use the Hamming loss, the average loss of a denoiser ˆXn, 1 n Pn i=1 Λ(xi, ˆXi(zn)), is equal to the bit error rate (BER). Note that in this setting, the noisy sequence zn is a hidden Markov process. Therefore, when the stochastic model of the clean sequence is exactly known to the denoiser, the Viterbi-like Forward-Backward (FB) recursion algorithm can attain the optimum BER. Figure 1 shows the denoising results of DUDE and Neural DUDE, which do not know anything about the characteristics of the clean sequence xn. For DUDE, the window size k is the single hyperparameter to choose. For Neural DUDE, we used the feed-forward fully connected neural networks for p(w, ) and varied the depth of the network between 1 4 while also varying k. Neural DUDE(1L) corresponds to the simple linear softmax regression model. For deeper models, we used 40 hidden nodes in each layer with Rectified Linear Unit (Re LU) activations. We used Adam [16] with default setting in Keras as an optimizer to minimize (7). We used the mini-batch size of 100 and ran 10 epochs for learning. The performance of Neural DUDE was robust to the initializtion of the parameters w. Figure 1(a) shows the BERs of DUDE and Neural DUDE with respect to varying k. Firstly, we see that minimum BERs of both DUDE and Neural DUDE(4L), i.e., 0.563δ with k = 5, get very close to the optimum BER (0.558δ) obtained by the Forward-Backward (FB) recursion. Secondly, we observe that Neural DUDE quickly approaches the optimum BER as we increase the depth of the network. This shows that as the descriminative power of the model increases with the depth of the network, p(w, ) can successfully learn the denoising rule for each context c with a shared parameter w. Thirdly, we clearly see that in contrast to the performance of DUDE being sensitive to k, that of Neural DUDE(4L) is robust to k by sharing information across contexts. Such robustness with respect to k is obviously a very desirable property in practice. Figure 1(b) and Figure 1(c) plot the average estimated BER, 1 n Pn i=1 L(Zi, sk(ci, )), against the true BER for DUDE and Neural DUDE (4L), respectively, to show the concentration phenomenon described in (9). From the figures, we can see that while the estimated BER drastically diverges from true BER for DUDE as k increases, it strongly concentrates on true BER for Neural DUDE (4L) for all k. This result suggests the concrete rule for selecting the best k described in Algorithm 1. Such rule is used for the experiments using real data in the following subsections. 5.2 Real binary image denoising In this section, we experiment with real, binary image data. The settings of Π and Λ are identical to Section 5.1, while the clean sequence was generated by converting image to a 1-D sequence via raster scanning. We tested with 5 representative binary images with various textual characteristics: Einstein, Lena, Barbara, Cameraman, and scanned Shannon paper. Einstein and Shannon images had the resolution of 256 256 and the rest had 512 512. For Neural DUDE, we tested with 4 layer model with 40 hidden nodes with Re LU activations in each layer. (a) Clean image 0 5 10 15 20 25 30 35 40 Window size k (Bit Error Rate) / δ 0.404δ 0.563δ DUDE BER Neural DUDE(4L) BER Neural DUDE(4L) Est. BER (b) BER results Figure 2: Einstein image(256 256) denoising results with δ = 0.1. Figure 2(b) shows the result of denoising Einstein image in Figure 2(a) for δ = 0.1. We see that the BER of Neural DUDE(4L) continues to drop as we increase k, whereas DUDE quickly fails to denoise for larger k s. Furthermore, we observe that the estimated BER of Neural DUDE(4L) again strongly correlates with the true BER. Note that when k = 36, we have 272 possible different contexts, which are much more than the number of pixels, 216(256 256). However, we see that Neural DUDE can still learn a good denoising rule from such many different contexts by aggregating information from similar contexts. δ Schemes Einstein Lena Barbara Cameraman Shannon DUDE 0.578 (5) 0.494 (6) 0.492 (5) 0.298 (6) 0.498 (5) Neural DUDE 0.384 (38) 0.405 (38) 0.448 (33) 0.264 (39) 0.410 (38) Improvement 33.6% 18.0% 9.0% 11.5% 17.7% DUDE 0.563 (5) 0.495 (6) 0.506 (6) 0.310 (5) 0.475 (5) Neural DUDE 0.404 (36) 0.403 (38) 0.457 (27) 0.268 (35) 0.402 (35) Improvement 28.2% 18.6% 9.7% 13.6% 15.4% Table 1: BER results for binary images. Each number represents the relative BER compared to δ and the Improvement stands for the relative BER improvement of Neural DUDE(4L) over DUDE. The numbers inside parentheses are the k values achieving the result. Table 1 summarizes the denoising results on six binary images for δ = 0.1, 0.15. We see that Neural DUDE always significantly outperforms DUDE using much larger context size k. We believe this is a significant result since DUDE is shown to outperform many state-of-the-art sliding window denoisers in practice such as median filters [5, 1]. Furthermore, following DUDE s extension to grayscale image denoising [2], the result gives strong motivation for extending Neural DUDE to grayscale image denoising. 5.3 Nanopore DNA sequence denoising We now go beyond binary data and apply Neural DUDE to DNA sequence denoising. As surveyed in [9], denoising DNA sequences is becoming increasingly important as the sequencing devices are getting cheaper, but injecting more noise than before. For our experiment, we used simulated Min ION Nanopore reads, which were generated as follows; we obtained 16S r DNA reference sequences for 20 species [18] and randomly generated noiseless template reads from them. The number of reads and read length for each species were set as identical to those of real Min ION Nanopore reads [18]. Then, based on Π of Min ION Nanopore sequencer (Figure 3(a)) obtained in [19] (with 20.375% average error rate), we induced substitution errors to the reads and obtained the corresponding noisy reads. Note that we are only considering substitution errors, while there also exist insertion/deletion errors in real Nanopore sequenced data. The reason is that substitution errors can be directly handled by DUDE and Neural DUDE, so we focus on quantitatively evaluating the performance on those errors. We sequentially merged 2,372 reads from 20 species and formed 1-D sequence of 2,469,111 base pairs long. We used two Neural DUDE (4L) models with 40 and 80 hidden nodes in each layer, and denoted as (40-40-40) and (80-80-80), respectively. (a) Π for nanopore sequencer 0 20 40 60 80 100 Window size k (Error Rate) / δ DUDE Neural DUDE (40-40-40) Neural DUDE (80-80-80) (b) BER results Figure 3: Nanopore DNA sequence denoising results. Figure 3(b) shows the denoising results. We observe that Neural DUDE with large k s (around k = 100) can achieve less than half of the error rate of DUDE. Furthermore, as the complexity of model increases, the performance of Neural DUDE gets significantly better. We could not find a comparable baseline scheme, since most of nanopore error correction tool, e.g., Nanocorr [20], did not produce read-by-read correction sequence, but returns downstream analyses results after denoising. Coral [21], which gives read-by-read denoising result for Illumina data, completely failed for the nanopore data. Given that DUDE ourperforms state-of-the-art schemes, including Coral, for Illumina sequenced data as shown in [3], we expect the improvement of Neural DUDE over DUDE could translate into fruitful downstream analyses gain for nanopore data. 6 Concluding remark and future work We showed Neural DUDE significantly improves upon DUDE and has a systematic mechanism for choosing the best k. There are several future research directions. First, we plan to do thorough experiments on DNA sequence denoising and quantify the impact of Neural DUDE in the downstream analysis. Second, we plan to give theoretical analyses on the concentration (9) and justify the derived k selection rule. Third, extending the framework to deal with continuous-valued signal and finding connection with SURE principle [22] would be fruitful. Finally, applying recurrent neural networks (RNN) in place of DNNs could be another promising direction. Acknowledgments T. Moon was supported by DGIST Faculty Start-up Fund (2016010060) and Basic Science Research Program through the National Research Foundation of Korea (2016R1C1B2012170), both funded by Ministry of Science, ICT and Future Planning. S. Min, B. Lee, and S. Yoon were supported in part by Brain Korea 21 Plus Project (SNU ECE) in 2016. [1] E. Ordentlich, G. Seroussi, S. Verdú, M.J. Weinberger, and T. Weissman. A universal discrete image denoiser and its application to binary images. In IEEE ICIP, 2003. [2] Giovanni Motta, Erik Ordentlich, Ignacio Ramirez, Gadiel Seroussi, and Marcelo J. Weinberger. The i DUDE framework for grayscale image denoising. IEEE Trans. Image Processing, 20:1 21, 2011. [3] B. Lee, T. Moon, S. Yoon, and T. Weissman. DUDE-Seq: Fast, flexible, and robust denoising of nucleotide sequences. ar Xiv:1511.04836, 2016. [4] E. Ordentlich, G. Seroussi, S. Verdú, and K. Viswanathan. Universal algorithms for channel decoding of uncompressed sources. IEEE Trans. Inform. Theory, 54(5):2243 2262, 2008. [5] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdú, and M. J. Weinberger. Universal discrete denoising: Known channel. IEEE Trans. on Inform. Theory, 51(1):5 28, 2005. [6] G. Hinton, Y. Le Cun, and Y. Bengio. Deep learning. Nature, 521:436 444, 2015. [7] H. Burger, C. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM3D? In CVPR, 2012. [8] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In NIPS, 2012. [9] D. Laehnemann, A. Borkhardt, and A.C. Mc Hardy. Denoising DNA deep sequencing data high-throughput sequencing errors and their corrections. Brief Bioinform, 17(1):154 179, 2016. [10] V. Jain and H.S. Seung. Natural image denoising with convolutional networks. In NIPS, 2008. [11] T. Moon and T. Weissman. Discrete denoising with shifts. IEEE Trans. on Inform. Theory, 2009. [12] T. Weissman, E. Ordentlich, M. Weinberger, A. Somekh-Baruch, and N. Merhav. Universal filtering via prediction. IEEE Trans. Inform. Theory, 53(4):1253 1264, 2007. [13] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. JMLR, 3:1137 1155, 2003. [14] Y. Nesterov. A method of solving a convex programming problem with convergence rate o(1/sqr(k)). Soviet Mathematics Doklady, 27:372 376, 1983. [15] Tieleman and G. Hinton. RMSProp: Divide the gradient by a running average of its recent magnitude. In Lecture Note 6-5, University of Toronto, 2012. [16] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. [17] Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improvements. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012. [18] A. Benitez-Paez, K. Portune, and Y. Sanz. Species level resolution of 16S r RNA gene amplicons sequenced through Min ION portable nanopore sequencer. bio Rxiv:021758, 2015. [19] M. Jain, I. Fiddes, K. Miga, H. Olsen, B. Paten, and M. Akeson. Improved data analysis for the Min ION nanopore sequencer. Nature Methods, 12:351 356, 2015. [20] S. Goodwin, J. Gurtowski, S Ethe-Sayers, P. Deshpande, M. Schatz, and W.R. Mc Combie. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res., 2015. [21] L. Salmela and J. Schroder. Correcting errors in short reads by multiple alignments. Bio Informatics, 27(11):1455 1461, 2011. [22] C. Stein. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9(6):1135 1151, 1981.