# unsupervised_cipher_cracking_using_discrete_gans__13466824.pdf Published as a conference paper at ICLR 2018 UNSUPERVISED CIPHER CRACKING USING DISCRETE GANS Aidan N. Gomez 1,2 aidan@cs.toronto.edu Sicong Huang 1,2 huang@cs.toronto.edu Ivan Zhang 2 ivan@ivanzhang.ca Bryan M. Li 1,2 bryan@bryanli.io Muhammad Osama 1,2 muhammad.osama@mcode.ca Łukasz Kaiser 3 lukaszkaiser@google.com 1 Department of Computer Science, University of Toronto 2 FOR.ai 3 Google Brain This work details Cipher GAN, an architecture inspired by Cycle GAN used for inferring the underlying cipher mapping given banks of unpaired ciphertext and plaintext. We demonstrate that Cipher GAN is capable of cracking language data enciphered using shift and Vigen ere ciphers to a high degree of fidelity and for vocabularies much larger than previously achieved. We present how Cycle GAN can be made compatible with discrete data and train in a stable way. We then prove that the technique used in Cipher GAN avoids the common problem of uninformative discrimination associated with GANs applied to discrete data. 1 INTRODUCTION Humans have been encoding messages for secrecy since before the ancient Greeks, and for the same amount of time, have been fascinated with trying to crack these codes using brute-force, frequency analysis, crib-dragging and even espionage. Simple ciphers have, in the past century, been rendered irrelevant in favor of the more secure encryption schemes enabled by modern computational resources. However, the question of cipher-cracking remains an interesting problem since it requires an intimate understanding of the structure in a language. Nearly all automated cipher cracking techniques have had to rely on a human-in-the-loop; grounding the automated techniques in a human s preexisting knowledge of language to clean up the errors made by simple algorithms such as frequency analysis. Across a number of domains, the use of hand-crafted features has often been replaced by automatic feature extraction directly from data using end-to-end learning frameworks (Goodfellow et al., 2016). The question to be addressed is as follows: Can a neural network be trained to deduce withheld ciphers from unaligned text, without the supplementation of preexisting human knowledge? The implications for such a general framework would be far-reaching in the field of unsupervised translation, where each language can be treated as an enciphering of the other. The decoding of the Copiale cipher (Knight et al., 2011) stands as an excellent example of the potential for machine learning techniques to decode enciphered texts by treating the problem as language translation. The Cycle GAN (Zhu et al., 2017) architecture is extremely general and we demonstrate our adaptation, Cipher GAN, is capable of cracking ciphers to an extremely high degree of accuracy. Cipher GAN Code available at: github.com/for-ai/ciphergan Published as a conference paper at ICLR 2018 requires little or no modification to be applied to plaintext and ciphertext banks generated by the user s cipher of choice. In addition to presenting a GAN that can crack ciphers, we contribute the following techniques: We show how to stabilize Cycle GAN training: our Cipher GAN achieves good performance in all training runs, compared to approximately 50% of runs for the original Cycle GAN. We provide a theoretical description and analysis of the uninformative discrimination problem that impacts GANs applied to discrete data. We introduce a solution to the above problem by operating in the embedding space and show that it works in practice. 1.1 SHIFT AND VIGEN ERE CIPHERS The shift and Vigen ere ciphers are well known historical substitution ciphers. The earliest known record of a substitution cipher is believed to have dated back to 58 BCE, when Julius Caesar replaced each letter in a message with the letter that was three places further down the alphabet (Singh, 2000). Plain alphabet a b c d e f g h i j k l m n o p q r s t u v w x y z Shifted alphabet D E F G H I J K L M N O P Q R S T U V W X Y Z A B C Figure 1: An example of a right-shift-3 cipher. Using Figure 1, the message attackatdawn can be encrypted to DWWDFNDWGDZQ . This message can be easily deciphered by the intended recipient (who is aware of the particular shift number used) but looks meaningless to a third party. The shift cipher ensured secure communication between sender and receiver for centuries, until the ninth century polymath Al-Kindi introduced the concept of frequency analysis (Singh, 2000). He suggested that it would be possible to crack a cipher simply by analyzing the individual characters frequencies. For instance, in English the most frequently occurring letters are e (12.7%), t (9.1%) and a (8.2%); whereas q , x and z each have frequency of less than 1%. Moreover, the code-breaker can also focus on bigrams of repeated letters; ss , ee , and oo are the most common in English. This structure in language provides an exploit of efficiency to the code-breaker. Polyalphabetic substitution ciphers, including the Vigen ere cipher, were introduced to inhibit the use of n-gram frequency analysis in determining the cipher mapping. Instead, the encrypter further scrambles the message by using a separate shift cipher for each element of a key that is tiled to match the length of the plaintext. Increasing the key length greatly increases the number of possible combinations and thus prevents against basic frequency analysis. In the mid nineteenth century, Charles Babbage recognized that the length of the used key could be determined by counting the repetitions and spacing of sequences of letters in the cipher (Singh, 2000). Using the determined length, we can then apply frequency analysis on the index of the cipher base. This method makes it possible to break the Vigen ere cipher, but is very time consuming and requires strong knowledge of the language itself. There is a rich literature of automated shift-cipher cracking techniques (Ramesh et al., 1993; Forsyth & Safavi-Naini, 1993; Hasinoff, 2003; Knight et al., 2006; Verma et al., 2007; Raju et al., 2010; Knight et al., 2011) many of which achieve excellent results which is what one would expect from hand-crafted algorithms targeting specific ciphers and vocabularies. Work on automated cracking of polyalphabetic ciphers (Carroll & Martin, 1986; Toemeh & Arumugam, 2008; Omran et al., 2011) has seen similar success on small vocabularies. It is a difficult matter to compare the results of previous work with our own as their focus ranges from inferring cipher keys (Carroll & Martin, 1986; Ramesh et al., 1993; Omran et al., 2011), to inferring the mappings given limited quantities of ciphertext (determining unicity distance) (Carroll & Martin, 1986; Ramesh et al., 1993; Hasinoff, 2003; Verma et al., 2007), to analyzing the unicity distance required to solve small percentages of the cipher mappings (i.e. 20% in Carroll & Martin (1986)). In comparison to these past works, we afford ourselves the advantage of an unconstrained corpus of ciphertext, however, we prescribe ourselves the following constraints: our model is not provided any prior knowledge of vocabulary element frequencies; and, no information about the cipher key is Published as a conference paper at ICLR 2018 provided. Another complexity our work must overcome is our significantly larger vocabulary sizes; all previous work has addressed vocabularies of approximately 26 characters, while our model is capable of solving word-level ciphers with over 200 distinct vocabulary elements. As such, our methodology is notably hands-off in comparison to previous work and can be easily applied to different forms of cipher, different underlying data and unsupervised text alignment tasks. 1.2 GANS AND WASSERSTEIN GANS Generative Adversarial Networks (GANs) are a class of neural network architectures introduced by Goodfellow et al. (2014) as an alternative to optimizing likelihood under a true data distribution. Instead, GANs balance the optimization of a generator network which attempts to produce convincing samples from the data distribution, and a discriminator which is trained to distinguish between samples from the true data distribution and the generator s synthetic samples. GANs have been shown to produce compelling results in the domain of image generation, but comparatively weak performance in domains using discrete data (discussed in Section 2). The original GAN discriminator objective as introduced in Goodfellow et al. (2014) is: D = arg max D Ex X [log D(x)] Ez Z[log(1 D(F(z)))] (1) Where F is the generator network and D is the discriminator network. This loss is vulnerable to the problem of mode collapse where the generative distribution collapses to produce a generating distribution with low diversity. In order to more broadly distribute the mass, the Wasserstein GAN (WGAN) objective (Arjovsky et al., 2017) considers the set of K-Lipschitz discriminator functions D : X R and minimizes the earth movers (1st Wasserstein) distance. The Lipschitz condition is enforced by clipping discriminators weights to fall within a predefined range. D = arg max D L K Ex X [D(x)] Ez Z[D(F(z))] (2) An improved WGAN objective, introduced by Gulrajani et al. (2017), enforced the Lipschitz condition using a Jacobian regularization term instead of the originally proposed weight-clipping solution. This resulted in more stable training, avoiding capacity under-use and exploding gradients, and improved network performance over weight-clipping. D = arg max D Ex X [D(x)] Ez Z[D(F(z))]+ α Eˆx ˆ X [( ˆx D(ˆx) 2 1)2] (3) Here ˆ X are samples taken along a line between the true data distribution X and the generator s data distribution Xg = {F(z)|z Z}. 1.3 CYCLEGAN Cycle GAN (Zhu et al., 2017) is a generative adversarial network designed to learn a mapping between two data distributions without supervision. Three separate works (Zhu et al., 2017; Yi et al., 2017; Liu et al., 2017) share many of the core features we describe below, however, for simplicity we will refer to Cycle GAN as the basis for our work as it is the most similar to our model. It acts on distributions X and Y by using two mapping generators: F : X Y and G : Y X; and two discriminators: DX : X [0, 1] and DY : Y [0, 1]. Cycle GAN optimizes the standard GAN loss LGAN: LGAN(F, DY, X, Y) = Ey Y[log DY(y)] + Ex X [log(1 DY(F(x)))] (4) While also considering a reconstruction loss, or cycle loss Lcyc: Lcyc(F, G, X, Y) = Ex X [ G(F(x)) x 1] + Ey Y[ F(G(y)) y 1] (5) Taken together the losses are balanced using a hyperparameter λ: L(F, G, DX , DY, X, Y) = LGAN(F, DY, X, Y) + LGAN(G, DX , Y, X) + λ Lcyc(F, G, X, Y) Published as a conference paper at ICLR 2018 Figure 2: Discriminators trained on the toy example of recognizing the bottom-right corner of a simplex as true data. From left to right the discriminators were regularized using: nothing; WGAN Jacobian norm regularization; and, the relaxed sampling technique. This leads to the training objectives: F = arg min F Lcyc(F, G, X, Y) + LGAN(F, DY, X, Y) G = arg min G Lcyc(F, G, X, Y) + LGAN(G, DX , Y, X) D X = arg max DX LGAN(G, DX , Y, X) D Y = arg max DY LGAN(F, DY, X, Y) Cycle GAN uses Lcyc to avoid mode collapse by preserving reconstruction of mapping inputs from outputs. It has demonstrated excellent results in unpaired image translation between two visually similar categories. Our architecture is the first example of this unsupervised learning framework being successfully applied to discrete data such as language. 2 DISCRETE GANS Applying GANs to discrete data generation is still an open research problem that has seen great interest and development. The primary difficulty with training discrete data generators in a GAN setting is the lack of a gradient through a discrete node in the computation graph. The alternatives to producing discrete outputs - for instance, generators producing a categorical distribution over discrete elements - are prone to uninformative discrimination (described below), in that, the discriminator may use an optimal discrimination criterion that is unrelated to the correctness of the re-discretized generated data. In our example of a continuous distribution over discrete elements, the produced samples all lie within the standard simplex k with dimension k equal to the number of elements in the distribution. In this case, samples from the true data distribution always lie on a vertex vi of the simplex, while any sub-optimal generator will produce samples within the simplex s interior k\{v1, ..., vk}. In this example, a discriminator which performs uninformative discrimination might evaluate a sample s membership in the vertices of the simplex as an optimal discrimination criterion, which is entirely uninformative of the correctness of re-discretized samples from the generator. A number of solutions to training generators with discrete outputs have been proposed: Seq GAN (Yu et al., 2017) uses the REINFORCE gradient estimate to train the generator; Boundary-seeking GANs (Hjelm et al., 2017) and maximum-likelihood augmented GANs (Che et al., 2017) proposed a gradient approximation with low bias and variance that resembles the REINFORCE (Williams, 1992) estimator. Gumbel-softmax GANs (Kusner & Hern andez-Lobato, 2016) replace discrete variables in the simplex with continuous relaxations called Concrete (Maddison et al., 2016) or Gumbel-softmax (Jang et al., 2016) variables; WGANs (Arjovsky et al., 2017) were suggested as a remedy to the uninformative discrimination problem by ensuring that the discriminator s rate of change with respect to its input is bound by some constant. Our work utilizes both the Wasserstein GAN (Gulrajani et al., 2017; Arjovsky et al., 2017) and relaxation of discrete random variables (such as Concrete/Gumbel-softmax). It has been noted multiple times in implementations of Cycle GAN as well as in the original paper itself that the architecture was sensitive to initialization and requires repeat attempts in order to converge to a satisfactory mapping (Bansal & Rathore, 2017; Sari, 2017). Our architecture suffered the same instability before the WGAN Jacobian norm regularization term was added to the discriminator s loss. In addition, we found that having the discriminator operate over embedding space instead of directly over softmax vectors produced by our generator has improved performance. Published as a conference paper at ICLR 2018 Our hypothesis, which is justified by Proposition 1 below, is that the embedding vectors may act as continuous relaxations of discrete random variables as small, noisy updates are applied throughout training; Proposition 1 asserts that by replacing discrete random variables with continuous ones, our discriminator is prevented from arbitrarily approximating a Dirac delta distribution. Figure 2 shows simple discriminators trained on the toy task of identify a single vertex of a simplex as true data; it is clear that a lack of regularization leads to the discriminator collapsing to the vertex of the simplex, leaving approximately zero gradient everywhere; while the Jacobian regularization of Wasserstein GANs leads to the space covered by lines leading from the true data vertex to the generated data having a lower rate of change (note that the gradient is still close to zero in the remaining area of the simplex); and finally, replacing the discrete random variables of the true data with continuous samples about the vertex results in a much more gradual transition, which is desirable since it provides a stronger gradient signal from which to learn. Unique to Cycle GAN is the auxiliary cycle loss described in Section 1.3. The effect of this additional objective is the generated samples regularly being forced away from the discriminator s minimum in favor of a mapping that better-satisfies reconstruction. For instance, it may be the case that the discriminator favors a particular cipher mapping that is not bijective; in this case, the model will receive a strong signal from the cycle loss away from the discriminator s minimum. In these cases where the model moves against the gradient it receives from the discriminator it may be the case that this region has near zero curvature (as is visually discernible from Figure 2); this is because the WGAN curvature regularization (see Equation 3) has not been applied in this region. Curvature here refers to the curvature of the discriminator s output with respect to its input; this curvature determines the strength of the training signal received by the generator. Low curvature means little information for the generator to improve itself with. This motivates the benefits of having strong curvature globally, as opposed to linearly between the generators samples and the true data. Kodali et al. (2017) proposes regularizing in all directions about the generated samples, which would likely remedy the vanishing gradient in our case as well; for our experiments, the relaxed sampling technique proved effective. It should also be noted that Luc et al. (2016) propose something similar to relaxed sampling whereby they replace the ground-truth discrete tokens with a distribution over the vocabulary that distributes some of the mass across the remaining incorrect tokens. Let us now introduce the definitions needed for the formal presentation of Proposition 1. Definitions. (Continuous relaxation of a discrete set). A continuous relaxation of a discrete set X is a proper, path-connected metric space X satisfying X X. (Rediscretization function). A rediscretization function is an injective function R : X X from a continuous relaxation X of discrete space X satisfying x X, ϵ > 0 s.t. R x on Bϵ[x]. Note that R defines an equivalence relation in X. (Uninformative Discrimination). A discriminator DX is said to perform uninformative discrimination under rediscretization function R if: x X, x X s.t. (R( x) = x) (DX (x) DX ( x)). (Continuous relaxation of a function). A continuous relaxation of a function over discrete sets F : X Y is another function F : X Y (where X, Y are continuous relaxations of X, Y) such that F is continuous and F(x) = F(x), x X. The following proposition (proved in the Appendix) forms the theoretical basis of the technique. Proposition 1 (Reliable Fooling Via Relaxation). discrete spaces X, Y and continuous relaxations X, Y generators F : X Y, G : Y X bijections satisfying F = G 1 discrete discriminators DX , DY both optimal for fixed F, G rediscretization functions RX , RY Published as a conference paper at ICLR 2018 Suppose: F is approximately volume preserving in a small region about each x X. Consequently the same is true for G about each y Y. If: during training, we replace discrete random variables from X which lie in the continuous metric space X with samples from regions about them. Then: the optimal relaxed discriminators DX and DY have a non-empty region about each x X and y Y where they are expected to assign values close to DX (x) and DY(y). Figure 3 compares a model trained with embedding vectors versus one with only the softmax outputs. It becomes clear on a harder task, such as Vigen ere, that the embeddings vastly outperforms softmax in terms of speed of convergence and final accuracy; however we found that the simpler task of a shift cipher showed little difference between embeddings and softmax, suggesting an increase in task complexity increases the benefits provided by the stronger gradient signal of embeddings. 3.1 CIPHERGAN GANs applied to text data have yet to produce truly convincing results (Kawthekar et al.). Previous attempts at discrete sequence generation with GANs have generally utilized a generator outputting a probability distribution over the token space (Gulrajani et al., 2017; Yu et al., 2017; Hjelm et al., 2017). This leads to the discriminator receiving a sequence of discrete random variables from the data distribution, and a sequence of continuous random variables from the generator distribution; making the task of discrimination trivial and uninformative of the underlying data distribution. In order to avoid such a scenario, we perform all discrimination within the embedding space by allowing the generator s output distribution to define a convex combination of corresponding embeddings. This leads to the following losses: LGAN(F, DY, X, Y) = Ey Y[log DY(y W Emb)] + Ex X [log(1 DY(F(x W Emb) W Emb))] Lcyc(F, G, X, Y) = Ex X [ G(F(x W Emb) W Emb) x 1] + Ey Y[ F(G(y W Emb) W Emb) y 1] We perform an inner product between the embeddings WEmb and the one-hot vectors in x as well as between the embeddings and the softmax vectors produced by generators F and G. The former is equivalent to a lookup operation over the table of embedding vectors, while the latter is a convex combination between all vectors in the vocabulary. The embeddings WEmb are trained at each step to minimize Lcyc and maximize LGAN, meaning the embeddings are easily mapped from and are easy to discriminate. As was discussed in Section 2, training with the above loss functions was unstable, with approximately three of every four experiments failing to produce compelling results. This is a problem we observed with the original Cycle GAN horse-zebra experiment, and one that has been noted by multiple re-implementations online (Bansal & Rathore, 2017; Sari, 2017). We were able to significantly increase the stability by training the discriminator loss along with the Lipschitz conditioning term from the improved Wasserstein GAN (Gulrajani et al., 2017) (see Equation 3 and Fedus et al. (2017)), resulting in the following loss (Dual GAN (Yi et al., 2017) opted to use weight-clipping to enforce the Lipschitz condition): LGAN(F, DY, X, Y) = Ey Y[DY(y W Emb)] Ex X [DY(F(x W Emb) W Emb)] + α Eˆy ˆ Y[( ˆy DY(ˆy) 2 1)2] As a consequence of Proposition 1, discriminators trained on non-stationary embeddings will be unable to approximate Dirac delta distributions to arbitrary accuracy; implying there are dedicated safe-zones about members of X where the generator can reliably fool the discriminator and uninformative discrimination is prevented. In our experiments, we jointly train the embedding vectors as parameters of the model. The gradient updates applied to these vectors introduces noise between training iterations; we observed that embedding vectors tend to remain in a bounded region after the initial steps of training. We found Published as a conference paper at ICLR 2018 Work Ciphertext Length Accuracy Hasinoff (2003) 500 97% Forsyth & Safavi-Naini (1993) 5000 100% Ramesh et al. (1993) 160 78.5% Verma et al. (2007) 1000 87% Table 1: Previous results on automated shift cipher cracking with limited ciphertext length. that simply replacing the data with embedding vectors had a similar effect to performing the random sampling described in Proposition 1 (see Figure 3). 4 EXPERIMENTS Our experiments use plaintext natural language samples from the Brown English text dataset (Francis & Kucera, 1979). We generate 2 batch size plaintext samples, the first half are fed as the Cycle GAN s X distribution and the second half is passed through the cipher of choice and fed as the Y distribution. For our natural language plaintext data we used the Brown English-language corpus which consists of over one million words in 57340 sentences. We experiment with both word-level Brown-W and character-level Brown-C vocabularies. For word-level vocabularies, we control the size of the vocabulary by taking the top k most frequent words and introducing an unknown token which we use to replace all words that are not within the taken vocabulary. We demonstrate our method s ability to scale to large vocabularies using the word-level vocabularies; more modern enciphering techniques rely on large substitution-boxes (S-boxes) with many (often hundreds of) elements. 4.2 TRAINING As in Zhu et al. (2017) we replace the log-likelihood loss with a squared difference loss which was originally introduced by Mao et al. (2016). The original motivation for this replacement was improved stability in training and avoidance of the vanishing gradients problem. In this work we found the effect on training stability substantial. LGAN(F, DY, X, Y) = Ey Y[(DY(y W Emb))2] + Ex X [(1 DY(F(x W Emb) W Emb))2] + α Eˆy ˆ Y[(1 ˆy DY(ˆy) 2)2] Hence, our total loss is: LTotal(F, G, DX , DY, X, Y) =Lcyc(F, G, X, Y) + LGAN(F, DY, X, Y) + LGAN(G, DX , X, Y) We adapted the convolutional architecture for the generator and discriminator directly from Zhu et al. (2017). We simply replace all two dimensional convolutions with the one dimension variant and reduce the filter sizes in our generators to 1 (pointwise convolutions). Convolutional neural networks have recently been shown to be highly effective on language tasks and can speed up training significantly (Zhang & Le Cun, 2015; Kalchbrenner et al., 2016; Yu et al., 2017). Both our generators and discriminators receive a sequence of vectors in embedding space; our generators produce a softmax distribution over the vocabulary, while our discriminator produces a scalar output. For all our experiments we use a cycle loss with regularization coefficient λ = 1. In order to be compatible with the WGAN we replace batch normalization (Ioffe & Szegedy, 2015) with layer normalization (Ba et al., 2016). We train using the Adam optmizer (Kingma & Ba, 2014) with batch size 64 and learning rate 2e 4, β1 = 0 and β2 = 0.9. Our learning rate is exponentially warmed up to 2e 4 over 2500 steps, and held constant thereafter. We use learned embedding vectors with 256 Published as a conference paper at ICLR 2018 Figure 3: Left: Comparison of different timing techniques for Brown-C Vigen ere. Right: Comparison of embedding vs. raw softmax on Brown-W with vocab size of 200. Data Brown-W Brown-W Brown-C Freq. Analysis (With Key) Vocab size 10 200 58 58 200 Cipher Shift/Permutation Acc. 100% 98.7% 99.8% 80.9% 44.5% Cipher Vigen ere (Key: 345 ) Acc. 99.7% 75.7% 99.0% 9.6% (78.1%) <0.1% (44.3%) Table 2: Average proportion of characters correctly mapped in a given sequence. The Freq. Analysis column is simple frequency analysis applied to the same corpus our model observes. For Vigen ere we also show the score if the key were known (note: the key is left unknown to our model). dimensions. The WGAN Lipschitz conditioning parameter was set to α = 10 as was prescribed in Gulrajani et al. (2017). For the Vigen ere cipher, positional information is critical to the network being able to perform the mapping. In order to facilitate this we experimented with adding the timing signal described in Vaswani et al. (2017) ( Transformer Timing in Figure 3) and found that performance increased relative to no explicit timing signal; we found that the best option was concatenating a learned positional embedding vector specific to each position onto the sequence ( Concat Timing in Figure 3), this dramatically improved performance, however this means that the architecture can not generalize to sequences longer than those in the training set. A potential solution to the issue of generalizing to longer sequences would be making a soft choice at each position for which positional embedding vector to concatenate using a softmax distribution over a set of embedding vectors larger than the expected key length, however, we leave this to future work. 4.3 DISCUSSION Table 2 shows that Cipher GAN was able to solve shift ciphers to near flawless accuracy, with all three vocabulary sizes being easily decoded by the model. Cipher GAN performs extremely well on Vigen ere, achieving excellent results on the character-level cipher and strong results on the challenging word-level cipher with a vocabulary size of 200. The vocabulary size of 58 for our character level, containing punctuation and special characters, is more than double what has been previously explored. In comparison to the original Cycle GAN architecture, we found Cipher GAN to be extremely consistent in training and notably insensitive to the random initialization of weights; we attribute this stability to the Jacobian norm regularization term. For both ciphers, the first mappings to be correctly determined were those of the most frequently occurring vocabulary elements, suggesting that the network does indeed perform some form of frequency analysis to distinguish outlier frequencies in the two banks of text. Another interesting observation is that of the mistakes made by the network: the network would frequently confuse Published as a conference paper at ICLR 2018 punctuation marks with one another, perhaps suggesting that these vocabulary elements skip-gram signatures were similar enough to lead to the repeated confusion observed across many training runs. 5 CONCLUSION Cipher GAN is a compelling demonstration of the potential generative adversarial networks hold to act on discrete data to solve difficult tasks that rely on an extremely sensitive and nuanced discrimination criterion. Our work serves to redouble the promise of the Cycle GAN architecture for unsupervised alignment tasks for multiple classes of data. Cipher GAN presents an algorithm that is both stable and consistent in training, improving upon past implementations of the Cycle GAN architecture. Our work theoretically motivates and empirically confirms the use of continuous relaxations of discrete variables, not only to facilitate the flow of gradients through discrete nodes, but also to prevent the oft-observed phenomena of uninformative discrimination. Cipher GAN is highly general in its structure and can be directly applied to a variety of unsupervised text alignment tasks, without excess burden of adaptation. On the one hand, Cipher GAN is an early step towards the goal of unsupervised translation between languages and has shown excellent performance on the simplified task of cipher map inference. On the other hand, the methods we introduce can be used more broadly in the field of text generation with adversarial networks. ACKNOWLEDGMENTS Our thanks goes to Roger Grosse and Kelvin Shuangjian Zhang for their advice and support throughout. We also thank Otavio Good and Ian Goodfellow for meaningful early discussions and direction; as well as Michal Wiszniewski for his assistance in developing the code upon which the experiments were run. This work was made possible thanks to the AI Grant, which provided generous support throughout. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. Hardik Bansal and Archit Rathore. Understanding and implementing cyclegan in tensorflow. https://hardikbansal.github.io/Cycle GANBlog/, 2017. John M Carroll and Steve Martin. The automated cryptanalysis of substitution ciphers. Cryptologia, 10(4):193 209, 1986. Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. Maximum-likelihood augmented discrete generative adversarial networks. ar Xiv preprint ar Xiv:1702.07983, 2017. William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed, and Ian Goodfellow. Many paths to equilibrium: Gans do not need to decrease a divergence at every step, 2017. William S Forsyth and Reihaneh Safavi-Naini. Automated cryptanalysis of substitution ciphers. Cryptologia, 17(4):407 418, 1993. W Nelson Francis and Henry Kucera. Brown corpus manual. Brown University, 2, 1979. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org. Published as a conference paper at ICLR 2018 Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. ar Xiv preprint ar Xiv:1704.00028, 2017. Sam Hasinoff. Solving substitution ciphers. Department of Computer Science, University of Toronto, Tech. Rep, 2003. R Devon Hjelm, Athul Paul Jacob, Tong Che, Kyunghyun Cho, and Yoshua Bengio. Boundaryseeking generative adversarial networks. ar Xiv preprint ar Xiv:1702.08431, 2017. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448 456, 2015. Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016. Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. ar Xiv preprint ar Xiv:1610.10099, 2016. Prasad Kawthekar, Raunaq Rewari, and Suvrat Bhooshan. Evaluating generative models for text generation. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980. Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL on Main conference poster sessions, pp. 499 506. Association for Computational Linguistics, 2006. Kevin Knight, Be ata Megyesi, and Christiane Schaefer. The copiale cipher. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pp. 2 9. Association for Computational Linguistics, 2011. Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. How to train your dragan. ar Xiv preprint ar Xiv:1705.07215, 2017. Matt J Kusner and Jos e Miguel Hern andez-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. ar Xiv preprint ar Xiv:1611.04051, 2016. Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. ar Xiv preprint ar Xiv:1703.00848, 2017. Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. Semantic segmentation using adversarial networks. ar Xiv preprint ar Xiv:1611.08408, 2016. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. ar Xiv preprint ar Xiv:1611.00712, 2016. Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. ar Xiv preprint Ar Xiv:1611.04076, 2016. SS Omran, AS Al-Khalid, and DM Al-Saady. A cryptanalytic attack on vigen ere cipher using genetic algorithm. In Open Systems (ICOS), 2011 IEEE Conference on, pp. 59 64. IEEE, 2011. Bhadri Msvs Raju et al. Decipherment of substitution cipher using enhanced probability distribution. International Journal of Computer Applications, 5(8):34 40, 2010. RS Ramesh, G Athithan, and K Thiruvengadam. An automated approach to solve simple substitution ciphers. Cryptologia, 17(2):202 218, 1993. Eyyb Sari. tensorflow-cyclegan. https://github.com/Eyyub/tensorflow-cyclegan, 2017. Simon Singh. The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography. Anchor, 2000. Published as a conference paper at ICLR 2018 Ragheb Toemeh and Subbanagounder Arumugam. Applying genetic algorithms for searching keyspace of polyalphabetic substitution ciphers. International Arab Journal of Information Technology (IAJIT), 5(1), 2008. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. ar Xiv preprint ar Xiv:1706.03762, 2017. AK Verma, Mayank Dave, and RC Joshi. Genetic algorithm and tabu search attack on the monoalphabetic substitution cipher i adhoc networks. In Journal of Computer science. Citeseer, 2007. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. Zili Yi, Hao Zhang, Ping Tan Gong, et al. Dualgan: Unsupervised dual learning for image-to-image translation. ar Xiv preprint ar Xiv:1704.02510, 2017. Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pp. 2852 2858, 2017. Xiang Zhang and Yann Le Cun. Text understanding from scratch. ar Xiv preprint ar Xiv:1502.01710, 2015. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. ar Xiv preprint ar Xiv:1703.10593, 2017. Published as a conference paper at ICLR 2018 A ARCHITECTURAL DETAILS For both the generator and discriminator architectures below, we assume the network is provided with a sequence of N tokens denoted data laying in a K-simplex. Therefore data is an N by K matrix. For simplicity we define the following notation: Conv Layerfc,fs=1,s=1(x) = Re LU(Layer Norm(Conv1Dfc,fs(x))) Res Blockfc,fs=1(x) = Re LU(Layer Norm(x + Conv1Dfc,fs(Conv1Dfc,fs(x)))) Conv Stackn,fc,fs=1(x) = Conv Layer1,fs Conv Layer21 fc,fs,2 Conv Layer2n fc,fs,2 Conv Layerfc,fs,2(x) A.1 GENERATOR We define the following constants: fc = 32: The base filter count , or number of filters in a convolution layer fs = 1: The filter size , or width of weight kernel used in convolution layers vs = 1: The vocab size , or number of elements in the vocabulary s = 1: The stride of the convolution layers E = 100: The dimensionality of the embedding vectors T = 100: The dimensionality of the concat timing weights First, we first look up embedding vectors corresponding to the observed tokens. This is performed as a simple inner-product between data and a learning weight matrix of embeddings WEmb RK E: x = data WEmb Next, a timing signal is added using either: The Transformer method: signaln,2k = sin(n/1e52k/K) signaln,2k+1 = cos(n/1e52k/K) timing(x) = x + signal The concat method: Wtime RN T t = timing(x) = [x 2Wtime] Where Wtime are trained parameters and [ k ] denotes concatenation along the kth axis. Then the generator is defined as follows: a(x) = Conv Layer4 fc(Conv Layer2 fc,fs(Conv Layerfc(x))) b(x) = Res Block(5) 4 fc(x) out(t) = Softmax(Conv Layervs(b(a(t)))) A.2 DISCRIMINATOR We define the following constants: fc = 32: The base filter count , or number of filters in a convolution layer fs = 15: The filter size , or width of weight kernel used in convolution layers n = 5: The depth of the convolutional stack Similar to the generator, a timing signal is added first. Leading to the following discriminator: out(t) = Conv Stackn,fc,fs(dropout0.5(t)) Published as a conference paper at ICLR 2018 B PROOF OF PROPOSITION 1 Proposition 1 (Reliable Fooling Via Relaxation). discrete spaces X, Y and continuous relaxations X, Y generators F : X Y, G : Y X bijections satisfying F = G 1 discrete discriminators DX , DY both optimal for fixed F, G rediscretization functions RX , RY Suppose: F is approximately volume preserving in a small region about each x X. Consequently the same is true for G about each y Y. If: during training, we replace discrete random variables from X which lie in the continuous metric space X with samples from regions about them. Then: the optimal relaxed discriminators DX and DY have a non-empty region about each x X and y Y where they are expected to assign values close to DX (x) and DY(y). Proof. We ll prove one side of the Cipher GAN as the proof for both sides are similar. Given bijective function between continuous relaxations of X and Y: F : (X, d X ) (Y, d Y), where X and Y contain finite sequences (length n) of vectors laying on the vertices of the simplex k, and are supports of data distributions p X , p Y respectively. X = Y = k k | {z } n with k equal to the number of elements in our vocabulary. the rediscretization function RX : X X: RX ( x) = arg minx X d X ( x, x); similarly for RY. Now, for each x X consider the infinite set Sx with cardinality of the continuum constructed according to: x Sx x X, s.t. RX ( x) = x and RY(F( x)) = F(x) Equivalently, Sx = R 1 X (x) F 1(R 1 Y (F(x))) Note. Sx is never of cardinality less than the continuum since the following is implied by the definitions of RX , Sx and the fact that F is continuous: x X = x Sx closed ball Bϵ[x] with radius 0 < ϵ < min z X RY(F (z)) =RY(F (x)) d X (x, z) So, for each element x X there exists a closed set of points in X which are rediscretized, under RX , to x. Since Sx is a Borel Set we can sample uniformly from it. Therefore, during training suppose we replace each element of x X with a sample x Sx: We begin with the discrete objective: X x X p X (x) log(DX (x)) + p F (x) log(1 DX (x)) As was noted in Goodfellow et al. (2014), this objective is optimized in DX : X [0, 1] when: DX = p X p X + p G which is undesirable as p X is a sum of Dirac delta distributions p X (x) = P xi X δxi(x) and lacks a non-zero gradient to train the generator function with. Instead, let us consider a continuous relaxation DX : X [0, 1] of the discriminator DX and observe where it optimizes. Published as a conference paper at ICLR 2018 Suppose x X, ϵx 0, x Sx (1 ϵx) | x F( x)| (1 + ϵx) (6) That is, suppose F is approximately volume preserving within a small region about each x X. Lemma 1. y Y, G(Sy) = SG(y) Proof of Lemma 1. For all y Y with G(y) = x y = F(x): G(Sy) = G[R 1 Y (y) G 1(R 1 X (G(y)))] = G(R 1 Y (y)) G(G 1(R 1 X (G(y)))) = G(R 1 Y (y)) R 1 X (G(y)) = F 1(R 1 Y (F(x))) R 1 X (x) = Sx = SG(y) Corollary 1. Z 1 |SG(y)|d x = Z Proof of Corollary 1. 1 |SG(y)|d x = 1 = Z | x G 1( x)| Corollary 2. 1 (1 ϵx)|SG(y)| 1 |Sy| 1 (1 + ϵx)|SG(y)| Proof of Corollary 2. By Equation 6 and Corollary 1: Z 1 |SG(y)|d x Z |Sy| |SG(y)| 1 1 + ϵx |Sy| |SG(y)| = (1 ϵx)|SG(y)| |Sy| (1 + ϵx)|SG(y)| = 1 (1 ϵx)|SG(y)| 1 |Sy| 1 (1 + ϵx)|SG(y)| Corollary 1 leads to the following: Ex X [E x Sx[log(DX ( x))]] + Ey Y[E y Sy[log(1 DX (G( y)))]] x X p X (x) Z 1 |Sx| log(DX ( x))d x + X y Y p Y(y) Z 1 |Sy| log(1 DX (G( y)))d y x X p X (x) Z 1 |Sx| log(DX ( x))d x + X y Y p Y(y) Z |Sy| log(1 DX ( x))d x Published as a conference paper at ICLR 2018 Using Lemma 1 and Corollary 2, we obtain the following lower-bound of Equation 7: x X p X (x) Z 1 |Sx| log(DX ( x))d x + X y Y p Y(y) Z |Sy| log(1 DX ( x))d x x X p X (x) Z 1 |Sx| log(DX ( x))d x + p Y(F(x)) Z x SG(F (x)) 1 ϵx (1 + ϵx)|SG(F (x))| log(1 DX ( x))d x p X (x) log(DX ( x)) + 1 ϵx 1 + ϵx p G(x) log(1 DX ( x)) d x Which is maximal at: DX ( x Sx) = p X p X + 1 ϵx = E x Sx[DX ( x)] ϵx DX And similarly, we find Equation 7 is upper-bounded by: x X p X (x) Z 1 |Sx| log(DX ( x))d x + X y Y p Y(y) Z |Sy| log(1 DX ( x))d x x X p X (x) Z 1 |Sx| log(DX ( x))d x + p Y(F(x)) Z x SG(F (x)) 1 + ϵx (1 ϵx)|SG(F (x))| log(1 DX ( x))d x p X (x) log(DX ( x)) + 1 + ϵx 1 ϵx p G(x) log(1 DX ( x)) d x Which is maximal in DX at: DX ( x Sx) = p X p X + 1+ϵx = E x Sx[DX ( x)] ϵx DX Hence, asymptotically as F becomes approximately volume-preserving about each discrete x X the bounds maximize to the same loss value in the same class of functions E x Sx[DX ( x)] = DX , application of the squeeze theorem concludes the proof.