# transformation_autoregressive_networks__c1bed7ae.pdf Transformation Autoregressive Networks Junier B Oliva 1 2 Avinava Dubey 2 Manzil Zaheer 2 Barnab as P oczos 2 Ruslan Salakhutdinov 2 Eric P Xing 2 Jeff Schneider 2 The fundamental task of general density estimation p(x) has been of keen interest to machine learning. In this work, we attempt to systematically characterize methods for density estimation. Broadly speaking, most of the existing methods can be categorized into either using: a) autoregressive models to estimate the conditional factors of the chain rule, p(xi xi 1,...); or b) non-linear transformations of variables of a simple base distribution. Based on the study of the characteristics of these categories, we propose multiple novel methods for each category. For example we propose RNN based transformations to model non-Markovian dependencies. Further, through a comprehensive study over both real world and synthetic data, we show that jointly leveraging transformations of variables and autoregressive conditional models, results in a considerable improvement in performance. We illustrate the use of our models in outlier detection and image modeling. Finally we introduce a novel data driven framework for learning a family of distributions. 1. Introduction Density estimation is at the core of a multitude of machine learning applications. However, this fundamental task is difficult in the general setting due to issues like the curse of dimensionality. Furthermore, for general data, unlike spatial/temporal data, we do not have known correlations a priori among covariates that may be exploited. For example, image data has known correlations among neighboring pixels that may be hard-coded into a model through convolutions, whereas one must find such correlations in a data-driven fashion with general data. 1Computer Science Department, University of North Carolina, Chapel Hill, NC 27599 (Work completed while at CMU.) 2Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213. Correspondence to: Junier Oliva . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). forrest (d = 10) pendigits (d = 16) susy (d = 18) higgs (d = 28) hepmass (d=28) satimage2 (d=36) music (d=90) Relative Test Likelihood NADE NICE ARC NLT TAN Figure 1. The proposed TAN models for density estimation, which jointly leverages non-linear transformation and autoregressive conditionals, shows considerable improvement over other methods across datasets of varying dimensions. The scatter plots shows that only utilizing autoregressive conditionals (ARC) without transformations (e.g. existing works like NADE (Uria et al., 2014) and other variants) or only using non-linear transformation (NLT) with simple restricted conditionals (e.g. existing works like NICE (Dinh et al., 2014) and other variants) is not sufficient for all datasets. In order to model high dimensional data, the main challenge lies in constructing models that are flexible enough while having tractable learning algorithms. A variety of diverse solutions exploiting different aspects of the problems have been proposed in the literature. A large number of methods have considered auto-regressive models to estimate the conditional factors p(xi xi 1,...,x1), for i {1,...,d} in the chain rule (Larochelle & Murray, 2011; Uria et al., 2013; 2016; Germain et al., 2015; Gregor et al., 2014). While some methods directly model the conditionals p(xi xi 1,...) using sophisticated semiparametric density estimates, other methods apply sophisticated transformations of variables x z and take the conditionals over z to be a restricted, often independent base distribution p(zi zi 1,...) f(zi) (Dinh et al., 2014; 2016). Further related works are discussed in Sec. 3. However, looking across a diverse set of dataset, as in Fig. 1, neither of these approaches have the flexibility required to accurately model real world data. In this paper we take a step back and start from the basics. If we only model the conditionals, the conditional factors p(xi xi 1,...), may become increasingly complicated as i increases to d. On the other hand if we use a complex Transformation Autoregressive Networks transformation with restricted conditionals then the transformation has to ensure that the transformed variables are independent. This requirement of independence on the transformed variables can be very restrictive. Now note that the transformed space is homeomorphic to the original space and a simple relationship between the density of the two spaces exists through the Jacobian. Thus, we can employ conditional modeling on the transformed variables to alleviate the independence requirement, while being able to recover density in the original space in a straightforward fashion. In other words, we propose transformation autoregressive networks (TANs) which composes the complex transformations and autoregressive modeling of the conditionals. The composition not only increases the flexibility of the model but also reduces the expressibility power needed from each of the individual components. This leads to an improved performance as can be seen from Fig. 1. In particular, first we propose two flexible autoregressive models for modeling conditional distributions: the linear autoregressive model (LAM), and the recurrent autoregressive model (RAM) (Sec. 2.1). Secondly, we introduce several novel transformations of variables: 1) an efficient method for learning a linear transformation on covariates; 2) an invertible RNN-based transformation that directly acts on covariates; 3) an additive RNN-base transformation (Sec. 2.2). Extensive experiments on both synthetic (Sec. 4.1) and realworld (Sec. 4.2) datasets show the power of TANs for capturing complex dependencies between the covariates. We run an ablation study to demonstrate contributions of various components in TAN Sec. 4.3, Moreover, we show that the learned model can be used for anomaly detection (Sec. 4.4) and learning a family of distributions (Sec. 4.5). 2. Transformation Autoregressive Networks As mentioned above, TANs are composed of two modules: a) an autoregressive module for modeling conditional factors and b) transformations of variables. We first introduce our two proposed autoregressive models to estimate the conditional distribution of input covariates x Rd. Later, we show how to use such models over transformation z = q(x), while renormalizing to obtain density values for x. 2.1. Autoregressive Models Autoregressive models decompose density estimation of a multivariate variable x Rd into multiple conditional tasks on a growing set of inputs through the chain rule: p(x1,...,xd) = d i=1 p(xi xi 1,...,x1). (1) That is, autoregressive models will estimate the d conditional distributions p(xi xi 1,...). A class of autoregressive models can be defined by approximating conditional distributions through a mixture model, MM(θ(xi 1,...,x1)), with parameters θ depending on xi 1,...,x1: p(xi xi 1,...,x1) = p(xi MM(θ(xi 1,...,x1)), (2) θ(xi 1,...,x1) = f (hi) (3) hi = gi (xi 1,...,x1), (4) where f( ) is a fully connected network that may use a element-wise non-linearity on inputs, and gi( ) is some general mapping that computes a hidden state of features, hi Rp, which help in modeling the conditional distribution of xi xi 1,...,x1. One can control the flexibility of the model through gi. It is important to be powerful enough to model our covariates while still generalizing. In order to achieve this we propose two methods for modeling gi. Linear Autoregressive Model (LAM): This uses a straightforward linear map as gi in (4): gi (xi 1,...,x1) = W (i)x