# rectified_factor_networks__2f3eb239.pdf Rectified Factor Networks Djork-Arn e Clevert, Andreas Mayr, Thomas Unterthiner and Sepp Hochreiter Institute of Bioinformatics, Johannes Kepler University, Linz, Austria {okko,mayr,unterthiner,hochreit}@bioinf.jku.at We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data s covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods. RFN package for GPU/CPU is available at http://www.bioinf.jku.at/software/rfn. 1 Introduction The success of deep learning is to a large part based on advanced and efficient input representations [1, 2, 3, 4]. These representations are sparse and hierarchical. Sparse representations of the input are in general obtained by rectified linear units (Re LU) [5, 6] and dropout [7]. The key advantage of sparse representations is that dependencies between coding units are easy to model and to interpret. Most importantly, distinct concepts are much less likely to interfere in sparse representations. Using sparse representations, similarities of samples often break down to co-occurrences of features in these samples. In bioinformatics sparse codes excelled in biclustering of gene expression data [8] and in finding DNA sharing patterns between humans and Neanderthals [9]. Representations learned by Re LUs are not only sparse but also non-negative. Non-negative representations do not code the degree of absence of events or objects in the input. As the vast majority of events is supposed to be absent, to code for their degree of absence would introduce a high level of random fluctuations. We also aim for non-linear input representations to stack models for constructing hierarchical representations. Finally, the representations are supposed to have a large number of coding units to allow coding of rare and small events in the input. Rare events are only observed in few samples like seldom side effects in drug design, rare genotypes in genetics, or small customer groups in e-commerce. Small events affect only few input components like pathways with few genes in biology, few relevant mutations in oncology, or a pattern of few products in e-commerce. In summary, our goal is to construct input representations that (1) are sparse, (2) are non-negative, (3) are non-linear, (4) use many code units, and (5) model structures in the input data (see next paragraph). Current unsupervised deep learning approaches like autoencoders or restricted Boltzmann machines (RBMs) do encode all peculiarities in the data (including noise). Generative models can be design to model specific structures in the data, but their codes cannot be enforced to be sparse and nonnegative. The input representation of a generative model is its posterior s mean, median, or mode, which depends on the data. Therefore, sparseness and non-negativity cannot be guaranteed independent of the data. For example, generative models with rectified priors, like rectified factor analysis, have zero posterior probability for negative values, therefore their means are positive and not sparse [10, 11]. Sparse priors like Laplacian and Jeffrey s do not guarantee sparse posteriors (see experiments in Tab. 1). To address the data dependence of the code, we employ the posterior regularization method [12]. This method separates model characteristics from data dependent characteristics that are enforced by constraints on the model s posterior. We aim at representations that are feasible for many code units and massive datasets, therefore the computational complexity of generating a code is essential in our approach. For non-Gaussian priors, the computation of the posterior mean of a new input requires either to numerically solve an integral or to iteratively update variational parameters [13]. In contrast, for Gaussian priors the posterior mean is the product between the input and a matrix that is independent of the input. Still the posterior regularization method leads to a quadratic (in the number of coding units) constrained optimization problem in each E-step (see Eq. (3) below). To speed up computation, we do not solve the quadratic problem but perform a gradient step. To allow for stochastic gradients and fast GPU implementations, also the M-step is a gradient step. These E-step and M-step modifications of the posterior regularization method result in a generalized alternating minimization (GAM) algorithm [12]. We will show that the GAM algorithm used for RFN learning (i) converges and (ii) is correct. Correctness means that the RFN codes are non-negative, sparse, have a low reconstruction error, and explain the covariance structure of the data. 2 Rectified Factor Network Our goal is to construct representations of the input that (1) are sparse, (2) are non-negative, (3) are non-linear, (4) use many code units, and (5) model structures in the input. Structures in the input are identified by a generative model, where the model assumptions determine which input structures to explain by the model. We want to model the covariance structure of the input, therefore we choose maximum likelihood factor analysis as model. The constraints on the input representation are enforced by the posterior regularization method [12]. Non-negative constraints lead to sparse and non-linear codes, while normalization constraints scale the signal part of each hidden (code) unit. Normalizing constraints avoid that generative models explain away rare and small signals by noise. Explaining away becomes a serious problem for models with many coding units since their capacities are not utilized. Normalizing ensures that all hidden units are used but at the cost of coding also random and spurious signals. Spurious and true signals must be separated in a subsequent step either by supervised techniques, by evaluating coding units via additional data, or by domain experts. A generative model with hidden units h and data v is defined by its prior p(h) and its likelihood p(v | h). The full model distribution p(h, v) = p(v | h)p(h) can be expressed by the model s posterior p(h | v) and its evidence (marginal likelihood) p(v): p(h, v) = p(h | v)p(v). The representation of input v is the posterior s mean, median, or mode. The posterior regularization method introduces a variational distribution Q(h | v) Q from a family Q, which approximates the posterior p(h | v). We choose Q to constrain the posterior means to be non-negative and normalized. The full model distribution p(h, v) contains all model assumptions and, thereby, defines which structures of the data are modeled. Q(h | v) contains data dependent constraints on the posterior, therefore on the code. For data {v} = {v1, . . . , vn}, the posterior regularization method maximizes the objective F [12]: i=1 log p(vi) 1 i=1 DKL(Q(hi | vi) p(hi | vi)) (1) Z Q(hi | vi) log p(vi | hi) dhi 1 i=1 DKL(Q(hi | vi) p(hi)) , where DKL is the Kullback-Leibler distance. Maximizing F achieves two goals simultaneously: (1) extracting desired structures and information from the data as imposed by the generative model and (2) ensuring desired code properties via Q Q. h1 h3 h2 h4 Figure 1: Factor analysis model: hidden units (factors) h, visible units v, weight matrix W , noise ϵ. The factor analysis model v = W h + ϵ extracts the covariance structure of the data. The prior h N (0, I) of the hidden units (factors) h Rl and the noise ϵ N (0, Ψ) of visible units (observations) v Rm are independent. The model parameters are the weight (loading) matrix W Rm l and the noise covariance matrix Ψ Rm m. We assume diagonal Ψ to explain correlations between input components by the hidden units and not by correlated noise. The factor analysis model is depicted in Fig. 1. Given the mean-centered data {v} = {v1, . . . , vn}, the posterior p(hi | vi) is Gaussian with mean vector (µp)i and covariance matrix Σp: (µp)i = I + W T Ψ 1W 1 W T Ψ 1 vi , Σp = I + W T Ψ 1W 1 . (2) A rectified factor network (RFN) consists of a single or stacked factor analysis model(s) with constraints on the posterior. To incorporate the posterior constraints into the factor analysis model, we use the posterior regularization method that maximizes the objective F given in Eq. (1) [12]. Like the expectation-maximization (EM) algorithm, the posterior regularization method alternates between an E-step and an M-step. Minimizing the first DKL of Eq. (1) with respect to Q leads to a constrained optimization problem. For Gaussian distributions, the solution with (µp)i and Σp from Eq. (2) is Q(hi | vi) N (µi, Σ) with Σ = Σp and the quadratic problem: i=1 (µi (µp)i)T Σ 1 p (µi (µp)i) , s.t. i : µi 0 , j : 1 i=1 µ2 ij = 1 , (3) where is component-wise. This is a constraint non-convex quadratic optimization problem in the number of hidden units which is too complex to be solved in each EM iteration. Therefore, we perform a step of the gradient projection algorithm [14, 15], which performs first a gradient step and then projects the result to the feasible set. We start by a step of the projected Newton method, then we try the gradient projection algorithm, thereafter the scaled gradient projection algorithm with reduced matrix [16] (see also [15]). If these methods fail to decrease the objective in Eq. (3), we use the generalized reduced method [17]. It solves each equality constraint for one variable and inserts it into the objective while ensuring convex constraints. Alternatively, we use Rosen s gradient projection method [18] or its improvement [19]. These methods guarantee a decrease of the E-step objective. Since the projection P by Eq. (6) is very fast, the projected Newton and projected gradient update is very fast, too. A projected Newton step requires O(nl) steps (see Eq. (7) and P defined in Theorem 1), a projected gradient step requires O(min{nlm, nl2}) steps, and a scaled gradient projection step requires O(nl3) steps. The RFN complexity per iteration is O(n(m2 + l2)) (see Alg. 1). In contrast, a quadratic program solver typically requires for the (nl) variables (the means of the hidden units for all samples) O(n4l4) steps to find the minimum [20]. We exemplify these values on our benchmark datasets MNIST (n = 50k, l = 1024, m = 784) and CIFAR (n = 50k, l = 2048, m = 1024). The speedup with projected Newton or projected gradient in contrast to a quadratic solver is O(n3l2) = O(n4l4)/O(nl2), which gives speedup ratios of 1.3 1020 for MNIST and 5.2 1020 for CIFAR. These speedup ratios show that efficient E-step updates are essential for RFN learning. Furthermore, on our computers, RAM restrictions limited quadratic program solvers to problems with nl 20k. Running times of RFNs with the Newton step and a quadratic program solver are given in the supplementary Section 15. The M-step decreases the expected reconstruction error Rl Q(hi | vi) log (p(vi | hi)) dhi (4) m log (2π) + log |Ψ| + Tr Ψ 1C 2 Tr Ψ 1W U T + Tr W T Ψ 1W S . from Eq. (1) with respect to the model parameters W and Ψ. Definitions of C, U and S are given in Alg. 1. The M-step performs a gradient step in the Newton direction, since we want to Algorithm 1 Rectified Factor Network. n Pn i=1 viv T i 2: while STOP=false do 3: E-step1 4: for all 1 i n do 5: (µp)i = I + W T Ψ 1W 1 W T Ψ 1vi 6: end for 7: Σ = Σp = I + W T Ψ 1W 1 8: Constraint Posterior 9: (1) projected Newton, (2) projected gradient, (3) scaled gradient projection, (4) generalized reduced method, (5) Rosen s gradient project. 10: E-step2 11: U = 1 n Pn i=1 vi µT i 12: S = 1 n Pn i=1 µi µT i + Σ 13: M-step 14: E = C U W T W U + W S W T 15: W = W + η U S 1 W 16: for all 1 k m do 17: Ψkk = Ψkk + η (Ekk Ψkk) 18: end for 19: if stopping criterion is met: STOP=true 20: end while Complexity: objective F: O(min{nlm, nl2} + l3); E-step1: O(min{m2(m + l), l2(m + l)} + nlm); projected Newton: O(nl); projected gradient: O(min{nlm, nl2}); scaled gradient projection: O(nl3); Estep2: O(nl(m+l)); M-step: O(ml(m+l)); overall complexity with projected Newton / gradient for (l+m) < n: O(n(m2 + l2)). allow stochastic gradients, fast GPU implementation, and dropout regularization. The Newton step is derived in the supplementary which gives further details, too. Also in the E-step, RFN learning performs a gradient step using projected Newton or gradient projection methods. These projection methods require the Euclidean projection P of the posterior means {(µp)i} onto the non-convex feasible set: i=1 (µi (µp)i)T (µi (µp)i) , s.t. µi 0 , 1 i=1 µ2 ij = 1 . (5) The following Theorem 1 gives the Euclidean projection P as solution to Eq. (5). Theorem 1 (Euclidean Projection). If at least one (µp)ij is positive for 1 j l, then the solution to optimization problem Eq. (5) is µij = [P((µp)i)]j = ˆµij q 1 n Pn i=1 ˆµ2 ij , ˆµij = 0 for (µp)ij 0 (µp)ij for (µp)ij > 0 . (6) If all (µp)ij are non-positive for 1 j l, then the optimization problem Eq. (5) has the solution µij = n for j = arg maxˆj{(µp)iˆj} and µij = 0 otherwise. Proof. See supplementary material. Using the projection P defined in Eq. (6), the E-step updates for the posterior means µi are: µnew i = P µold i + γ d µold i , d = P µold i + λ H 1 Σ 1 p ((µp)i µold i ) (7) where we set for the projected Newton method H 1 = Σp (thus H 1Σ 1 p = I), and for the projected gradient method H 1 = I. For the scaled gradient projection algorithm with reduced matrix, the ϵ-active set for i consists of all j with µij ϵ. The reduced matrix H is the Hessian Σ 1 p with ϵ-active columns and rows j fixed to unit vectors ej. The resulting algorithm is a posterior regularization method with a gradient based Eand M-step, leading to a generalized alternating minimization (GAM) algorithm [21]. The RFN learning algorithm is given in Alg. 1. Dropout regularization can be included before E-step2 by randomly setting code units µij to zero with a predefined dropout rate (note that convergence results will no longer hold). 3 Convergence and Correctness of RFN Learning Convergence of RFN Learning. Theorem 2 states that Alg. 1 converges to a maximum of F. Theorem 2 (RFN Convergence). The rectified factor network (RFN) learning algorithm given in Alg. 1 is a generalized alternating minimization (GAM) algorithm and converges to a solution that maximizes the objective F. Proof. We present a sketch of the proof which is given in detail in the supplement. For convergence, we show that Alg. 1 is a GAM algorithm which convergences according to Proposition 5 in [21]. Alg. 1 ensures to decrease the M-step objective which is convex in W and Ψ 1. The update with η = 1 leads to the minimum of the objective. Convexity of the objective guarantees a decrease in the M-step for 0 < η 1 if not in a minimum. Alg. 1 ensures to decrease the E-step objective by using gradient projection methods. All other requirements for GAM convergence are also fulfilled. Proposition 5 in [21] is based on Zangwill s generalized convergence theorem, thus updates of the RFN algorithm are viewed as point-to-set mappings [22]. Therefore, the numerical precision, the choice of the methods in the E-step, and GPU implementations are covered by the proof. Correctness of RFN Learning. The goal of the RFN algorithm is to explain the data and its covariance structure. The expected approximation error E is defined in line 14 of Alg. 1. Theorem 3 states that the RFN algorithm is correct, that is, it explains the data (low reconstruction error) and captures the covariance structure as good as possible. Theorem 3 (RFN Correctness). The fixed point W of Alg. 1 minimizes Tr (Ψ) given µi and Σ by ridge regression with i=1 ϵi 2 2 + W Σ1/2 2 where ϵi = vi W µi. The model explains the data covariance matrix by C = Ψ + W S W T (9) up to an error, which is quadratic in Ψ for Ψ W W T . The reconstruction error 1 n Pn i=1 ϵi 2 2 is quadratic in Ψ for Ψ W W T . Proof. The fixed point equation for the W update is W = US 1 W = 0 W = US 1. Using the definition of U and S, we have W = 1 n Pn i=1 vi µT i 1 n Pn i=1 µi µT i + Σ 1 . W is the ridge regression solution of i=1 vi W µi 2 2 + W Σ1/2 2 i=1 ϵi ϵT i + W Σ W T ! where Tr is the trace. After multiplying out all ϵiϵT i in 1/n Pn i=1 ϵiϵT i , we obtain: i=1 ϵi ϵT i + W Σ W T . (11) For the fixed point of Ψ, the update rule gives: diag (Ψ) = diag 1 n Pn i=1 ϵiϵT i + W ΣW T . Thus, W minimizes Tr (Ψ) given µi and Σ. Multiplying the Woodbury identity for W W T + Ψ 1 from left and right by Ψ gives W ΣW T = Ψ Ψ W W T + Ψ 1 Ψ. (12) Inserting this into the expression for diag (Ψ) and taking the trace gives i=1 ϵi ϵT i = Tr Ψ W W T + Ψ 1 Ψ Tr W W T + Ψ 1 Tr (Ψ)2 . (13) Therefore, for Ψ W W T the error is quadratic in Ψ. W U T = W SW T = UW T follows from fixed point equation U = W S. Using this and Eq. (12), Eq. (11) is i=1 ϵi ϵT i Ψ W W T + Ψ 1 Ψ = C Ψ W S W T . (14) Using the trace norm (nuclear norm or Ky-Fan n-norm) on matrices, Eq. (13) states that the left hand side of Eq. (14) is quadratic in Ψ for Ψ W W T . The trace norm of a positive semi-definite matrix is its trace and bounds the Frobenius norm [23]. Thus, for Ψ W W T , the covariance is approximated up to a quadratic error in Ψ according to Eq. (9). The diagonal is exactly modeled. Since the minimization of the expected reconstruction error Tr (Ψ) is based on µi, the quality of reconstruction depends on the correlation between µi and vi. We ensure maximal information in µi on vi by the I-projection (the minimal Kullback-Leibler distance) of the posterior onto the family of rectified and normalized Gaussian distributions. 4 Experiments RFNs vs. Other Unsupervised Methods. We assess the performance of rectified factor networks (RFNs) as unsupervised methods for data representation. We compare (1) RFN: rectified factor networks, (2) RFNn: RFNs without normalization, (3) DAE: denoising autoencoders with Re LUs, (4) RBM: restricted Boltzmann machines with Gaussian visible units, (5) FAsp: factor analysis with Jeffrey s prior (p(z) 1/z) on the hidden units which is sparser than a Laplace prior, (6) FAlap: factor analysis with Laplace prior on the hidden units, (7) ICA: independent component analysis by Fast ICA [24], (8) SFA: sparse factor analysis with a Laplace prior on the parameters, (9) FA: standard factor analysis, (10) PCA: principal component analysis. The number of components are fixed to 50, 100 and 150 for each method. We generated nine different benchmark datasets (D1 to D9), where each dataset consists of 100 instances. Each instance has 100 samples and 100 features resulting in a 100 100 matrix. Into these matrices, biclusters are implanted [8]. A bicluster is a pattern of particular features which is found in particular samples like a pathway activated in some samples. An optimal representation will only code the biclusters that are present in a sample. The datasets have different noise levels and different bicluster sizes. Large biclusters have 20 30 samples and 20 30 features, while small biclusters 3 8 samples and 3 8 features. The pattern s signal strength in a particular sample was randomly chosen according to the Gaussian N (1, 1). Finally, to each matrix, zero-mean Gaussian background noise was added with standard deviation 1, 5, or 10. The datasets are characterized by Dx=(σ, n1, n2) with background noise σ, number of large biclusters n1, and the number of small biclusters n2: D1=(1,10,10), D2=(5,10,10), D3=(10,10,10), D4=(1,15,5), D5=(5,15,5), D6=(10,15,5), D7=(1,5,15), D8=(5,5,15), D9=(10,5,15). We evaluated the methods according to the (1) sparseness of the components, the (2) input reconstruction error from the code, and the (3) covariance reconstruction error for generative models. For RFNs sparseness is the percentage of the components that are exactly 0, while for others methods it is the percentage of components with an absolute value smaller than 0.01. The reconstruction error is the sum of the squared errors across samples. The covariance reconstruction error is the Frobenius norm of the difference between model and data covariance. See supplement for more details on the data and for information on hyperparameter selection for the different methods. Tab. 1 gives averaged results for models with 50 (undercomplete), 100 (complete) and 150 (overcomplete) coding units. Results are the mean of 900 instances consisting of 100 instances for each dataset D1 to D9. In the supplement, we separately tabulate the results for D1 to D9 and confirm them with different noise levels. FAlap did not yield sparse codes since the variational parameter did not Table 1: Comparison of RFN with other unsupervised methods, where the upper part contains methods that yielded sparse codes. Criteria: sparseness of the code (SP), reconstruction error (ER), difference between data and model covariance (CO). The panels give the results for models with 50, 100 and 150 coding units. Results are the mean of 900 instances, 100 instances for each dataset D1 to D9 (maximal value: 999). RFNs had the sparsest code, the lowest reconstruction error, and the lowest covariance approximation error of all methods that yielded sparse representations (SP>10%). undercomplete 50 code units complete 100 code units overcomplete 150 code units SP ER CO SP ER CO SP ER CO RFN 75 0 249 3 108 3 81 1 68 9 26 6 85 1 17 6 7 6 RFNn 74 0 295 4 140 4 79 0 185 5 59 3 80 0 142 4 35 2 DAE 66 0 251 3 69 0 147 2 71 0 130 2 RBM 15 1 310 4 7 1 287 4 5 0 286 4 FAsp 40 1 999 63 999 99 63 0 999 65 999 99 80 0 999 65 999 99 FAlap 4 0 239 6 341 19 6 0 46 4 985 45 4 0 46 4 976 53 ICA 2 0 174 2 3 1 0 0 3 1 0 0 SFA 1 0 218 5 94 3 1 0 16 1 114 5 1 0 16 1 285 7 FA 1 0 218 4 90 3 1 0 16 1 83 4 1 0 16 1 263 6 PCA 0 0 174 2 2 0 0 0 2 0 0 0 (a) MNIST digits (b) MNIST digits with random image background (c) MNIST digits with random noise background (d) convex and concave shapes (e) tall and wide rectangular (f) rectangular images on background images (g) CIFAR-10 images (best viewed in color) (h) NORB images Figure 2: Randomly selected filters trained on image datasets using an RFN with 1024 hidden units. RFNs learned stroke, local and global blob detectors. RFNs are robust to background noise (b,c,f). push the absolute representations below the threshold of 0.01. The variational approximation to the Laplacian is a Gaussian [13]. RFNs had the sparsest code, the lowest reconstruction error, and the lowest covariance approximation error of all methods yielding sparse representations (SP>10%). RFN Pretraining for Deep Nets. We assess the performance of rectified factor networks (RFNs) if used for pretraining of deep networks. Stacked RFNs are obtained by first training a single layer RFN and then passing on the resulting representation as input for training the next RFN. The deep network architectures use a RFN pretrained first layer (RFN-1) or stacks of 3 RFNs giving a 3hidden layer network. The classification performance of deep networks with RFN pretrained layers was compared to (i) support vector machines, (ii) deep networks pretrained by stacking denoising autoencoders (SDAE), (iii) stacking regular autoencoders (SAE), (iv) restricted Boltzmann machines (RBM), and (v) stacking restricted Boltzmann machines (DBN). The benchmark datasets and results are taken from previous publications [25, 26, 27, 28] and contain: (i) MNIST (original MNIST), (ii) basic (a smaller subset of MNIST for training), (iii) bg-rand (MNIST with random noise background), (iv) bg-img (MNIST with random image background), (v) rect (tall or wide rectangles), (vi) rect-img (tall or wide rectangular images with random background images), (vii) convex (convex or concave shapes), (viii) CIFAR-10 (60k color images in 10 classes), and (ix) NORB (29,160 stereo image pairs of 5 categories). For each dataset its size of training, validation and test set is given in the second column of Tab. 2. As preprocessing we only performed median centering. Model selection is based on the validation set [26]. The RFNs hyperparameters are (i) the number of units per layer from {1024, 2048, 4096} and (ii) the dropout rate from {0.0, 0.25, 0.5, 0.75}. The learning rate was fixed to η = 0.01 (default value). For supervised fine-tuning with stochastic gradient descent, we selected the learning rate from {0.1, 0.01, 0.001}, the masking noise from {0.0, 0.25}, and the number of layers from {1, 3}. Fine-tuning was stopped based on the validation set, see [26]. Fig. 2 shows learned filters. Test error rates and the 95% Table 2: Results of deep networks pretrained by RFNs and other models (taken from [25, 26, 27, 28]). The test error rate is reported together with the 95% confidence interval. The best performing method is given in bold, as well as those for which confidence intervals overlap. The first column gives the dataset, the second the size of training, validation and test set, the last column indicates the number of hidden layers of the selected deep network. In only one case RFN pretraining was significantly worse than the best method but still the second best. In six out of the nine experiments RFN pretraining performed best, where in four cases it was significantly the best. Dataset SVM RBM DBN SAE SDAE RFN MNIST 50k-10k-10k 1.40 0.23 1.21 0.21 1.24 0.22 1.40 0.23 1.28 0.22 1.27 0.22 (1) basic 10k-2k-50k 3.03 0.15 3.94 0.17 3.11 0.15 3.46 0.16 2.84 0.15 2.66 0.14 (1) bg-rand 10k-2k-50k 14.58 0.31 9.80 0.26 6.73 0.22 11.28 0.28 10.30 0.27 7.94 0.24 (3) bg-img 10k-2k-50k 22.61 0.37 16.15 0.32 16.31 0.32 23.00 0.37 16.68 0.33 15.66 0.32 (1) rect 1k-0.2k-50k 2.15 0.13 4.71 0.19 2.60 0.14 2.41 0.13 1.99 0.12 0.63 0.06 (1) rect-img 10k-2k-50k 24.04 0.37 23.69 0.37 22.50 0.37 24.05 0.37 21.59 0.36 20.77 0.36 (1) convex 10k-2k-50k 19.13 0.34 19.92 0.35 18.63 0.34 18.41 0.34 19.06 0.34 16.41 0.32 (1) NORB 19k-5k-24k 11.6 0.40 8.31 0.35 - 10.10 0.38 9.50 0.37 7.00 0.32 (1) CIFAR 40k-10k-10k 62.7 0.95 40.39 0.96 43.38 0.97 43.25 0.97 - 41.29 0.95 (1) Micronuclei Figure 3: Examples of small and rare events identified by RFN in two drug design studies, which were missed by previous methods. Panel A and B: first row gives the coding unit, while the other rows display expression values of genes for controls (red), active drugs (green), and inactive drugs (black). Drugs (green) in panel A strongly downregulate the expression of tubulin genes which hints at a genotoxic effect by the formation of micronuclei (C). The micronuclei were confirmed by microscopic analysis (D). Drugs (green) in panel B show a transcriptional effect on genes with a negative feedback to the MAPK signaling pathway (E) and therefore are potential cancer drugs. confidence interval (computed according to [26]) for deep network pretraining by RFNs and other methods are given in Tab. 2. Best results and those with overlapping confidence intervals are given in bold. RFNs were only once significantly worse than the best method but still the second best. In six out of the nine experiments RFNs performed best, where in four cases it was significantly the best. Supplementary Section 14 shows results of RFN pretraining for convolutional networks, where RFN pretraining decreased the test error rates to 7.63% for CIFAR-10 and to 29.75% for CIFAR-100. RFNs in Drug Discovery. Using RFNs we analyzed gene expression datasets of two projects in the lead optimization phase of a big pharmaceutical company [29]. The first project aimed at finding novel antipsychotics that target PDE10A. The second project was an oncology study that focused on compounds inhibiting the FGF receptor. In both projects, the expression data was summarized by FARMS [30] and standardized. RFNs were trained with 500 hidden units, no masking noise, and a learning rate of η = 0.01. The identified transcriptional modules are shown in Fig. 3. Panels A and B illustrate that RFNs found rare and small events in the input. In panel A only a few drugs are genotoxic (rare event) by downregulating the expression of a small number of tubulin genes (small event). The genotoxic effect stems from the formation of micronuclei (panel C and D) since the mitotic spindle apparatus is impaired. Also in panel B, RFN identified a rare and small event which is a transcriptional module that has a negative feedback to the MAPK signaling pathway. Rare events are unexpectedly inactive drugs (black dots), which do not inhibit the FGF receptor. Both findings were not detected by other unsupervised methods, while they were highly relevant and supported decision-making in both projects [29]. 5 Conclusion We have introduced rectified factor networks (RFNs) for constructing very sparse and non-linear input representations with many coding units in a generative framework. Like factor analysis, RFN learning explains the data variance by its model parameters. The RFN learning algorithm is a posterior regularization method which enforces non-negative and normalized posterior means. We have shown that RFN learning is a generalized alternating minimization method which can be proved to converge and to be correct. RFNs had the sparsest code, the lowest reconstruction error, and the lowest covariance approximation error of all methods that yielded sparse representations (SP>10%). RFNs have shown that they improve performance if used for pretraining of deep networks. In two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that were so far missed by other unsupervised methods. These gene modules were highly relevant and supported the decision-making in both studies. RFNs are geared to large datasets, sparse coding, and many representational units, therefore they have high potential as unsupervised deep learning techniques. Acknowledgment. The Tesla K40 used for this research was donated by the NVIDIA Corporation. [1] G. E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 507, 2006. [2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In B. Sch olkopf, J. C. Platt, and T. Hoffman, editors, NIPS, pages 153 160. MIT Press, 2007. [3] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85 117, 2015. [4] Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436 444, 2015. [5] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, pages 807 814. Omnipress 2010, ISBN 978-1-60558-907-7, 2010. [6] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, volume 15, pages 315 323, 2011. [7] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929 1958, 2014. [8] S. Hochreiter, U. Bodenhofer, et al. FABIA: factor analysis for bicluster acquisition. Bioinformatics, 26(12):1520 1527, 2010. [9] S. Hochreiter. Hap FABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data. Nucleic Acids Res., 41(22):e202, 2013. [10] B. J. Frey and G. E. Hinton. Variational learning in nonlinear Gaussian belief networks. Neural Computation, 11(1):193 214, 1999. [11] M. Harva and A. Kaban. Variational learning for rectified factor analysis. Signal Processing, 87(3):509 527, 2007. [12] K. Ganchev, J. Graca, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11:2001 2049, 2010. [13] J. Palmer, D. Wipf, K. Kreutz-Delgado, and B. Rao. Variational EM algorithms for non-Gaussian latent variable models. In NIPS, volume 18, pages 1059 1066, 2006. [14] D. P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Trans. Automat. Control, 21:174 184, 1976. [15] C. T. Kelley. Iterative Methods for Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 1999. [16] D. P. Bertsekas. Projected Newton methods for optimization problems with simple constraints. SIAM J. Control Optim., 20:221 246, 1982. [17] J. Abadie and J. Carpentier. Optimization, chapter Generalization of the Wolfe Reduced Gradient Method to the Case of Nonlinear Constraints. Academic Press, 1969. [18] J. B. Rosen. The gradient projection method for nonlinear programming. part ii. nonlinear constraints. Journal of the Society for Industrial and Applied Mathematics, 9(4):514 532, 1961. [19] E. J. Haug and J. S. Arora. Applied optimal design. J. Wiley & Sons, New York, 1979. [20] A. Ben-Tal and A. Nemirovski. Interior Point Polynomial Time Methods for Linear Programming, Conic Quadratic Programming, and Semidefinite Programming, chapter 6, pages 377 442. Society for Industrial and Applied Mathematics, 2001. [21] A. Gunawardana and W. Byrne. Convergence theorems for generalized alternating minimization procedures. Journal of Machine Learning Research, 6:2049 2073, 2005. [22] W. I. Zangwill. Nonlinear Programming: A Unified Approach. Prentice Hall, Englewood Cliffs, N.J., 1969. [23] N. Srebro. Learning with Matrix Factorizations. Ph D thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2004. [24] A. Hyv arinen and E. Oja. A fast fixed-point algorithm for independent component analysis. Neural Comput., 9(7):1483 1492, 1999. [25] Y. Le Cun, F.-J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Press, 2004. [26] P. Vincent, H. Larochelle, et al. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11:3371 3408, 2010. [27] H. Larochelle, D. Erhan, et al. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, pages 473 480, 2007. [28] A. Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, Deptartment of Computer Science, University of Toronto, 2009. [29] B. Verbist, G. Klambauer, et al. Using transcriptomics to guide lead optimization in drug discovery projects: Lessons learned from the {QSTAR} project. Drug Discovery Today, 20(5):505 513, 2015. [30] S. Hochreiter, D.-A. Clevert, and K. Obermayer. A new summarization method for Affymetrix probe level data. Bioinformatics, 22(8):943 949, 2006.