# svmbased_deep_stacking_networks__99a3a668.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) SVM-Based Deep Stacking Networks Jingyuan Wang, , Kai Feng, Junjie Wu , MOE Engineering Research Center of Advanced Computer Application Technology, School of Computer Science Engineering, Beihang University, Beijing 100191, China Beijing Key Laboratory of Emergency Support Simulation Technologies for City Operations, School of Economics and Management, Beihang University, Beijing 100191, China Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing 100191, China Email: {jywang, fengkai, wujj}@buaa.edu.cn The deep network model, with the majority built on neural networks, has been proved to be a powerful framework to represent complex data for high performance machine learning. In recent years, more and more studies turn to nonneural network approaches to build diverse deep structures, and the Deep Stacking Network (DSN) model is one of such approaches that uses stacked easy-to-learn blocks to build a parameter-training-parallelizable deep network. In this paper, we propose a novel SVM-based Deep Stacking Network (SVM-DSN), which uses the DSN architecture to organize linear SVM classifiers for deep learning. A BP-like layer tuning scheme is also proposed to ensure holistic and local optimizations of stacked SVMs simultaneously. Some good math properties of SVM, such as the convex optimization, is introduced into the DSN framework by our model. From a global view, SVM-DSN can iteratively extract data representations layer by layer as a deep neural network but with parallelizability, and from a local view, each stacked SVM can converge to its optimal solution and obtain the support vectors, which compared with neural networks could lead to interesting improvements in anti-saturation and interpretability. Experimental results on both image and text data sets demonstrate the excellent performances of SVM-DSN compared with some competitive benchmark models. Introduction Recent years have witnessed the tremendous interests from both the academy and industries in building deep neural networks (Hinton and Salakhutdinov 2006; Bengio, Courville, and Vincent 2013). Many types of deep neural networks have been proposed for classification, regression and feature extracting tasks, such as Stacked Denoising Autoencoders (SAE) (Vincent et al. 2010), Deep Belief Networks (DBN) (Hinton 2011), deep Convolutional Neural Networks (CNN) (Krizhevsky, Sutskever, and Hinton 2012), Recurrent Neural Networks (RNN) (Medsker and Jain 2001), and so on. Meanwhile, the shortcomings of neural network based deep models, such as the non-convex optimization, hardto-parallelizing, and lacking model interpretation, are getting more and more attentions from the pertinent research Corresponding author Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. societies. Some potential solutions have been proposed to build deep structure models using non neural network approaches. For instance, in the literature, the PCANet build a deep model using an unsupervised convolutional principal component analysis (Chan et al. 2015). The gc Forest builds a tree based deep model using stacked random forests, which is regarded as a good alternative to deep neural networks (Zhi-Hua Zhou 2017). Deep Fisher Networks build deep networks by stacking Fisher vector encoding into multiple layers (Simonyan, Vedaldi, and Zisserman 2013). Along this line, in this paper, we propose a novel SVMbased Deep Stacking Network (SVM-DSN) for deep machine learning. On one hand, SVM-DSN belongs to the community of Deep Stacking Networks (DSN), which consist of many stacked multilayer base blocks that could be trained in a parallel way and have comparable performance with deep neural networks (Deng and Yu 2011; Deng, He, and Gao 2013). In this way, SVM-DSN can gain the deep learning ability with extra scalability. On the other hand, we replace the traditional base blocks in a DSN, i.e., the perceptrons, by the well known Support Vector Machine (SVM), which has long been regarded as a succinct model with appealing math properties such as the convexity in optimization, and was considered as a different method to model complicated data distributions compared with deep neural networks (Bengio and others 2009). In this way, SVM-DSN can gain the ability in anti-saturation and enjoys improved interpretability, which are deemed to be the tough challenges to deep neural networks. A BP-like Layered Tuning (BLT) algorithm is then proposed for SVM-DSN to conduct holistic and local optimizations for all base SVMs simultaneously. Compared with the traditional deep stacking networks and deep neural networks, the SVM-DSN model has the following advantages: The optimization of each base-SVM is convex. Using the proposed BLT algorithm, all base-SVMs are optimized as a whole, and meanwhile each base-SVM can also converge to its own optimum. The final solution of SVMDSN is a group of optimized linear SVMs that are integrated as a deep model. This advantage allows SVMDSN to avoid the neuron saturation problem in deep neural networks, and thus could improve the performance. The SVM-DSN model is very easy to parallelize. The Figure 1: An illustration of the DSN architecture (Deng, He, and Gao 2013). The color is used to distinguish different blocks in a DSN. The components in the same color belong to the same block. training parallelization in SVM-DSN can reach the base SVM level due to the support vectors oriented property of SVM, but the traditional DSN can only reach the block level. The SVM-DSN model has improved interpretability. The support vectors in base-SVMs can provide some insightful information about what a block learned from training data. This property empowers users to partially understand the feature extracting process of the SVMDSN model. Experimental results on image and sentiment classification tasks show that SVM-DSN model obtains respectable improvements over neural networks. Moreover, compared with the stacking models with strong base-learners, the SVM-DSN model also demonstrates significant advantages in performance. The SVM-DSN Model Framework of Deep Stacking Network The Deep Stacking Network is a scalable deep machine learning architecture (Deng, He, and Gao 2013; Deng and Yu 2011) that consists of stacked easy-to-learn blocks in a layer by layer manner. In the standard DSN, a block is a simplified multilayer perceptron with a single hidden layer. Let the inputs of a block be a vector x, the block uses a connection weight matrix W to calculate the hidden layer vector h as h = ϕ W x , (1) where ϕ(x) = 1/(1 + exp( x)) is a sigmoid nonlinear activation function. Using a weight matrix U, the objective function of the DSN block optimization is defined as min y U h 2 As shown in Fig. 1, the blocks of a DSN are stacked layer by layer. For the block in the input layer, the input vector x contains only the raw input features. For blocks in the middle layers, x is a concatenated vector of the raw input features and output representations of all previous layer blocks. The training of deep stacking networks contains two steps: block training and fine-tuning. In the block training step, the DSN blocks are independently training as supervised multilayer perceptrons. In the fine-tuning step, all the stacked blocks are considered as a multi-layer deep neural network. The parameters of DSN are end-to-end trained using the error Back Propagation (BP) algorithm. SVM-based DSN Blocks In the SVM-DSN model, we adopt support vector machines to implement a DSN block. A SVM classifier is a hyperplane ω x + b = 0 that divides the feature space of a data sample x into two parts one for the positive and the other for the negative. The parameters ω and b are optimized to maximize the minimum distances from the hyperplane to a set of training samples T = {(xk, yk)|yk { 1, 1}, k = 1, . . . , K}, i.e., max ω,b 2 ω s.t. yk(ω xk + b) 1, k = 1, 2, . . . , N. (3) A training sample is called a support vector if the constraint in Eq. (3) turns into equality. For a multi-class problem with N classes, we connect the input vector x of a DSN block with N binary SVM classifiers each for recognizing whether a sample belongs to a corresponding class to predict the label of a sample. A binary SVM classifier in a DSN block is called a base-SVM. The N binary SVM classifiers for a N classification problem is called as a base-SVM group. A SVM-DSN block could contains multiple base-SVM groups. In the same block, all base-SVM groups share the same input vector x. Stacking Blocks Given a classification hyperplane of a SVM, the decision function for the sample xk is expressed as f(xk) = sign ω xk + b , (4) where f(xk) = 1 for the positive class and f(xk) = 1 for the negative. The distance from a sample to the hyperplane could be considered as the confidence of a classification decision. For the samples behind the support vectors, i.e., ω xk + b > 1, the confidence is 1, otherwise is ω xk + b . We therefore can express the classification confidence of a SVM classifier for the sample xk as g(xk) = min 1, |ω xk + b| . (5) We denote the i-th base-SVM in the layer l as svm(l, i) and its decision function and confidence as f (l,i)( ) and g(l,i)( ), respectively. For the base-SVM svm(l, i), we define a confidence weighted output y(l,i) as y(l,i) = f (l,i)(x) g(l,i)(x). (6) In the layer l+1, SVM-DSN concatenates the confidence weighted outputs of all base-SVMs in the previous layers and raw inputs as x(l+1) = y(l,1), . . . , y(l,i), . . . , y(l 1,1), . . . , y(l 1,i), . . . , y(1,1), . . . , y(1,i), . . . , x(1,1), . . . , x(1,i) . (7) The base-SVMs in the layer l+1 use x(l+1) as the input to generate their confidence weighted outputs y(l+1,i). In this way, base-SVMs are stacked and connected layer by layer. Model Training Block Training Similar to the standard deep stacking network, the training of the SVM-DSN model also contains a block training step and a fine-tuning step. In the block training step, the base-SVMs in a DSN block are trained as regular SVM classifiers. Given a set of training samples T = {(xk, yk)|k = 1, . . . , K}, where yk is the ground-truth label of xk, the objective function of a base SVM group with N classification is defined as i=1 ℓhinge y(i) k (ω(i) xk + b(i)) , (8) where Ω= ω(1) , . . . , ω(N) , and y(i) k = 1 if yk = i and -1 otherwise. The function ℓhinge( ) is a hinge loss function defined as ℓhinge(z) = max (0, 1 z). The parameter θ = (ω(i), b(i))| i is inferred as θ = arg min θ J (θ). In order to increase the diversity of base-SVM groups in a block, we adopt a bootstrap aggregating method in the block training. For a block with M base-SVM groups, we re-sample the training data as M sets using the bootstrap method (Efron and Tibshirani 1994). Each base-SVM group is trained using one re-sampled data set. Fine Tuning The traditional DSN model is based on neural networks and uses the BP algorithm in the fine-tuning step. For the SVMDSN model, we introduce SVM training into the BP algorithm framework, and propose a BP-like Layered Tuning (BLT) algorithm to fine-tune the model parameters. Algorithm 1 gives the pseudocodes of BLT. In general, BLT iteratively optimizes the base-SVMs from the output layer to the input layer. In each iteration, BLT optimizes svm(l, i) by firstly generating a set of virtual training samples T (l,i) = {(x(l) k , y(l,i) k )|k = 1, . . . , K}, and then trains a new svm(l, i) on T (l,i). According to Eq. (6) and Eq. (7), it is easy to have x(l) k = (y(l 1,1) k , y(l 1,2) k , , y(l 1,i) k , ) . However, the calculation of the virtual label y(l,i) k is not that straightforward. Algorithm 1 BP-like Layered Tuning Algorithm 1: Initialization: Initializing ω(l,i), b(l,i) for all svm(l, i) as random values. 2: repeat 3: Select a batch of training samples T = {(xk, yk)|k = 1, . . . , K}. 4: for l = L, L 1, . . . , 2, 1 do 5: for i = 1, 2, . . . do 6: Use Eq. (6), Eq. (7), and Eq. (9) to calculate T (l,i) = {(x(l) k , y(l,i) k )|k = 1, . . . , K}. 7: Use T (l,i) to train svm(l, i) as Eq. (11). 8: end for 9: end for 10: until The algorithm converges. Specifically, BLT adopts a gradient descent method to calculate y(l,i) k as y(l,i) k = σ y(l,i) k η J (o) y(l,i)=y(l,i) k where J (o) is the objective function of the output layer, y(l,i) k is the output of x(l) k in the previous iteration, η is the learning rate, and σ( ) is a shaping function defined as 1, z > 1 z, |z| 1 1, < 1 . (10) Note that since the term η J (o)/ y(l,i) in Eq. (9) is a negative gradient direction of J (o), tuning the output y(l,i) to the virtual label y(l,i) can reduce the value of the objective function J (o) in the output layer. Therefore, it could be expected that BLT can lower the overall model prediction error iteratively by training base-SVMs on virtual training sets in each iteration. Given the training set T (l,i) = {(x(l) k , y(l,i) k )|k = 1, . . . K}, the objective function of training svm(l, i) is defined as min J (l,i) = 1 k Θ ℓhinge y(l,i) k (ω(l,i) x(l) k + b(l,i)) | {z } The SVM Loss k / Θ ℓϵ ω(l,i) x(l) k + b(l,i) y(l,i) k | {z } The SVR Loss where Θ is the index set of the virtual labels y(l,i) k = 1, and the function ℓϵ( ) is an ϵ-insensitive loss function in the form of ℓϵ(z) = max(|z| ϵ, 0). Note that the objective function in Eq. (11) contains two types of loss functions so as to adapt to the different conditions of y(l,i) k . When y(l,i) k { 1, 1}, i.e., the virtual labels are binary, BLT trains svm(l, i) as the standard SVM classifier and thus uses the hinge loss function to measure errors. When y(l,i) k ( 1, 1), the objective function adopts a Support Vector Regression loss term ℓϵ for this condition. In the Appendix, we prove that the problem defined in Eq. (11) is a quadratic convex optimization problem. The training of svm(l, i) can thus reach an optimal solution by using various quadratic programming methods such as sequential minimal optimization and gradient descents. We finally turn to the small problem unsolved how to calculate the partial derivative J (o)/ y(l,i) in Eq. (9). Based on the chain rule, the partial derivative can be recursively calculated as J y(m,j) dy(m,j) dz(m,j) z(m,j) J y(m,j) y z(m,j) ω(m,j) i , where ω(m,j) i is the connection weight of y(m,j) in svm(m, j), and z(m,j) = ω(m,j) x(m 1) + b(m,j). The term y (z) is the derivative of the function in Eq. (6), which is in the form of y (z) = 0, |z| > 1 1, |z| 1. (13) The principle of this chain derivation is similar to the error back-propagation of the neural network training. The difference lies in that the BP algorithm calculates the derivative for each neuron connecting weight but BLT for each base SVM output. That is why we name our algorithm as BP-like Layered Tuning. Model Properties Connection to Neural Networks The SVM-DSN model has close relations with neural networks. If we view the base-SVM output function defined in Eq. (6) as a neuron, the SVM-DSN model can be regarded as a type of neural networks. Specifically, we can rewrite the function in Eq. (6) as a neuron form as follows: y(l,i) = σ ω (l,i) x(l) + b(l,i) , (14) where the shaping function σ( ) works as an activate function, with the output σ(z) {1, 1} if |z| 1, and σ(z) = z if |z| < 1. As proved in (Hornik 1991), a multilayer feedforward neural network with arbitrary bounded and non-constant activation function has an universal approximation capability. As a consequence, we could expect that the proposed SVM-DSN model also has the universal approximation capability in theory. Nevertheless, the difference between the SVMDSN model and neural networks is still significant. Indeed, we have proven in the Appendix that the base-SVMs in our SVM-DSN model have the following property: Given a set of virtual training samples {(x(l) k , y(l,i) k )|k = 1, . . . , K} for svm(l, i), to minimize the loss function defined in Eq. (11) is a convex optimization problem. Moreover, because the base-SVMs in the same block are mutually independent, the optimization of the whole block is a convex problem too. This implies that, in each iteration of the BLT algorithm, all blocks can converge to an optimal solution. In other words, SVM-DSN ensures that all blocks and their base-SVMs do their own best to minimize their own objective functions in each iteration, which however is not the case for neural networks and MLP based deep stacking networks. It is also worth noting that this do their own best property is compatible with the decrease of the overall prediction error measured by the global objective function J (o). An important advantage empowered by the do their own best property is the anti-saturation feature of SVM-DSN. In neural network models, the BP algorithm updates the parameter ω of a neuron as ω η J / ω. Hence, the partial derivative for the i-th ω in the j-th neuron at the layer l is calculated as ω(l,j) i = J y(l,j) dy(l,j) dz(l,j) z(l,j) = J y(l,j) y z(l,j) y(l 1,i), where y z(l,j) is a derivative of the activation function. For the sigmoid activation function, if |z| is very large then y becomes very small, and J/ ω 0. In this condition, the BP algorithm cannot update ω any more even if there is still much room for the optimization of ω. This phenomenon is called the neuron saturation in neural network training. For the Re LU activation function, similar condition appears when z < 0, where y = 0 and J/ ω = 0. The neuron will die when a Re LU neuron fall into this condition. In the BLT algorithm of SVM-DSN model, the update of a base-SVM is guided by J / y, with the details given in Eq. (12). From Eq. (12), we can see that unless all base SVMs in an upper layer are saturated, i.e., y (z(m,j)) = 0 for all z(m,j), the base-SVMs in the layer l would not fall into the saturation state. Therefore, we could expect that the saturation risk of a base-SVM in SVM-DSN tends to be much lower than a neuron in neural networks. Interpretation In SVM classifiers, the support vector samples could provide some interpretation information about what a classifier learned from data set. The SVM-DSN inherit this interpretation property of SVM. In the base-SVM output function defined in Eq. (6), the confidence function g(x) indicates whether a sample is clearly classified by the hyperplane of a base-SVM. For a sample to be classified, we can calculate the average confidence of the sample in all base-SVM groups in a block. The average confidence indicates whether the feature extracted by the block and previous layers offer enough representations to identify label of the sample. Because the samples with low confidence are near to the hyperplane, we can use the low confidence samples to form a classifying plane map in each blocks. Comparing the (a) Layer-1 (b) Layer-2 (c) Layer-3 Figure 2: The classifying plane maps of each layer in the SVM-DSN. Figure 3: The Circle Data classifying plane maps layer by layer, we could partly understand the feature extracting process of the stacked block in a SVM-DSN model. We here give a show case to explain the interpretation property of SVM-DSN in a direct way. In this case, we generate a circle data set containing samples of two classes, as shown in Fig. 3. The positive samples are generated from a circle with radius of 0.5 plus a Gaussian noise with variance of 0.1. The negative samples are generated from a circle with radius of 1 plus the same Gaussian noise. We use the circle data set to train a 3-layer SVM-DSN model, where the middle layers contain 40 and 60 base-SVM groups, respectively. Because this experiment is a binary classification, a base-SVM group only contains one base-SVM. In the experiment, we traverse the feature space from the coordinate point ( 2, 2) to (2, 2). For each coordinate point, we calculate the average confidence of all base-SVMs in each layer. Figs. 2a - 2c plot the confidence distribution maps in different layers. The samples with low confidence are thus near to the SVM classification hyperplane, so the low confidence areas in red form the classifying plane of a layer. It is obvious that there is a clear classifying plane generation process from Figs. 2a to 2c. In the layer 1, the low confidence values concentrate in the center of the map. In the layer 2, the low confidence values have a vague shape as a circle. In the layer 3, the low confidence values distribute as a circle shape that clearly divides the feature space into two parts. This process demonstrates how an SVM-DSN model extracts the data representations of the feature space layer by layer. Parallelization The deep stacking network is proposed for parallel parameter learning. For a DSN with L blocks, given a group of training samples, the DSN can resample the data set as L batches. A DSN block only uses one batch to update its parameter. In this way, the training of DSN parameters could be deployed over L processing units. The parallelization of DSN training is in the block level. This parallel parameter learning property could be further extended by using Parallel Support Vector Machines (Graf et al. 2005). Because in a SVM classifier, only the support vector samples are crucial, we could divide a training set as several M sub-sets, and use M virtual SVM classifiers to select support vector candidates from each sub-set. Finally, we use the support vector candidates of all sub-sets to train the final model. In this way, the training of a SVM-DSN block can be deployed over M processing units, and the all SVMDSN model can be deployed over L M processors. That is to say the training of base-SVMs in a block is also parallelizable in SVM-DSN. The parallelization of SVM-DSN training is in the base-SVM level. The parallel degree of whole model is greatly improved. As reported in Ref (Graf et al. 2005), the speed-up for a 5-layer Cascade SVM (16 parallel SVMs) to a single SVM is about 10 times for each pass and 5 times for fully converged. Experiments Image Classification Performance We first test the performance of SVM-DSN on the MNIST image classification database (Le Cun et al. 1998). The MNIST database contains 60,000 handwritten digits images in the size of 28 28 for training and validation, and 10,000 images for testing. The images are classified as ten classes according to the digits written on them. The SVMDSN model used in the experiment consists of three layers two middle layers and an output layer. Both of the block in the two middle layers contains 20 base-SVM groups, and each group contains 10 base-SVMs, i.e., 200 base-SVMs one layer. We actually had tried neural networks with deeper layers, but the classification performance could not be improved significantly. The benchmark models includes: i) A 3-layer deep stacking network where the block in each middle layer contains 200 hidden neurons; ii) As analyzed in the Model Properties section, the SVM-DSN could be considered as a type of neural network. Therefore, we use the BP algorithm to train a same structure SVM-DSN as a benchmark; iii) The 3-layer neural network models with the same neuron connection structure as the 3-layer SVM-DSN. In these benchmarks, we use SVM output (Tang 2013), and Table 1: MNIST Classification Performance Models Error Rate (%) 3-layer SVM-DSN 1.49 3-layer DSN 1.65 3-layer SVM-DSN, BP 1.62 3-layer NN, SVM output, sigmoid 1.74 3-layer NN, SVM output, tanh 1.59 3-layer NN, SVM output, Re LU 1.56 Homepage benchmark (Le Cun et al. 1998) 1.53 Bagging of base-SVMs 5.41 the different activate functions in neurons; iv) The best 3layer NN benchmark listed in the homepage of MNIST 3layer NN, 500+300 hidden units, cross entropy loss, weight decay (Le Cun et al. 1998); v) A bagging of 41 base-SVM groups, each group contains 10 base-SVMs. In the SVM-DSN model fine-tuning, we have two hyperparameters to set, i.e., C1 and C2 of the base-SVM s objective function. The two hyper-parameters are used to balance structural risks and empirical risks in a base-SVM. Either too big or too small for the two hyper-parameters may lead model performance degenerate. Therefore, we use the trial and error method to set the hyper-parameters. The learning rate η is the other hyper-parameter, which could be dynamic setting using elegant algorithms such as Adam. In our experiment, we directly set the learning rate as a fix value η = 0.0005 to ensure the experiment fairness. Table 1 gives the MNIST image classification results. The SVM-DSN model achieved the best performance compared with the other benchmarks, which verified the effectiveness of SVM-DSN. In the benchmarks, the 3-layer SVM-DSN+BP model has the same model structure with SVM-DSN but was trained by the BP algorithm. The results show that SVM-DSN has a better performance than the SVM-DSN+BP benchmark, which indicates that the base SVMs do their own best feature of SVM-DSN is a positive feature for model performance. In fact, the idea of BLT fine-tuning could be extend to optimize the deep stacking networks with any derivable model as blocks, such as soft decision-making tree and linear discriminant analysis. Feature Extractor Compatibility Currently, the mainstream image classifiers usually adopt convolutional neural networks (CNN) as feature extractors. Table 2 demonstrates the MNIST classification performance of SVM-DSN with a CNN feature extractor. In the experiment, we connect a CNN feature extractor with a 3-layer SVM-DSN model. The structure of SVM-DSN is same as in Table 1. The CNN feature extractor contains 3 convolutional layers, and each layer consists of 24 flitters with the 5 5 receptive field. In the CNN and SVM-DSN mixture model, we first pre-trained the CNN part using BP, and then uses the feature extracted by CNN as input of the SVMDSN part to train the blocks. In the fine-tuning step, the CNN part is fine tuned by BP and the SVM-DSN part is tuned by BLT. The benchmark models include: i) The same CNN feature extractor connected with a 3-layer DSN, where Table 2: MNIST Classification with CNN Feature Extractor Models Error Rate (%) CNN + SVM-DSN 0.51 CNN + DSN 0.60 CNN + SVM-DSN, BP 0.72 CNN + sigmoid activation 0.80 CNN + tanh activation 0.67 CNN + Re LU activation 0.58 Homepage benchmark (Le Cun et al. 1998) 0.54 gc Forest (Zhi-Hua Zhou 2017) 0.74 Table 3: IMDB Classification Performance Models Error Rate (%) SVM-DSN 10.51 DSN 11.15 SVM-DSN, BP 11.42 Random Forest 14.68 XGBoost 14.77 Ada Boost 16.63 SVM (linear kernel) 12.43 Stacking 11.55 Bagging of base-SVMs 11.66 gc Forest (Zhi-Hua Zhou 2017) 10.84 the 3-layer DSN has the same structure with the 3-layer SVM-DSN model; ii) The CNN+SVM-DSN model trained by the BP algorithm; iii) The neural networks consist of 3 CNN layers and 3-layer neural network with different activate functions, where the structures of the CNN and the neural network are same as the CNN + SVM-DSN model; iv) The trainable CNN feature extractor + SVMs with affine distortions, which is the best benchmark with the similar models scale listed in the MNIST homepage; v) The gc Forest with convolutional kernels (Zhi-Hua Zhou 2017). As shown in Table 2, the CNN + SVM-DSN model achieved the best performance, which verified the effectiveness of our model again. What s more, this experiment demonstrates that the SVM-DSN model is completely compatible to the neural network framework. The other types of networks, such as RNN and LSTM, could also be used as feature extractors of SVM-DSN to adapt diversified application scenarios. Comparison with Ensemble Models In this section, we compare the performance of SVMDSN with several classical ensemble models. Because on the MNIST data set, the performance of ensemble models are usually not very well. For a fair comparison, we use the IMDB sentiment classification data set (Maas et al. 2011) in our experiment. Many tree based ensemble methods achieved good performance on this data set (Zhi Hua Zhou 2017). The IMDB dataset contains 25,000 movie reviews for training and 25,000 for testing. The movie reviews are represented by tf-idf features and labeled as positives and negatives. The SVM-DSN model used in this experiment consists of 4 middle layers, and the number of base-SVMs in the middle layer are 1024-1024-512-256. The benchmark models include Random Forest, XGBoost, Ada Boost and SVM (linear kernel). The four benchmarks are also stacked as a stacking benchmark (Perlich and Swirszcz 2011). A bagging of base-SVMs and the gr Forest (Zhi Hua Zhou 2017) are also included as competitor. A 4-layer DSN is used as a benchmark, where the number of hidden neurons in the middle layer blocks are same as the number of base-SVMs in the SVM-DSN model. The SVM-DSN model trained by the BP algorithm is also used as the benchmark. As shown in Table 3, the SVM-DSN model achieved the best performance again. Especially, the performance of SVM-DSN is better than the stacking benchmark, which indicates that holistic optimized multi-layer stacking of linear base-learners can defeat the traditional two-layer stacking of strong base-learners. In the experiment, the performance of the BLT algorithm is yet better than the BP algorithm. The do their best feature of base-SVM in BLT is still effective in text sentiment classification. Related Works This work has close relations with SVM, deep learning, and stacking. The support vector machine was first proposed by Vapnik in (Vapnik 1998). Multi-layer structures in SVM were usually used as speedup solutions. In cascade SVM (Graf et al. 2005), a multi-layer cascade SVM model structure was used to select support vectors in a parallel way. In the literature (Collobert, Bengio, and Bengio 2002), a parallel mixture stacking structure was proposed to speed up SVM training in the very large scale problems. Before our work, some studies proposed to use SVM to replace the output layer of a neural network (Wiering et al. 2013; Tang 2013). In recent years, neural network based deep models has achieved great success in various applications (Hinton and Salakhutdinov 2006). The gc Froest model (Zhi-Hua Zhou 2017) was proposed to use the forest based deep model as an alternative to deep neural networks. The PCANet builds a deep model using unsupervised convolutional principal component analysis (Chan et al. 2015). LDANet is a supervised extension of PCANet, which uses linear discriminant analysis (LDA) to replace the PCA parts of PCANet (Chan et al. 2015). Deep Fisher Networks build deep network through stacking Fisher vector encoding as multi-layers (Simonyan, Vedaldi, and Zisserman 2013). The DSN framework adopted in this work is a scalable deep architecture amenable to parallel parameter training, which has been adopted in various applications, such as information retrieval (Deng, He, and Gao 2013), image classification (Li, Chang, and Yang 2015), and speech pattern classification (Deng and Yu 2011). T-DSN uses tensor blocks to incorporate higher order statistics of the hidden binary features (Hutchinson, Deng, and Yu 2013). The CCNN model extends the DSN framework using convolutional neural networks (Zhang, Liang, and Wainwright 2016). To the best of our knowledge, there are very few works introduce the advantages of SVM into the DSN framework. Stacking was introduced by Wolpert in (Wolpert 1992) as a scheme of combining multiple generalizers. In many real-world applications, the stacking methods were used to integrate strong base-learners as an ensemble model to improve performance (Jahrer, T oscher, and Legenstein 2010). In the literature, most of stacking works focused on designing elegant meta-learners and create better baselearners, such as using class probabilities in stacking (Ting and Witten 1999), using a weighted average to combine stacked regression (Rooney and Patterson 2007), training base-learners using cross-validations (Perlich and Swirszcz 2011), and applying ant colony optimization to configure base-learners (Chen, Wong, and Li 2014). To the best of our knowledge, there are very few works to study how to optimize multi-layer stacked base-learners as a whole. In this paper, we proposed an SVM-DSN model where linear base-SVMs are stacked and trained in a deep stacking network way. In the SVM-DSN model, the good mathematical property of SVMs and the flexible model structure of deep stacking networks are nicely combined in a same framework. The SVM-DSN model has many advantage properties including holistic and local optimization, parallelization and interpretation. The experimental results demonstrated the superiority of the SVM-DSN model to some benchmark methods. Acknowledgments Prof. J. Wang s work was partially supported by the National Key Research and Development Program of China (No.2016YFC1000307), the National Natural Science Foundation of China (NSFC) (61572059, 61202426), the Science and Technology Project of Beijing (Z181100003518001), and the CETC Union Fund (6141B08080401). Prof. J. Wu was partially supported by the National Natural Science Foundation of China (NSFC) (71531001, 71725002, U1636210, 71471009, 71490723). Property: Given a set of virtual samples T (l,i) = {(x(l) k , y(l,i) k )|k = 1, . . . , K} for svm(l, i), to minimize the loss function defined in Eq. (11) is a convex optimization problem. Proof. For the sake of simplicity, we omit the superscripts (l) of x(l) k and y(l) k in our proof. We define a constrained optimization problem in the form of min ω,b,ξk,ˆξk,ζk 1 2 ω 2 + C1 X k/ Θ ζk + C2 X s.t. 1 yk ω xk + b ζk, (1) ζk 0, k Θ; ω xk + b yk ϵ + ξk, (2) yk ω xk + b ϵ + ˆξk, (3) ξk 0, ˆξk 0, k / Θ. (16) We can see the constrained optimization problem Eq. (16) is in a quadratic programming form as min a 1 2a Ua + c a s.t. Qa p, (17) where a = (ω, b, ξ, ˆξ, ζ), and U is a positive semi-definite diagonal matrix. Therefore, the constrained optimization problem is a quadratic convex optimization problem (Boyd and Vandenberghe 2004). It is easy to prove that the constrained optimization problem defined in Eq. (16) is equivalent to the unconstrained optimization problem defined in Eq. (11) (Zhang 2003). Therefore, the optimization problem of base-SVM is equivalent to the problem defined in Eq. (16). The optimization problem of base-SVM is a convex optimization problem. Bengio, Y., et al. 2009. Learning deep architectures for AI. Foundations and trends in Machine Learning 2(1):1 127. Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 35(8):1798 1828. Boyd, S., and Vandenberghe, L. 2004. Convex optimization. Cambridge University. Chan, T.-H.; Jia, K.; Gao, S.; Lu, J.; Zeng, Z.; and Ma, Y. 2015. PCANet: A simple deep learning baseline for image classification? IEEE Transactions on Image Processing 24(12):5017 5032. Chen, Y.; Wong, M.-L.; and Li, H. 2014. Applying ant colony optimization to configuring stacking ensembles for data mining. Expert Systems with Applications 41(6):2688 2702. Collobert, R.; Bengio, S.; and Bengio, Y. 2002. A parallel mixture of SVMs for very large scale problems. In NIPS, 633 640. Deng, L., and Yu, D. 2011. Deep convex net: A scalable architecture for speech pattern classification. In INTERSPEECH, 2285 2288. Deng, L.; He, X.; and Gao, J. 2013. Deep stacking networks for information retrieval. In ICASSP, 3153 3157. IEEE. Efron, B., and Tibshirani, R. J. 1994. An introduction to the bootstrap. CRC press. Graf, H. P.; Cosatto, E.; Bottou, L.; Dourdanovic, I.; and Vapnik, V. 2005. Parallel support vector machines: The cascade SVM. In NIPS, 521 528. Hinton, G. E., and Salakhutdinov, R. R. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786):504 507. Hinton, G. 2011. Deep belief nets. In Encyclopedia of Machine Learning. Springer. 267 269. Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural networks 4(2):251 257. Hutchinson, B.; Deng, L.; and Yu, D. 2013. Tensor deep stacking networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 35(8):1944 1957. Jahrer, M.; T oscher, A.; and Legenstein, R. 2010. Combining predictions for accurate recommender systems. In SIGKDD, 693 702. ACM. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097 1105. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324. Li, J.; Chang, H.; and Yang, J. 2015. Sparse deep stacking network for image classification. In AAAI, 3804 3810. Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.; and Potts, C. 2011. Learning word vectors for sentiment analysis. In ACL, 142 150. Medsker, L., and Jain, L. 2001. Recurrent neural networks. Design and Applications. Perlich, C., and Swirszcz, G. 2011. On cross-validation and stacking: Building seemingly predictive models on random data. ACM SIGKDD Explorations Newsletter 12(2):11 15. Rooney, N., and Patterson, D. 2007. A weighted combination of stacking and dynamic integration. Pattern Recognition 40(4):1385 1388. Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013. Deep fisher networks for large-scale image classification. In NIPS, 163 171. Tang, Y. 2013. Deep learning using linear support vector machines. ar Xiv preprint ar Xiv:1306.0239. Ting, K. M., and Witten, I. H. 1999. Issues in stacked generalization. Journal of Artificial Intelligence Research 10:271 289. Vapnik, V. 1998. Statistical learning theory. 1998. Wiley, New York. Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Manzagol, P.-A. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11:3371 3408. Wiering, M.; Van der Ree, M.; Embrechts, M.; Stollenga, M.; Meijster, A.; Nolte, A.; and Schomaker, L. 2013. The neural support vector machine. In BNAIC. Wolpert, D. H. 1992. Stacked generalization. Neural networks 5(2):241 259. Zhang, Y.; Liang, P.; and Wainwright, M. J. 2016. Convexified convolutional neural networks. ar Xiv preprint ar Xiv:1609.01000. Zhang, T. 2003. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics 32(1):56 134. Zhi-Hua Zhou, J. F. 2017. Deep Forest: Towards an alternative to deep neural networks. In IJCAI, 3553 3559.