# learning_kernels_with_random_features__77d1ea1c.pdf

Learning Kernels with Random Features

Aman Sinha1 John Duchi1,2

Departments of 1Electrical Engineering and 2Statistics Stanford University {amans,jduchi}@stanford.edu

Randomized features provide a computationally efﬁcient way to approximate kernel machines in machine learning tasks. However, such methods require a user-deﬁned kernel as input. We extend the randomized-feature approach to the task of learning a kernel (via its associated random features). Speciﬁcally, we present an efﬁcient optimization problem that learns a kernel in a supervised manner. We prove the consistency of the estimated kernel as well as generalization bounds for the class of estimators induced by the optimized kernel, and we experimentally evaluate our technique on several datasets. Our approach is efﬁcient and highly scalable, and we attain competitive results with a fraction of the training cost of other techniques.

1 Introduction

An essential element of supervised learning systems is the representation of input data. Kernel methods [27] provide one approach to this problem: they implicitly transform the data to a new feature space, allowing non-linear data representations. This representation comes with a cost, as kernelized learning algorithms require time that grows at least quadratically in the data set size, and predictions with a kernelized procedure require the entire training set. This motivated Rahimi and Recht [24, 25] to develop randomized methods that efﬁciently approximate kernel evaluations with explicit feature transformations; this approach gives substantial computational beneﬁts for large training sets and allows the use of simple linear models in the randomly constructed feature space.

Whether we use standard kernel methods or randomized approaches, using the right kernel for a problem can make the difference between learning a useful or useless model. Standard kernel methods as well as the aforementioned randomized-feature techniques assume the input of a user-deﬁned kernel a weakness if we do not a priori know a good data representation. To address this weakness, one often wishes to learn a good kernel, which requires substantial computation. We combine kernel learning with randomization, exploiting the computational advantages offered by randomized features to learn the kernel in a supervised manner. Speciﬁcally, we use a simple pre-processing stage for selecting our random features rather than jointly optimizing over the kernel and model parameters. Our workﬂow is straightforward: we create randomized features, solve a simple optimization problem to select a subset, then train a model with the optimized features. The procedure results in lowerdimensional models than the original random-feature approach for the same performance. We give empirical evidence supporting these claims and provide theoretical guarantees that our procedure is consistent with respect to the limits of inﬁnite training data and inﬁnite-dimensional random features.

1.1 Related work

To discuss related work, we ﬁrst describe the supervised learning problem underlying our approach. We have a cost c : R Y R, where c( , y) is convex for y Y, and a reproducing kernel Hilbert space (RKHS) of functions F with kernel K. Given a sample {(xi, yi)}n i=1, the usual ℓ2-regularized

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

learning problem is to solve the following (shown in primal and dual forms respectively):

minimize f F

i=1 c(f(xi), yi) + λ

2 f 2 2 , or maximize α Rn

i=1 c (αi, yi) 1

2λαT Gα, (1)

where 2 denotes the Hilbert space norm, c (α, y) = supz{αz c(z, y)} is the convex conjugate of c (for ﬁxed y) and G = [K(xi, xj)]n i,j=1 denotes the Gram matrix.

Several researchers have studied kernel learning. As noted by Gönen and Alpaydın [14], most formulations fall into one of a few categories. In the supervised setting, one assumes a base class or classes of kernels and either uses heuristic rules to combine kernels [2, 23], optimizes structured (e.g. linear, nonnegative, convex) compositions of the kernels with respect to an alignment metric [9, 16, 20, 28], or jointly optimizes kernel compositions with empirical risk [17, 20, 29]. The latter approaches require an eigendecomposition of the Gram matrix or costly optimization problems (e.g. quadratic or semideﬁnite programs) [10, 14], but these models have a variety of generalization guarantees [1, 8, 10, 18, 19]. Bayesian variants of compositional kernel search also exist [12, 13]. In unand semi-supervised settings, the goal is to learn an embedding of the input distribution followed by a simple classiﬁer in the embedded space (e.g. [15]); the hope is that the input distribution carries the structure relevant to the task. Despite the current popularity of these techniques, especially deep neural architectures, they are costly, and it is difﬁcult to provide guarantees on their performance.

Our approach optimizes kernel compositions with respect to an alignment metric, but rather than work with Gram matrices in the original data representation, we work with randomized feature maps that approximate RKHS embeddings. We learn a kernel that is structurally different from a user-supplied base kernel, and our method is an efﬁciently (near linear-time) solvable convex program.

2 Proposed approach

At a high level, we take a feature mapping, ﬁnd a distribution that aligns this mapping with the labels y, and draw random features from the learned distribution; we then use these features in a standard supervised learning approach.

For simplicity, we focus on binary classiﬁcation: we have n datapoints (xi, yi) Rd { 1, 1}. Letting φ : Rd W [ 1, 1] and Q be a probability measure on a space W, deﬁne the kernel

KQ(x, x ) := Z φ(x, w)φ(x , w)d Q(w). (2)

We want to ﬁnd the best kernel KQ over all distributions Q in some (large, nonparametric) set P of possible distributions on random features; we consider a kernel alignment problem of the form

maximize Q P

i,j KQ(xi, xj)yiyj. (3)

We focus on sets P deﬁned by divergence measures on the space of probability distributions. For a convex function f with f(1) = 0, the f-divergence between distributions P and Q is Df (P||Q) = R f( d P

d Q)d Q. Then, for a base (user-deﬁned) distribution P0, we consider collections P := {Q : Df (Q||P0) ρ} where ρ > 0 is a speciﬁed constant. In this paper, we focus on divergences f(t) = tk 1 for k 2. Intuitively, the distribution Q maximizing the alignment (3) gives a feature space in which pairwise distances are similar to those in the output space Y. Unfortunately, the problem (3) is generally intractable as it is inﬁnite dimensional.

Using the randomized feature approach, we approximate the integral (2) as a discrete sum over samples W i iid P0, i [Nw]. Deﬁning the discrete approximation PNw := {q : Df (q||1/Nw) ρ} to P, we have the following empirical version of problem (3):

maximize q PNw

i,j yiyj Nw X

m=1 qmφ(xi, wm)φ(xj, wm). (4)

Using randomized features, matching the input and output distances in problem (4) translates to ﬁnding a (weighted) set of points among w1, w2, ..., w Nw that best describe the underlying dataset, or, more directly, ﬁnding weights q so that the kernel matrix matches the correlation matrix yy T .

Given a solution bq to problem (4), we can solve the primal form of problem (1) in two ways. First, we can apply the Rahimi and Recht [24] approach by drawing D samples W 1, . . . , W D iid bq, deﬁning features φi = [φ(xi, w1) φ(xi, w D)]T , and solving the risk minimization problem

bθ = argmin θ

DθT φi, yi + r(θ) (5)

for some regularization r. Alternatively, we may set φi = [φ(xi, w1) φ(xi, w Nw)]T , where w1, . . . , w Nw are the original random samples from P0 used to solve (4), and directly solve

bθ = argmin θ

i=1 c(θT diag(bq) 1 2 φi, yi) + r(θ) . (6)

Notably, if bq is sparse, the problem (6) need only store the random features corresponding to non-zero entries of bq. Contrast our two-phase procedure to that of Rahimi and Recht [25], which samples W 1, . . . , W D iid P0 and solves the minimization problem

minimize α RNw

m=1 αmφ(xi, wm), yi subject to α C/Nw, (7)

where C is a numerical constant. At ﬁrst glance, it appears that we may suffer both in terms of computational efﬁciency and in classiﬁcation or learning performance compared to the one-step procedure (7). However, as we show in the sequel, the alignment problem (4) can be solved very efﬁciently and often yields sparse vectors bq, thus substantially decreasing the dimensionality of problem (6). Additionally, we give experimental evidence in Section 4 that the two-phase procedure yields generalization performance similar to standard kernel and randomized feature methods.

2.1 Efﬁciently solving problem (4)

The optimization problem (4) has structure that enables efﬁcient (near linear-time) solutions. Deﬁne the matrix Φ = [φ1 φn] RNw n, where φi = [φ(xi, w1) φ(xi, w Nw)]T RNw is the

randomized feature representation for xi and wm iid P0. We can rewrite the optimization objective as

i,j yiyj Nw X

m=1 qmφ(xi, wm)φ(xj, wm) =

i=1 yiφ(xi, wm) 2 = q T ((Φy) (Φy)) ,

where denotes the Hadamard product. Constructing the linear objective requires the evaluation of Φy. Assuming that the computation of φ is O(d), construction of Φ is O(n Nwd) on a single processor. However, this construction is trivially parallelizable. Furthermore, computation can be sped up even further for certain distributions P0. For example, the Fastfood technique can approximate Φ in O(n Nw log(d)) time for the Gaussian kernel [21].

The problem (4) is also efﬁciently solvable via bisection over a scalar dual variable. Using λ 0 for the constraint Df (Q||P0) ρ, a partial Lagrangian is

L(q, λ) = q T ((Φy) (Φy)) λ (Df (q||1/Nw) ρ) .

The corresponding dual function is g(λ) = supq L(q, λ), where := {q RNw + : q T 1 = 1} is the probability simplex. Minimizing g(λ) yields the solution to problem (4); this is a convex optimization problem in one dimension so we can use bisection. The computationally expensive step in each iteration is maximizing L(q, λ) with respect to q for a given λ. For f(t) = tk 1, we deﬁne v := (Φy) (Φy) and solve

maximize q q T v λ 1

m=1 (Nwqm)k. (8)

This has a solution of the form qm = vm/λN k 1 w + τ 1 k 1 + , where τ is chosen so that P

m qm = 1. We can ﬁnd such a τ by a variant of median-based search in O(Nw) time [11]. Thus, for any k 2, an ϵ-suboptimal solution to problem (4) can be found in O(Nw log(1/ϵ)) time (see Algorithm 1).

Algorithm 1 Kernel optimization with f(t) = tk 1 as divergence

INPUT: distribution P0 on W, sample {(xi, yi)}n i=1, Nw N, feature function φ, ϵ > 0 OUTPUT: q RNw that is an ϵ-suboptimal solution to (4). SETUP: Draw Nw samples wm iid P0, build feature matrix Φ, compute v := (Φy) (Φy). Set λu , λl 0, λs 1 while λu =

q argmaxq L(q, λs) // (solution to problem (8)) if Df (q||1/Nw) < ρ then λu λs else λs 2λs while λu λl > ϵλs

λ (λu + λl)/2 q argmaxq L(q, λ) // (solution to problem (8)) if Df (q||1/Nw) < ρ then λu λ else λl λ

3 Consistency and generalization performance guarantees

Although the procedure (4) is a discrete approximation to a heuristic kernel alignment problem, we can provide guarantees on its performance as well as the generalization performance of our subsequent model trained with the optimized kernel.

Consistency First, we provide guarantees that the solution to problem (4) approaches a population optimum as the data and random sampling increase (n and Nw , respectively). We consider the following (slightly more general) setting: let S : X X [ 1, 1] be a bounded function, where we intuitively think of S(x, x ) as a similarity metric between labels for x and x , and denote Sij := S(xi, xj) (in the binary case with y { 1, 1}, we have Sij = yiyj). We then deﬁne the alignment functions

T(P) := E[S(X, X )KP (X, X )], b T(P) := 1 n(n 1)

i =j Sij KP (xi, xj),

where the expectation is taken over S and the independent variables X, X . Lemmas 1 and 2 provide consistency guarantees with respect to the data sample (xi and Sij) and the random feature sample (wm); together they give us the overall consistency result of Theorem 1. We provide proofs in the supplement (Sections A.1, A.2, and A.3 respectively). Lemma 1 (Consistency with respect to data). Let f(t) = tk 1 for k 2. Let P0 be any distribution on the space W, and let P = {Q : Df (Q||P0) ρ}. Then

b T(Q) T(Q) t

Lemma 1 shows that the empirical quantity b T is close to the true T. Now we show that, independent of the size of the training data, we can consistently estimate the optimal Q P via sampling (i.e. Q PNw). Lemma 2 (Consistency with respect to sampling features). Let the conditions of Lemma 1 hold. Then, with Cρ = 2(ρ+1) 1+ρ 1 and Dρ = p

8(1 + ρ), we have

sup Q PNw b T(Q) sup Q P b T(Q) 4Cρ

with probability at least 1 δ over the draw of the samples W m iid P0.

Finally, we combine the consistency guarantees for data and sampling to reach our main result, which shows that the alignment provided by the estimated distribution b Q is nearly optimal.

Theorem 1. Let b Qw maximize b T(Q) over Q PNw. Then, with probability at least 1 3δ over the sampling of both (x, y) and W, we have T( b Qw) sup Q P T(Q) 4Cρ

Generalization performance The consistency results above show that our optimization procedure nearly maximizes alignment T(P), but they say little about generalization performance for our model trained using the optimized kernel. We now show that the class of estimators employed by our method has strong performance guarantees. By construction, our estimator (6) uses the function class

FNw := n h(x) =

m=1 αm qmφ(x, wm) | q PNw, α 2 B o ,

and we provide bounds on its generalization via empirical Rademacher complexity. To that end, deﬁne Rn(FNw) := 1 n E[supf FNw Pn i=1 σif(xi)], where the expectation is taken over the i.i.d. Rademacher variables σi { 1, 1}. We have the following lemma, whose proof is in Section A.4.

Lemma 3. Under the conditions of the preceding paragraph, Rn(FNw) B q

Applying standard concentration results, we obtain the following generalization guarantee. Theorem 2 ([8, 18]). Let the true misclassiﬁcation risk and ν-empirical misclassiﬁcation risk for an estimator h be deﬁned as follows:

R(h) := P(Y h(X) < 0), b Rν(h) := 1

i=1 min n 1, 1 yh(xi)/ν

Then suph FNw {R(h) b Rν(h)} 2

ν Rn(FNw) + 3 q

δ 2n with probability at least 1 δ.

The bound is independent of the number of terms Nw, though in practice we let B grow with Nw.

4 Empirical evaluations

We now turn to empirical evaluations, comparing our approach s predictive performance with that of Rahimi and Recht s randomized features [24] as well as a joint optimization over kernel compositions and empirical risk. In each of our experiments, we investigate the effect of increasing dimensionality of the randomized feature space D. For our approach, we use the χ2-divergence (k = 2 or f(t) = t2 1). Letting bq denote the solution to problem (4), we use two variants of our approach: when D < nnz(bq) we use estimator (5), and we use estimator (6) otherwise. For the original randomized feature approach, we relax the constraint in problem (7) with an ℓ2 penalty. Finally, for the joint optimization in which we learn the kernel and classiﬁer together, we consider the kernel-learning objective, i.e. ﬁnding the best Gram matrix G in problem (1) for the soft-margin SVM [14]:

minimizeq PNw supα αT 1 1

i,j αiαjyiyj PNw m=1 qmφ(xi, wm)φ(xj, wm) subject to 0 α C1, αT y = 0. (9)

We use a standard primal-dual algorithm [4] to solve the min-max problem (9). While this is an expensive optimization, it is a convex problem and is solvable in polynomial time.

In Section 4.1, we visualize a particular problem that illustrates the effectiveness of our approach when the user-deﬁned kernel is poor. Section 4.2 shows how learning the kernel can be used to quickly ﬁnd a sparse set of features in high dimensional data, and Section 4.3 compares our performance with unoptimized random features and the joint procedure (9) on benchmark datasets. The supplement contains more experimental results in Section C.

4.1 Learning a new kernel with a poor choice of P0

For our ﬁrst experiment, we generate synthetic data xi iid N(0, I) with labels yi = sign( x 2

d), where x Rd. The Gaussian kernel is ill-suited for this task, as the Euclidean distance used in this kernel does not capture the underlying structure of the classes. Nevertheless, we use the Gaussian kernel, which corresponds [24] to φ(x, (w, v)) = cos((x, 1)T (w, v)) where (W, V ) N(0, I) Uni(0, 2π), to showcase the effects of our method. We consider a training set of size n = 104 and a test set of size 103, and we employ logistic regression with D = nnz(bq) for both our technique as well as the original random feature approach.1

1For 2 d 15, nnz(bq) < 250 when the kernel is trained with Nw = 2 104, and ρ = 200.

-4 -2 0 2 4

(a) Training data & optimized features for d = 2

2 4 6 8 10 12 14 0

GK-train GK-test OK-train OK-test

(b) Error vs. d

Figure 1. Experiments with synthetic data. (a) Positive and negative training examples are blue and red, and optimized randomized features (wm) are yellow. All offset parameters vm were optimized to be near 0 or π (not shown). (b) Misclassiﬁcation error of logistic regression model vs. dimensionality of data. GK denotes random features with a Gaussian kernel, and our optimized kernel is denoted OK.

101 102 103 104 0.1

(a) Error vs. D

101 102 103 104 105 0

(b) bqi vs. i

Figure 2. Feature selection in sparse data. (a) Misclassiﬁcation error of ridge regression model vs. dimensionality of data. LK denotes random features with a linear kernel, and OK denotes our method. Our error is ﬁxed above D = nnz(bq) after which we employ estimator (6). (b) Weight of feature i in optimized kernel (qi) vs. i. Vertical bars delineate separations between k-grams, where 1 k 5 is nondecreasing in i. Circled features are preﬁxes of GGTTG and GTTGG at indices 60 64.

Figure 1 shows the results of the experiments for d {2, . . . , 15}. Figure 1(a) illustrates the output of the optimization when d = 2. The selected kernel features wm lie near (1, 1) and ( 1, 1); the offsets vm are near 0 and π, giving the feature φ( , w, v) a parity ﬂip. Thus, the kernel computes similarity between datapoints via neighborhoods of (1, 1) and ( 1, 1) close to the classiﬁcation boundary. In higher dimensions, this generalizes to neighborhoods of pairs of opposing points along the surface of the d-sphere; these features provide a coarse approximation to vector magnitude. Performance degradation with d occurs because the neighborhoods grow exponentially larger and less dense (due to ﬁxed Nw and n). Nevertheless, as shown in Figure 1(b), this degradation occurs much more slowly than that of the Gaussian kernel, which suffers a similar curse of dimensionality due to its dependence on Euclidean distance. Although somewhat contrived, this example shows that even in situations with poor base kernels our approach learns a more suitable representation.

4.2 Feature selection and biological sequences

In addition to the computational advantages rendered by the sparsity of q after performing the optimization (4), we can use this sparsity to gain insights about important features in high-dimensional datasets; this can act as an efﬁcient ﬁltering mechanism before further investigation. We present one example of this task, studying an aptamer selection problem [6]. In this task, we are given n = 2900 nucleotide sequences (aptamers) xi A81, where A = {A,C,G,T} and labels yi indicate (thresholded) binding afﬁnity of the aptamer to a molecular target. We create one-hot encoded forms of k-grams of the sequence, where 1 k 5, resulting in d = P5 k=1 |A|k(82 k) = 105,476

(a) Error vs. D, adult

101 102 103 104

(b) Error vs. D, reuters

101 102 103 0.04

(c) Error vs. D, buzz

102 103 10-1

(d) Speedup vs. D, adult

101 102 103 104 10-1

(e) Speedup vs. D, reuters

101 102 103

(f) Speedup vs. D, buzz

Figure 3. Performance analysis on benchmark datasets. The top row shows training and test misclassiﬁcation rates. Our method is denoted as OK and is shown in red. The blue methods are random features with Gaussian, linear, or arc-cosine kernels (GK, LK, or ACK respectively). Our error and running time become ﬁxed above D = nnz(bq) after which we employ estimator (6). The bottom row shows the speedup factor of using our method over regular random features (speedup = x indicates our method takes 1/x of the time required to use regular random features). Our method is faster at moderate to large D and shows better performance than the random feature approach at small to moderate D.

Table 1: Best test results over benchmark datasets

Dataset n, ntest d Model Our error (%), time(s) Random error (%), time(s) adult 32561, 16281 123 Logistic 15.54, 3.6 15.44, 43.1 reuters 23149, 781265 47236 Ridge 9.27, 0.8 9.36, 295.9 buzz 105530, 35177 77 Ridge 4.92, 2.0 4.58, 11.9

features. We consider the linear kernel, i.e. φ(x, w) = xw, where w Uni({1, . . . , d}). Figure 2(a) compares the misclassiﬁcation error of our method with that of random k-gram features, while Figure 2(b) indicates the weights qi given to features by our method. In under 0.2 seconds, we whittle down the original feature space to 379 important features. By restricting random selection to just these features, we outperform the approach of selecting features uniformly at random when D d. More importantly, however, we can derive insights from this selection. For example, the circled features in Figure 2(b) correspond to k-gram preﬁxes for the 5-grams GGTTG and GTTGG at indices 60 through 64; G-complexes are known to be relevant for binding afﬁnities in aptamers [6], so this is reasonable.

4.3 Performance on benchmark datasets

We now show the beneﬁts of our approach on large-scale datasets, since we exploit the efﬁciency of random features with the performance of kernel-learning techniques. We perform experiments on three distinct types of datasets, tracking training/test error rates as well as total (training + test) time. For the adult2 dataset we employ the Gaussian kernel with a logistic regression model, and for the reuters3 dataset we employ a linear kernel with a ridge regression model. For the buzz4 dataset we employ ridge regression with an arc-cosine kernel of order 2, i.e. P0 = N(0, I) and φ(x, w) = H(w T x)(w T x)2, where H( ) is the Heavyside step function [7].

2https://archive.ics.uci.edu/ml/datasets/Adult 3http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm. We consider predicting whether a document has a CCAT label. 4http://ama.liglab.fr/data/buzz/classiﬁcation/. We use the Twitter dataset.

Table 2: Comparisons with joint optimization on subsampled data Dataset Our training / test error (%), time(s) Joint training / test error (%), time(s) adult 16.22 / 16.36, 1.8 14.88 / 16.31, 198.1 reuters 7.64 / 9.66, 0.6 6.30 / 8.96, 173.3 buzz 8.44 / 8.32, 0.4 7.38 / 7.08, 137.5

Comparison with unoptimized random features Results comparing our method with unoptimized random features are shown in Figure 3 for many values of D, and Table 1 tabulates the best test error and corresponding time for the methods. Our method outperforms the original random feature approach in terms of generalization error for small and moderate values of D; at very large D the random feature approach either matches our surpasses our performance. The trends in speedup are opposite: our method requires extra optimizations that dominate training time at extremely small D; at very large D we use estimator (6), so our method requires less overall time. The nonmonotonic behavior for reuters (Figure 3(e)) occurs due to the following: at D nnz(bq), sampling indices from the optimized distribution takes a non-neglible fraction of total time, and solving the linear system requires more time when rows of Φ are not unique (due to sampling).

Performance improvements also depend on the kernel choice for a dataset. Namely, our method provides the most improvement, in terms of training time for a given amount of generalization error, over random features generated for the linear kernel on the reuters dataset; we are able to surpass the best results of the random feature approach 2 orders of magnitude faster. This makes sense when considering the ability of our method to sample from a small subset of important features. On the other hand, random features for the arc-cosine kernel are able to achieve excellent results on the buzz dataset even without optimization, so our approach only offers modest improvement at small to moderate D. For the Gaussian kernel employed on the adult dataset, our method is able to achieve the same generalization performance as random features in roughly 1/12 the training time.

Thus, we see that our optimization approach generally achieves competitive results with random features at lower computational costs, and it offers the most improvements when either the base kernel is not well-suited to the data or requires a large number of random features (large D) for good performance. In other words, our method reduces the sensitivity of model performance to the user s selection of base kernels.

Comparison with joint optimization Despite the fact that we do not choose empirical risk as our objective in optimizing kernel compositions, our optimized kernel enjoys competitive generalization performance compared to the joint optimization procedure (9). Because the joint optimization is very costly, we consider subsampled training datasets of 5000 training examples. Results are shown in Table 2, where it is evident that the efﬁciency of our method outweighs the marginal gain in classiﬁcation performance for joint optimization.

5 Conclusion

We have developed a method to learn a kernel in a supervised manner using random features. Although we consider a kernel alignment problem similar to other approaches in the literature, we exploit computational advantages offered by random features to develop a much more efﬁcient and scalable optimization procedure. Our concentration bounds guarantee the results of our optimization procedure closely match the limits of inﬁnite data (n ) and sampling (Nw ), and our method produces models that enjoy good generalization performance guarantees. Empirical evaluations indicate that our optimized kernels indeed learn structure from data, and we attain competitive results on benchmark datasets at a fraction of the training time for other methods. Generalizing the theoretical results for concentration and risk to other f divergences is the subject of further research. More broadly, our approach opens exciting questions regarding the usefulness of simple optimizations on random features in speeding up other traditionally expensive learning problems.

Acknowledgements This research was supported by a Fannie & John Hertz Foundation Fellowship and a Stanford Graduate Fellowship.

[1] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463 482, 2003. [2] A. Ben-Hur and W. S. Noble. Kernel methods for predicting protein protein interactions. Bioinformatics, 21(suppl 1):i38 i46, 2005. [3] A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013. [4] D. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, 1999. [5] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic Theory of Independence. Oxford University Press, 2013. [6] M. Cho, S. S. Oh, J. Nie, R. Stewart, M. Eisenstein, J. Chambers, J. D. Marth, F. Walker, J. A. Thomson, and H. T. Soh. Quantitative selection and parallel characterization of aptamers. Proceedings of the National Academy of Sciences, 110(46), 2013. [7] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342 350, 2009. [8] C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 247 254, 2010. [9] C. Cortes, M. Mohri, and A. Rostamizadeh. Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 13(1):795 828, 2012. [10] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On kernel target alignment. In Innovations in Machine Learning, pages 205 256. Springer, 2006. [11] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efﬁcient projections onto the ℓ1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, 2008. [12] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. ar Xiv preprint ar Xiv:1302.4922, 2013. [13] M. Girolami and S. Rogers. Hierarchic bayesian models for kernel learning. In Proceedings of the 22nd international conference on Machine learning, pages 241 248. ACM, 2005. [14] M. Gönen and E. Alpaydın. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 12:2211 2268, 2011. [15] G. E. Hinton and R. R. Salakhutdinov. Using deep belief nets to learn covariance kernels for gaussian processes. In Advances in neural information processing systems, pages 1249 1256, 2008. [16] J. Kandola, J. Shawe-Taylor, and N. Cristianini. Optimizing kernel alignment over combinations of kernel. 2002. [17] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning. The Journal of Machine Learning Research, 12:953 997, 2011. [18] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Annals of Statistics, pages 1 50, 2002. [19] V. Koltchinskii, D. Panchenko, et al. Complexities of convex combinations and bounding the generalization error in classiﬁcation. The Annals of Statistics, 33(4):1455 1496, 2005. [20] G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with semideﬁnite programming. The Journal of Machine Learning Research, 5:27 72, 2004. [21] Q. Le, T. Sarlós, and A. Smola. Fastfood-computing hilbert space expansions in loglinear time. In Proceedings of the 30th International Conference on Machine Learning, pages 244 252, 2013. [22] D. Luenberger. Optimization by Vector Space Methods. Wiley, 1969. [23] S. Qiu and T. Lane. A framework for multiple kernel support vector regression and its applications to sirna efﬁcacy prediction. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 6(2): 190 199, 2009. [24] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems 20, 2007. [25] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems 21, 2008. [26] P. Samson. Concentration of measure inequalities for Markov chains and φ-mixing processes. Annals of Probability, 28(1):416 461, 2000. [27] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [28] Y. Ying, K. Huang, and C. Campbell. Enhanced protein fold recognition through a novel data integration approach. BMC bioinformatics, 10(1):1, 2009. [29] A. Zien and C. S. Ong. Multiclass multiple kernel learning. In Proceedings of the 24th international conference on Machine learning, pages 1191 1198. ACM, 2007.