# localized_structured_prediction__2dfa800e.pdf Localized Structured Prediction Carlo Ciliberto 1 Francis Bach 2 Alessandro Rudi 2 c.ciliberto@imperial.ac.uk francis.bach@inria.fr alessandro.rudi@inria.fr 1 Department of Electrical and Electronic Engineering, Imperial College, London, UK. 2 INRIA - Département d informatique, École Normale Supérieure - PSL Research University, Paris, France. Key to structured prediction is exploiting the problem s structure to simplify the learning process. A major challenge arises when data exhibit a local structure (i.e., are made by parts ) that can be leveraged to better approximate the relation between (parts of) the input and (parts of) the output. Recent literature on signal processing, and in particular computer vision, shows that capturing these aspects is indeed essential to achieve state-of-the-art performance. However, in this context algorithms are typically derived on a case-by-case basis. In this work we propose the first theoretical framework to deal with part-based data from a general perspective and study a novel method within the setting of statistical learning theory. Our analysis is novel in that it explicitly quantifies the benefits of leveraging the part-based structure of a problem on the learning rates of the proposed estimator. 1 Introduction Structured prediction deals with supervised learning problems where the output space is not endowed with a canonical linear metric but has a rich semantic or geometric structure [5, 29]. Typical examples are settings in which the outputs correspond to strings (e.g., captioning [19]), images (e.g., segmentation [1]), rankings [16, 20], points on a manifold [33], probability distributions [24] or protein foldings [18]. While the lack of linearity poses several modeling and computational challenges, this additional complexity comes with a potentially significant advantage: when suitably incorporated within the learning model, knowledge about the structure allows to capture key properties of the data. This could potentially lower the sample complexity of the problem, attaining better generalization performance with less training examples. A natural scenario in this sense is the case where both input and output data are organized into parts that can interact with one another according to a specific structure. Examples can be found in computer vision (e.g., segmentation [1], localization [6, 22], pixel-wise classification [41]), speech recognition [4, 40], natural language processing [43], trajectory planing [31] or hierarchical classification [44]. Recent literature on the topic has empirically shown that the local structure in the data can indeed lead to significantly better predictions than global approaches [17, 45]. However in practice, these ideas are typically investigated on a case-by-case basis, leading to ad-hoc algorithms that cannot be easily adapted to new settings. On the theoretical side, few works have considered less specific part-based factorizations [12] and a comprehensive theory analyzing the effect of local interactions between parts within the context of learning theory is still missing. In this paper, we propose: 1) a novel theoretical framework that can be applied to a wide family of structured prediction settings able to capture potential local structure in the data, and 2) a structured prediction algorithm, based on this framework for which we prove universal consistency and generalization rates. The proposed approach builds on recent results from the structured prediction literature that leverage the concept of implicit embeddings [8, 9, 28, 15, 25], also related to [30, 39]. A key contribution of our analysis is to quantify the impact of the part-based structure of the problem on the learning rates of the proposed estimator. In particular, we prove that under natural assumptions on 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. 20 40 60 80 100 120 140 160 Figure 1: (Left) Between-locality in a sequence-to-sequence setting: each window (part) yp of the output sequence y is fully determined by the part xp of the input sequence x, for every p 2 P. (Right) Empirical within-locality Cp,q of 100 images sampled from Image Net between a 20 20 patch q and the central patch p. the local behavior of the data, our algorithm benefits adaptively from this underlying structure. We support our theoretical findings with experiments on the task of detecting local orientation of ridges in images depicting human fingerprints. 2 Learning with Between- & Within-locality To formalize the concept of locality within a learning problem, in this work we assume that the data is structured in terms of parts . Practical examples of this setting often arise in image/audio or language processing, where the signal has a natural factorization into patches or sub-sequences. Following these guiding examples, we assume every input x 2 X and output y 2 Y to be interpretable as a collection of (possibly overlapping) parts, and denote xp (respectively yp) its p-th part, with p 2 P a set of part identifiers (e.g., the position and size of a patch in an image). We assume input and output to share same part structure with respect to P. To formalize the intuition that the learning problem should interact well with this structure of parts, we introduce two key assumptions: between-locality and within-locality. They characterize respectively the interplay between corresponding input-output parts and the correlation of parts within the same input. Assumption 1 (Between-locality). yp is conditionally independent from x, given xp, moreover the probability of yp given xp is the same as yq given xq, for any p, q 2 P. Between-locality (BL) assumes that the p-th part of the output y 2 Y depends only on the p-th part of the input x 2 X, see Fig. 1 (Left) for an intuition in the case of sequence-to-sequence prediction. This is often verified in pixel-wise classification settings, where the class yp of a pixel p is determined only by the sub-image in the corresponding patch xp. BL essentially corresponds to assuming a joint graphical model on the parts of x and y, where each yp is only connected to xp but not to other parts. BL motivates us to focus on a local level by directly learning the relation between input-output parts. This is often an effective strategy in computer vision [22, 45, 17] but intuitively, one that provides significant advantages only when the input parts are not highly correlated with each other: in the extreme case where all parts are identical, there is no advantage in solving the learning problem locally. In this sense it can be useful to measure the amount of covariance Cp,q = Ex S(xp, xq) Ex,x0 S(xp, x0 between two parts p and q of an input x, for S(xp, xq) a suitable measure of similarity between parts (if S(xp, xq) = xpxq, with xp and xq scalars random variables, then Cp,q is the p, q-th entry of the covariance matrix of the vector (x1, . . . , x|P |) ). Here Ex S(xp, xq) and Ex,x0S(xp, x0 q) measure the similarity between the p-th and the q-th part of, respectively, the same input, and two independent ones (in particular Cp,q = 0 when the p-th and q-th part of x are independent). In many applications, it is reasonable to assume that Cp,q decays according to the distance between p and q. Assumption 2 (Within-locality). There exists a distance d : P P ! R and γ 0, such that |Cp,q| 6 r2 e γd(p,q) with r2 = sup x,x0 |S(x, x0)|. (2) Within-locality (WL) is always satisfied for γ = 0. However, when xp is independent of xq, it holds with γ = 1 and d(p, q) = δp,q the Dirac s delta. Exponential decays of correlation are typically observed when the distribution of the parts of x factorizes in a graphical model that connects parts which are close in terms of the distance d: although all parts depend on each other, the long-range dependence typically goes to zero exponentially fast in the distance (see, e.g., [26] for mixing properties of Markov chains). Fig. 1 (Right) reports the empirical WL measured on 100 images randomly sampled from Image Net [13]: each pixel (i, j) reports the value of Cp,q of the central patch p with respect to a 20 20 patch q centered in (i, j). Here S(xp, xq) = x> p xq. We note that Cp,q decreases extremely fast as a function of the distance !!, suggesting that Assumption 2 holds for a large value of γ. Contributions. In this work we present a novel structured prediction algorithm that adaptively leverages locality in the learning problem, when present (Sec. 4). We study the generalization properties of the proposed estimator (Sec. 5), showing that it is equivalent to the state of the art in the worst case scenario. More importantly, if the locality Assumptions 1 and 2 are satisfied, we prove that our learning rates improve proportionally to the number |P| of parts in the problem. Here we give an informal version of this main result, reported in more detail in Thm. 4 (Sec. 5). Below we denote by bf the proposed estimator, by E(f) the expected risk of a function f : X ! Y and f = argminf E(f). Theorem 1 (Informal - Learning Rates & Locality). Under mild assumptions on the loss and the data distribution, if the learning problem is local (Assumptions 1 and 2), there exists c0 > 0 such that E( bf ) E(f ) e γd(p,q), (3) where the expectation is taken with respect to the sample of n input-output points used to train bf. In the worst-case scenario γ = 0 (no exponential decay of the covariance between parts), the bound in (3) scales as 1/n1/4 (since s = r2|P|) recovering [8], where no structure is assumed on the parts. However, as soon as γ > 0, s can be upper bounded by a constant independent of |P| and thus the rate scales as 1/(|P|n)1/4, accelerating proportionally to the number of parts. In this sense, Thm. 1 shows the significant benefit of making use of locality. The following example focuses on the special case of sequence-to-sequence prediction. Example 1 (Locality on Sequences). As depicted in Fig. 1, for discrete sequences we can consider parts (e.g., windows) indexed by P = {1, . . . , |P|}, with d(p, q) = |p q| for p, q 2 P (see Appendix K.1 for more details). In this case, Assumption 2 leads to s 6 2r2(1 e γ) 1, (4) which for γ > 0 is bounded by a constant not depending on the number of parts. Hence, Thm. 1 guarantees a learning rate of order 1/(n|P|)1/4, which is significanlty faster than the rate 1/n1/4 of methods that do not leverage locality such as [8]. See Sec. 6 for empirical support to this observation. 3 Problem Formulation We denote by X, Y and Z respectively the input space, label space and output space of a learning problem. Let be a probability measure on X Y and 4 : Z Y X ! R a loss measuring prediction errors between a label y 2 Y and a output z 2 Z, possibly parametrized by an input x 2 X. To stress this interpretation we adopt the notation 4(z, y|x). Given a finite number of (xi, yi)n i=1 independently sampled from , our goal is to approximate the minimizer f of the expected risk min f:X!Z E(f), with E(f) = 4(f(x), y|x) d (x, y). (5) Loss Made by Parts. We formalize the intuition introduced in Sec. 2 that data are decomposable into parts: we denote the sets of parts of X, Y and Z by respectively [X], [Y ] and [Z]. These are abstract sets that depend on the problem at hand (see examples below). We assume P to be a set of part indices equipped with a selection operator X P ! [X] denoted (x, p) 7! [x]p (analogously for Y and Z). When clear from context, we will use the shorthand xp = [x]p. For simplicity, in the following we will assume P be finite, however our analysis generalizes also to the infinite case (see supplementary material). Let ( |x) be a probability distribution over the set of parts P, conditioned with respect to an input x 2 X. We study loss functions 4 that can be represented as 4(z, y|x) = (p|x) Lp(zp, yp| xp). (6) The collection of (Lp)p2P is a family of loss functions Lp : [Z] [Y ] [X] ! R, each comparing the p-th part of a label y and output z. For instance, in an image processing scenario, Lp could measure the similarity between the two images at different locations and scales, indexed by p. In this sense, the distribution (p|x) allows to weigh each Lp differently depending on the application (e.g., mistakes at large scales could be more relevant than at lower scales). Various examples of parts and concrete cases are illustrated in the supplementary material, here we report an extract. Example 2 (Sequence to Sequence Prediction). Let X = Ak, Y = Z = Bk for two sets A, B and k 2 N a fixed length. We consider in this example parts that are windows of length l 6 k. Then P = {1, . . . , k l + 1} where p 2 P indexes the window xp = (x(p), . . . , x(p+l 1)), with x 2 X, where we have denoted x(s) the s-th entry of the sequence x 2 X, analogous definition for yp, zp. Finally, we choose the loss Lp to be the 0-1 distance between two strings of same length Lp(zp, yp|x) = 1(zp 6= yp). Finally, we can choose (p|x) = 1/|P|, leading to a loss function 4(z, y|x) = 1 |P | p2P 1(zp 6= yp), which is common in the context of Conditional Random Fields (CRFs) [21]. The example above, highlights a tight connection between the framework considerd in this work and the literature of CRFs. However, we care to stress that the two approaches differ by the way they interpret the concepts of loss (used to evaluate fitting errors at training time) and the score functions (used to estimate predictions at inference time). Specifically, while such functions are two separate entities in CRF settings, they essentially coincide in our framework (i.e. the score is a linear combination of loss functions). However, as shown in Example 2, the resulting score functions for both CRFs and our approach have essentially the same structure. Hence they ultimately lead to the same inference problem [40]. We conclude this section by providing additional examples of loss functons decomposable into parts. Remark 1 (Examples of Loss Functions by Parts). Several loss functions used in machine learning have a natural formulation in terms of (6). Notable examples are the Hamming distance [10, 42, 11], used in settings such as hierarchical classification [44], computer vision [29, 45, 41] or trajectory planning [31] to name a few. Also, loss functions used in natural language processing, such as the precision/recall and F1 score can be written in this form. Finally, we point out that multi-task learning settings [27] can be seen as problem by parts, with the loss corresponding to the sum of standard regression/classification loss functions (least-squares, logistic, etc.) over the tasks/parts. 4 Algorithm In this section we introduce our estimator for structured prediction problems with parts. Our approach starts with an auxiliary step for dataset generation that explicitly extracts the parts from the data. Auxiliary Dataset Generation. The locality assumptions introduced in Sec. 2 motivate us to learn the local relations between individual parts p 2 P of each input-output pair. In this sense, given a training dataset D = (xi, yi)n i=1 a first step would be to extract a new, part-based dataset {(xp, p, yp) | (x, y) 2 D, p 2 P}. However in most applications the cardinality |P| of the set of parts can be very large (possibly infinite as we discuss in the Appendix) making this process impractical. Instead, we generate an auxiliary dataset by randomly sub-sampling m 2 N elements from the part-based dataset. Concretely, for j 2 {1, . . . , m}, we first sample ij according to the uniform distribution Un on {1, . . . , n}, set χj = xij, sample pj ( | χj) and finally set j = [yij]pj. This leads to the auxiliary dataset D0 = (χj, pj, j)m j=1, as summarized in the GENERATE routine of Alg. 1. Estimator. Given the auxiliary dataset, we propose the estimator bf : X ! Z, such that 8x 2 X bf(x) = argmin (p|x) Lp(zp, j|xp) The functions j : X P ! R are learned from the auxiliary dataset and are the fundamental components allowing our estimator to capture the part-based structure of the learning problem. Indeed, Input Output observed similarity implied similarity Figure 2: Illustration of the prediction process for the Localized Structured Prediction Estimator (7) for a hypothetical computer vision application. Algorithm 1 Input: training set (xi, yi)n i=1, distributions ( |x) a reproducing kernel k on X P, hyperparameter λ > 0, auxiliary dataset size m 2 N. GENERATE the auxiliary set ( j, χj, pj)m j=1: Sample ij 2 Un( ). Set χj = xij. Sample pj ( |χj). Set j = [yij]pj. LEARN the coefficients for the map : Set K with Kjj0 = k((χj, pj), (χj0, pj0)). A = (K + mλI) 1. Return the map : (x, p) 7! A v(x, p) 2 Rm with v(x, p)j = k (χj, pj), (x, p) for any test point x 2 X and part p 2 P, the value j(x, p) can be interpreted as a measure of how similar xp is to the pj-th part of the auxiliary training point χj. For instance, assume j(x, p) to be an approximation of the delta function that is 1 when xp = [χj]pj and 0 otherwise. Then, j(x, p) Lp(zp, j|xp) δ(xp, [χj]pj) Lp(zp, j|xp), (8) which implies essentially that xp [χj]pj =) zp j. (9) In other words, if the p-th part of test input x and the pj-th part of the auxiliary training input χj (i.e., the pj-th part of the training input xij) are deemed similar, then the estimator will encourage the p-th part of the test output z to be similar to the auxiliary part j. This process is illustrated in Fig. 2 for an ideal computer vision application: for a given test image x, the scores detect a similarity between the p-th patch of x and the pj-th patch of the training input xij. Hence, the estimator will enforce the p-th patch of the output z to be similar to the pj-th patch of the training label yij. Learning . In line with previous work on structured prediction [8], we learn each j by solving a linear system for a problem akin to kernel ridge regression (see Sec. 5 for the theoretical motivation). In particular, let k : (X P) (X P) ! R be a positive definite kernel, we define ( 1(x, p), . . . , m(x, p))> = (K + mλI) 1v(x, p), (10) where K 2 Rm m is the empircal kernel matrix with entries Kjh = k((χj, pj), (χh, ph)) and v(x, p) 2 Rm is the vector with entries v(x, p)j = k((χj, pj), (x, p)). Training the proposed algorithm, consists in precomputing A = (K + mλI) 1 to evaluate the coefficients as detailed by the LEARN routine in Alg. 1. While computing A amounts to solving a linear system, which requires O(m3) operations, we note that it is possible to achieve the same statistical accuracy with reduced complexity O(mpm) by means of low rank approximations (see [14, 32]). Remark 2 (Evaluating bf). According to (7), evaluating bf on a test point x 2 X consists in solving an optimization problem over the output space Z. This is a standard strategy in structured prediction, where an optimization protocol is derived on a case-by-case basis depending on both 4 and Z (see, e.g., [29]). Hence, from a computational viewpoint, the inference step in this work is not more demanding than previous methods (while also enjoying strong theoretical guarantees on the prediction perfomance, as discussed in Sec. 5). Moreover, the specific form of our estimator suggests a general stochastic meta-algorithm to address the inference problem in special settings. In particular, we can reformulate (7) as bf(x) = argmin Ej,p hj,p(z|x), (11) with p sampled according to , j 2 {1, . . . , m} sampled according to the weights j and hj,p suitably defined in terms of Lp. When the hj,p are (sub)differentiable, (11) can be effectively addressed by stochastic gradient methods (SGM). In Alg. 3 in Appendix J we give an example of this strategy. 5 Generalization Properties of Structured Prediction with Parts In this section we study the statistical properties for the proposed algorithm, with particular attention to the impact of locality on learning rates, see Thm. 4 (for a complete analysis of univeral consistency and learning rates without locality assumptions, see Appendices F and H). Our analysis leverages the assumption that the loss function 4 is a Structure Encoding Loss Function (SELF) by Parts. Definition 1 (SELF by Parts). A function 4 : Z Y X ! R is a Structure Encoding Loss Function (SELF) by Parts if it admits a factorization in the form of (6) with functions Lp : [Z] [Y ] [X] ! R, and there exists a separable Hilbert space H and two bounded maps : [Z] [X] P ! H, ' : [Y ] ! H such that for any 2 [Z], 2 [Y ], 2 [X], p 2 P Lp( , | ) = h ( , , p), '( )i H . (12) The definition of SELF by Parts specializes the definition of SELF in [9] and in the following we will always assume 4 to satisfy it. Indeed, Def. 1 is satisfied when the spaces of parts involved are discrete sets and it is rather mild in the general case (see [8] for an exhaustive list of examples). Note that when 4 is SELF, the solution of (5) is completely characterized in terms of the conditional expectation (related to the conditional mean embedding [7, 23, 36, 34]) of '(yp) given x, denoted by g : X P ! H, as follows. Lemma 2. Let 4 be SELF and Z compact. Then, the minimizer of (5) is X-a.e. characterized by f (x) = argmin (p|x) h (zp, xp, p), g (x, p)i H , g (x, p) = '(yp)d (y|x). (13) Lemma 2 (proved in Appendix C) shows that f is completely characterized in terms of the conditional expectation g , which indeed plays a key role in controlling the learning rates of bf. In particular, we investigate the learning rates in light of the two assumptions of betweenand within-locality introduced in Sec. 2. To this end, we first study the direct effects of these two assumptions on the learning framework introduced in this work. The effect of Between-locality. We start by observing that the between-locality between parts of the inputs and parts of the output allows for a refined characterization of the conditional mean g . Lemma 3. Let g be defined as in (13). Under Assumption 1, there exists g : [X] ! H such that g (x, p) = g (xp) 8x 2 X, p 2 P. (14) Lemma 3 above shows that we can learn g by focusing on a simpler problem, identified by the function g acting only the parts [X] of X rather than on the whole input directly (for a proof see Lemma 21 in Appendix G). This motivates the adoption of the restriction kernel [6], namely a function k : (X P) (X P) ! R such that k((x, p), (x0, q)) = k(xp, xq), (15) which, for any pair of inputs x, x0 2 X and parts p, q 2 P, measures the similarity between the p-part of x and the q-th part of q via a kernel k : [X] [X] ! R on the parts of X. The restriction kernel is a well-established tool in structured prediction settings [6] and it has been observed to be remarkably effective in computer vision applications [22, 45, 17]. The effect of Within-locality. We recall that within-locality characterizes the statistical correlation between two different parts of the input (see Assumption 2). To this end we consider the simplified scenario where the parts are sampled from the uniform distribution on P, i.e., (p|x) = 1 |P | for any x 2 X and p 2 P. While more general situations can be considered, this setting is useful to illustrate the effect we are interested in this work. We now define some important quantities that characterize the learning rates under locality, Cp,q = Ex,x0 k(xp, xq)2 k(xp, x0 , r = sup x2X,p2P k(xp, xp). (16) It is clear that the terms Cp,q and r above correspond respectively to the correlations introduced in (1) and the scale parameter introduced in (2), with similarity function S = k2. Let bf be the structured prediction estimator in (7) learned using the restriction kernel in (15) based on k and denote by G the Local Local-LS Global KRLS 0.02 Figure 3: Learning the direction of ridges in fingerprint images. (Left) Examples of ground truths and predictions with pixels color corresponding to the local direction of ridges. (Right) Test error according to 4 in (18). space of functions G = H F with F the reproducing kernel Hilbert space [3] associated to k. In particular, in the following we will consider the standard assumption in the context of non-parametric estimation [7] on the regularity of the target function, which in our context reads as g 2 G. Finally we introduce c2 4 = supz2Z,x2X p2P k (z, x, p)k2 H to measure the complexity of the loss 4 w.r.t. the representation induced by SELF decomposition (Def. 1) analogously to Thm. 2 of [8]. Theorem 4 (Learning Rates & Locality). Under Assumptions 1 and 2 with S = k2, let g satisfying Lemma 3, with g = k g k G < 1. Let s be as in (3). When λ = (r2/m + s/(|P|n))1/2, then E E( bf ) E(f ) 6 12 c4 g |P|n + s |P|n The proof of the result above can be found in Appendix G.1. We can see that betweenand withinlocality allow to refine (and potentially improve) the bound of n 1/4 from structured prediction without locality [8] (see also Thm. 5 in Appendix F). In particular, we observe that the adoption of the restriction kernel in Thm. 4 allows the structured prediction estimator to leverage the within-locality, gaining a benefit proportional to the magnitude of the parameter γ. Indeed r2 6 s 6 r2|P| by definition. More precisely, if γ = 0 (e.g., all parts are identical copies) then s = r2|P| and we recover the rate of O(n 1/4) of [8], while if γ is large (the parts are almost not correlated) then s = r2 and we can take m / n|P| achieving a rate of the order of O (n|P|) 1/4- . We clearly see that depending on the amount of within-locality in the learning problem, the proposed estimator is able to gain significantly in terms of finite sample bounds. 6 Empirical Evaluation We evaluate the proposed estimator on simulated as well as real data. We highlight how locality leads to improved generalization performance, in particular when only few training examples are available. Learning the Direction of Ridges for Fingerprint. Similarly to [37], we considered the problem of detecting the pointwise direction of ridges in a fingerprint image on the FVC04 dataset1 comprising 80 grayscale 640 480 input images depicting fingerprints and corresponding output images encoding in each pixel the local direction of the ridges of the input fingerprint as an angle 2 [ , ]. A natural loss function is the average pixel-wise error sin( 0)2 between a ground-truth angle and the predicted 0 according to the geodesic distance on the sphere. To apply the proposed algorithm, we consider the following representation of the loss in term of parts: let P be the collection of patches of dimension 20 20 and equispaced each 5 5 pixels2 so that each pixel belongs exactly to 16 patches. For all z, y 2 R640 480, the average pixel-wise error is 4(z, y) = 16 L(zp, yp), with L( , ) = 1 20 20 sin([ ]ij, [ ]ij)2, (18) 1http://bias.csr.unibo.it/fvc2004, DB1_B. The output is obtained by applying 7 7 Sobel filtering. 2For simplicity we assume circular images , namely [x]i,j = [x](i mod 640),(j mod 480). Figure 4: Empirical estimation of within-locality for the central patch of the fingerprints dataset. 1 50 100 150 200 # of Parts Mean Squared Error = 0.0, Global-LS = 0.1, Global-LS = 0.2, Global-LS = 0.6, Global-LS = 1.6, Global-LS = 4.0, Global-LS = 10.0, Global-LS Independent-Parts-LS = 0.1, Local-LS = 0.2, Local-LS = 0.6, Local-LS = 1.6, Local-LS = 4.0, Local-LS = 10.0, Local-LS Figure 5: Effect of within-locality w.r.t. γ and |P|: Global-LS vs. Independent Parts-LS vs. Local-LS (ours). where = zp, = yp 2 [ , ]20 20 are the extracted patches and [ ]ij their value at pixel (i, j). We compared our approach using 4 (Local-4) or least-squares (Local-LS) with competitors that do not take into account the local structure of the problem, namely standard vector-valued kernel ridge regression (KRLS) [7] and the structured prediction algorithm in [8] with 4 loss (4-Global). We used a Gaussian kernel on the input (for the local estimators the restriction kernel in (15) with k Gaussian). We randomly sampled 50/30 images for training/testing, performing 5-fold cross-validation on λ in [10 6, 10] (log spaced) and the kernel bandwidth in [10 3, 1]. For Local-4 and Local-LS we built an auxiliary set with m = 30000 random patches (see Sec. 4), sampled from the 50 training images. Results. Fig. 3 (Left) reports the average prediction error across 10 random train-test splits. We make two observations: first, methods that leverage the locality in the data are consistently superior to their global counterparts, supporting our theoretical results in Sec. 5 that the proposed estimator can lead to significantly better performance, in particular when few training points are available. Second, the experiment suggests that choosing the right loss is critical, since exploiting locality without the right loss (i.e., Local-LS in the figure) generally leads to worse performance. The three sample predictions in Fig. 3 (Right) provide more qualitative insights on the models tested. In particular while both locality-aware methods are able to recover the correct structure of the fingerprints, only combining this information with the loss 4 leads to accurate recovery of the ridge orientation. Within-locality. In Fig. 4 we visualize the (empirical) within-locality of the central patch p for the fingerprint dataset. The figure depicts Cp,q (defined in (16)) for q 2 P, with the (i, j)-th pixel in the image corresponding to Cp,q with q the 20 20 patch centered in (i, j). The fast decay of these values as the distance from the central patch p increase, suggests that within-locality holds for a large value of γ, possibly justifying the good performance exhibited by (Local-4) in light of Thm. 4. Simulation: Within-Locality. We complement our analysis with synthetic experiments where we control the amount of within-locality γ. We considered a setting where input points are vectors x 2 Rk|P | comprising |P| parts of dimension k = 1000. Inputs are sampled according to a normal distribution with zero mean and covariance (γ) = M(γ) I, where M(γ) 2 R|P | |P | has entries M(γ)pq = e γd(p,q) and d(p, q) = |p q|/|P|. By design, as γ grows C varies from being rank-one (all parts are identical copies) to diagonal (all parts are independently sampled). To isolate the effect of within-locality on learning, we tested our estimator on a linear multitask (actually vector-valued) regression problem with least-squares loss 4. We generated datasets (xi, yi)n i=1 of size n = 100 for training and n = 1000 for testing, with xi sampled as described above and yi = w>xi + with noise 2 Rk|P | sampled from an isotropic Gaussian with standard deviation 0.5. To guarantee between-locality to hold, we generated the target vector w = [ w, . . . , w] 2 Rk|P | by concatenating copies of a w 2 Rk sampled uniformly on the radius-one ball. We performed regression with linear restriction kernel on the parts/subvectors (Local-LS) on the full auxiliary dataset ([xi]p, [yi]p) with 1 6 i 6 n and 1 6 p 6 |P|, and compared it with standard linear regression (Global-LS) on the original dataset (xi, yi)n i=1 and linear regression performed independently for each (local) subdataset ([xi]p, [yi]p)n i=1 (Independent Parts - LS). The parameter λ was chosen by hold-out cross-validation in [10 6, 10] (log spaced). Fig. 5 reports the (log scale) mean square error (MSE) across 100 runs of the two estimators for increasing values of γ and |P|. In line with Thm. 4, when γ and |P| are large, Local-LS significantly outperforms both i) Global-LS, which solves one single problem jointly and does not benefit withinlocality, and ii) Independent Parts-LS, which is insensitive to the between-locality across parts and solves each local prediction problem in isolation. For a smaller γ, such advantage becomes less prominent even when the number of parts is large. This is expected since for γ = 0 the input parts are extremely correlated and there is no within locality that can be exploited. 7 Conclusion We proposed a novel approach for structured prediction in presence of locality in the data. Our method builds on [8] by incorporating knowledge of the parts directly within the learning model. We proved the benefits of locality by showing that, under a low-correlation assumption on the parts of the input (within locality), the learning rates of our estimator can improve proportionally to the number of parts in the data. To obtain this result we additionally introduced a natural assumption on the conditional independence between input-output parts (between locality), which provides also a formal justification for adoption of the so-called restriction kernel , previously proposed in the literature, as a mean to lower the sample complexity of the problem. Empirical evaluation on synthetic as well as real data shows that our approach offers significant advantages when few training points are available and leveraging structural information such as locality is crucial to achieve good prediction performance. We identify two main directions for future work: 1) consider settings where the parts are unknown (or latent ) and need to be discovered/learned from data; 2) Consider more general locality assumptions. In particular, we argue that Assumption 2 (WL) might be weakened to account for different (but related) local input-output relations across adjacent parts. [1] Karteek Alahari, Pushmeet Kohli, and Philip H. S. Torr. Reduce, reuse & recycle: Efficiently solving multi-label MRFs. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1 8, 2008. [2] Charalambos D. Aliprantis and Kim Border. Infinite Dimensional Analysis: a Hitchhiker s Huide. Springer Science & Business Media, 2006. [3] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathemati- cal Society, 68(3):337 404, 1950. [4] Lalit Bahl, Peter Brown, Peter De Souza, and Robert Mercer. Maximum mutual information esti- mation of hidden markov model parameters for speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 11, pages 49 52, 1986. [5] G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan. Predicting Structured Data. MIT Press, 2007. [6] Matthew B. Blaschko and Christoph H. Lampert. Learning to localize objects with structured output regression. In European Conference on Computer Vision, pages 2 15. Springer, 2008. [7] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007. [8] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach for structured prediction. Advances in Neural Information Processing Systems 29 (NIPS), pages 4412 4420, 2016. [9] Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco, and Massimiliano Pontil. Consistent multi- task learning with nonlinear output relations. In Advances in Neural Information Processing Systems, pages 1983 1993, 2017. [10] Michael Collins. Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods. In New Developments in Parsing Technology, pages 19 55. Springer, 2004. [11] Corinna Cortes, Vitaly Kuznetsov, and Mehryar Mohri. Ensemble methods for structured prediction. In International Conference on Machine Learning, pages 1134 1142, 2014. [12] Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Structured prediction theory based on factor graph complexity. In Advances in Neural Information Processing Systems, pages 2514 2522, 2016. [13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. [14] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger con- vergence rates for least-squares regression. Journal of Machine Learning Research, 18(1):3520 3570, 2017. [15] Moussab Djerrab, Alexandre Garcia, Maxime Sangnier, and Florence d Alché Buc. Output fisher embedding regression. Machine Learning, 107(8-10):1229 1256, 2018. [16] John C. Duchi, Lester W. Mackey, and Michael I. Jordan. On the consistency of ranking algorithms. In Proceedings of the International Conference on Machine Learning (ICML), pages 327 334, 2010. [17] Pedro F. Felzenszwalb, Ross B. Girshick, David Mc Allester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627 1645, 2010. [18] Thorsten Joachims, Thomas Hofmann, Yisong Yue, and Chun-Nam Yu. Predicting structured objects with support vector machines. Communications of the ACM, 52(11):97 104, 2009. [19] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 3128 3137, 2015. [20] Anna Korba, Alexandre Garcia, and Florence d Alché Buc. A structured prediction approach for label ranking. In Advances in Neural Information Processing Systems, pages 8994 9004, 2018. [21] John Lafferty, Andrew Mc Callum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001. [22] Christoph H. Lampert, Matthew B. Blaschko, and Thomas Hofmann. Efficient subwindow search: A branch and bound framework for object localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12):2129 2142, 2009. [23] Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, Massimiliano Pontil, and Steffen Grünewälder. Conditional mean embeddings as regressors. In International Conference on Machine Learing (ICML), volume 5, 2012. [24] Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential properties of sinkhorn approximation for learning with wasserstein distance. In Advances in Neural Information Processing Systems, pages 5859 5870, 2018. [25] Giulia Luise, Dimitris Stamos, Massimiliano Pontil, and Carlo Ciliberto. Leveraging low-rank relations between surrogate tasks in structured prediction. International Conference on Machine Learning (ICML), 2019. [26] Sean P. Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. Springer Science & Business Media, 2012. [27] Charles A. Micchelli and Massimiliano Pontil. Kernels for multi task learning. In Advances in Neural Information Processing Systems, pages 921 928, 2004. [28] Alex Nowak-Vila, Francis Bach, and Alessandro Rudi. Sharp analysis of learning with discrete losses. AISTATS, 2018. [29] Sebastian Nowozin, Christoph H Lampert, et al. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 2011. [30] Anton Osokin, Francis Bach, and Simon Lacoste-Julien. On structured prediction theory with calibrated convex surrogate losses. In Advances in Neural Information Processing Systems, pages 302 313, 2017. [31] Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. Maximum margin planning. In Proceedings of the International Conference on Machine Learning, pages 729 736. ACM, 2006. [32] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large scale kernel method. In Advances in Neural Information Processing Systems, pages 3891 3901, 2017. [33] Alessandro Rudi, Carlo Ciliberto, Gian Maria Marconi, and Lorenzo Rosasco. Manifold structured prediction. In Advances in Neural Information Processing Systems, pages 5610 5621, 2018. [34] Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression. Advances in Neural Information Processing Systems, 2019. [35] Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26(2):153 172, 2007. [36] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98 111, 2013. [37] Florian Steinke, Matthias Hein, and Bernhard Schölkopf. Nonparametric regression between general riemannian manifolds. SIAM Journal on Imaging Sciences, 3(3):527 563, 2010. [38] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer New York, 2008. [39] Kirill Struminsky, Simon Lacoste-Julien, and Anton Osokin. Quantifying learning guarantees for convex but inconsistent surrogates. In Advances in Neural Information Processing Systems, pages 669 677, 2018. [40] Charles Sutton and Andrew Mc Callum. An introduction to conditional random fields. Founda- tions and Trends R in Machine Learning, 4(4):267 373, 2012. [41] Martin Szummer, Pushmeet Kohli, and Derek Hoiem. Learning CRFs using graph cuts. In European Conference on Computer Vision, pages 582 595. Springer, 2008. [42] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems, pages 25 32, 2004. [43] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. volume 6, pages 1453 1484, 2005. [44] Devis Tuia, Jordi Munoz-Mari, Mikhail Kanevski, and Gustavo Camps-Valls. Structured output svm for remote sensing image classification. Journal of Signal Processing Systems, 65(3):301 310, 2011. [45] Andrea Vedaldi and Andrew Zisserman. Structured output regression for detection with partial truncation. In Advances in Neural Information Processing Systems, pages 1928 1936, 2009.