# resource_constrained_structured_prediction__4d228525.pdf

Resource Constrained Structured Prediction

Tolga Bolukbasi, Kai-Wei Chang,+ Joseph Wang, Venkatesh Saligrama

Boston University, Boston, MA +University of Virginia, Chancellorsville, VA tolgab@bu.edu, kw@kwchang.net, joewang@bu.edu, and srv@bu.edu

We study the problem of structured prediction under test-time budget constraints. We propose a novel approach based on selectively acquiring computationally costly features during test-time in order to reduce the computational cost of prediction with minimal performance degradation. We formulate a novel empirical risk minimization (ERM) for policy learning. We show that policy learning can be reduced to a series of structured learning problems, resulting in efﬁcient training using existing structured learning algorithms. This framework provides theoretical justiﬁcation for several existing heuristic approaches found in literature. We evaluate our proposed adaptive system on two structured prediction tasks, optical character recognition and dependency parsing and show signiﬁcant reduction in the feature costs without degrading accuracy.

Introduction Structured prediction is a powerful and ﬂexible framework for making a joint prediction over mutually dependent output variables. It has been successfully applied to a wide range of computer vision and natural language processing tasks ranging from text classiﬁcation to human detection. However, the superior performance and ﬂexibility of structured predictors come at the cost of increased computational cost. In order to construct computationally efﬁcient algorithms, a trade-off must be made between the expressiveness and speed of structured models. The cost of inference in structured prediction can be broken down into three parts: acquiring the features, evaluating the part responses, and solving a combinatorial optimization problem to make a prediction based on part responses. Much of the past research has focused on efﬁcient inference algorithms for speciﬁc structures (e.g., Viterbi and CKY parsing algorithms) and general structures (e.g., variational inference (Jordan et al. 1999)). However, these methods overlook feature acquisition and part response, which are bottlenecks when the underlying structure is relatively simple or can be efﬁciently solved. Consider the dependency parsing task, where the goal is to create a directed tree that describes semantic relations between words in a sentence. The task can be formulated as

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

a structured prediction problem, where inference concerns ﬁnding the maximum spanning trees (MSTs) in a directed graph (Mc Donald et al. 2005). Each node in the graph represents a word and a directed edge (xi, xj) represents how likely xj is to depend on xi. Fig. 1 shows an example of our selective feature acquisition scheme for this task. Intuitively, our goal is to learn a system that identiﬁes the parts in each example that are incorrectly parsed using cheap features. Acquiring expensive features for these parts yields large reduction in error over the entire structure due to improved distinguishability and relationships to other parts. We consider two forms of the budgeted structured learning problem, prediction under expected budget constraints and anytime prediction. For both cases, we consider the streaming test-time scenario where the system operates on each test example without observation or interaction of other test examples. For both settings, we propose a novel ERM formulation to learn the policy function and reduce this problem to a weighted structured learning problem. Our proposed approach is general and can be applied to any structured prediction task. Given a structured prediction model, we demonstrate that our approach yields state-of-theart speedup while maintaining predictive power. We summarize our contributions as follows: Novel ERM formulation of structured prediction under expected budget constraints and anytime prediction. Reduction of the ERM problem for both these settings to conventional structured prediction problems. State of the art speedup of existing structured prediction models on two real-world data sets. Implementation details and proofs are at https://arxiv.org/ abs/1602.08761.

Budgeted Structured Learning In this section we formulate the problem of budgeted structured prediction for the expected and anytime budget settings. We describe each approach in more detail in the following sections. Structured Prediction: The goal in structured prediction is to learn a function, F, that maps from an input space, X, to a structure space, Y. In contrast to multi-class classiﬁcation, the space of outputs Y is not simply categorical but instead is assumed to be some exponential space of outputs (often of varying size dependent on the feature space) con-

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

Root I saw a friend today

Figure 1: When predicting the dependency tree, some dependencies (e.g., the dashed edges) are easily resolved, and there is less need for expressive features in making a prediction. Our goal is to learn a system that identiﬁes these dependencies to reduce the test-time costs.

taining some underlying structure, generally represented by multiple parts and relationships between parts. For example, in dependency parsing, x X are features representing a sentence (e.g., words, pos tags), and y Y is a parse tree. In a structured prediction model, the mapping function F is often modeled as F maxy Y Ψ(x, y), where Ψ : (X, Y ) R is a scoring function. We assume the score can be broken up into sub-scores across components C, Ψ(x, y) = c C ψc(x, yc), where yc is the output assignment associated with the component c. The number of subcomponents, |C|, varies across examples. For the dependency parsing example, each c is an edge in the directed graphs, and yc is an indicator variable for whether the edge is in the parse tree. The score of a parse tree consists of the scores ψc(x, yc) of all its edges.

Structured Prediction Under an Expected Budget Our goal is to reduce the cost of prediction during testtime (representing computational time, energy consumption, etc.). We consider the case where a variety of scoring functions are available to be used for each component. Additionally, associated with each scoring function is an evaluation cost (such as the time or energy consumption required to extract the features for the scoring function). For each example, we deﬁne a state S S, where the space of states is deﬁned S = {0, 1}K |C|, representing which of the K features is used for the |C| components during prediction. In a state, the element S(k, c) = 1 indicates that the kth feature will be used during prediction for component c. For any state S, we deﬁne the evaluation cost: c(S) =

k K S(k, c)δk, where δk is the (known) cost of evaluating the kth feature for a single part. We assume that we are given a structured prediction model F : X S Y that maps from a set of features X X and a state S S to a structured label prediction ˆY Y. For a predicted label, we have a loss L : Y Y R that maps from a predicted and true structured label, ˆY and Y , respectively, to an error cost such as Hamming error, L( ˆY , Y ) = k i=1 1 ˆY (i) =Y (i). For an example (X, Y ) and state S. We deﬁne a policy π : X S that maps from the feature space X and the initial state S0 to a new state. For ease of reference, we refer to this policy as the feature selection policy. Our goal is to learn a policy π chosen from a family of functions Π that minimzes the expected loss subject to a budget constraint:

min π Π EX [L (F(X, π(X)), Y )] , s.t. EX [C(π(X))] B,

where B is a user speciﬁed test time budget. Rather an solving this constrained minimization, we deﬁne the modiﬁed loss C(X, Y, S) = L (F(X, S), Y ) + λc(S) (1) that represents the error induced by predicting a label from X using the sensors in S combined with the cost of acquiring the sensors in S, where λ is a trade-off pattern adjusted according to the budget B1. A small value of λ encourages correct classiﬁcation at the expense of feature cost, whereas a large value of λ penalizes use of costly features, enforcing a tighter budget constraint. We tune this parameter in the training for the target budget B. Our goal is to learn a policy with minimal expected modiﬁed loss, π = argminπ Π ED [C (X, Y, π (X))] . In practice, D denotes a set of I.I.D training examples:

π = argminπ Π n

i=1 C (Xi, Yi, π (Xi)) . (2)

Note that the objective of the minimization can be expanded with respect to the space of states, allowing the optimization problem in (2) to be expressed π = argminπ Π n i=1

S S C (Xi, Yi, S) 1π(Xi)=S. From this, we can reformulate the problem of learning a policy as a structured learning problem. Theorem 1. The minimization in (2) is equivalent to the structured learning problem:

max S S C(Xi, Yi, S) C (Xi, Yi, S) 1π(Xi) =S.

Proofs can be found in Suppl. Material. Theorem 1 maps the policy learning problem in (2) to a weighted structured learning problem. For each example Xi, an example/label pair is created for each state S with an importance weight representing the savings lost by not choosing the state S. Unfortunately, the expansion of the cost function over the space of states introduces the summation over the combinatorial space of states. To avoid this, we instead introduce an approximation to the objective in (2). Using a single indicator function, we formulate the approximate policy

ˆπ = argminπ Π n

W(Xi, Yi)1π(Xi) =S (Xi,Yi)

+ C (Xi, Yi, S (Xi, Yi))

where the pseudo-label is deﬁned:

S (Xi, Yi) = argmin S S C (Xi, Yi, S) (4)

and the example weight is deﬁned as W(Xi, Yi) = max S S C (Xi, Yi, S ) C (Xi, Yi, S (Xi, Yi)). This formulation reduces the objective from a summation over the combinatorial set of states to a single indicator function for each example and represents an upper-bound on the original risk. Theorem 2. The objective in (3) is an upper-bound on the objective in (2).

1Our framework does not restrict the type of modiﬁed loss, C(X, Y, S), or the state cost, C(S) and extends to general losses

Note that the second term (3) is not dependent on π. Thus, Theorem 2 leads to an efﬁcient algorithm for learning a policy function π by solving an importance-weighted structured learning problem:

ˆπ = argminπ Π n

i=1W(Xi, Yi)1π(Xi) =S (Xi,Yi), (5)

where each example Xi having a pseudo-label S (Xi, Yi) and importance weight W(Xi, Yi). Combinatorial Search Space: Finding the psuedo-label in Eqn. (4) involves searching over the combinatorially large space of states, S, which is computationally intractable. Note that in limited cases, one can apply greedy approaches as in (Samdani and Roth 2012) for an exact solution. Instead, we present trajectory-based and parsimonious pseudo-labels for approximating S . Trajectory Search: The trajectory-based pseudo-label is a greedy approximation to the optimization in Eqn. (4). To this end, deﬁne St i as the 1-stage feasible transitions: St i = {S | d(S, ˆSt 1 i ) 1, S ˆSt 1 i = ˆSt 1 i }, where d is the Hamming distance. We deﬁne a trajectory of states ˆSt i where ˆSt i = argmin S St i C(Xi, Yi, S). The initial state is assumed

to be ˆS0 i = 0K |C| where none of the K features are evaluated for the C components. For each example i, we obtain a trajectory ˆS0 i , ˆS1 i , . . . , ˆST i , where the terminal state ˆST i is the all-one state. We choose the pseudo-label from the trajectory: ˆS i = argmin S { ˆS0 i ,..., ˆST i } C(Xi, Yi, S). Note that by restricting the search space of states differing by a single component, the approximation needs to only perform a polynomial search over states as opposed to the exhaustive combinatorial search in Eqn. (4). Observe that the modiﬁed loss is not strictly decreasing, as the cost of adding features may outweigh the reduction in loss at any time. Empirically, this approach is computationally tractable and is shown to produce strong results. Parsimonious Search: Rather than a trajectory search, which requires an inference update as we acquire more features, we consider an alternative one stage update here. The idea is to look for 1-step transitions that can potentially improve the cost. We then simultaneously update all the features that produce improvement. This obviates the need for a trajectory search. In addition we can incorporate a guaranteed loss improvement for our parsimonious search. S+ i argmin S St i 1{C(Xi,Yi,St 1 i ) C(Xi,Yi,S)+τ}. Note that the potential candidate transitions can be non-unique and thus we generate a collection of potential state transitions, S+ i . To obtain the ﬁnal state we take the union over these transitions, namely, ˆS =

S S+ i S. Suppose we set the margin τ = 0, replace the cost-function with the loss function then this optimization is relatively simple (assuming that acquiring more features does not increase the loss). This is because the new state is simply the collection of transitions where the sub-components are incorrect. Finding the parsinomious pseudo-label is computationally efﬁcient and empirically shows similar performance to the trajectory-based pseudo-label. Choosing the pseudo-label requires knowledge of the budget B to set the cost trade-off parameter λ. If the bud-

get is unspeciﬁed or varies over time, a system capable of adapting to changing budget demands is necessary. To handle this scenario, we propose an anytime system in the next section.

Anytime Structured Prediction In many applications, the budget constraint is unknown a priori or varies from example to example due to changing resource availability, and an expected budget system as in the previous section is not feasible. We instead consider the problem of learning an anytime system (Grubb and Bagnell 2012). In this setting, a single system is designed such that when a new example arrives during test-time, features are acquired until an arbitrary budget constraint (that may vary over different examples) is met for the particular example. Note that an anytime system is a special case of the expected budget constrained system. Instead of an expected budget, a hard per-example budget is speciﬁed in test-time. A single system is applied to all feasible budgets, as opposed to learning unique systems for each budget constraint. We model the anytime learning problem as sequential state selection. The goal is to select a trajectory of states, starting from an initial state S0 = 0k |C| where all components use features with negligible cost. To select this trajectory of states, we deﬁne policy functions π1, . . . , πT , where πt : X S S is a function that maps from a set of structured features X and current state S to a new state S . The sequential selection system is then deﬁned by the policy functions π1, . . . , πT . For an example X, the policy functions produce a trajectory of states S1(X) . . . , ST (X) deﬁned as follows: St(X) = πt(X, St 1(X)), S0(X) = S0. Our goal is to learn a system with small expected loss at any time t [0, T]. Formally, we deﬁne this as the average modiﬁed loss of the system over the trajectory of states:

π 1, ...,π T = argmin π1,...,πT Π 1 T ED

t=1 C X, Y, St(X)

where Π is a user-speciﬁed family of functions. The problem of learning the policy functions is highly coupled due to the dependence of the state trajectory on all policy functions. We propose a greedy approximation to the policy learning problem by sequentially learning policy functions π1, . . . , πT that minimize the modiﬁed loss:

πt = argminπ Π ED C X, Y, St(X) (7)

for t {1, . . . , T}. Note that the πt selected in (7) does not take into account the future effect on the loss in (6). We consider πt in (7) to be a greedy approximation as it is instead chosen to minimize the immediate loss at time t. As in Thm. 1, the problem of learning the sequence of policy functions π1, . . . , πT can be viewed as a weighted structured learning problem. Theorem 3. The optimization problem in (7) is equivalent to solving an importance weighted structured learning problem using an indicator risk of the form:

s S(St 1(Xi)) W(Xi, Yi, s)1π(Xi,St 1(Xi)) =s,

Algorithm 1 Anytime Policy Learning

input Training set, {Xi, Yi}i=1,...,n set S0 i = 0 i = 1, ..., n, t = 1 while t T do Train πt according to Thm. 3 for i [n] do Update states: St i = πt(Xi, St 1 i ) t t + 1 return π = {π1, . . . , πT }

Algorithm 2 Anytime Structured Prediction

input Policy: π1, ..., πT , Example: X, Budget: B set S0 = 0, t = 0 while t T and c(St) < B do St+1 = πt(X, St), t t + 1 return y = F(X, St)

where the weight is deﬁned:

W(Xi, Yi, s) = max s S(St 1(Xi)) C(Xi, Yi, s ) C (Xi, Yi, s) .

This is equivalent to an importance weighted structured learning problem, where each state s in S(St 1(Xi)) deﬁnes a pseudo-label for the example Xi with an associated importance maxs S(St 1(Xi)) C(Xi, Yi, s ) C (Xi, Yi, s) .

Theorem 3 reduces the problem of learning a policy to an importance weighted structured learning problem. Replacement of the indicators with upper-bounding convex surrogate functions results in a convex minimization problem to learn the policies π1, ..., πT . In particular, use of a hingeloss surrogate converts this problem to the commonly used structural SVM. Experimental results show signiﬁcant cost savings by applying this sequential policy. The training algorithm is presented in Algorithm 1. At time t = 0, the policy π1 is trained to minimize the immediate loss. Given this policy, the states of examples at time t = 1 are ﬁxed, and π2 is trained to minimize the immediate loss given these states. The algorithm continues learning policies until every feature for every example as been acquired. During test-time, the system sequentially applies the trained policy functions until the speciﬁed budget is reaches, as shown in Algorithm 2.

Related Work

Multi-class prediction with test-time budget has received signiﬁcant attention (see e.g., (Viola and Jones 2001; Chen et al. 2012; Busa-Fekete, Benbouzid, and K egl 2012; Karayev, Fritz, and Darrell 2013; Xu et al. 2013; Trapeznikov and Saligrama 2013; Kusner et al. 2014; Wang et al. 2014; Wang, Trapeznikov, and Saligrama 2014)). Fundamentally, multi-class classiﬁcation based approaches cannot be directly applied to structured settings for two reasons: (1) Structured Feature Selection Policy: Unlike multiclass prediction, in a structured setting, we have many parts with associated features and costs for each part.

This often requires a coupled adaptive part by part feature selection policy applied to varying structures; (2) Structured Inference Costs: In contrast to multi-class prediction, structured prediction requires solving a constrained optimization problem in test-time, which is often computationally expensive and must be taken into account. Strubell et al. (2015) improve the speed of a parser that operates on search-based structured prediction models, where joint prediction is decomposed to a sequence of decisions. In such a case, resource-constrained multi-class approaches can be applied, however this reduction only applies to search-based models that are fundamentally different from the graph-based models we discussed (with different types of theoretical guarantees and use cases). Applying their policy to the case of graphical models requires repeated inferences, dramatically increasing the computational cost when inference is slow.2

Similar observations apply to (Weiss, Sapp, and Taskar 2013; Weiss and Taskar 2013)), who present a scheme for adaptive feature selection assuming the computational costs of policy execution and inference are negligible. Their approach uses reinforcement learning, requiring inference at each step of their policy to estimate rewards. For complex tasks, repeatedly performing inference can negate any computational gains induced by adaptive feature selection (see Fig. 3 in (Weiss and Taskar 2013)). He et al. (2013) use imitation learning to adaptively select features for dependency parsing. Their approach can be viewed as an approximation of Eqn. (5) with a parsimonious search. Although their policy avoids performing inference to estimate reward, multiple inferences are required for each instance due to the design of action space. Overhead is avoided by exploiting the speciﬁc inference structure (maximal spanning tree over fully connected graph), and it is unclear if it can be generalized. Methods to increase the speed of inference (predicting the given part responses) have been proposed (Weiss and Taskar 2010; Shi, Steinhardt, and Liang 2015). These approaches can be incorporated into our approach to further reduce computational cost and therefore are complementary. More focused research has improved the speed of individual algorithms such as object detection using deformable parts models (Felzenszwalb et al. 2010; Zhu et al. 2014) and dependency parsing (He, Daum e III, and Eisner 2013; Strubell et al. 2015). These methods are specialized, failing to generalize to varying graph size and/or structures and relying on problem-speciﬁc and algorithm-speciﬁc properties. Adaptive features approaches have been designed to improve accuracy, including easy-ﬁrst decoding strategies (Goldberg and Elhadad 2010; Stoyanov and Eisner 2012), however these methods focus on performance as opposed to computational cost.

2The equivalent policy of (Strubell et al. 2015) applied to our inference algorithm is marked as the myopic policy in our experiments. Due to the high cost of repeated inference, the resulting policy is computationally intensive.

Figure 2: An example word from the OCR test dataset. Note that the word is initially incorrectly identiﬁed due to degradation in letters u and n . The letter classiﬁcation accuracy increases after the policy acquires the HOG features at strategic positions.

Experiments In this section, we demonstrate the effectiveness of the proposed algorithm on two structured prediction tasks in different domains: dependency parsing and OCR. We report the results on both anytime and expected case policies and refer to the latter one as one-shot policy. Cost Computation: At a high-level, policies for resource constrained structured prediction must manage & trade-off beneﬁts of three resources, namely, feature acquisition costs, intermediate inference costs, and policy overhead costs that decides between feature acquisition and inference. Some methods as described earlier account for feature costs but not inference and overhead costs. Other methods incorporate inference into their policy (meta-features) for selecting new features but do not account for the resulting policy overhead. Our approach poses policy optimization as a structured learning problem and jointly optimizes these resources. Comparisons to Existing Methods: We compare our system to the Q-learning approach in (Weiss and Taskar 2013) and two baselines: a uniform policy and a myopic policy. The uniform policy takes random part level actions. The uniform policy will help us show that the performance of our policy does not come from removing redundant features, but clever allocation of them among samples. As a second baseline, we adapt the myopic policy used by (Trapeznikov and Saligrama 2013) to the structured prediction case. The myopic policy runs the structured predictor initially on all cheap features, then looks at the total conﬁdence of the classiﬁer normalized by the sample size (e.g. sentence length). If the conﬁdence is below a threshold, it chooses to acquire expensive features for all positions. Finally, we compare against the Q-learning method proposed by (Weiss and Taskar 2013). This method requires global features for structures with varying size. From now on we will refer to features that require access to more than one part as complex features and part level features as simple features. In their case, they use conﬁdence feedback from the structured predictor which induces additional inference overhead for the policy. In addition to this, it is not straightforward to apply this approach to do part by part feature selection on structures with varying sizes. Policy Overhead: Recall that policy evaluation must be factored into test-time costs. Since our policy solves a structured inference problem this can be high when rich features and structure are used. Our approach allows maximum ﬂexibility in terms of designing the best policy overhead ver-

Figure 3: Comparison of our one-shot policy (in red) with uniform strategy (in black) and (Weiss and Taskar 2013) s policy for the OCR dataset. We obtained a top accuracy of 93%, which is not shown in the plots. Although the rate of accuracy gain is large for the policy that utilizes complex features, we note that policies that utilize simple features achieve lower absolute run-time in the low budget region. This is due to the overhead arising from additional inference required for complex features.

sus performance trade-off. Experiments show that complex features indeed beneﬁt the policy, but simple features (e.g., w/o transition features) perform better for cases where the inference time and feature costs are comparable and the additional overhead is unwanted. In our OCR and DP experiments, we use simple features and model the policy with a weighted SVM. One shot results are obtained by sweeping λ in Equation 1. Anytime results show the average performance if all test examples terminate at particular budget. We discuss the implementation details in Suppl. Material. Optical Character Recognition We tested our algorithm on a sequence-label problem, the OCR dataset (Taskar, Guestrin, and Koller 2003) composed of 6,877 handwritten words, where each word is represented as a sequence of 16 8 letter images. We use a linear-chain Markov model, and similar to the setup in (Weiss, Sapp, and Taskar 2013; Wang et al. 2014), use raw pixel values and HOG features with 3 3 cell size as our feature templates. We split the data such that 90% is used for training and 10% is used for test. Fig. 3 shows the average letter accuracy vs. total running time. Note that (Weiss and Taskar 2013) can not operate on part by part level when the graph structure is varying. We see that using clever part by part selection has signiﬁcant advantage over using uniform feature templates. The one-shot policy performs better than the any-time policy because it has more information (the test-time budget) during training as well as being less constrained (samples can use varying budgets compared to every sample using the same budget). Finally, Fig 2 shows the behavior of the policy on an individual example for the anytime model, signiﬁcant gains in accuracy are made in ﬁrst several steps by correctly identifying the noisy letters. The signiﬁcant speedup for our approach for OCR (and

Figure 4: Performance of various adaptive policies for varying budget levels (dependency tree accuracy vs. total execution time), is compared to a uniform strategy on word and sentence level, and myopic policy for the 23 section of PTB dataset.

dependency parsing) can be partially attributed to the fact that feature acquisition cost is relatively close to inference costs. Therefore, methods such as (Weiss and Taskar 2013) that require inference feedback on acquired features can perform poorly. Dependency Parsing We follow the setting in (He, Daum e III, and Eisner 2013) and conduct experiments on English Penn Treebank (PTB) corpus (Marcus, Marcinkiewicz, and Santorini 1993). All algorithms are implemented based on the graph-based dependency parser (Mc Donald et al. 2005) in Illinois-SL library (Chang et al. 2015), where the code is optimized for speed. Two sets of feature templates are considered for the parser. 3 The ﬁrst (ψFull) considers the part-of-speech (POS) tags and lexicons of xi, xj, and their surrounding words (see (Mc Donald et al. 2005)). The other (ψPOS) only considers the POS features. The policy assigns one of these two feature templates to each word in the sentence, such that all the directed edges (xi, xj) corresponding to the word xi share the same feature templates. The ﬁrst feature template, ψPOS, takes 165 μs per word and the second feature template, ψFull, takes 275 μs per word to extract the features and compute edge scores. The decoding by Chu Liu-Edmonds algorithm is 75 μs per word, supporting our hypothesis that feature extraction makes a signiﬁcant portion of the total running time yet the inference time is not negligible. Due to the space limit, we present further details of the experiment setting in the appendix. Fig. 4 shows the test performance (unlabeled attachment

3Complex features often contribute to small performance improvement. Adding redundant features can easily yield arbitrarily large speedups, and comparing speedups of different systems with different accuracy levels is not meaningful (see Fig. 3 in (He, Daum e III, and Eisner 2013)). In addition, greedy-style parser such as (Strubell et al. 2015) might be faster by nature. On the other hand, several recently developed neural dependency parsers can achieve higher accuracy, but are slower (Andor et al. 2016). Discussing different architectures is outside the scope of this paper.

Figure 5: Distribution of parse-tree depth for words that use cheap (green) or expensive features (orange) for anytime policy. Time increases from left to right. Each group of columns show the distribution of depths from 0(root) to 7. The policy is concentrated on acquiring features for lower depth words. A sentence example also shows this effect. It is easy to identify parents of the adjectives and determiner. However, additional features(orange) are required for the root(verb), subject and object.

accuracy) along with inference time. We see that all one-shot policies perform similarly, and the anytime policy is below one-shot policy for all budget levels. As discussed in Anytime Structured Prediction, the anytime policy is more constrained in that it has to achieve a ﬁxed budget for all examples. The naive myopic policy performs worse than uniform since it has to run inference on samples with low conﬁdence two times, adding approximately 4.5 seconds of extra time for the full test dataset. The effect of importance weights for the greedy policy seems small. We hypothesize that this is due to the policy functional complexity being the limiting factor. When we apply the length dictionary ﬁltering heuristic in (He, Daum e III, and Eisner 2013; Rush and Petrov 2012), our parser achieves 89.7% UAS on PTB section 23 with overall running time merely 7.5 seconds (I/O excluded, 10s with I/O) and obtains 2.9X total speed-up while losing only 1% UAS compared to the baseline. 4 This signiﬁcant speedup over an efﬁcient implementation is remarkable. Fig. 4 shows the distribution of depth for the words that use expensive and cheap features in the ground truth dependency tree. This result is particularly informative, because any uniform feature selection policy would uniformly distribute the features among all bins. Hence, the result clearly shows the advantage of using a part-by-part selection policy.

Conclusion We presented a novel method for reducing feature acquisition cost in structured learning while maintaining prediction performance. Our method jointly incorporates feature costs with prediction error in its objective. We propose two new formulations for two test-time settings (one-shot and

4This heuristic only works for parsing. Therefore, we exclude it when presenting Figure 4 as it does not reﬂect the performance of policies in general.In contrast, the baseline system in (He, Daum e III, and Eisner 2013) is slower than us by about three times when operating at 90% accuracy, Figure 3 in (He, Daum e III, and Eisner 2013) shows that their ﬁnal system takes about 20s. We acknowledge that (He, Daum e III, and Eisner 2013) use different features, policy settings, and hardware from ours; therefore these numbers might not be comparable.

anytime) and show that both can be reduced to instances of structured learning problems. Our policy leads to understanding the utility of simple and complex features in different regimes (low-budget vs. high-budget) providing the user with the ﬂexibility to seek the suitable tradeoffs. Our method can utilize any structured inference algorithm and available feature choices. Experimentally, we showed significant speedup for problems that require structured inference, where both feature acquisition cost and structured inference cost must both be taken into account. Acknowledgment. This material is based upon work supported in part by NSF Grants CCF: 1320566, CNS: 1330008, CCF: 1527618, DHS 2013-ST-061-ED0001, ONR Grant 50202168 and US AF contract FA8650-14-C-1728.

References Andor, D.; Alberti, C.; Weiss, D.; Severyn, A.; Presta, A.; Ganchev, K.; Petrov, S.; and Collins, M. 2016. Globally normalized transition-based neural networks. ar Xiv preprint ar Xiv:1603.06042. Busa-Fekete, R.; Benbouzid, D.; and K egl, B. 2012. Fast classiﬁcation using sparse decision DAGs. In 29th International Conference on Machine Learning (ICML 2012), 951 958. Omnipress. Chang, K.-W.; Upadhyay, S.; Chang, M.-W.; Srikumar, V.; and Roth, D. 2015. Illinois SL: A JAVA Library for Structured Prediction. ar Xiv preprint ar Xiv:1509.07179. Chen, M.; Weinberger, K. Q.; Chapelle, O.; Kedem, D.; and Xu, Z. 2012. Classiﬁer cascade for minimizing feature evaluation cost. In International Conference on Artiﬁcial Intelligence and Statistics, 218 226. Felzenszwalb, P. F.; Girshick, R. B.; Mc Allester, D.; and Ramanan, D. 2010. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32(9):1627 1645. Goldberg, Y., and Elhadad, M. 2010. An efﬁcient algorithm for easy-ﬁrst non-directional dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 742 750. Association for Computational Linguistics. Grubb, A., and Bagnell, D. 2012. Speedboost: Anytime prediction with uniform near-optimality. In International Conference on Artiﬁcial Intelligence and Statistics, 458 466. He, H.; Daum e III, H.; and Eisner, J. 2013. Dynamic Feature Selection for Dependency Parsing. In EMNLP, 1455 1464. Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L. K. 1999. An introduction to variational methods for graphical models. Machine learning 37(2):183 233. Karayev, S.; Fritz, M.; and Darrell, T. 2013. Dynamic feature selection for classiﬁcation on a budget. In International Conference on Machine Learning (ICML): Workshop on Prediction with Sequential Models. Kusner, M. J.; Chen, W.; Zhou, Q.; Xu, Z. E.; Weinberger, K. Q.; and Chen, Y. 2014. Feature-Cost Sensitive Learning with Submodular Trees of Classiﬁers. In AAAI, 1939 1945.

Marcus, M. P.; Marcinkiewicz, M. A.; and Santorini, B. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics 19(2):313 330. Mc Donald, R.; Pereira, F.; Ribarov, K.; and Hajiˇc, J. 2005. Non-projective dependency parsing using spanning tree algorithms. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, 523 530. Association for Computational Linguistics. Rush, A., and Petrov, S. 2012. Vine pruning for efﬁcient multi-pass dependency parsing learned prioritization for trading off accuracy and speed. In NAACL. Samdani, R., and Roth, D. 2012. Efﬁcient decomposed learning for structured prediction. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), 217 224. Shi, T.; Steinhardt, J.; and Liang, P. 2015. Learning Where to Sample in Structured Prediction. In AISTATS. Stoyanov, V., and Eisner, J. 2012. Easy-ﬁrst Coreference Resolution. In COLING, 2519 2534. Citeseer. Strubell, E.; Vilnis, L.; Silverstein, K.; and Mc Callum, A. 2015. Learning Dynamic Feature Selection for Fast Sequential Prediction. ar Xiv preprint ar Xiv:1505.06169. Taskar, B.; Guestrin, C.; and Koller, D. 2003. Max-Margin Markov Networks. In Advances in Neural Information Processing Systems, None. Trapeznikov, K., and Saligrama, V. 2013. Supervised sequential classiﬁcation under budget constraints. In Proceedings of the Sixteenth International Conference on Artiﬁcial Intelligence and Statistics, 581 589. Viola, P., and Jones, M. 2001. Robust real-time object detection. International Journal of Computer Vision 4. Wang, J.; Bolukbasi, T.; Trapeznikov, K.; and Saligrama, V. 2014. Model selection by linear programming. In Computer Vision ECCV 2014. Springer. 647 662. Wang, J.; Trapeznikov, K.; and Saligrama, V. 2014. An LP for Sequential Learning Under Budgets. In Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics, 987 995. Weiss, D., and Taskar, B. 2010. Structured Prediction Cascades. In International Conference on Artiﬁcial Intelligence and Statistics, 916 923. Weiss, D. J., and Taskar, B. 2013. Learning adaptive value of information for structured prediction. In Advances in Neural Information Processing Systems, 953 961. Weiss, D.; Sapp, B.; and Taskar, B. 2013. Dynamic structured model selection. In Proceedings of the IEEE International Conference on Computer Vision, 2656 2663. Xu, Z.; Kusner, M.; Chen, M.; and Weinberger, K. Q. 2013. Cost-Sensitive Tree of Classiﬁers. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 133 141. Zhu, M.; Atanasov, N.; Pappas, G. J.; and Daniilidis, K. 2014. Active deformable part models inference. In Computer Vision ECCV 2014. Springer. 281 296.