{"title": "Efficient Learning by Directed Acyclic Graph For Resource Constrained Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 2152, "page_last": 2160, "abstract": "We study the problem of reducing test-time acquisition costs in classification systems. Our goal is to learn decision rules that adaptively select sensors for each example as necessary to make a confident prediction. We model our system as a directed acyclic graph (DAG) where internal nodes correspond to sensor subsets and decision functions at each node choose whether to acquire a new sensor or classify using the available measurements. This problem can be naturally posed as an empirical risk minimization over training data. Rather than jointly optimizing such a highly coupled and non-convex problem over all decision nodes, we propose an efficient algorithm motivated by dynamic programming. We learn node policies in the DAG by reducing the global objective to a series of cost sensitive learning problems. Our approach is computationally efficient and has proven guarantees of convergence to the optimal system for a fixed architecture. In addition, we present an extension to map other budgeted learning problems with large number of sensors to our DAG architecture and demonstrate empirical performance exceeding state-of-the-art algorithms for data composed of both few and many sensors.", "full_text": "Ef\ufb01cient Learning by Directed Acyclic Graph For\n\nResource Constrained Prediction\n\nJoseph Wang\n\nDepartment of Electrical\n& Computer Engineering\n\nBoston University,\nBoston, MA 02215\njoewang@bu.edu\n\nKirill Trapeznikov\n\nSystems & Technology Research\n\nWoburn, MA 01801\n\nkirill.trapeznikov@\n\nstresearch.com\n\nAbstract\n\nVenkatesh Saligrama\nDepartment of Electrical\n& Computer Engineering\n\nBoston University,\nBoston, MA 02215\n\nsrv@bu.edu\n\nWe study the problem of reducing test-time acquisition costs in classi\ufb01cation sys-\ntems. Our goal is to learn decision rules that adaptively select sensors for each\nexample as necessary to make a con\ufb01dent prediction. We model our system as a\ndirected acyclic graph (DAG) where internal nodes correspond to sensor subsets\nand decision functions at each node choose whether to acquire a new sensor or\nclassify using the available measurements. This problem can be posed as an em-\npirical risk minimization over training data. Rather than jointly optimizing such\na highly coupled and non-convex problem over all decision nodes, we propose an\nef\ufb01cient algorithm motivated by dynamic programming. We learn node policies\nin the DAG by reducing the global objective to a series of cost sensitive learning\nproblems. Our approach is computationally ef\ufb01cient and has proven guarantees of\nconvergence to the optimal system for a \ufb01xed architecture. In addition, we present\nan extension to map other budgeted learning problems with large number of sen-\nsors to our DAG architecture and demonstrate empirical performance exceeding\nstate-of-the-art algorithms for data composed of both few and many sensors.\n\n1\n\nIntroduction\n\nMany scenarios involve classi\ufb01cation systems constrained by measurement acquisition budget. In\nthis setting, a collection of sensor modalities with varying costs are available to the decision system.\nOur goal is to learn adaptive decision rules from labeled training data that, when presented with an\nunseen example, would select the most informative and cost-effective acquisition strategy for this\nexample. In contrast, non-adaptive methods [24] attempt to identify a common sparse subset of\nsensors that can work well for all data. Our goal is an adaptive method that can classify typical cases\nusing inexpensive sensors while using expensive sensors only for atypical cases.\nWe propose an adaptive sensor acquisition system learned using labeled training examples. The\nsystem, modeled as a directed acyclic graph (DAG), is composed of internal nodes, which contain\ndecision functions, and a single sink node (the only node with no outgoing edges), representing\nthe terminal action of stopping and classifying (SC). At each internal node, a decision function\nroutes an example along one of the outgoing edges. Sending an example to another internal node\nrepresents acquisition of a previously unacquired sensor, whereas sending an example to the sink\nnode indicates that the example should be classi\ufb01ed using the currently acquired set of sensors. The\ngoal is to learn these decision functions such that the expected error of the system is minimized\nsubject to an expected budget constraint.\nFirst, we consider the case where the number of sensors available is small (as in [19, 23, 20]), though\nthe dimensionality of data acquired by each sensor may be large (such as an image taken in different\n\n1\n\n\f[19] and Wang et al.\n\nmodalities). In this scenario, we construct a DAG that allows for sensors to be acquired in any order\nand classi\ufb01cation to occur with any set of sensors. In this regime, we propose a novel algorithm to\nlearn node decisions in the DAG by emulating dynamic programming (DP). In our approach, we\ndecouple a complex sequential decision problem into a series of tractable cost-sensitive learning\nsubproblems. Cost-sensitive learning (CSL) generalizes multi-decision learning by allowing deci-\nsion costs to be data dependent [2]. Such reduction enables us to employ computationally ef\ufb01cient\nCSL algorithms for iteratively learning node functions in the DAG. In our theoretical analysis, we\nshow that, given a \ufb01xed DAG architecture, the policy risk learned by our algorithm converges to the\nBayes risk as the size of the training set grows.\nNext, we extend our formulation to the case where a large number of sensors exist, but the number\nof distinct sensor subsets that are necessary for classi\ufb01cation is small (as in [25, 11] where the depth\nof the trees is \ufb01xed to 5). For this regime, we present an ef\ufb01cient subset selection algorithm based\non sub-modular approximation. We treat each sensor subset as a new \u201csensor,\u201d construct a DAG\nover unions of these subsets, and apply our DP algorithm. Empirically, we show that our approach\noutperforms state-of-the-art methods in both small and large scale settings.\nRelated Work: There is an extensive literature on adaptive methods for sensor selection for reducing\ntest-time costs. It arguably originated with detection cascades (see [26, 4] and references therein),\na popular method in reducing computation cost in object detection for cases with highly skewed\nclass imbalance and generic features. Computationally cheap features are used at \ufb01rst to \ufb01lter out\nnegative examples and more expensive features are used in later stages.\n[23, 20].\nOur technical approach is closely related to Trapeznikov et al.\nLike us they formulate an ERM problem and generalize detection cascades to classi\ufb01er cascades\nand trees and handle balanced and/or multi-class scenarios. Trapeznikov et al.\n[19] propose a\nsimilar training scheme for the case of cascades, however restrict their training to cascades and\nsimple decision functions which require alternating optimization to learn. Alternatively, Wang et\nal. [21, 22, 23, 20] attempt to jointly solve the decision learning problem by formulating a linear\nupper-bounding surrogate, converting the problem into a linear program (LP).\nConceptually, our work is closely related to Xu et al. [25]\nand Kusner et al.[11], who introduce Cost-Sensitive Trees\nof Classi\ufb01ers (CSTC) and Approximately Submodular\nTrees of Classi\ufb01ers (ASTC), respectively, to reducing test\ntime costs. Like our paper they propose a global ERM\nproblem. They solve for the tree structure, internal de-\ncision rules and leaf classi\ufb01ers jointly using alternative\nminimization techniques. Recently, Kusner et al.[11]\npropose Approximately Submodular Trees of Classi\ufb01ers\n(ASTC), a variation of CSTC which provides robust per-\nformance with signi\ufb01cantly reduced training time and\ngreedy approximation, respectively. Recently, Nan et al.\n[14] proposed random forests to ef\ufb01ciently learn budgeted\nsystems using greedy approximation over large data sets.\nThe subject of this paper is broadly related to other\nadaptive methods in the literature. Generative methods\n[17, 8, 9, 6] pose the problem as a POMDP, learn condi-\ntional probability models, and myopically select features\nbased information gain of unknown features. MDP-based methods [5, 10, 7, 3] encode current obser-\nvations as state, unused features as action space, and formulate various reward functions to account\nfor classi\ufb01cation error and costs. He et al. [7] apply imitation learning of a greedy policy with a sin-\ngle classi\ufb01cation step as actions. Dulac-Arnold et al. [5] and Karayev et al. [10] apply reinforcement\nlearning to solve this MDP. Benbouzid et al.[3] propose classi\ufb01er cascades with an additional skip\naction within an MDP framework. Nan et al. [15] consider a nearest neighbor approach to feature\nselection, with con\ufb01dence driven by margin magnitude.\n\nFigure 1: A simple example of a sensor\nselection DAG for a three sensor system.\nAt each state, represented by a binary vec-\ntor indicating measured sensors, a policy \u03c0\nchooses between either adding a new sensor\nor stopping and classifying. Note that the\nstate sSC has been repeated for simplicity.\n\n2\n\n\f2 Adaptive Sensor Acquisition by DAG\n\nIn this section, we present our adaptive sensor acquisition DAG that during test-time sequentially de-\ncides which sensors should be acquired for every new example entering the system. Before formally\ndescribing the system and our learning approach, we \ufb01rst provide a simple illustration for a 3 sensor\nDAG shown in Fig. 1. The state indicating acquired sensors is represented by a binary vector, with\na 0 indicating that a sensor measurement has not been acquired and a 1 representing an acquisition.\nConsider a new example that enters the system. Initially, it has a state of [0, 0, 0]T (as do all samples\nduring test-time) since no sensors have been acquired. It is routed to the policy function \u03c00, which\nmakes a decision to measure one of the three sensors or to stop and classify. Let us assume that the\nfunction \u03c00 routes the example to the state [1, 0, 0]T , indicating that the \ufb01rst sensor is acquired. At\nthis node, the function \u03c01 has to decide whether to acquire the second sensor, acquire the third, or\nclassifying using only the \ufb01rst. If \u03c01 chooses to stop and classify then this example will be classi\ufb01ed\nusing only the \ufb01rst sensor.\nSuch decision process is performed for every new example. The system adaptively collects sensors\nuntil the policy chooses to stop and classify (we assume that when all sensors have been collected\nthe decision function has no choice but to stop and classify, as shown for \u03c07 in Fig. 1).\nProblem Formulation: A data instance, x \u2208 X , consists of M sensor measurements, x =\n{x1, x2, . . . , xM}, and belongs to one of L classes indicated by its label y \u2208 Y = {1, 2, . . . L}.\nEach sensor measurement, xm, is not necessarily a scalar but may instead be multi-dimensional. Let\nthe pair, (x, y), be distributed according to an unknown joint distribution D. Additionally, associated\nwith each sensor measurement xm is an acquisition cost, cm.\nTo model the acquisition process, we de\ufb01ne a state space S = {s1, . . . , sK, sSC}. The states\n{s1, . . . , sK} represent subsets of sensors, and the stop-and-classify state sSC represents the action\nof stopping and classifying with a current subset. Let Xs correspond to the space of sensor mea-\nsurements in subset s. We assume that the state space includes all possible subsets1, K = 2M .\nFor example in Fig. 1, the system contains all subsets of 3 sensors. We also introduce the state\ntransition function, T : S \u2192 S, that de\ufb01nes a set of actions that can be taken from the current\nstate. A transition from the current sensor subset to a new subset corresponds to an acquisition of\nnew sensor measurements. A transition to the state sSC corresponds to stopping and classifying\nusing the available information. This terminal state, sSC, has access to a classi\ufb01er bank used to\npredict the label of an example. Since classi\ufb01cation has to operate on any sensor subset, there is one\nclassi\ufb01er for every sk: fs1 , . . . , fsK such that fs : Xs \u2192 Y. We assume the classi\ufb01er bank is given\nand pre-trained. Practically, the classi\ufb01ers can be either unique for each subset or a missing feature\n(i.e. sensor) classi\ufb01cation system as in [13]. We overload notation and use node, subset of sensors,\nand path leading up to that subset on the DAG interchangeably. In particular we let S denote the\ncollection of subsets of nodes. Each subset is associated with a node on the DAG graph. We refer to\neach node as a state since it represents the \u201cstate-of-information\u201d for an instance at that node.\nWe de\ufb01ne the loss associated with classifying an example/label pair (x, y) using the sensors in sj as\n(1)\n\nUsing this convention, the loss is the sum of the empirical risk associated with classi\ufb01er fsj and the\ncost of the sensors in the subset sj. The expected loss over the data is de\ufb01ned\n\nLsj (x, y) = 1fsj (x)(cid:54)=y +\n\n(cid:88)\nLD(\u03c0) = Ex,y\u223cD(cid:2)L\u03c0(x)(x, y)(cid:3) .\n\nk\u2208sj\n\nck.\n\n(2)\nOur goal is to \ufb01nd a policy which adaptively selects subsets for examples such that their average\nloss is minimized\n\nmin\n\u03c0\u2208\u03a0\n\n(3)\nwhere \u03c0 : X \u2192 S is a policy selected from a family of policies \u03a0 and \u03c0(x) is the state selected by\nthe policy \u03c0 for example x. We denote the quantity LD as the value of (3) when \u03a0 is the family\nof all measurable functions. LD is the Bayes cost, representing the minimum possible cost for any\n1While enumerating all possible combinations is feasible for small M, for large M this problem becomes\nintractable. We will overcome this limitation in Section 3 by applying a novel sensor selection algorithm. For\nnow, we remain in the small M regime.\n\nLD(\u03c0),\n\n3\n\n\ffunction given the distribution of data. In practice, the distribution D is unknown, and instead we\nare given training examples (x1, y1), . . . , (xn, yn) drawn I.I.D. from D. The problem becomes an\nempirical risk minimization:\n(4)\n\nn(cid:88)\n\nL\u03c0(xi)(xi, yi).\n\nmin\n\u03c0\u2208\u03a0\n\ni=1\n\nRecall that our sensor acquisition system is represented as a DAG. Each node in a graph corresponds\nto a state (i.e. sensor subset) in S, and the state transition function, T (sj), de\ufb01nes the outgoing edges\nfrom every node sj. We refer to the entire edge set in the DAG as E. In such a system, the policy \u03c0 is\nparameterized by the set of decision functions \u03c01, . . . , \u03c0K at every node in the DAG. Each function,\n\u03c0j : X \u2192 T (sj), maps an example to a new state (node) from the set speci\ufb01ed by outgoing edges.\nRather than directly minimizing the empirical risk in (4), \ufb01rst, we de\ufb01ne a step-wise cost associated\nwith all edges (sj, sk) \u2208 E\n\n(cid:40)(cid:80)\n\nC(x, y, sj, sk) =\n\n(5)\nC(\u00b7) is either the cost of acquiring new sensors or is the classi\ufb01cation error induced by classifying\nwith the current subset if sk = sSC. Using this step-wise cost, we de\ufb01ne the empirical loss of the\nsystem w.r.t a path for an example x:\n\n.\n\nct\n\nt\u2208sk\\sj\n1fsj (x)(cid:54)=y\n\nif sk (cid:54)= sSC\notherwise\n\nR (x, y, \u03c01, ..., \u03c0K ) =\n\nC (x, y, sj, sj+1) ,\n\n(6)\n\n(sj ,sj+1) \u2208 path(x,\u03c01,...,\u03c0K )\n\n(cid:88)\n\nn(cid:88)\n\ni=1\n\nwhere path (x, \u03c01, . . . , \u03c0K) is the path on the DAG induced by the policy functions \u03c01, . . . , \u03c0K for\nexample x. The empirical minimization equivalent to (4) for our DAG system is a sample average\nover all example speci\ufb01c path losses:\n\nR (xi, yi, \u03c01, . . . , \u03c0K ) .\n\n(7)\n\n\u2217\n1 , . . . , \u03c0\n\n\u2217\nK = argmin\n\n\u03c0\n\n\u03c01,...,\u03c0K\u2208\u03a0\n\nNext, we present a reduction to learn the functions \u03c01, ..., \u03c0K that minimize the loss in (7).\n\n2.1 Learning Policies in a DAG\n\nLearning the functions \u03c01, . . . , \u03c0K that minimize the cost in (7) is a highly coupled problem. Learn-\ning a decision function \u03c0j is dependent on the other functions in two ways: (a) \u03c0j is dependent on\nfunctions at nodes downstream (nodes for which a path exists from \u03c0j), as these determine the cost\nof each action taken by \u03c0j on an individual example (the cost-to-go), and (b) \u03c0j is dependent on\nfunctions at nodes upstream (nodes for which a path exists to \u03c0j), as these determine the distribution\nof examples that \u03c0j acts on. Consider a policy \u03c0j at a node corresponding to state sj such that all\noutgoing edges from j lead to leaves. Also, we assume all examples pass through this node \u03c0j (we\nare ignoring the effect of upstream dependence b). This yields the following important lemma:\nLemma 2.1. Given the assumptions above, the problem of minimizing the risk in (6) w.r.t a single\npolicy function, \u03c0j, is equivalent to solving a k-class cost sensitive learning (CSL) problem.2\nProof. Consider the risk in (6) with \u03c0j such that all outgoing edges from j lead to a leaf. Ignoring\nthe effect of other policy functions upstream from j, the risk w.r.t \u03c0j is:\n\nR(x, y, \u03c0j) =\n\nC(x, y, sj, sk)1\u03c0j (x)=sk \u2192 min\n\u03c0\u2208\u03a0\n\nR(xi, yi, \u03c0j).\n\n(cid:88)\n\nsk\u2208T (sj )\n\nn(cid:88)\n\ni=1\n\nMinimizing the risk over training examples yields the optimization problem on the right hand side.\nThis is equivalent to a CSL problem over the space of \u201clabels\u201d T (sj) with costs given by the transi-\ntion costs C(x, y, sj, sk).\n\nIn order to learn the policy functions \u03c01, . . . , \u03c0K, we propose Algorithm 1, which iteratively learns\npolicy functions using Lemma 2.1. We solve the CSL problem by using a \ufb01lter-tree scheme [2]\nfor Learn, which constructs a tree of binary classi\ufb01ers. Each binary classi\ufb01er can be trained using\nregularized risk minimization. For concreteness we de\ufb01ne the Learn algorithm as\n\nLearn ((x1, (cid:126)w1), ..., (xn, (cid:126)wn)) (cid:44)\nF ilterT ree((x1, (cid:126)w1), ..., (xn, (cid:126)wn))\n\n(8)\nwhere the binary classi\ufb01ers in the \ufb01lter tree are trained using an appropriately regularized cali-\nbrated convex loss function. Note that multiple schemes exist that map the CSL problem to binary\nclassi\ufb01cation.\n\n2We consider the k-class CSL problem formulated by Beygelzimer et al. [2], where an instance of the\nproblem is de\ufb01ned by a distribution D over X \u00d7[0, inf)k, a space of features and associated costs for predicting\neach of the k labels for each realization of features. The goal is to learn a function which maps each element of\nX to a label {1, . . . , k} s.t. the expected cost is minimized.\n\n4\n\n\fA single iteration of Algorithm 1 pro-\nceeds as follows: (1) A node j is cho-\nsen whose outgoing edges connect only\nto leaf nodes. (2) The costs associated\nwith each connected leaf node are found.\n(3) The policy \u03c0j is trained on the entire\nset of training data according to these\n(4)\ncosts by solving a CSL problem.\nThe costs associated with taking the ac-\ntion \u03c0j are computed for each example,\nand the costs of moving to state j are\nupdated. (5) Outgoing edges from node\nj are removed (making it a leaf node),\nand (6) disconnected nodes (that were\npreviously connected to node j) are re-\nmoved. The algorithm iterates through\nthese steps until all edges have been re-\nmoved. We denote the policy functions\ntrained on the empirical data using Alg.\n1 as \u03c0n\n\nK.\n1 , . . . , \u03c0n\n\n2.2 Analysis\n\nAlgorithm 1 Graph Reduce Algorithm\n\ni=1,\n\nInput: Data: (xi, yi)n\nDAG: (nodes S, edges E, costs C(xi, yi, e),\u2200e \u2208 E),\nCSL alg: Learn ((x1, (cid:126)w1), . . . , (xn, (cid:126)wn))) \u2192 \u03c0(\u00b7)\nwhile Graph S is NOT empty do\n\n(1) Choose a node, j \u2208 S, s.t. all children of j are\nleaf nodes\nfor example i \u2208 {1, . . . , n} do\n\n(2) Construct the weight vector (cid:126)wi of edge costs\nper action.\n\nend for\n(3) \u03c0j \u2190 Learn ((x1, (cid:126)w1), . . . , (xn, (cid:126)wn))\n(4) Evaluate \u03c0j and update edge costs to node j:\nC(xi, yi, sn, sj) \u2190 (cid:126)wj\ni (\u03c0j(xi)) + C(xi, yi, sn, sj)\n(5) Remove all outgoing edges from node j in E\n(6) Remove all disconnected nodes from S.\n\nend while\nOutput: Policy functions, \u03c01, . . . , \u03c0K\n\nOur goal is to show that the expected risk of the policy functions \u03c01, . . . , \u03c0K learned by Alg. 1\nconverge to the Bayes risk. We \ufb01rst state our main result:\nTheorem 2.2. Alg. 1 is universally consistent, that is\n\n(9)\nK are the policy functions learned using Alg. (1), which in turn uses Learn de-\n\n1 , . . . , \u03c0n\n\nlim\n\nn\u2192\u221eLD(\u03c0n\n\nK) \u2192 LD\n\nwhere \u03c0n\n1 , . . . , \u03c0n\nscribed by Eq. 8.\n\nAlg. 1 emulates a dynamic program applied in an empirical setting. Policy functions are decoupled\nand trained from leaf to root conditioned on the output of descendant nodes.\nTo adapt to the empirical setting, we optimize at each stage over all examples in the training set. The\nkey insight is the fact that universally consistent learners output optimal decisions over subsets of the\nspace of data, that is they are locally optimal. To illustrate this point, consider a standard classi\ufb01ca-\ntion problem. Let X (cid:48) \u2282 X be the support (or region) of examples induced by upstream deterministic\ndecisions. d\u2217 and f\u2217, Bayes optimal classi\ufb01ers w.r.t the full space and subset, respectively, are equal\non the reduced support:\n\nE(cid:2)1f (x)(cid:54)=y|x, x \u2208 X (cid:48) \u2282 X(cid:3) \u2200 x \u2208 X (cid:48)\n\nE(cid:2)1d(x)(cid:54)=y|x(cid:3) = f\n\n\u2217\nd\n\n.\n\n\u2217\n\n(x) = arg min\n\nf\n\n(x) = arg min\n\nd\n\nFrom this insight, we decouple learning problems while still training a system that converges to the\nBayes risk. This can be achieved by training universally consistent CSL algorithms such as \ufb01lter\ntrees [2] that reduce the problem to binary classi\ufb01cation. By learning consistent binary classi\ufb01ers\n[1, 18], the risk of the cost-sensitive function can be shown to converge to the Bayes risk [2]. Proof\nof Theorem 2.2 is included in the Supplementary Material.\nComputational Ef\ufb01ciency: Alg. 1 reduces the problem to solving a series of O(KM ) binary clas-\nsi\ufb01cation problems, where K is the number of nodes in the DAG and M is the number of sensors.\nFinding each binary classi\ufb01er is computationally ef\ufb01cient, reducing to a convex problem with O(n)\nvariables. In contrast, nearly all previous approaches require solving a non-convex problem and\nresort to alternating optimization [25, 19] or greedy approximation [11]. Alternatively, convex sur-\nrogates proposed for the global problem [23, 20] require solving large convex programs with \u03b8(n)\nvariables, even for simple linear decision functions. Furthermore, existing off-the-shelf algorithms\ncannot be applied to train these systems, often leading to less ef\ufb01cient implementation.\n\n2.3 Generalization to Other Budgeted Learning Problems\n\nAlthough, we presented our algorithm in the context of supervised classi\ufb01cation and a uniform\nlinear sensor acquisition cost structure, the above framework holds for a wide range of problems.\n\n5\n\n\fIn particular, any loss-based learning problem can be solved using the proposed DAG approach by\ngeneralizing the cost function\n\n(cid:40)\n\n\u02dcC(x, y, sj, sk) =\n\nc(x, y, sj, sk)\nD (x, y, sj)\n\nif sk (cid:54)= sSC\notherwise\n\n,\n\n(10)\n\nwhere c(x, y, sj, sk) is the cost of acquiring sensors in sk\\sj for example (x, y) given the current\nstate sj and D (x, y, sj) is some loss associated with applying sensor subset sj to example (x, y).\nThis framework allows for signi\ufb01cantly more complex budgeted learning problems to be handled.\nFor example, the sensor acquisition cost, c(x, y, sj, sk), can be object dependent and non-linear,\nsuch as increasing acquisition costs as time increases (which can arise in image retrieval problems,\nwhere users are less likely to wait as time increases). The cost D (x, y, sj) can include alternative\ncosts such as (cid:96)2 error in regression, precision error in ranking, or model error in structured learning.\nAs in the supervised learning case, the learning functions and example labels do not need to be\nexplicitly known. Instead, the system requires only empirical performance to be provided, allow-\ning complex decision systems (such as humans) to be characterized or systems learned where the\nclassi\ufb01ers and labels are sensitive information.\n\n3 Adaptive Sensor Acquisition in High-Dimensions\n\nSo far, we considered the case where the DAG system allows for any subset of sensors to be ac-\nquired, however this is often computationally intractable as the number of nodes in the graph grows\nexponentially with the number of sensors. In practice, these complete systems are only feasible for\ndata generated from a small set of sensors ( 10 or less).\n\n3.1 Learning Sensor Subsets\n\nFigure 2: An example of a DAG system us-\ning the 3 sensor subsets shown on the bot-\ntom left. The new states are the union of\nthese sensor subsets, with the system other-\nwise constructed in the same fashion as the\nsmall scale system.\n\nAlthough constructing an exhaustive DAG for data with\na large number of sensors is computationally intractable,\nin many cases this is unnecessary. Motivated by previous\nmethods [6, 25, 11], we assume that the number of \u201cac-\ntive\u201d nodes in the exhaustive graph is small, that is these\nnodes are either not visited by any examples or all ex-\namples that visit the node acquire the same next sensor.\nEquivalently, this can be viewed as the system needing\nonly a small number of sensor subsets to classify all ex-\namples with low acquisition cost.\nRather than attempt to build the entire combinatorially\nsized graph, we instead use this assumption to \ufb01rst \ufb01nd\nthese \u201cactive\u201d subsets of sensors and construct a DAG to choose between unions of these subsets.\nThe step of \ufb01nding these sensor subsets can be viewed as a form of feature clustering, with a goal\nof grouping features that are jointly useful for classi\ufb01cation. By doing so, the size of the DAG is re-\nduced from exponential in the number of sensors, 2M , to exponential in a much smaller user chosen\nparameter number of subsets, 2t. In experimental results, we limit t = 8, which allows for a diverse\nsubsets of sensors to be found while preserving computational tractability and ef\ufb01ciency.\nOur goal is to learn sensor subsets with high classi\ufb01cation performance and low acquisition cost\n(empirically low cost as de\ufb01ned in (1)). Ideally, our goal is to jointly learn the subsets which mini-\nmize the empirical risk of the entire system as de\ufb01ned in (4), however this presents a computationally\nintractable problem due to the exponential search space. Rather than attempt to solve this dif\ufb01cult\nproblem directly, we minimize classi\ufb01cation error over a collection of sensor subsets \u03c31, . . . , \u03c3t\nsubject to a cost constraint on the total number of sensors used. We decouple the problem from the\npolicy learning problem by assuming that each example is classi\ufb01ed by the best possible subset. For\na constant sensor cost, the problem can be expressed as a set constraint problem:\n|\u03c3j| \u2264 B\n\u03b4\n\n1f\u03c3j (xi)(cid:54)=yi\n\nmin\n\nj\u2208{1,...,t}\n\nN(cid:88)\n\nsuch that:\n\n(11)\n\n(cid:105)\n\n(cid:104)\n\nmin\n\n\u03c31,...,\u03c3t\n\n1\nN\n\ni=1\n\nt(cid:88)\n\nj=1\n\n,\n\nwhere B is the total sensor budget over all sensor subsets and \u03b4 is the cost of a single sensor.\n\n6\n\n\f(cid:104)\n\nN(cid:88)\n\ni=1\n\n(cid:105) \u2192 max\n\n\u03c31,...,\u03c3t\n\nt(cid:88)\n\nj=1\n\nAlthough minimizing this loss is still computationally intractable, consider instead the equivalent\nproblem of maximizing the \u201creward\u201d (the event of a correct classi\ufb01cation) of the subsets, de\ufb01ned as\n\nG =\n\nmax\n\nj\u2208{1,...,t}\n\n1f\u03c3j (xi)=yi\n\n1\nN\n\nG(c1, . . . , ct) such that:\n\n|\u03c3j| \u2264 B\n\u03b4\n\n.\n\n(12)\n\nThis problem is related to the knapsack problem with a non-linear objective. Maximizing the reward\nin (12) is still a computationally intractable problem, however the reward function is structured to\nallow for ef\ufb01cient approximation.\nLemma 3.1. The objective of the maximization in (12) is sub-modular with respect to the set of\nsubsets, such that adding any new set to the reward yields diminishing returns.\nTheorem 3.2. Given that the empirical risk of each classi\ufb01er f\u03c3k is submodular and monotonically\ndecreasing w.r.t.\nthe elements in \u03c3k and uniform sensor costs, the strategy in Alg. 2 is an O(1)\napproximation of the optimal reward in (12).\nProof of these statements is included in the Supplementary Material and centers on showing that the\nobjective is sub-modular, and therefore applying a greedy strategy yields a 1 \u2212 1\ne approximation of\nthe optimal strategy [16].\n\nAlgorithm 2 Sensor Subset Selection\n\n3.2 Constructing DAG using Sensor Subsets\n\nAlg. 2 requires computation of the reward G for\n\n\u03b4 tM(cid:1) sensor subsets, where M is the num-\n\nonly O(cid:0) B\n\nInput: Number of Subsets t, Cost Con-\nstraint B\n\u03b4\nOutput: Feature subsets, \u03c31, . . . , \u03c3t\nInitialize: \u03c31, . . . , \u03c3t = \u2205\n(i, j) = argmaxi\u2208{1,...,t} argmaxj\u2208\u03c3C\nG(\u03c31, ..., \u03c3i \u222a j, ..., \u03c3t)\nj=1 |\u03c3j| \u2264 C\n\u03b4 do\n\nwhile(cid:80)T\n\nber of sensors, to return a constant-order approxima-\ntion to the NP-hard knapsack-type problem. Given\nthe set of sensor subsets \u03c31, . . . , \u03c3t, we can now\nconstruct a DAG using all possible unions of these\nsubsets, where each sensor subset \u03c3j is treated as\na new single sensor, and apply the small scale sys-\ntem presented in Sec. 2. The result is an ef\ufb01ciently\nlearned system with relatively low complexity yet\nstrong performance/cost trade-off. Additionally, this\nresult can be extended to the case of non-uniform\ncosts, where a simple extension of the greedy algorithm yields a constant-order approximation [12].\nA simple case where three subsets are used is shown in Fig. 2. The three learned subsets of sensors\nare shown on the bottom left of Fig. 2, and these three subsets are then used to construct the entire\nDAG in the same fashion as in Fig. 1. At each stage, the state is represented by the union of sensor\nsubsets acquired. Grouping the sensors in this fashion reduces the size of the graph to 8 nodes as\nopposed to 64 nodes required if any subset of the 6 sensors can be selected. This approach allows\nus to map high-dimensional adaptive sensor selection problems to small scale DAG in Sec. 2.\n\n\u03c3i = \u03c3i \u222a j\n(i, j) = argmaxi\u2208{1,...,t} argmaxj\u2208\u03c3C\nG(\u03c31, ..., \u03c3i \u222a j, ..., \u03c3t)\n\nend while\n\ni\n\ni\n\n4 Experimental Results\n\nTo demonstrate the performance of our DAG sensor acquisition system, we provide experimental\nresults on data sets previously used in budgeted learning. Three data sets previously used for budget\ncascades [19, 23] are tested. In these data sets, examples are composed of a small number of sensors\n(under 4 sensors). To compare performance, we apply the LP approach to learning sensor trees [20]\nand construct trees containing all subsets of sensors as opposed to \ufb01xed order cascades [19, 23].\nNext, we examine performance of the DAG system using 3 higher dimensional sets of data previ-\nously used to compare budgeted learning performance [11]. In these cases, the dimensionality of\nthe data (between 50 and 400 features) makes exhaustive subset construction computationally infea-\nsible. We greedily construct sensor subsets using Alg. 2, then learn a DAG over all unions of these\nsensor subsets. We compare performance with CSTC [25] and ASTC [11].\nFor all experiments, we use cost sensitive \ufb01lter trees [2], where each binary classi\ufb01er in the tree is\nlearned using logistic regression. Homogeneous polynomials are used as decision functions in the\n\ufb01lter trees. For all experiments, uniform sensor cost were were varied in the range [0, M ] achieve\nsystems with different budgets. Performance between the systems is compared by plotting the aver-\nage number of features acquired during test-time vs. the average test error.\n\n7\n\n\f4.1 Small Sensor Set Experiments\n\n(a) letter\n\n(b) pima\n\n(c) satimage\n\nFigure 3: Average number of sensors acquired vs. average test error comparison between LP tree systems and\nDAG systems.\nWe compare performance of our trained DAG with that of a complete tree trained using an LP\nsurrogate [20] on the landsat, pima, and letter datasets. To construct each sensor DAG, we include all\nsubsets of sensors (including the empty set) and connect any two nodes differing by a single sensor,\nwith the edge directed from the smaller sensor subset to the larger sensor subset. By including the\nempty set, no initial sensor needs to be selected. 3rd-order homogeneous polynomials are used for\nboth the classi\ufb01cation and system functions in the LP and DAG.\nAs seen in Fig. 3, the systems learned with a DAG outperform the LP tree systems. Additionally,\nthe performance of both of the systems is signi\ufb01cantly better than previously reported performance\non these data sets for budget cascades [19, 23]. This arises due to both the higher complexity of the\nclassi\ufb01ers and decision functions as well as the \ufb02exibility of sensor acquisition order in the DAG\nand LP tree compared to cascade structures. For this setting, it appears that the DAG approach is\nsuperior approach to LP trees for learning budgeted systems.\n\n4.2 Large Sensor Set Experiments\n\n(a) MiniBooNE\n\n(b) Forest\n\n(c) CIFAR\n\nFigure 4: Comparison between CSTC, ASTC, and DAG of the average number of acquired features (x-axis)\nvs. test error (y-axis).\nNext, we compare performance of our trained DAG with that of CSTC [25] and ASTC [11] for\nthe MiniBooNE, Forest, and CIFAR datasets. We use the validation data to \ufb01nd the homogeneous\npolynomial that gives the best classi\ufb01cation performance using all features (MiniBooNE: linear,\nForest: 2nd order, CIFAR: 3rd order). These polynomial functions are then used for all classi\ufb01cation\nand policy functions. For each data set, Alg. 2 was used to \ufb01nd 7 subsets, with an 8th subset of all\nfeatures added. An exhaustive DAG was trained over all unions of these 8 subsets.\nFig. 4 shows performance comparing the average cost vs. average error of CSTC, ASTC, and our\nDAG system. The systems learned with a DAG outperform both CSTC and ASTC on the Mini-\nBooNE and Forest data sets, with comparable performance on CIFAR at low budgets and superior\nperformance at higher budgets.\n\nAcknowledgments\nThis material is based upon work supported in part by the U.S. National Science Foundation Grant 1330008,\nby the Department of Homeland Security, Science and Technology Directorate, Of\ufb01ce of University Programs,\nunder Grant Award 2013- ST-061-ED0001, by ONR Grant 50202168 and US AF contract FA8650-14-C-1728.\nThe views and conclusions contained in this document are those of the authors and should not be interpreted as\nnecessarily representing the social policies, either expressed or implied, of the U.S. DHS, ONR or AF.\n\n8\n\n11.21.41.61.822.22.42.62.830.20.250.30.350.4Average Features UsedAverage Test Error LP TreeDAG11.21.41.61.822.22.42.62.830.180.190.20.210.220.230.240.250.260.27Average Features UsedAverage Test Error LP TreeDAG11.522.533.540.10.150.20.250.30.350.40.45Average Features UsedAverage Test Error LP TreeDAG051015202530354045500.050.10.150.20.250.3 ASTCCSTCDAG051015202530354045500.210.220.230.240.250.260.270.28 ASTCCSTCDAG0501001502002500.30.320.340.360.380.40.420.44 ASTCCSTCDAG\fReferences\n[1] P. Bartlett, M. Jordan, and J. Mcauliffe. Convexity, Classi\ufb01cation, and Risk Bounds. Journal of American\n\nStatistical Association, 101(473):138\u2013156, 2006.\n\n[2] A. Beygelzimer, J. Langford, and P. Ravikumar. Multiclass classi\ufb01cation with \ufb01lter trees. 2007.\n[3] R. Busa-Fekete, D. Benbouzid, and B. K\u00e9gl. Fast classi\ufb01cation using sparse decision dags. In Proceedings\n\nof the 29th International Conference on Machine Learning, 2012.\n\n[4] M. Chen, Z. Xu, K. Weinberger, O. Chapelle, and D. Kedem. Classi\ufb01er cascade: Tradeoff between\naccuracy and feature evaluation cost. In International Conference on Arti\ufb01cial Intelligence and Statistics,\n2012.\n\n[5] G. Dulac-Arnold, L. Denoyer, P. Preux, and P. Gallinari. Datum-wise classi\ufb01cation: a sequential approach\n\nto sparsity. In Machine Learning and Knowledge Discovery in Databases, pages 375\u2013390. 2011.\n\n[6] T. Gao and D. Koller. Active classi\ufb01cation based on value of classi\ufb01er. In Advances in Neural Information\n\nProcessing Systems, volume 24, pages 1062\u20131070, 2011.\n\n[7] H. He, H. Daume III, and J. Eisner. Imitation learning by coaching. In Advances In Neural Information\n\nProcessing Systems, pages 3158\u20133166, 2012.\n\n[8] S. Ji and L. Carin. Cost-sensitive feature acquisition and classi\ufb01cation. Pattern Recognition, 40(5), 2007.\n[9] P. Kanani and P. Melville. Prediction-time active feature-value acquisition for cost-effective customer\n\ntargeting. In Advances In Neural Information Processing Systems, 2008.\n\n[10] S. Karayev, M. Fritz, and T. Darrell. Dynamic feature selection for classi\ufb01cation on a budget. In Interna-\n\ntional Conference on Machine Learning: Workshop on Prediction with Sequential Models, 2013.\n\n[11] M. Kusner, W. Chen, Q. Zhou, Z. Xu, K. Weinberger, and Y. Chen. Feature-cost sensitive learning with\n\nsubmodular trees of classi\ufb01ers. In Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\n[12] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak\n\ndetection in networks. In International Conference on Knowledge Discovery and Data Mining, 2007.\n\n[13] L. Maaten, M. Chen, S. Tyree, and K. Q. Weinberger. Learning with marginalized corrupted features. In\n\nProceedings of the 30th International Conference on Machine Learning, 2013.\n\n[14] F. Nan, J. Wang, and V. Saligrama. Feature-budgeted random forest. In Proceedings of the 32nd Interna-\n\ntional Conference on Machine Learning, 2015.\n\n[15] F. Nan, J. Wang, K. Trapeznikov, and V. Saligrama. Fast margin-based cost-sensitive classi\ufb01cation. In\n\nInternational Conference on Acoustics, Speech and Signal Processing, 2014.\n\n[16] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of approximations for maximizing submodular set\n\nfunctions\u00ef\u00a3\u00a1i. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[17] V. S. Sheng and C. X. Ling. Feature value acquisition in testing: A sequential batch test algorithm. In\n\nProceedings of the 23rd International Conference on Machine Learning, pages 809\u2013816, 2006.\n\n[18] I. Steinwart. Consistency of support vector machines and other regularized kernel classi\ufb01ers. Information\n\nTheory, IEEE Transactions on, 51(1):128\u2013142, 2005.\n\n[19] K. Trapeznikov and V. Saligrama. Supervised sequential classi\ufb01cation under budget constraints.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 581\u2013589, 2013.\n\nIn\n\n[20] J. Wang, T. Bolukbasi, K. Trapeznikov, and V. Saligrama. Model selection by linear programming. In\n\nEuropean Conference on Computer Vision, pages 647\u2013662, 2014.\n\n[21] J. Wang and V. Saligrama. Local supervised learning through space partitioning. In Advances in Neural\n\nInformation Processing Systems, pages 91\u201399. 2012.\n\n[22] J. Wang and V. Saligrama. Locally-linear learning machines (L3M). In Asian Conference on Machine\n\nLearning, pages 451\u2013466, 2013.\n\n[23] J. Wang, K. Trapeznikov, and V. Saligrama. An lp for sequential learning under budgets. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 987\u2013995, 2014.\n\n[24] Z. Xu, O. Chapelle, and K. Weinberger. The greedy miser: Learning under test-time budgets. In Proceed-\n\nings of the 29th International Conference on Machine Learning, 2012.\n\n[25] Z. Xu, M. Kusner, M. Chen, and K. Weinberger. Cost-sensitive tree of classi\ufb01ers. In Proceedings of the\n\n30th International Conference on Machine Learning, pages 133\u2013141, 2013.\n\n[26] C. Zhang and Z. Zhang. A Survey of Recent Advances in Face Detection. Technical report, Microsoft\n\nResearch, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1286, "authors": [{"given_name": "Joseph", "family_name": "Wang", "institution": "Boston University"}, {"given_name": "Kirill", "family_name": "Trapeznikov", "institution": "STR"}, {"given_name": "Venkatesh", "family_name": "Saligrama", "institution": "Boston University"}]}