{"title": "Controlled Recognition Bounds for Visual Learning and Exploration", "book": "Advances in Neural Information Processing Systems", "page_first": 2915, "page_last": 2923, "abstract": "We describe the tradeoff between the performance in a visual recognition problem and the control authority that the agent can exercise on the sensing process. We focus on the problem of \u201cvisual search\u201d of an object in an otherwise known and static scene, propose a measure of control authority, and relate it to the expected risk and its proxy (conditional entropy of the posterior density). We show this analytically, as well as empirically by simulation using the simplest known model that captures the phenomenology of image formation, including scaling and occlusions. We show that a \u201cpassive\u201d agent given a training set can provide no guarantees on performance beyond what is afforded by the priors, and that an \u201comnipotent\u201d agent, capable of infinite control authority, can achieve arbitrarily good performance (asymptotically).", "full_text": "Controlled Recognition Bounds for Visual Learning\n\nand Exploration\n\nVasiliy Karasev1 Alessandro Chiuso2\n\nStefano Soatto1\n\n1University of California, Los Angeles\n\n2University of Padova\n\nAbstract\n\nWe describe the tradeoff between the performance in a visual recognition problem\nand the control authority that the agent can exercise on the sensing process. We\nfocus on the problem of \u201cvisual search\u201d of an object in an otherwise known and\nstatic scene, propose a measure of control authority, and relate it to the expected\nrisk and its proxy (conditional entropy of the posterior density). We show this ana-\nlytically, as well as empirically by simulation using the simplest known model that\ncaptures the phenomenology of image formation, including scaling and occlusions.\nWe show that a \u201cpassive\u201d agent given a training set can provide no guarantees on\nperformance beyond what is afforded by the priors, and that an \u201comnipotent\u201d agent,\ncapable of in\ufb01nite control authority, can achieve arbitrarily good performance\n(asymptotically). In between these limiting cases, the tradeoff can be characterized\nempirically.\n\n1\n\nIntroduction\n\nWe are interested in visual learning for recognition of objects and scenes embedded in physical space.\nRather than using datasets consisting of collections of isolated snapshots, however, we wish to\nactively control the sensing process during learning. This is because, in the presence of nuisance\nfactors involving occlusion and scale changes, learning requires mobility [1]. Visual learning is thus a\nprocess of discovery, literally uncovering occluded portions of an object or scene, and viewing it from\nclose enough that all structural details are revealed.1 We call this phase of learning exploration or\nmapping, accomplished by actively controlling the sensor motion within a scene, or by manipulating\nan object so as to discover all aspects.2\nOnce exploration has been performed, one has a model (or \u201cmap\u201d or \u201crepresentation\u201d) of the scene or\nobject of interest. One can then attempt to detect, localize or recognize a particular object or scene, or\na class of them, provided intra-class variability has been exposed during exploration. This phase can\nyield localization \u2013 where one wishes to recognize a portion of a mapped scene and, as a byproduct,\ninfer the pose relative to the map \u2013 or search where a particular object mapped during the exploration\nphase is detected and localized within an otherwise known scene. This can also be interpreted as a\nchange detection problem, where one wishes to revisit a known map to detect changes. In the case\n1It has been shown [1] that mobility is required in order to reduce the Actionable Information Gap, the\ndifference between the complexity of a maximal invariant of the data and the minimal suf\ufb01cient statistic of a\ncomplete representation of the underlying scene.\n\n2Note that we are not suggesting that one should construct a three-dimensional (3-D) model of an object or a\nscene for recognition, as opposed to using collections of 2-D images. From an information perspective, there is\nno gain in replacing a collection of 2-D images with a 3-D model computed from them. What matters is how\nthese images are collected. The multiple images must portray the same scene or object, lest one cannot attribute\nthe variability in the data to nuisance factors as opposed to intrinsic variability of the object of interest. The\nmultiple images must enable establishing correspondence between different images of the same scene. Temporal\ncontinuity enables that.\n\n1\n\n\fwhere a known object is sought in an unknown map, exploration and search have to be conducted\nsimultaneously.\nWithin this scenario, exploration and search can be framed as optimal control and optimal stopping\ntime problems. These relate to active vision (next-best-view generation), active learning, robotic\nmotion planning, sequential decision in the setting of partially-observable Markov decision processes\n(POMDP) and a number of related \ufb01elds (including Information Bottleneck, Value of Information)\nand a vast literature that we cannot extensively review here. As often in this class of problems,\ninference algorithms are essentially intractable, so we wish to design surrogate tasks and prove\nperformance bounds to ensure desirable properties of the surrogate solution.\nIn this manuscript we consider the problem of detecting and estimating discrete parameters of an\nunknown object in a known environment. To this purpose we:\n\n1. Describe the simplest model that includes scaling and occlusion nuisances, a two dimensional\n\u201ccartoon \ufb02atland,\u201d and a test suite to perform simulation experiments. We derive an explicit\nprobability model to compute the posterior density given photometric measurements.\n\n2. Discuss the tradeoff between performance in a visual decision task and the control authority\nthat the explorer possesses. This tradeoff is akin the tradeoff between rate and distortion\nin a communication system, but it pertains to decision and control tasks, as opposed to the\ntransmission of data. We characterize this tradeoff for the simple case of a static environment,\nwhere control authority relates to reachability and energy.\n\n3. Discuss and test algorithms for visual search based on the maximization of the conditional\nentropy of future measurements and the proxies of this quantity. These algorithms can\nbe used to locate an unknown object in unknown position of a known environment, or to\nperform change detection in an otherwise known map, for the purpose of updating it.\n\n4. Provide experimental validation of the algorithms, including regret and expected exploration\n\nlength.\n\n1.1 Related prior work\n\nActive search and recognition of objects in the scene has been one of the mainstays of Active\nPerception in the eighties [2, 3], and has recently resurged (see [4] and references therein). The\nproblem can be formulated as a POMDP [5], solving which requires developing approximate, near-\noptimal policies. Active recognition using next-best-view generation and object appearance is\ndiscussed in [6] where authors use PCA to embed object images in a linear, low dimensional space.\nThe scheme does not incorporate occlusions or scale changes. More recently, information driven\nsensor control for object recognition was used in [7, 8, 9], who deal with visual and sonar sensors,\nbut take features (e.g. SIFT, SURF) to be the observed data. A utility function that accounts for\nocclusions, viewing angle, and distance to the object is proposed in [10] who aim to actively learn\nobject classi\ufb01ers during the training stage. Exploration and learning of 3D object surface models by\nrobotic manipulation is discussed in [11]. The case of object localization (and tracking if object is\nmoving) is discussed in [12]; information-theoretic approach for solving this problem using a sensor\nnetwork is described in [13]. Both authors used realistic, nonlinear sensor models, which however are\ndifferent from photometric sensors and are not affected by the same nuisances. Typically, information-\ntheoretic utility functions used in these problems are submodular and thus can be ef\ufb01ciently optimized\nby greedy heuristics [14, 15]. With regards to models, our work is different in several aspects: instead\nof choosing the next best view on a sphere centered at the object, we model a cluttered environment\nwhere the object of interest occupies a negligible volume and is therefore fully occluded when viewed\nfrom most locations. Second, we wish to operate in a continuous environment, rather than in a world\nthat is discretized at the outset. Third, given the signi\ufb01cance of quantization-scale and occlusions in a\nvisual recognition task, we model the sensing process such that it accounts for both.\n\n2 Preliminaries\nLet y 2Y denote data3 (measurements) and x 2X a hidden class variable from a \ufb01nite alphabet\nthat we are interested in inferring. If prior p(x) and conditional distributions p(y|x) are known, the\n\n3Random variables will be displayed in boldface (e.g. y), and realizations in regular fonts (e.g. y).\n\n2\n\n\fexpected risk can be written as\n\nPe =Z p(y)(1  max\n\ni\n\np(xi|y))dy\n\n(1)\n\nand minimized by Bayes\u2019 decision rule, which chooses the class label with maximum a posteriori\nprobability. If the distributions above are estimated empirically, the expected risk depends on the data\nset. We are interested in controlling the data acquisition process so as to make this risk as small as\npossible. We use the problem of visual search (\ufb01nding a not previously seen object in a scene) as\na motivation. It is related to active learning and experimental design. In order to enforce temporal\ncontinuity, we model the search agent (\u201cexplorer\u201d) as a dynamical system of the form:\n\n( \u21e0t+1 = \u21e0t\n\ngt+1 = f (gt, ut)\n\nyt = h(gt,\u21e0 ) + nt\n\n(2)\n\nwhere gt denotes the pose state at time t, ut denotes the control, and \u21e0 denotes the scene that\ndescribes the search environment \u2013 a collection of objects (simply-connected surfaces supporting a\nradiance function) of which the target x is one instance. Constraints on the controller enter through f;\nphotometric nuisances, quantization and occlusions enter through the measurement map h. Additive\nand unmodeled phenomena that affect observed data are incorporated into nt, the \u201cnoise\u201d term.\n\n2.1 Signal models\n\nThe simplest model that includes both scaling and occlusion nuisances is the \u201ccartoon \ufb02atland\u201d,\nwhere a bounded subset of R2 is populated by self-luminous line segments, corresponding to clutter\nobjects. We denote an instance of this model, the scene, by \u21e0 = (1, . . . , C), which is a collection\nof C objects k. The number of objects in the scene C is the clutter density parameter that can\npossibly grow to be in\ufb01nite in the limit. Each object is described by its center (ck), length (lk),\nbinary orientation (ok), and radiance function supported on the segment \u21e2k. This is the \u201ctexture\u201d or\n\u201cappearance\u201d of the object, which in the simplest case can be assumed to be a constant function:\n\nk = (ck, lk, ok,\u21e2 k) 2 [0, 1]3 \u21e5{ 0, 1}\u21e5 [R2 ! R+]\n\n(3)\nAn agent can move continuously throughout the search domain. We take the state gt 2 R2 to be its\ncurrent position, ut 2 R2 the currently exerted move, and assume trivial dynamics: gt+1 = gt + ut.\nMore complex agents where gt 2 SE(3) can be incorporated without conceptual dif\ufb01culties.\nThe measurement model is that of an omnidirectional m-pixel camera, with each entry of yt 2 Rm\nin (2) given by:\n\nyt(i) =Z (i+ 1\n\n(i 1\n\n2 ) 2\u21e1\n\n2 ) 2\u21e1\nm\n\nm Z 1\n\n0\n\n\u21e2`(\u2713,gt)(z)d\u2713d\u2327 + nt(i), with z = (\u2327 cos(\u2713),\u2327 sin(\u2713))\n\n(4)\n\nwhere 2\u21e1\nm is the angle subtended by each pixel. The integrand is a collection of radiance functions\nwhich are supported on objects (line segments). Because of occlusions, only the closest objects that\nintersect the pre-image contribute to the image. The index of the object (clutter or object of interest)\nthat contributes to the image is denoted by `(\u2713, gt) and is de\ufb01ned as:\n\n`(\u2713, gt) = arg min\n\nk nk9(sk, k) 2 [\n\nlk\n2\n\n,\n\nlk\n2\n\n] \u21e5 R+ s.t. ck +\u2713 ok\n\n1  ok \u25c6 sk = g + \u02c6g(\u2713)ko (5)\n\nAbove, g and \u02c6g(\u2713) = (cos(\u2713), sin(\u2713)) are current position and direction, respectively. ck, lk, and ok\nare k-th segment center, length, and orientation. Condition ck + sk = g + \u02c6g(\u2713)k encodes intersection\nof ray g + \u02c6g(\u2713) with a point on a segment k. The segment closest to viewer, i.e. one that is visible, has\nthe smallest k. Integration over 2\u21e1\nm in (4) accounts for quantization, and the layer model (5) describes\nocclusions. While the measurement model is non-trivial (in particular, it is not differentiable), it is\nthe simplest that captures the nuisance phenomenology. All unmodeled phenomena are lumped in the\nadditive term nt, which we assume to be zero-mean Gaussian \u201cnoise\u201d with covariance 2I.\nIn order to design control sequences to minimize risk, we need to evaluate the uncertainty of future\nmeasurements, those we have not yet measured, which are a function of the control action to be taken.\nTo that end, we write the probability model for computing the posterior and the predictive density.\n\n3\n\n\fWe \ufb01rst describe the general case of visual exploration where the environment is unknown. We begin\nwith noninformative prior for objects k = 1, . . . , C\n\n(6)\np(k) = p(ck)p(lk)p(ok)p(\u21e2k) = U [0, Nc]2 \u21e5 Exp() \u21e5 Ber(1/2) \u21e5 U [0, N\u21e2]\nwhere U,Exp and Ber denote uniform, exponential, and Bernoulli distributions parameterized by\nNc, , and N\u21e2. Then p(\u21e0) = p(1, ..., C). The posterior is then computed by Bayes rule4:\n\np(\u21e0|yt, gt) /\n\np(y\u2327|g\u2327 , \u21e0)p(\u21e0) =\n\nN (y\u2327  h(g\u2327 , \u21e0); 2I)p(\u21e0)\n\n(7)\n\ntY\u2327 =1\n\ntY\u2327 =1\n\nAbove, N (z, \u2303) denotes the value of a zero-mean Gaussian density with covariance \u2303 at z. The\ndensity can be decomposed as a product of likelihoods since knowledge of environment (\u21e0) and\nlocation (gt) is suf\ufb01cient to predict measurement yt up to Gaussian noise. The predictive distribution\n(distribution of the next measurement conditioned on the past) is computed by marginalization:\n\np(yt+1|yt, gt, gt+1) = Z p(\u21e0|yt, gt, gt+1)p(yt+1|\u21e0, y t, gt+1)d\u21e0\n= Z p(\u21e0|yt, gt)N (yt+1  h(gt+1,\u21e0 ), 2I)d\u21e0\n\nThe marginalization above is essentially intractable. In this paper we focus on visual search of a\nparticular object in an otherwise known environment, so marginalization is only performed with\nrespect to a single object in the environment, x, whose parameters are discrete, but otherwise\nanalogous to (6):\n\np(x) = U{0, ..., Nc  1}2 \u21e5 Exp() \u21e5 Ber(1/2) \u21e5 U{0, . . . , N\u21e2  1}\n\n(10)\nWe denote by xi, i = 1, ...,|X| object with parameters (ci, li, oi,\u21e2 i) and write \u21e0i = (xi, 1, . . . , C)\nto denote the scene with known clutter objects 1, ..., C augmented by an unknown object xi. In this\ncase, we have:\n\np(x|yt, gt) /\n\np(yt+1|yt, gt, gt+1) =\n\nN (y\u2327  h(g\u2327 , \u21e0); 2I)p(x)\n\ntY\u2327 =1\n|X|Xi=1\np(xi|yt, gt)N (yt+1  h(gt+1,\u21e0 i), 2I)\n\n(8)\n\n(9)\n\n(11)\n\n(12)\n\n3 The role of control in active recognition\n\nIt is clear from equations (11) and (12) that the history of agent\u2019s positions gt plays a key role in\nthe process of acquiring new information on the object of interest x for the purpose of recognition.\nThis is encoded by the conditional density (11). In the context of the identi\ufb01cation of the model (2),\none would say that data yt (a function of the scene and the history of positions) must be suf\ufb01ciently\ninformative [16] on x, meaning that yt contains enough information to estimate x; this can be\nmeasured e.g. through the Fisher information matrix if x is deterministic but unknown, or by the\nposterior p(x|yt) in a probabilistic setting. This depends upon whether ut is suf\ufb01ciently exciting, a\n\u201crichness\u201d condition that has been extensively used in the identi\ufb01cation and adaptive control literature\n[17, 18], which guarantees that the state trajectory gt explores the space of interest. If this condition\nis not satis\ufb01ed, there are limitations on the performance that can be attained during the search process.\nThere are two extreme cases which set an upper and lower bounds on recognition error:\n\n1. Passive recognition: there is no active control, and instead a collection of vantage points gt\nis given a-priori. Under this scenario it is easy to prove that, averaging over the possible\nscenes and initial agent locations, the probability of error approaches chance (i.e. that given\nby the prior distribution) as clutter density and/or the environment volume increase.\n\n2. Full control on gt: if the control action can take the \u201comnipotent agent\u201d anywhere, and\nin\ufb01nite time is available to collect measurements, then the conditional entropy H(x|yt)\ndecreases asymptotically to zero thus providing arbitrarily good recognition rate in the limit.\n\n4superscript in e.g. yt indicates history of y up to t, i.e. yt .\n\n= (y1, . . . , yt) and yt+T\n\nt\n\n.\n= (yt, . . . yt+T )\n\n4\n\n\fIn general, there is a tradeoff between the ability to gather new information through suitable control\nactions, which we name \u201ccontrol authority\u201d, and the recognition rate. In the sequel we shall propose\na measure for the \u201ccontrol authority\u201d over the sensing process; later in the paper we will consider\nconditional entropy as a proxy (upper bound) on probability of error and evaluate empirically how\ncontrol authority affects the conditional entropy decrease.\n\n3.1 Control authority\n\nUnlike the passive case, in the controlled scenario time plays an important role. This happens in two\nways. One is related to the ability to visit previously unexplored regions and therefore is related to\nthe reachable space under input and time constraints, the other is the effect of noise which needs\nto be averaged. If objects in the scene move, this can be done only at an expense in energy, and\nachieving asymptotic performance may not be possible under control limitations. This considerably\nmore complex scenario is beyond our scope in this paper. We focus on the simplest case of static\nenvironment.\nControl authority depends on (i) the controller u, as measured for instance by a norm5 kuk :\nU[0, T ] ! R and (ii) on the geometry of the state space, the input-to-state map and on the environment.\nWe propose to measure control authority in the following manner: associate to each pair of locations\nin the state space (go, gf ) and a given time horizon T the cost kuk required to move from go at time\nt = 0 to gf at time t = T along a minimum cost path i.e.\n(13)\ninf\n\nJ\u21e0(go, gf , T )\n\n.\n=\n\nu : gu(0)=go,gu(T )=gf \u21e0kuk\n\nwhere gu(t) is the state vector at time t under control u. If gf is not reachable from go in time T we\nset J\u21e0(go, gf , T ) = 1. This will depend on the dynamical properties of the agent \u02d9g = f (g, u) (or\ngt+1 = f (gt, ut) for discrete time) as well as on the scene \u21e0 where the agent has to navigate through\nwhile avoiding obstacles.\nThe control authority (CA) can be measured via the volume of the reachable space for \ufb01xed control\ncost, and will be a function of the initial con\ufb01guration g0 and of the scene \u21e0, i.e.\n(14)\nIf instead one is interested in average performance (e.g. w.r.t. the possible scene distributions with\n\ufb01xed clutter density), a reasonable measure is the average of smallest volume (as g0 varies) of the\nreachable space with a unit cost input\n\n.\n= V ol{gf : J\u21e0(g0, gf , k) \uf8ff 1}\n\nCA(k, go,\u21e0 )\n\nCA(k)\n\n.\n\n= E\u21e0\u21e5inf\n\ngo CA(k, go,\u21e0 )\u21e4\n\nJ(go, gf )\n\nJ(go, gf , T )\n\n.\n= inf\nT0\n\n(15)\n\n(16)\n\nIf planning on an inde\ufb01nitely long time horizon is allowed, then one would minimize J(go, gf , T )\nover time T :\n\nwith\n\n.\n= inf\ngo\n\nCA1\n\n(V ol{gf : J(go, gf ) \uf8ff 1})\n\n(17)\nThe \ufb01gures CA(k, go,\u21e0 ) in (14), CA(k) and CA1 in (17) are proxies of the exploration ability which,\nin turn, is related to the ability to gather new information on the task at hand. The data acquisition\nprocess can be regarded as an experiment design problem [16] where the choice of the control signal\nguides the experiment. Control authority, as de\ufb01ned above, measures how much freedom one has\non the sampling procedure; the larger the CA, the more freedom the designer has. Hence, having\n\ufb01xed (say) the number of snapshots of the scene one may consider the time interval over which these\nsnapshots can be taken, the designer is trying to maximize the information the data contains on the\ntask (making a decision on class label); this information is of course a nondecreasing function of CA.\nMore control authority corresponds to more freedom in the choice of which samples one is taking\n(from which location and at which scale).\nTherefore the risk, considered against CA(k) in (15), CA(k, go,\u21e0 ) in (14) or CA1 in (17) will follow\na surface that depends on the clutter: For any given clutter (or clutter density), the risk will be a\nmonotonically non-increasing function of control authority CA(k). This is illustrated in Fig. 4.\nthat the control is such that kuk \uf8ff 1\n\n5This could be, for instance, total energy, (average) power, maximum amplitude and so on. We can assume\n\n5\n\n\f4 Control policy\n\nGiven gt, \u21e0, and a \ufb01nite control authority CA(k, gt,\u21e0 ), in order to minimize average risk (1) with\nrespect to a sequence of control actions we formulate a \ufb01nite k-step horizon optimal control problem:\n\nu\u21e4t+k1\nt\n\n= arg min\nut+k1\n\nt\n\nt\n\nt+1|yt, ut+k1\n\nZ p(yt+k\n)1  max\nut Z p(yt+1|yt, ut)1  max\n\ni\n\nu\u21e4t = arg min\n\np(xi|yt, yt+k\n\nt+1 , ut+k1\n\nt\n\ni\n\n)dyt+k\n\nt+1\n\np(xi|yt, yt+1, ut)dyt+1\n\n(18)\n\n(19)\n\nwhich is unfortunately intractable. As is standard, we can settle for the greedy k = 1 case:\n\nbut it is still often impractical. We relax the problem further by choosing to minimize the upper\nbound on Bayesian risk, of which a convenient one is the conditional entropy (see [19], which shows\nPe \uf8ff 1\n\n2 H(x|y)): This implies that control action can be chosen by entropy minimization:\n\nu\u21e4t = arg min\nut\n\nH(x|yt, yt+1, ut)\n\n(20)\n\nUsing chain rules of entropy, we can rewrite minimization of H(x|yt, yt+1, ut) as maximization of\nconditional entropy of next measurement:\n(21)\n\nu\u21e4t = arg min\nut\n\nH(x|yt, yt+1, ut) = arg min\n= arg max\n\nut\n\nH(x|yt)  I(yt+1; x|yt, ut)\nH(yt+1|yt, ut)  H(yt+1|yt, ut, x)\nH(yt+1|yt, ut)\n\nut\n\nut\n\n(22)\n\n(23)\n\n= arg max\n\nbecause H(yt+1|yt, ut, x) = H(nt) is due to Gaussian noise, since yt+1 = h(gt+1; \u21e0) + nt+1 and\nboth gt+1 and \u21e0 are known (the only unknown object in the scene is x, and it is conditioned on).\nH(yt+1|yt, ut) is the entropy of a Gaussian mixture distribution which can be easily approximated\nby Monte Carlo, and for which both lower [20] and upper bounds [21] are known.\nSince the controller has energy limitations, i.e. is unable to traverse the environment in one step,\noptimization is taken over a small ball in R2 centered at current location gt. In practice, the set\nof controls needs to be discretized and entropy computed for each action. However, rather than\nmyopically choosing the next control, we instead choose the next target position, in a \u201cbandit\u201d\napproach [22, 23]: maximization in (23) is taken with respect to all locations in the world, rather\nthan the set of controls (locations reachable in one step), and the agent takes the minimum energy\npath toward the most informative location. Since this location is typically not reachable in a single\nstep, one can adopt a \u201cstubborn\u201d strategy that follows the planned path to the target location before\nchoosing next action, and an \u201cindecisive\u201d \u2013 that replans as soon as additional information becomes\navailable as a consequence of motion. We demonstrate the characteristics of conditional entropy as a\ncriterion for planning in Fig. 1.\n\n5 Experiments\n\nIn addition to evaluating \u201cindecisive\u201d and \u201cstubborn\u201d strategies, we also consider several different\nuncertainty measures. Section 4 provided arguments for H(yt+1|yt, g) (a \u201cmax-ent\u201d approach)\nwhich is a proxy for minimization of Bayesian risk. Another option is to maximize covariance of\np(yt+1|yt, g) (\u201cmax-var\u201d), for example due to reduced computational cost. Alternatively, if we do not\nwish to hypothesize future measurements and compute p(yt+1|yt, g), we may search by approaching\nthe mode of the posterior distribution p(x|yt) (\u201cmax-posterior\u201d). To test average performance of\nthese strategies, we consider search in 100 environment instances, each containing 40 known clutter\nobjects and one unknown object. Clutter objects are sampled from the continuous prior distribution\n(6) and unknown object is chosen from the prior (10) discretized to |X| \u21e1 9000. Agent\u2019s sensor has\nm = 30 pixels, with additive noise  set to half of the difference between object colors. Conditional\nentropy of the next measurement, H(yt+1|yt, gt+1), is calculated over the entire map, on a 16x16\ngrid. Search is terminated once residual entropy falls below a threshold value: H(x|yt) < 0.001. We\nare interested in average search time (expressed in terms of number of steps) and average regret, which\n\n6\n\n\fFigure 1: \u201cValue of measurement\u201d described by conditional entropy H(yt+1|yt, g) as a function of\nlocation g. We focus on three special cases, and for each show entropy, its lower bound (see [20]),\nand upper bound (based on Gaussian approximation, see [24]). In all cases, the agent is at the bottom\nof the environment, and a small unknown object is at the top. The agent has made one measurement\n(y1) and must now determine the best location to visit. The left three panels demonstrate a case of\nscaling: object is seen, but due to noise and quantization its parameters are uncertain. Agent gains\ninformation if putative object location (top) is approached. Middle three panels demonstrate partial\nocclusion: a part of the object has been seen, and there is now a region (bottom right corner) that is\nuninformative \u2013 measurements taken there are predictable. Full occlusion is shown in the right three\npanels. The object has not been seen (due to occluder in the middle of the environment) and the best\naction is to visit new area. Notice that lower and upper bounds are maximized at the same point as\nactual entropy. This was a common occurrence in many experiments that we did. Because we are\ninterested in the maximizing point, rather than the maximizing value, even if the bounds are loose,\nusing them for navigation can lead to reasonable results.\n\nfinish\n\nfinish\n\nstart\n\nstart\n\n15\n\n10\n\n5\n\n0\n15\n\n10\n\n5\n\n0\n\n5\n\n10\n\n15\n\n20\n\n10\n\n20\n\n30\n\nFigure 2: A typical run of \u201cindecisive\u201d (left) and \u201cstubborn\u201d (right) strategies. Objects are colored\naccording to their radiance and the unknown object is shown as a thick line. Traveled path is shown\nin black. The thinner lines are the planned paths that were not traversed to the end because of\nreplanning. Stubborn explorer traverses each planned segment to its end. Right: Residual entropy\nH(x|yt) shown over time for the two strategies (top: \u201cindecisive\u201d, bottom: \u201cstubborn\u201d). Lower and\nupper bounds on H(x|yt, yt+1) can be computed prior to measuring yt+1 using upper and lower\nbounds on H(yt+1|yt). Sharp decrease occurs when object becomes visible.\n\nJ(xo,c0)\n\n.\n= cu(xo,c0)J(xo,c0)\n\nwe de\ufb01ne as the excess fraction of the minimum energy path to the center of the unknown object (c0)\nthat the explorer takes: regret\n. Because it is not always necessary to reach the\nobject to recognize it (viewing it closely from multiple viewpoints may be suf\ufb01cient), this quantity is\nan approximation to minimum search effort. We show an example of a typical problem instance in\nFig. 2. Statistics of strategies\u2019 performance are shown in Fig. 3. Minimum energy path and random\nwalk strategy play roles of lower and upper bounds. For each of the three uncertainty measures,\n\u201cindecisive\u201d outperformed \u201cstubborn\u201d in terms of both average path length and average regret, as\nalso shown in Table 1. Notice however that for speci\ufb01c problem instances \u201cindecisive\u201d can be much\nworse than \u201cstubborn\u201d \u2013 the curves for the two strategy types cross. Generally, \u201cmax-ent\u201d strategy\nseems to perform best, followed by \u201cmax-var\u201d, and \u201cmax-posterior\u201d. \u201cRandom-walk\u201d strategy was\nunable to \ufb01nd the object unless it was visible initially or became visible by chance. We next\n\nindecisive\nstubborn\n\nAverage search duration\n\nAverage regret\n\nmax-ent max-var max-p(x|yt) max-ent max-var max-p(x|yt)\n28.42\n34.26\n\n1.44\n1.78\nTable 1: Search time statistics for different strategies.\n\n41.00\n41.49\n\n32.70\n36.17\n\n1.27\n1.71\n\n1.96\n2.19\n\n7\n\nentropylower boundupper boundentropylower boundupper boundentropylower boundupper bound\fFigure 3: Search time statistics for a 100 world test suite. Left: cumulative distribution of distance\nuntil detection traveled by the max-entropy, max-posterior, max-variance explorers, and random\nwalker. Right: cumulative distribution of regret for the explorers.\n\nFigure 4: Left: Control authority. The red dashed curve corresponds to reachable volume in the\nabsence of clutter. The black dashed line is the normalized maximum reachable volume in the\nenvironment. Right: Residual entropy H(x|yt), as a function of control authority and clutter density.\nBlack dashed line indicates H(x), entropy prior to taking any measurements. Lines correspond to\nresidual entropy for a given control authority averaged over the test suite; markers \u2013 to residual\nentropy on a speci\ufb01c problem instance. For certain scenes, agent is unable to signi\ufb01cantly reduce\nentropy because the object never becomes unoccluded (once object is seen, there is a sharp drop in\nresidual entropy, as shown in Fig. 2).\n\nempirically evaluated explorer\u2019s exploration ability under \ufb01nite control authority. Reachable volume\nwas computed by Monte Carlo sampling, following (14)-(15) for several clutter density values. For\neach clutter density, we generated 40 scene instances and tested \u201dindecisive\u201d max-entropy strategy\nwith respect to control authority. Here |X| \u21e1 2000, and other parameters remained as in previous\nexperiment. Fig. 4 empirically veri\ufb01es discussion in Section 3.\n\n6 Discussion\n\nWe have described a simple model that captures the phenomenology of nuisances in a visual search\nproblem, that includes uncertainty due to occlusion, scaling, and other \u201cnoise\u201d processes, and used\nit to compute the entropy of the prediction density to be used as a utility function in the control\npolicy. We have then related the amount of \u201ccontrol authority\u201d the agent can exercise during the\ndata acquisition process with the performance in the visual search task. The extreme cases show\nthat if one is given a passively gathered dataset of an arbitrary number of images, performance\ncannot be guaranteed beyond what is afforded by the prior. In the limit of in\ufb01nite control authority,\narbitrarily good decision performance can be attained. In between, we have empirically characterized\nthe tradeoff between decision performance and control authority. We believe this to be a natural\nextension of rate-distortion tradeoffs where the underlying task is not transmission and storage of\ndata, but usage of (visual) data for decision and control.\nAcknowledgments\nResearch supported on ARO W911NF-11-1-0391 and DARPA MSEE FA8650-11-1-7154.\n\n8\n\nreachable volumewithout clutterenvironment volumeprior entropy\fReferences\n[1] S. Soatto. Steps towards a theory of visual information: Active perception, signal-to-symbol\n\nconversion and the interplay between sensing and control. arXiv:1110.2053, 2011.\n\n[2] R. Bajcsy. Active perception. 76(8):996\u20131005, 1988.\n[3] D. H. Ballard. Animate vision. Arti\ufb01cial Intelligence, 48(1):57\u201386, 1991.\n[4] A. Andreopoulos and J. K. Tsotsos. A theory of active object localization. In Proceedings of\n\nthe IEEE International Conference on Computer Vision (ICCV), 2009.\n\n[5] N. Roy, G. Gordon Gordon, and S. Thrun. Finding approximate POMDP solutions through\n\nbelief compression. Journal of Arti\ufb01cial Intelligence Research, 23:1\u201340, 2005.\n\n[6] H. Kopp-Borotschnig, L. Paletta, M. Prantl, and A. Pinz. Appearance-based active object\n\nrecognition. Image and Vision Computing, 18(9):715\u2013727, 2000.\n\n[7] R. Eidenberger and J. Scharinger. Active perception and scene modeling by planning with\nprobabilistic 6d object poses. In Proceedings of the IEEE International Conference on Intelligent\nRobots and Systems (IROS), 2010.\n\n[8] J. Ma and J. W. Burdick. Dynamic sensor planning with stereo for model identi\ufb01cation on\na mobile platform. In Proceedings of the IEEE International Conference on Robotics and\nAutomation (ICRA), 2010.\n\n[9] G. A. Hollinger, U. Mitra, and G. S. Sukhatme. Active classi\ufb01cation: Theory and application to\n\nunderwater inspection. In International Symposium on Robotics Research, 2011.\n\n[10] Z. Jia, A. Saxena, and T. Chen. Robotic object detection: Learning to improve the classi\ufb01ers\n\nusing sparse graphs for path planning. In IJCAI, 2011.\n\n[11] M. Krainin, B. Curless, and D. Fox. Autonomous generation of complete 3d object models using\nnext best view manipulation planning. In Proceedings of the IEEE International Conference on\nRobotics and Automation (ICRA), 2011.\n\n[12] F. Bourgault, A. G\u00a8oktogan, T. Furukawa, and H. F. Durrant-Whyte. Coordinated search for a\n\nlost target in a Bayesian world. Advanced Robotics, 18(10), 2004.\n\n[13] G. M. Hoffmann and C. J. Tomlin. Mobile sensor network control using mutual information\n\nmethods and particle \ufb01lters. IEEE Transactions on Automatic Control, 55(1), 2010.\n\n[14] A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models.\n\nIn Uncertainty in Arti\ufb01cial Intelligence, 2005.\n\n[15] J.L. Williams, J.W. Fisher III, and A.S. Willsky. Performance guarantees for information\n\ntheoretic active inference. AI & Statistics (AISTATS), 2007.\n\n[16] L. Pronzato. Optimal experimental design and some related control problems. Automatica,\n\n44:303\u2013325, 2008.\n\n[17] R. Bitmead. Persistence of excitation conditions and the convergence of adaptive schemes.\n\nInformation Theory, IEEE Transactions on, 30(2):183 \u2013 191, 1984.\n\n[18] L. Ljung. System Identi\ufb01cation, Theory for the User. Prentice Hall, 1997.\n[19] M. E. Hellman and J. Raviv. Probability of error, equivocation and the Chernoff bound. IEEE\n\nTransactions on Information Theory, 16:368\u2013372, 1970.\n\n[20] J. R. Hershey and P. A. Olsen. Approximating the Kullback Leibler divergence between\nGaussian mixture models. Proceedings of the IEEE International Conference on Acoustics\nSpeech and Signal Processing (ICASSP), 4(6), 2007.\n\n[21] M. F. Huber, T. Bailey, Durrant-Whyte H., and U. D. Hanebeck. On entropy approximation for\nGaussian mixture random vectors. In Proceedings of the IEEE International Conference on\nMultisensor Fusion and Integration for Intelligent Systems (MFI), 2008.\n\n[22] L. Valente, R. Tsai, and S. Soatto. Information gathering control via exploratory path planning.\n\nIn Proceedings of the Conference on Information Sciences and Systems. March 2012.\n\n[23] R. Vidal, Omid Shakernia, H. J. Kim, D. H. Shim, and S. Sastry. Probabilistic pursuit-evasion\ngames: theory, implementation, and experimental evaluation. IEEE Transactions on Robotics,\n18(5), 2002.\n\n[24] T. M. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991.\n\n9\n\n\f", "award": [], "sourceid": 1320, "authors": [{"given_name": "Vasiliy", "family_name": "Karasev", "institution": null}, {"given_name": "Alessandro", "family_name": "Chiuso", "institution": null}, {"given_name": "Stefano", "family_name": "Soatto", "institution": null}]}