{"title": "Noisy Generalized Binary Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1366, "page_last": 1374, "abstract": "This paper addresses the problem of noisy Generalized Binary Search (GBS).  GBS is a well-known greedy algorithm for determining a binary-valued hypothesis through a sequence of strategically selected queries.  At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search.  GBS is used in many applications, including fault testing, machine diagnostics, disease diagnosis, job scheduling, image processing, computer vision, and active learning. In most of these cases, the responses to queries can be noisy.  Past work has provided a partial characterization of GBS, but existing noise-tolerant versions of GBS are suboptimal in terms of sample complexity.  This paper presents the first optimal algorithm for noisy GBS and demonstrates its application to learning multidimensional threshold functions.", "full_text": "Noisy Generalized Binary Search\n\nUniversity of Wisconsin-Madison\n\n1415 Engineering Drive, Madison WI 53706\n\nRobert Nowak\n\nnowak@ece.wisc.edu\n\nAbstract\n\nThis paper addresses the problem of noisy Generalized Binary Search (GBS).\nGBS is a well-known greedy algorithm for determining a binary-valued hypothe-\nsis through a sequence of strategically selected queries. At each step, a query is\nselected that most evenly splits the hypotheses under consideration into two dis-\njoint subsets, a natural generalization of the idea underlying classic binary search.\nGBS is used in many applications, including fault testing, machine diagnostics,\ndisease diagnosis, job scheduling, image processing, computer vision, and active\nlearning. In most of these cases, the responses to queries can be noisy. Past work\nhas provided a partial characterization of GBS, but existing noise-tolerant ver-\nsions of GBS are suboptimal in terms of query complexity. This paper presents\nan optimal algorithm for noisy GBS and demonstrates its application to learning\nmultidimensional threshold functions.\n\nIntroduction\n\n1\nThis paper studies learning problems of the following form. Consider a \ufb01nite, but potentially very\nlarge, collection of binary-valued functions H de\ufb01ned on a domain X . In this paper, H will be called\nthe hypothesis space and X will be called the query space. Each h \u2208 H is a mapping from X to\n{\u22121, 1}. Assume that the functions in H are unique and that one function, h\u2217 \u2208 H, produces the\ncorrect binary labeling. The goal is to determine h\u2217 through as few queries from X as possible. For\neach query x \u2208 X , the value h\u2217(x), corrupted with independently distributed binary noise, is ob-\nserved. If the queries were noiseless, then they are usually called membership queries to distinguish\nthem from other types of queries [Ang01]; here we will simply refer to them as queries. Problems\nof this nature arise in many applications , including channel coding [Hor63], experimental design\n[R\u00b4en61], disease diagnosis [Lov85], fault-tolerant computing [FRPU94], job scheduling [KPB99],\nimage processing [KK00], computer vision [SS93, GJ96], computational geometry [AMM+98], and\nactive learning [Das04, BBZ07, Now08].\nPast work has provided a partial characterization of this problem. If the responses to queries are\nnoiseless, then selecting the optimal sequence of queries from X is equivalent to determining an\noptimal binary decision tree, where a sequence of queries de\ufb01nes a path from the root of the tree\n(corresponding to H) to a leaf (corresponding to a single element of H).\nIn general the deter-\nmination of the optimal tree is NP-complete [HR76]. However, there exists a greedy procedure\nthat yields query sequences that are within an O(log |H|) factor of the optimal search tree depth\n[GG74, KPB99, Lov85, AMM+98, Das04], where |H| denotes the cardinality of H. The greedy\nprocedure is referred to as Generalized Binary Search (GBS) [Das04, Now08] or the splitting al-\ngorithm [KPB99, Lov85, GG74]), and it reduces to classic binary search in special cases [Now08].\nThe GBS algorithm is outlined in Figure 1(a). At each step GBS selects a query that results in\nthe most even split of the hypotheses under consideration into two subsets responding +1 and \u22121,\nrespectively, to the query. The correct response to the query eliminates one of these two subsets\nfrom further consideration. Since the hypotheses are assumed to be distinct, it is clear that GBS\nterminates in at most |H| queries (since it is always possible to \ufb01nd query that eliminates at least\n\n1\n\n\f1) Select xi = arg minx\u2208X |(cid:80)\n\nGeneralized Binary Search (GBS)\ninitialize: i = 0, H0 = H.\nwhile |Hi| > 1\n2) Obtain response yi = h\u2217(xi).\n3) Set Hi+1 = {h \u2208 Hi : h(xi) = yi},\ni = i + 1.\n\nh\u2208Hi\n\nh(x)|.\n\nNoisy Generalized Binary Search (NGBS)\ninitialize: p0 uniform over H.\nfor i = 0, 1, 2, . . .\n\n1) xi = arg minx\u2208X |(cid:80)\n(cid:98)hi := arg maxh\u2208H pi(h)\n\n2) Obtain noisy response yi.\n3) Bayes update pi \u2192 pi+1; Eqn. (1).\nhypothesis selected at each step:\n\nh\u2208H pi(h)h(x)|.\n\n(a)\n\n(b)\n\nFigure 1: Generalized binary search (GBS) algorithm and a noise-tolerant variant (NGBS).\n\none hypothesis at each step). In fact, there are simple examples demonstrating that this is the best\none can hope to do in general [KPB99, Lov85, GG74, Das04, Now08]. However, it is also true that\nin many cases the performance of GBS can be much better [AMM+98, Now08]. In general, the\nnumber of queries required can be bounded in terms of a combinatorial parameter of H called the\nextended teaching dimension [Ang01, Heg95] (also see [HPRW96] for related work). Alternatively,\nthere exists a geometric relation between the pair (X ,H), called the neighborly condition, that is\nsuf\ufb01cient to bound the number of queries needed [Now08].\nThe focus of this paper is noisy GBS. In many (if not most) applications it is unrealistic to assume\nthat the responses to queries are without error. Noise-tolerant versions of classic binary search have\nbeen well-studied. The classic binary search problem is equivalent to learning a one-dimensional\nbinary-valued threshold function by selecting point evaluations of the function according to a bisec-\ntion procedure. A noisy version of classic binary search was studied \ufb01rst in the context of channel\ncoding with feedback [Hor63]. Horstein\u2019s probabilistic bisection procedure [Hor63] was shown to\nbe optimal (optimal decay of the error probability) [BZ74] (also see[KK07]).\nOne straightforward approach to noisy GBS was explored in [Now08]. The idea is to follow the GBS\nalgorithm, but to repeat the query at each step multiple times in order to decide whether the response\nis more probably +1 or \u22121. The strategy of repeating queries has been suggested as a general\napproach for devising noise-tolerant learning algorithms [K\u00a8a\u00a8a06]. This simple approach has been\nstudied in the context of noisy versions of classic binary search and shown to be suboptimal [KK07].\nSince classic binary search is a special case of the general problem, it follows immediately that the\napproach proposed in [Now08] is suboptimal. This paper addresses the open problem of determining\nan optimal strategy for noisy GBS. An optimal noise-tolerant version of GBS is developed here. The\nnumber of queries an algorithm requires to con\ufb01dently identify h\u2217 is called the query complexity of\nthe algorithm. The query complexity of the new algorithm is optimal, and we are not aware of any\nother algorithm with this capability.\nIt is also shown that optimal convergence rate and query complexity is achieved for a broad class\nof geometrical hypotheses arising in image recovery and binary classi\ufb01cation. Edges in images and\ndecision boundaries in classi\ufb01cation problems are naturally viewed as curves in the plane or sur-\nfaces embedded in higher-dimensional spaces and can be associated with multidimensional thresh-\nold functions valued +1 and \u22121 on either side of the curve/surface. Thus, one important setting for\nGBS is when X is a subset of d dimensional Euclidean space and the set H consists of multidimen-\nsional threshold functions. We show that our algorithm achieves the optimal query complexity for\nactively learning multidimensional threshold functions in noisy conditions.\nThe paper is organized as follows. Section 2 describes the Bayesian algorithm for noisy GBS and\npresents the main results. Section 3 examines the proposed method for learning multidimensional\nthreshold functions. Section 4 discusses an agnostic algorithm that performs well even if h\u2217 is not\nin the hypothesis space H. Proofs are given in Section 5.\n2 A Bayesian Algorithm for Noisy GBS\nIn noisy GBS, one must cope with erroneous responses. Speci\ufb01cally, assume that the binary response\ny \u2208 {\u22121, 1} to each query x \u2208 X is an independent realization of the random variable Y satisfying\nP(Y = h\u2217(x)) > P(Y = \u2212h\u2217(x)), where h\u2217 \u2208 H is \ufb01xed but unknown. In other words, the\nresponse is only probably correct. If a query x is repeated more than once, then each response is\n\n2\n\n\fsure over H. That is, p0 : H \u2192 [0, 1] and(cid:80)\n\nan independent realization of Y . De\ufb01ne the noise-level for the query x as \u03b1x := P(Y = \u2212h\u2217(x)).\nThroughout the paper we will let \u03b1 := supx\u2208X \u03b1x and assume that \u03b1 < 1/2.\nA Bayesian approach to noisy GBS is investigated in this paper. Let p0 be a known probability mea-\nh\u2208H p0(h) = 1. The measure p0 can be viewed as an\ninitial weighting over the hypothesis class, expressing the fact that all hypothesis are equally reason-\nable prior to making queries. After each query and response (xi, yi), i = 0, 1, . . . , the distribution\nis updated according to\n\npi+1(h) \u221d pi(h) \u03b2(1\u2212zi(h))/2(1 \u2212 \u03b2)(1+zi(h))/2,\n\n(1)\nwhere zi(h) = h(xi)yi, h \u2208 H, \u03b2 is any constant satisfying 0 < \u03b2 < 1/2, and pi+1(h) is\nh\u2208H pi+1(h) = 1 . The update can be viewed as an application of Bayes rule\nand its effect is simple; the probability masses of hypotheses that agree with the label yi are boosted\nrelative to those that disagree. The parameter \u03b2 controls the size of the boost. The hypothesis with\n\nnormalized to satisfy(cid:80)\nthe largest weight is selected at each step:(cid:98)hi := arg maxh\u2208H pi(h). If the maximizer is not unique,\none of the maximizers is selected at random. The goal of noisy GBS is to drive the error P((cid:98)hi (cid:54)= h\u2217)\nif the weighted prediction(cid:80)\n\nto zero as quickly as possible by strategically selecting the queries. A similar procedure has been\nshown to be optimal for noisy (classic) binary search problem [BZ74, KK07]. The crucial distinction\nhere is that GBS calls for a fundamentally different approach to query selection.\nThe query selection at each step must be informative with respect to the distribution pi. For example,\nh\u2208H pi(h)h(x) is close to zero for a certain x, then a label at that point is\ninformative due to the large disagreement among the hypotheses. This suggests the following noise-\ntolerant variant of GBS outlined in Figure 1. This paper shows that a slight variation of the query\nselection in the NGBS algorithm in Figure 1 yields an algorithm with optimal query complexity.\nIt is shown that as long as \u03b2 is larger than the noise-level of each query, then the NGBS produces\n\na sequence of hypotheses,(cid:98)h0,(cid:98)h1, . . . , such that P((cid:98)hn (cid:54)= h\u2217) is bounded above by a monotonically\n\neach such subset. Let A denote the smallest such partition. Note that X = (cid:83)\n\ndecreasing sequence (see Theorem 1). The main interest of this paper is an algorithm that drives the\nerror to zero exponentially fast, and this requires the query selection criterion to be modi\ufb01ed slightly.\nTo see why this is necessary, suppose that at some step of the NGBS algorithm a single hypothesis\n(e.g., h\u2217) has the majority of the probability mass. Then the weighted prediction will be almost\nequal to the prediction of that hypothesis (i.e., close to +1 or \u22121 for all queries), and therefore the\nresponses to all queries are relatively certain and non-informative. Thus, the convergence of the\nalgorithm could become quite slow in such conditions. A similar effect is true in the case of noisy\n(classic) binary search [BZ74, KK07]. To address this issue, the query selection criterion is modi\ufb01ed\nvia randomization so that the response to the selected query is always highly uncertain.\nIn order to state the modi\ufb01ed selection procedure and the main results, observe that the query space\nX can be partitioned into equivalence subsets such that every h \u2208 H is constant for all queries in\nA\u2208A A. For every\nA \u2208 A and h \u2208 H, the value of h(x) is constant (either +1 or \u22121) for all x \u2208 A; denote this value\nby h(A). As \ufb01rst noted in [Now08], A can play an important role in GBS. In particular, observe that\nthe query selection step in NGBS is equivalent to an optimization over A rather that X itself. The\nrandomization of the query selection step is based on the notion of neighboring sets in A.\nDe\ufb01nition 1 Two sets A, A(cid:48) \u2208 A are said to be neighbors if only a single hypothesis (and its\ncomplement, if it also belongs to H) outputs a different value on A and A(cid:48).\nThe modi\ufb01ed NGBS algorithm is outlined in Figure 2. Note that the query selection step is identical\nto that of the original NGBS algorithm, unless there exist two neighboring sets with strongly bipolar\nweighted responses. In the latter case, a query is randomly selected from one of these two sets with\nequal probability, which guarantees a highly uncertain response.\nTheorem 1 Let P denotes the underlying probability measure (governing noises and algorithm ran-\ndomization). If \u03b2 > \u03b1, then both the NGBS and modi\ufb01ed NGBS algorithms, in Figure 1(b) and\n{an}n\u22650 is a monotonically decreasing sequence.\nThe condition \u03b2 > \u03b1 ensures that the update (1) is not overly aggressive. We now turn to the\n\nFigure 2, respectively, generate a sequence of hypotheses such that P((cid:98)hn (cid:54)= h\u2217) \u2264 an < 1, where\nmatter of suf\ufb01cient conditions guaranteeing that P((cid:98)hn (cid:54)= h\u2217) \u2192 0 exponentially fast with n. The\n\n3\n\n\fModi\ufb01ed NGBS\ninitialize: p0 uniform over H.\nfor i = 0, 1, 2, . . .\n\n1) Let b = minA\u2208A |(cid:80)\nwith(cid:80)\narg minA\u2208A |(cid:80)\n\nh\u2208H pi(h)h(A) > b and(cid:80)\n\nh\u2208H pi(h)h(A)|. If there exists neighboring sets A and A(cid:48)\nh\u2208H pi(h)h(A(cid:48)) < \u2212b , then select xi from\nA or A(cid:48) with probability 1/2 each. Otherwise select xi from the set Amin =\nh\u2208H pi(h)h(A)|. In the case that the sets above are non-unique,\n\nchoose at random any one satisfying the requirements.\n\n2) Obtain noisy response yi.\n3) Bayes update pi \u2192 pi+1; Eqn. (1).\nhypothesis selected at each step:\n\n(cid:98)hi := arg maxh\u2208H pi(h)\n\nFigure 2: Modi\ufb01ed NGBS algorithm.\n\nexponential convergence rate of classic binary search hinges on the fact that the hypotheses can be\nordered with respect to X . In general situations, the hypothesis space cannot be ordered in such a\nfashion, but the neighborhood graph of A provides a similar local structure.\nDe\ufb01nition 2 The pair (X ,H) is said to be neighborly if the neighborhood graph of A is connected\n(i.e., for every pair of sets in A there exists a sequence of neighboring sets that begins at one of the\npair and ends with the other).\n\nIn essence, the neighborly condition simply means that each hypothesis is locally distinguishable\nfrom all others. By \u2018local\u2019 we mean in the vicinity of points x where the output of the hypothesis\nchanges from +1 to \u22121. The neighborly condition was \ufb01rst introduced in [Now08] in the analysis\nof GBS. It is shown in Section 3 that the neighborly condition holds for the important case of\nhypothesis spaces consisting of multidimensional threshold functions. If (X ,H) is neighborly, then\n\nthe modi\ufb01ed NGBS algorithm guarantees that P((cid:98)hi (cid:54)= h\u2217) \u2192 0 exponentially fast.\n\nTheorem 2 Let P denotes the underlying probability measure (governing noises and algorithm ran-\ndomization). If \u03b2 > \u03b1 and (X ,H) is neighborly, then the modi\ufb01ed NGBS algorithm in Figure 2\ngenerates a sequence of hypotheses satisfying\n\nP((cid:98)hn (cid:54)= h\u2217) \u2264 |H| (1 \u2212 \u03bb)n \u2264 |H| e\u2212\u03bbn , n = 0, 1, . . .\n\nwith exponential constant \u03bb = min\n\n1 \u2212 \u03b2(1\u2212\u03b1)\n\n1\u2212\u03b2 \u2212 \u03b1(1\u2212\u03b2)\n\n\u03b2\n\n, where\n\n(2)\n\n(cid:110) 1\u2212c\u2217\n\n, 1\n4\n\n2\n\n(cid:111)(cid:16)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\n\nX\n\nc\u2217\n\n:= min\n\nP\n\nmax\nh\u2208H\n\nh(x) dP (x)\n\n(cid:17)\n\n(cid:12)(cid:12)(cid:12)(cid:12) .\n\nThe exponential convergence rate1 is governed by the key parameter 0 \u2264 c\u2217 < 1. The minimizer in\n(2) exists because the minimization can be computed over the space of \ufb01nite-dimensional probability\nmass functions over the elements of A. As long as no hypothesis is constant over the whole of\nX , the value of c\u2217 is typically a small constant much less than 1 that is independent of the size\nof H (see [Now08, Now09] and the next section for concrete examples). In such situations, the\nconvergence rate of modi\ufb01ed NGBS is optimal, up to constant factors. No other algorithm can solve\nthe noisy GBS problem with a lower query complexity. The query complexity of the modi\ufb01ed NGBS\nalgorithm can be derived as follows. Let \u03b4 > 0 be a prespeci\ufb01ed con\ufb01dence parameter. The number\n\u03b4 ), which is the\noptimal query complexity. Intuitively, O(log |H|) bits are required to encode each hypothesis. More\nformally, the classic noisy binary search problem satis\ufb01es the assumptions of Theorem 2 [Now08],\n\nof queries required to ensure that P((cid:98)hn (cid:54)= h\u2217) \u2264 \u03b4 is n \u2265 \u03bb\u22121 log |H|\n\n\u03b4 = O(log |H|\n\n1Note that the factor\n\nin the exponential rate parameter \u03bb is a positive constant\nstrictly less than 1. For a noise level \u03b1 this factor is maximized by a value \u03b2 \u2208 (\u03b1, 1/2) which tends to\n(1/2 + \u03b1)/2 as \u03b1 tends to 1/2.\n\n1 \u2212 \u03b2(1\u2212\u03b1)\n\n1\u2212\u03b2 \u2212 \u03b1(1\u2212\u03b2)\n\n\u03b2\n\n\u201c\n\n\u201d\n\n4\n\n\f\u03b4 ) [BZ74, KK07].\n\nand hence it is a special case of the general problem. It is known that the optimal query complexity\nfor noisy classic binary search is O(log |H|\nWe contrast this with the simple noise-tolerant GBS algorithm based on repeating each query in the\nstandard GBS algorithm of Figure 1(a) multiple times to control the noise (see [K\u00a8a\u00a8a06, Now08] for\nrelated derivations). It follows from Chernoff\u2019s bound that the query complexity of determining the\ncorrect label for a single query with con\ufb01dence at least 1 \u2212 \u03b4 is O( log(1/\u03b4)\n|1/2\u2212\u03b1|2 ). Suppose that GBS\nrequires n0 queries in the noiseless situation. Then using the union bound, we require O( log(n0/\u03b4)\n|1/2\u2212\u03b1|2 )\nqueries at each step to guarantee that the labels determined for all n0 queries are correct with prob-\nability 1 \u2212 \u03b4. If (X ,H) is neighborly, then GBS requires n0 = O(log |H|) queries in noiseless\nconditions [Now08]. Therefore, under the conditions of Theorem 2, the query complexity of the\nsimple noise-tolerant GBS algorithm is O(log |H| log log |H|\n), a logarithmic factor worse than the\noptimal query complexity.\n3 Noisy GBS for Learning Multidimensional Thresholds\nWe now apply the theory and modi\ufb01ed NGBS algorithm to the problem of learning multidimensional\nthreshold functions from point evaluations, a problem that arises commonly in computer vision\n[SS93, GJ96, AMM+98], image processing [KK00], and active learning [Das04, BBZ07, CN08,\nNow08]. In this case, the hypotheses are determined by (possibly nonlinear) decision surfaces in\nd-dimensional Euclidean space (i.e., X is a subset of Rd), and the queries are points in Rd.\nIt\nsuf\ufb01ces to consider linear decision surfaces of the form ha,b(x) := sign((cid:104)a, x(cid:105) + b), where a \u2208 Rd,\n(cid:107)a(cid:107)2 = 1, b \u2208 R, |b| \u2264 c for some constant c < \u221e, and (cid:104)a, x(cid:105) denotes the inner product in Rd.\nNote that hypotheses of this form can be used to represent nonlinear decision surfaces by applying\na nonlinear mapping to the query space.\nTheorem 3 Let H be a \ufb01nite collection of hypotheses of form sign((cid:104)a, x(cid:105) + b), for some constant\nc < \u221e. Then the hypotheses selected by the modi\ufb01ed NGBS algorithm with \u03b2 > \u03b1 satisfy\n\n\u03b4\n\nP((cid:98)hn (cid:54)= h\u2217) \u2264 |H| e\u2212\u03bbn ,\n(cid:17)\n. Moreover,(cid:98)hn can be computed in time polynomial in |H|.\n\n(cid:16)\n\nwith \u03bb = 1\n4\n\n1 \u2212 \u03b2(1\u2212\u03b1)\n\n1\u2212\u03b2 \u2212 \u03b1(1\u2212\u03b2)\n\n\u03b2\n\nBased on the discussion at the end of the previous section, we conclude that the query complexity\nof the modi\ufb01ed NGBS algorithm is O(log |H|); this is the optimal up to constant factors. The only\nother algorithm with this capability that we are aware of was analyzed in [BBZ07], and it is based\non a quite different approach tailored speci\ufb01cally to linear threshold problem.\n4 Agnostic Algorithms\nWe also mention the possibility of agnostic algorithms guaranteed to \ufb01nd the best hypothesis in H\neven if the optimal hypothesis h\u2217 is not in H and/or the assumptions of Theorem 2 or 3 do not hold.\nThe best hypothesis in H is the one that minimizes the error with respect to a given probability mea-\nsure on X , denoted by PX. The following theorem, proved in [Now09], demonstrates an agnostic\nalgorithm that performs almost as well as empirical risk minimization (ERM) in general, and has\nthe optimal O(log |H|/\u03b4) query complexity when the conditions of Theorem 2 hold.\nTheorem 4 Let PX denote a probability distribution on X and suppose we have a query budget\nof n. Let h1 denote the hypothesis selected by modi\ufb01ed NGBS using n/3 of the queries and let h2\ndenote the hypothesis selected by ERM from n/3 queries drawn independently from PX. Draw the\nremaining n/3 queries independently from P\u2206, the restriction of PX to the set \u2206 \u2282 X on which h1\n\nand h2 disagree, and let (cid:98)R\u2206(h1) and (cid:98)R\u2206(h2) denote the average number of errors made by h1 and\nh2 on these queries. Select(cid:98)h = arg min{(cid:98)R\u2206(h1),(cid:98)R\u2206(h2)}. Then, in general,\nE[R((cid:98)h)] \u2264 min{E[R(h1)], E[R(h2)]} + (cid:112)3/n ,\nP((cid:98)h (cid:54)= h\u2217) \u2264 N e\u2212\u03bbn/3 + 2e\u2212n|1\u22122\u03b1|2/6 .\n\nwhere R(h), h \u2208 H, denotes the probability of error of h with respect to PX and E denotes the\nexpectation with respect to all random quantities. Furthermore, if the assumptions of Theorem 2\nhold with noise bound \u03b1, then\n\n5\n\n\f5 Appendix: Proofs\n\n5.1 Proof of Theorem 1\nLet E denote expectation with respect to P, and de\ufb01ne Cn := (1 \u2212 pn(h\u2217))/pn(h\u2217). Note that\nCn \u2208 [0,\u221e) re\ufb02ects the amount of mass that pn places on the suboptimal hypotheses. First note\nthat\n\nP((cid:98)hn (cid:54)= h\u2217) \u2264 P(pn(h\u2217) < 1/2) = P(Cn > 1) \u2264 E[Cn] , by Markov\u2019s inequality.\n\nNext, observe that\n\nE[Cn] = E[(Cn/Cn\u22121) Cn\u22121] = E [E[(Cn/Cn\u22121) Cn\u22121|pn\u22121]]\n\n= E [Cn\u22121 E[(Cn/Cn\u22121)|pn\u22121]] \u2264 E[Cn\u22121] max\n\nE[(Cn/Cn\u22121)|pn\u22121]\n\n(cid:19)n\n\npn\u22121\n\n(cid:18)\n\n\u2264 C0\n\nmax\n\ni=0,...,n\u22121\n\nE[(Ci+1/Ci)|pi]\n\nmax\npi\n\n.\n\nNote that because p0 is assumed to be uniform, C0 = |H| \u2212 1. A similar conditioning tech-\nnique is employed for interval estimation in [BZ74]. The rest of the proof entails showing that\nE[(Ci+1/Ci)|pi] < 1, which proofs the result, and requires a very different approach than [BZ74].\nh pi(h) zi(h))/2, the\nThe precise form of p1, p2, . . .\nweighted proportion of hypotheses that agree with yi. The factor that normalizes the updated dis-\nh pi(h) \u03b2(1\u2212zi(h))/2(1 \u2212 \u03b2)(1+zi(h))/2 =\n\ntribution in (1) is related to \u03b4i as follows. Note that(cid:80)\n(cid:80)\nh:zi(h)=\u22121 pi(h)\u03b2 +(cid:80)\n\nis derived as follows. Let \u03b4i = (1 +(cid:80)\n\nh:zi(h)=1 pi(h)(1 \u2212 \u03b2) = (1 \u2212 \u03b4i)\u03b2 + \u03b4i(1 \u2212 \u03b2). Thus,\n\npi+1(h) = pi(h) \u03b2(1\u2212zi(h))/2(1 \u2212 \u03b2)(1+zi(h))/2\n\n(1 \u2212 \u03b4i)\u03b2 + \u03b4i(1 \u2212 \u03b2)\n\nDenote the reciprocal of the update factor for pi+1(h\u2217) by\n\n\u03b3i :=\n\n(1 \u2212 \u03b4i)\u03b2 + \u03b4i(1 \u2212 \u03b2)\n\n\u03b2(1\u2212Zi(h\u2217))/2(1 \u2212 \u03b2)(1+Zi(h\u2217))/2\n\n,\n\n(3)\n\nwhere zi(h\u2217) = h\u2217(xi)yi, and observe that pi+1(h\u2217) = pi(h\u2217)/\u03b3i. Thus,\n= \u03b3i \u2212 pi(h\u2217)\n1 \u2212 pi(h\u2217) .\n\n(1 \u2212 pi(h\u2217)/\u03b3i)pi(h\u2217)\npi(h\u2217)/\u03b3i(1 \u2212 pi(h\u2217))\n\nCi+1\nCi\n\n=\n\n(1 +(cid:80)\n\nE[Ci+1/Ci|pi] < 1 we will show that maxpi\n\nNow to bound maxpi\nthis, we will assume that pi is arbitrary.\nFor every A \u2208 A and every h \u2208 H let h(A) denote the value of h on the set A. De\ufb01ne \u03b4+\n\nA =\nh pi(h)h(A))/2, the proportion of hypotheses that take the value +1 on A. Note that for\nA < 1, since at least one hypothesis has the value \u22121 on A and p(h) > 0 for\nevery A we have 0 < \u03b4+\nall h \u2208 H. Let Ai denote that set that xi is selected from, and consider the four possible situations:\n\nE[\u03b3i|pi] < 1. To accomplish\n\nh\u2217(xi) = +1, yi = +1 : \u03b3i = (1\u2212\u03b4+\nh\u2217(xi) = +1, yi = \u22121 : \u03b3i = \u03b4+\nh\u2217(xi) = \u22121, yi = +1 : \u03b3i = (1\u2212\u03b4+\nh\u2217(xi) = \u22121, yi = \u22121 : \u03b3i = \u03b4+\n\nAi\n\nAi\n\nAi\n\n)\u03b2+\u03b4+\nAi\n1\u2212\u03b2\n\u03b2+(1\u2212\u03b4+\nAi\n\n(1\u2212\u03b2)\n\n)(1\u2212\u03b2)\n\n)\u03b2+\u03b4+\nAi\n\nAi\n\n(1\u2212\u03b2)\n\n\u03b2\n\n\u03b2\n\n\u03b2+(1\u2212\u03b4+\nAi\n1\u2212\u03b2\n\n)(1\u2212\u03b2)\n\nTo bound E[\u03b3i|pi] it is helpful to condition on Ai. De\ufb01ne qi := Px,y|Ai(h\u2217(x) (cid:54)= Y ). If h\u2217(Ai) =\n+1, then\n\nE[\u03b3i|pi, Ai] =\n\n(1 \u2212 \u03b4+\n\nAi\n\n)\u03b2 + \u03b4+\nAi\n1 \u2212 \u03b2\n+ (1 \u2212 \u03b4+\n)\n\nAi\n\n(1 \u2212 \u03b2)\n\n(cid:20) \u03b2(1 \u2212 qi)\n\n1 \u2212 \u03b2\n\n\u03b4+\nAi\n\n(1 \u2212 qi) +\n+ qi(1 \u2212 \u03b2)\n\n(cid:21)\n\n\u03b2\n\n= \u03b4+\nAi\n\n\u03b2\n\n.\n\n\u03b2 + (1 \u2212 \u03b4+\n\n)(1 \u2212 \u03b2)\n\nAi\n\nqi\n\n6\n\n\fDe\ufb01ne \u03b3+\n\ni (Ai) := \u03b4+\nAi\n\n+ (1 \u2212 \u03b4+\n\nAi\n\n)\n\nE[\u03b3i|pi, Ai] = (1 \u2212 \u03b4+\n\nAi\n\n) + \u03b4+\nAi\n\n(cid:104) \u03b2(1\u2212qi)\n1\u2212\u03b2 + qi(1\u2212\u03b2)\n\n(cid:105)\n(cid:20) \u03b2(1 \u2212 qi)\n\n\u03b2\n\n. Similarly, if h\u2217(Ai) = \u22121, then\n\n(cid:21)\n\n=: \u03b3\u2212\n\ni (Ai)\n\n+ qi(1 \u2212 \u03b2)\n\n\u03b2\n\n1 \u2212 \u03b2\n\nBy assumption qi \u2264 \u03b1 < 1/2, and since \u03b1 < \u03b2 < 1/2 the factor \u03b2(1\u2212qi)\n\u03b1(1\u2212\u03b2)\n\n1\u2212\u03b2 + qi(1\u2212\u03b2)\n\n\u03b2\n\n\u03b2 < 1. De\ufb01ne\n\n\u03b50 := 1 \u2212 \u03b2(1 \u2212 \u03b1)\n1 \u2212 \u03b2\n\n\u2212 \u03b1(1 \u2212 \u03b2)\n\n\u03b2\n\n,\n\nto obtain the bounds\n\ni (Ai) \u2264 \u03b4+\n\u03b3+\ni (Ai) \u2264 \u03b4+\n\u03b3\u2212\n\nAi\n\nAi\n\n+ (1 \u2212 \u03b4+\n(1 \u2212 \u03b50) + (1 \u2212 \u03b4+\n\n)(1 \u2212 \u03b50) ,\n) .\n\nAi\n\nAi\n\nSince both \u03b3+\n\ni (Ai) and \u03b3\u2212\n\ni (Ai) are less than 1, it follows that E[\u03b3i|pi] < 1.\n\n\u2264 \u03b2(1\u2212\u03b1)\n\n1\u2212\u03b2 +\n\n(4)\n(5)\n(cid:3)\n\n5.2 Proof of Theorem 2\n\nX\n\nX\n\nto be W (p, A) :=(cid:80)\n\ni (Ai), de\ufb01ned above in (4) and\nThe proof amounts to obtaining upper bounds for \u03b3+\n(5). For every A \u2208 A and any probability measure p on H the weighted prediction on A is de\ufb01ned\nh\u2208H p(h)h(A), where h(A) is the constant value of h for every x \u2208 A. The\n\ni (Ai) and \u03b3\u2212\n\nh\u2208H p(h)|(cid:82)\n\nX h(x) dP (x)| \u2264 maxh\u2208H |(cid:82)\n\n(cid:80)\n(cid:80)\nh\u2208H p(h)h(x)dP (x) \u2264(cid:80)\n\ndistribution P on X we have(cid:82)\nc\u2217 since(cid:82)\n\nfollowing lemma plays a crucial role in the analysis of the modi\ufb01ed NGBS algorithm.\nLemma 1 If (X ,H) is neighborly, then for every probability measure p on H there either exists a set\nA \u2208 A such that |W (p, A)| \u2264 c\u2217 or a pair of neighboring sets A, A(cid:48) \u2208 A such that W (p, A) > c\u2217\nand W (p, A(cid:48)) < \u2212c\u2217.\nProof of Lemma 1: Suppose that minA\u2208A |W (p, A)| > c\u2217. Then there must exist A, A(cid:48) \u2208 A\nsuch that W (p, A) > c\u2217 and W (p, A(cid:48)) < \u2212c\u2217, otherwise c\u2217 cannot be the minimax moment\nof H. To see this suppose, for instance, that W (p, A) > c\u2217 for all A \u2208 A. Then for every\nh\u2208H p(h)h(x)dP (x) > c\u2217. This contradicts the de\ufb01nition of\nX h(x) dP (x)|.\nThe neighborly condition guarantees that there exists a sequence of neighboring sets beginning at A\nand ending at A(cid:48). Since |W (p, A)| > c\u2217 on every set and the sign of W (p,\u00b7) must change at some\n(cid:3)\npoint in the sequence, it follows that there exist neighboring sets satisfying the claim.\nNow consider two distinct situations. De\ufb01ne bi := minA\u2208A |W (pi, A)|. First suppose that there do\nnot exist neighboring sets A and A(cid:48) with W (pi, A) > bi and W (pi, A(cid:48)) < \u2212bi. Then by Lemma 1,\nthis implies that bi \u2264 c\u2217, and according the query selection step of the modi\ufb01ed NGBS algorithm,\nAi = arg minA |W (pi, A)|. Note that because |W (pi, Ai)| \u2264 c\u2217, (1\u2212 c\u2217)/2 \u2264 \u03b4+\n\u2264 (1 + c\u2217)/2.\nHence, both \u03b3+\nNow suppose that there exist neighboring sets A and A(cid:48) with W (pi, A) > bi and W (pi, A(cid:48)) < \u2212bi.\nRecall that in this case Ai is randomly chosen to be A or A(cid:48) with equal probability. Note that\nA > (1 + bi)/2 and \u03b4+\n\u03b4+\n) \u2264 1 \u2212 \u03b50/4 ,\nE[\u03b3i|pi, Ai \u2208 {A, A(cid:48)}] <\nsince bi > 0. Similarly, if h\u2217(A) = h\u2217(A(cid:48)) = \u22121, then (5) yields E[\u03b3i|pi, Ai \u2208 {A, A(cid:48)}] <\n1 \u2212 \u03b50/4. If h\u2217(A) = \u22121 on A and h\u2217(A(cid:48)) = +1, then applying (5) on A and (4) on A(cid:48) yields\n\nA(cid:48) < (1 \u2212 bi)/2. If h\u2217(A) = h\u2217(A(cid:48)) = +1, then applying (4) results in\n\ni (Ai) are bounded above by 1 \u2212 \u03b50(1 \u2212 c\u2217)/2.\n\ni (Ai) and \u03b3\u2212\n\n(1 \u2212 \u03b50)) =\n\n(2 \u2212 \u03b50\n\n1 \u2212 bi\n\n1 + bi\n\n1 + bi\n\n(1 +\n\n1\n2\n\n1\n2\n\n+\n\nAi\n\n2\n\n2\n\n2\n\n(cid:0)\u03b4+\n\n7\n\nE[\u03b3i|pi, Ai \u2208 {A, A(cid:48)}] \u2264 1\n2\n1\n2\n1\n2\n\n=\n\nA(cid:48) + (1 \u2212 \u03b4+\nA(cid:48)))\n\nA \u2212 \u03b4+\n\nA(1 \u2212 \u03b50) + (1 \u2212 \u03b4+\n\nA) + \u03b4+\n\n(1 \u2212 \u03b4+\nA + \u03b4+\n(2 \u2212 \u03b50(1 + \u03b4+\n(1 + \u03b4+\n\nA(cid:48) + (1 \u2212 \u03b50)(1 + \u03b4+\nA \u2212 \u03b4+\nA \u2212 \u03b4+\n\nA(cid:48)))\nA(cid:48)) \u2264 1 \u2212 \u03b50/2 ,\n\n=\n= 1 \u2212 \u03b50\n2\n\nA(cid:48))(1 \u2212 \u03b50)(cid:1)\n\n\f(cid:0)\u03b4+\n\n1\n2\n\nA(cid:48))(cid:1)\n\nsince 0 \u2264 \u03b4+\nA and (5) on A(cid:48) to obtain\n\nA \u2212 \u03b4+\n\nA(cid:48) \u2264 1. The \ufb01nal possibility is that h\u2217(A) = +1 and h\u2217(A(cid:48)) = \u22121. Apply (4) on\n\nE[\u03b3i|pi, Ai \u2208 {A, A(cid:48)}] \u2264 1\n2\n\n=\n\nA + (1 \u2212 \u03b4+\nA \u2212 \u03b4+\n(1 + \u03b4+\n\nA)(1 \u2212 \u03b50) + \u03b4+\nA(cid:48) + (1 \u2212 \u03b50)(1 \u2212 \u03b4+\n\nA(cid:48)(1 \u2212 \u03b50) + (1 \u2212 \u03b4+\n\nA + \u03b4+\n\nA(cid:48)))\n\nA \u2212 \u03b4+\n\nA \u2212 \u03b4+\n\nA(cid:48) + (1 \u2212 \u00010)(1 \u2212 \u03b4+\n\nA(cid:48) = pi(h\u2217) \u2212 pi(\u2212h\u2217); if \u2212h\u2217 does\n\nNext, use the fact that because A and A(cid:48) are neighbors, \u03b4+\nnot belong to H, then pi(\u2212h\u2217) = 0. Hence,\nE[\u03b3i|pi, Ai \u2208 {A, A(cid:48)}] \u2264 1\n2\n1\n=\n2\n\u2264 1\n2\n\n(1 + \u03b4+\n(1 + pi(h\u2217) \u2212 pi(\u2212h\u2217) + (1 \u2212 \u00010)(1 \u2212 pi(h\u2217) + pi(\u2212h\u2217)))\n(1 + pi(h\u2217) + (1 \u2212 \u00010)(1 \u2212 pi(h\u2217))) = 1 \u2212 \u03b50\n2\n\n(1 \u2212 pi(h\u2217)) ,\nsince the bound is maximized when pi(\u2212h\u2217) = 0. Now bound E[\u03b3i|pi] by the maximum of the\nconditional bounds above to obtain\nE[\u03b3i|pi] \u2264 max\n\n(1 \u2212 pi(h\u2217)) , 1 \u2212 \u03b50\n\nA + \u03b4+\n\nA(cid:48)))\n\n(cid:110)\n\n(cid:111)\n\n,\n\n1 \u2212 \u03b50\n2\n\nand thus it is easy to see that\n\n(cid:20) Ci+1\n\nCi\n\nE\n\n(cid:21)\n\n|pi\n\n=\n\nE [\u03b3i|pi] \u2212 pi(h\u2217)\n\n1 \u2212 pi(h\u2217)\n\n\u2264 1 \u2212 min\n\n4 , 1 \u2212 (1 \u2212 c\u2217) \u03b50\n2\n(cid:111)\n(cid:110) \u03b50\n\n(1 \u2212 c\u2217),\n\n.\n\n\u03b50\n4\n\n2\n\n(cid:3)\n\n5.3 Proof of Theorem 3\nFirst we show that the pair (Rd,H) is neighborly (De\ufb01nition 2). Each A \u2208 A is a polytope in Rd.\nThese polytopes are generated by intersections of the halfspaces corresponding to the hypotheses.\nAny two polytopes that share a common face are neighbors (the hypothesis whose decision boundary\nde\ufb01nes the face, and its complement if it exists, are the only ones that predict different values on\nthese two sets). Since the polytopes tessellate Rd, the neighborhood graph of A is connected.\nNext consider the \ufb01nal bound in the proof of Theorem 2, above. We next show that the value of c\u2217,\nde\ufb01ned in (2), is 0. Since the offsets b of the hypotheses are all less than c in magnitude, it follows\nthat the distance from the origin to the nearest point of the decision surface of every hypothesis is at\nmost c. Let Pr denote the uniform probability distribution on a ball of radius r centered at the origin\nin Rd. Then for every h of the form sign((cid:104)a, x(cid:105) + b)\n\nRd\n\nh(x) dPr(x)\n\nand limr\u2192\u221e(cid:12)(cid:12)(cid:82)\nLastly, note that the modi\ufb01ed NGBS algorithm involves computing(cid:80)\nknown that |A| =(cid:80)d\n\nX h(x) dPr(x)(cid:12)(cid:12) = 0 and so c\u2217 = 0.\n(cid:0)|H|\n(cid:1) = O(|H|d) [Buc43].\n\nh\u2208H pi(h)h(A) for all A \u2208 A\nat each step. The computational complexity of each step is therefore proportional to the cardinality\nof A, which is equal to the number of polytopes generated by intersections of half-spaces. It is\n(cid:3)\n\n,\n\ni=0\n\ni\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 c\n\nr\n\n8\n\n\fReferences\n[AMM+98] E. M. Arkin, H. Meijer, J. S. B. Mitchell, D. Rappaport, and S.S. Skiena. Decision trees\nfor geometric models. Intl. J. Computational Geometry and Applications, 8(3):343\u2013\n363, 1998.\nD. Angluin. Queries revisited. Springer Lecture Notes in Comp. Sci.: Algorithmic\nLearning Theory, pages 12\u201331, 2001.\n\n[Ang01]\n\n[BBZ07] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Conf. on\n\n[FRPU94] U. Feige, E. Raghavan, D. Peleg, and E. Upfal. Computing with noisy information.\n\n[HPRW96] L. Hellerstein, K. Pillaipakkamnatt, V. Raghavan, and D. Wilkins. How many queries\n\n[Buc43]\n[BZ74]\n\n[CN08]\n\n[Das04]\n\n[GG74]\n\n[GJ96]\n\n[Heg95]\n\n[Hor63]\n\n[HR76]\n\n[K\u00a8a\u00a8a06]\n\n[KK00]\n\n[KK07]\n\n[KPB99]\n\n[Lov85]\n\n[Now08]\n\n[Now09]\n\n[R\u00b4en61]\n\n[SS93]\n\nLearning Theory (COLT), 2007.\nR. C. Buck. Partition of space. The American Math. Monthly, 50(9):541\u2013544, 1943.\nM. V. Burnashev and K. Sh. Zigangirov. An interval estimation problem for controlled\nobservations. Problems in Information Transmission, 10:223\u2013231, 1974.\nR. Castro and R. Nowak. Minimax bounds for active learning.\nTheory, pages 2339\u20132353, 2008.\nS. Dasgupta. Analysis of a greedy active learning strategy.\nProcessing Systems, 2004.\n\nIn Neural Information\n\nIEEE Trans. Info.\n\nSIAM J. Comput., 23(5):1001\u20131018, 1994.\nM. R. Garey and R. L. Graham. Performance bounds on the splitting algorithm for\nbinary testing. Acta Inf., 3:347\u2013355, 1974.\nD. Geman and B. Jedynak. An active testing model for tracking roads in satellite\nimages. IEEE Trans. PAMI, 18(1):1\u201314, 1996.\nT. Heged\u00a8us. Generalized teaching dimensions and the query complexity of learning.\nIn 8th Annual Conference on Computational Learning Theory, pages 108\u2013117, 1995.\nM. Horstein. Sequential decoding using noiseless feedback. IEEE Trans. Info. Theory,\n9(3):136\u2013143, 1963.\n\nare needed to learn? J. ACM, 43(5):840\u2013862, 1996.\nL. Hya\ufb01l and R. L. Rivest. Constructing optimal binary decision trees is NP-complete.\nInf. Process. Lett., 5:15\u201317, 1976.\nM. K\u00a8a\u00a8ari\u00a8ainen. Active learning in the non-realizable case. In Algorithmic Learning\nTheory, pages 63\u201377, 2006.\nA. P. Korostelev and J.-C. Kim. Rates of convergence fo the sup-norm risk in image\nmodels under sequential designs. Statistics & Probability Letters, 46:391\u2013399, 2000.\nR. Karp and R. Kleinberg. Noisy binary search and its applications. In Proceedings\nof the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA 2007), pages 881\u2013\n890, 2007.\nS. R. Kosaraju, T. M. Przytycka, and R. Borgstrom. On an optimal split tree problem.\nLecture Notes in Computer Science: Algorithms and Data Structures, 1663:157\u2013168,\n1999.\nD. W. Loveland. Performance bounds for binary testing with arbitrary weights. Acta\nInformatica, 22:101\u2013114, 1985.\nR. Nowak. Generalized binary search. In Proceedings of the 46th Allerton Conference\non Communications, Control, and Computing, pages 568\u2013574, 2008.\nR. Nowak. The geometry of generalized binary search. 2009. Preprint available at\nhttp://arxiv.org/abs/0910.4397.\nA. R\u00b4enyi. On a problem in information theory. MTA Mat. Kut. Int. Kozl., page 505516,\n1961. reprinted in Selected Papers of Alfred R\u00b4enyi, vol. 2, P. Turan, ed., pp. 631-638.\nAkademiai Kiado, Budapest, 1976.\nM.J. Swain and M.A. Stricker. Promising directions in active vision. Int. J. Computer\nVision, 11(2):109\u2013126, 1993.\n\n9\n\n\f", "award": [], "sourceid": 684, "authors": [{"given_name": "Robert", "family_name": "Nowak", "institution": null}]}