{"title": "Fast Template Evaluation with Vector Quantization", "book": "Advances in Neural Information Processing Systems", "page_first": 2949, "page_last": 2957, "abstract": "Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cascade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the original Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy.", "full_text": "Fast Template Evaluation with Vector Quantization\n\nMohammad Amin Sadeghi\n\nDepartment of Computer Science\n\nDavid Forsyth\n\nDepartment of Computer Science\n\nUniversity of Illinois at Urbana-Champaign\n\nUniversity of Illinois at Urbana-Champaign\n\nmsadegh2@illinois.edu\n\ndaf@illinois.edu\n\nAbstract\n\nApplying linear templates is an integral part of many object detection systems and\naccounts for a signi\ufb01cant portion of computation time. We describe a method that\nachieves a substantial end-to-end speedup over the best current methods, without\nloss of accuracy. Our method is a combination of approximating scores by vector\nquantizing feature windows and a number of speedup techniques including cas-\ncade. Our procedure allows speed and accuracy to be traded off in two ways: by\nchoosing the number of Vector Quantization levels, and by choosing to rescore\nwindows or not. Our method can be directly plugged into any recognition system\nthat relies on linear templates. We demonstrate our method to speed up the orig-\ninal Exemplar SVM detector [1] by an order of magnitude and Deformable Part\nmodels [2] by two orders of magnitude with no loss of accuracy.\n\n1\n\nIntroduction\n\nOne core operation in computer vision involves evaluating a bank of templates at a set of sample\nlocations in an image. These sample locations are usually determined by sliding a window over the\nimage. This is by far the most computationally demanding task in current popular object detection\nalgorithms including canonical pedestrian [3] and face detection [4] methods (modern practice uses\na linear SVM); the deformable part models [2]; and exemplar SVMs [1]. The accuracy and \ufb02exibil-\nity of these algorithms has turned them into the building blocks of many modern computer vision\nsystems that would all bene\ufb01t from a fast template evaluation algorithm. There is a vast literature\nof models that are variants of these methods, but they mostly evaluate banks of templates at a set of\nsample locations in images.\nBecause this operation is important, there is now a range of methods to speed up this process,\neither by pruning locations to evaluate a template [7, 8] or by using fast convolution techniques.\nThe method we describe in this paper is signi\ufb01cantly faster than any previous method, at little or\nno loss of accuracy in comparison to the best performing reference implementations. Our method\ndoes not require retraining (it can be applied to legacy models). Our method rests on the idea\nthat it is suf\ufb01cient to compute an accurate, \ufb01xed-precision approximation to the value the original\ntemplate would produce. We use Vector Quantization speedups, together with a variety of evaluation\ntechniques and a cascade to exclude unpromising sample locations, to produce this approximation\nquickly.\nOur implementation is available online1 in the form of a MATLAB/C++ library. This library pro-\nvides simple interfaces for evaluating templates in dense or sparse grids of locations. We used this\nlibrary to implement a deformable part model algorithm that runs nearly two orders of magnitude\nfaster than the original implementation [2]. This library is also used to obtain an order of magnitude\nspeed-up for the exemplar SVM detectors of [1]. Our library could also be used to speed up various\nconvolution-based techniques such as convolutional neural networks.\n\n1http://vision.cs.uiuc.edu/ftvq\n\n1\n\n\fAs we discuss in section 4, speed comparisons in the existing literature are somewhat confusing.\nComputation costs break into two major terms: per image terms, like computing HOG features;\nand per (image\u00d7category) terms, where the cost scales with the number of categories as well as the\nnumber of images. The existing literature, entirely properly, focuses on minimizing the per (image\n\u00d7 category) terms, and as a result, various practical overhead costs are sometimes omitted. We feel\nthat for practical systems, all costs should be accounted for, and we do so.\n\n1.1 Prior Work\n\nAt heart, evaluating a deformable part model involves evaluating a bank of templates at a set of\nlocations in a scaled feature pyramid. There are a variety of strategies to speed up evaluation.\nCascades speed up evaluation by using cheap tests to identify sample points that do not require\nfurther evaluation. Cascades have been very successful in face detection algorithms (eg. [5, 6]) For\nexample, Felzenszwalb et al. [7] evaluate root models, and then evaluate the part scores iteratively\nonly in high-chance locations. At each iteration it evaluates the corresponding template only if\nthe current score of the object is higher than a certain threshold (trained in advance), resulting in an\norder of magnitude speed-up without signi\ufb01cant loss of accuracy. Pedersoli et al. [8] follow a similar\napproach but estimate the score of a location using a lower resolution version of the templates.\nTransform methods evaluate templates at all locations simultaneously by exploiting properties of\nthe Fast Fourier Transform. These methods, pioneered by Dubout et al. [9], result in a several fold\nspeed-up while being exact; however, there is the per image overhead of computing an FFT at the\nstart, and a per (image \u00d7 category) overhead of computing an inverse FFT at the end. Furthermore,\nthe approach computes the scores of all locations at once, and so is not random-access; it cannot be\nef\ufb01ciently combined with a cascade detection process. In contrast, our template evaluation algorithm\ndoes not require batching template evaluations. As a result, we can combine our evaluation speedups\nwith the cascade framework of [7]. We show that using our method in a cascade framework leads to\ntwo orders of magnitude speed-up comparing to the original deformable part model implementation.\nExtreme category scaling methods exploit locality sensitive hashing to get a system that can detect\n100,000 object categories in a matter of tens of seconds [10]. This strategy appears effective \u2014 one\ncan\u2019t tell precisely, because there is no ground truth data for that number of categories, nor are\ntheir baselines \u2014 and achieves a good speedup with very large numbers of categories. However,\nthe method cannot speedup detection of the 20 VOC challenge objects without signi\ufb01cant loss of\naccuracy. In contrast, because our method relies on evaluation speedups, it can speed up evaluation\nof even a single template.\nKernel approximation methods: Maji and Berg showed how to evaluate a histogram intersection\nkernel quickly [13]. Vedaldi et al. [12] propose a kernel approximation technique and use a new set\nof sparse features that are naturally faster to evaluate. This method provides a few folds speed-up\nwith manageable loss of accuracy.\nVector Quantization offers speedups in situations where arithmetic accuracy is not crucial\n(eg. [12, 14, 15, 16]). Jegou et al. [15] use Vector Quantization as a technique for approximate\nnearest neighbour search. They represent a vector by a short code composed of a number of sub-\nspace quantization indices. They ef\ufb01ciently estimate the euclidean distance between two vectors\nfrom their codes. This work has been very successful as it offers two orders of magnitude speedup\nwith a reasonable accuracy. Kokkinos [14] describes a similar approach to speed up dot-product.\nThis method can ef\ufb01ciently estimate the score of a template at a certain location by looking-up a\nnumber of tables. Vector Quantization is our core speedup technique.\nFeature quantization vs. Model quantization: Our method is similar to [12] as we both use Vector\nQuantization to speed up template evaluation. However, there is a critical difference in the way we\nquantize space. [12] quantizes the feature space and trains a new model using a high-dimensional\nsparse feature representation. In contrast, our method uses legacy models (that were trained on a\nlow-dimensional dense feature space) and quantizes the space only at the level of evaluating the\nscores. Our approach is simpler because it does not need to retrain a model; it also leads to higher\naccuracy as shown in Table 2.\n\n2\n\n\f(a) Input Image\n\n(b) Original HOG\n\n(c) 256 clusters\n\n(d) 16 clusters\n\nFigure 1: Visualization of Vector Quantized HOG features. (a) is the original image, (b) is the HOG\nvisualization, (c) is the visualization of Vector Quantized HOG feature into c = 256 clusters, (d)\nis the visualization of Vector Quantized HOG feature into c = 16 clusters. HOG visualizations are\nproduced using the inverse HOG algorithm from [19]. Vector Quantized HOG features into c = 256\nclusters can often preserve most of the visual information.\n\n2 Fast Approximate Scoring with Vector Quantization\n\nThe vast majority of modern object detectors work as follows:\n\nof the pyramid are computed.\n\n\u2022 In a preprocessing stage, an image pyramid and a set of underlying features for each layer\n\u2022 For each location in each layer of the pyramid, a \ufb01xed size window of the image fea-\ntures spanning the location is extracted. A set of linear functions of each such window is\ncomputed. The linear functions are then assembled into a score for each category at that\nlocation.\n\u2022 A post processing stage rejects scores that are either not local extrema or under threshold.\nPrecisely how the score is computed from linear functions varies from detector to detector. For\nexample, exemplar SVMs directly use the score; deformable part models summarize a score from\nseveral linear functions in nearby windows; and so on. The threshold for the post-processing stage\nis chosen using application loss criteria. Typically, detectors are evaluated by marking true windows\nin test data; establishing an overlap criterion to distinguish between false and true detects; plotting\nprecision as a function of recall; and then computing the average precision (AP; the integral of this\nplot). A detector that gets a good AP does so by assigning high values of the score to windows that\nstrongly overlap the right answer. Notice that what matters here is the ranking of windows, rather\nthan the actual value of the score; some inaccuracy in score computation might not affect the AP.\nIn all cases, the underlying features are the HOG features, originally described by Dalal and\nTriggs [3]. HOG features for a window consist of a grid of cells, where each cell contains a d-\ndimensional vector (typically d = 32) that corresponds to a small region of the image (typically\n8 \u00d7 8 pixels).\nThe linear template is usually thought of as an m \u00d7 n table of vectors. Each entry of the table\ncorresponds to a grid element, and contains a d dimensional vector w. The score at location (x, y)\nis given by:\n\nm(cid:88)\n\nn(cid:88)\n\nS(x, y) =\n\nw(\u2206x, \u2206y) \u00b7 h(x + \u2206x \u2212 1, y + \u2206y \u2212 1)\n\nwhere w is a weight vector and h is the feature vector at a certain cell (both d-dimensional vectors).\nWe wish to compute an approximation to this score where (a) the accuracy of the approximation is\n\n\u2206y=1\n\n\u2206x=1\n\n3\n\n\fFigure 2: The plot on the left side illustrates the trade-off between computation time and estimation\nerror | S(x, y) \u2212 S(cid:48)(x, y)| using two approaches: Principal Component Analysis and Vector Quan-\ntization. The time reported here is the average time required for estimating the score of a 12 \u00d7 12\ntemplate. The number of PCA dimensions and the number of clusters are indicated on the working\npoints. The two scatter-plots illustrate template score estimations using 107 sample points. The\nworking points D = 2 for PCA and c = 4096 for VQ are comparable in terms of running time.\n\nrelatively easily manipulated, so we can trade-off speed and performance and (b) the approximation\nis extremely fast.\nTo do so, we quantize the feature vectors in each cell h(x, y) into c clusters using a basic k-means\nprocedure and encode each quantized cell q(x, y) using its cluster ID (which can range from 1 to\nc). Figure 1 visualizes original and our quantized HOG features. We pre-compute the partial dot\nproduct of each template cell w(\u2206x, \u2206y) with all 1 \u2264 i \u2264 c possible centroids and store them in a\nlookup table T(\u2206x, \u2206y, i). We then approximate the dot product by looking up the table:\n\nm(cid:88)\n\nn(cid:88)\n\nS(cid:48)(x, y) =\n\nT(\u2206x, \u2206y, q(x + \u2206x \u2212 1, y + \u2206y \u2212 1)).\n\n\u2206y=1\n\n\u2206x=1\n\nThis reduces per template computation complexity of exhaustive search from \u0398(mnd) to \u0398(mn). In\npractice 32 multiplications and 32 additions are replaced by one lookup and one addition. This can\npotentially speed up the process by a factor of 32. Table lookup is often slower than multiplication,\ntherefore gaining the full speed-up requires certain implementation techniques that we will explain\nin the next section.\nThe cost of this approximation is that S(cid:48)(x, y) (cid:54)= S(x, y), and tight bounds on the difference are\nunavailable. However, as c gets large, we expect the approximation to improve. As \ufb01gure 2 demon-\nstrates, the approximation is good in practice, and improves quickly with larger c. A natural alter-\nnative, offered by Felzenszwalb et al. [7] is to use PCA to compress the cell vectors. This approx-\nimation should work well if high scoring vectors lie close to a low-dimensional af\ufb01ne space; the\napproximation can be improved by taking more principal components. However, the approximation\nwill work poorly if the cell vectors have a \u201cblobby\u201d distribution, which appears to be the case here.\nOur experimental analysis shows Vector Quantization is generally more effective than principal\ncomponent analysis for speeding-up dot product estimation. Figure 2 compares the time-accuracy\ntrade-offs posed by both techniques.\nIt should be obvious that this VQ approximation technique is compatible with a cascade. As results\nbelow show, this approximate estimate of S(x, y) is in practice extremely fast, particularly when\nimplemented with a cascade. The value of c determines the trade-off between speed and accuracy.\nWhile the loss of accuracy is small, it can be mitigated. Most object detection algorithms evaluate\nfor a small fraction of the scores that are higher than a certain threshold. Very low scores contribute\nlittle recall, and do not change AP signi\ufb01cantly either (because the contribution to the integral is\ntiny). A further speed-accuracy tradeoff involves re-scoring the top scoring windows using the\nexact evaluation of S(x, y). Our experimental results show that the described Vector Quantized\nconvolution coupled with a re-estimation step would signi\ufb01cantly speed up detection process without\nany loss of accuracy.\n\n4\n\n00.20.40.60.80.020.040.060.080.116642561024409612345678910Computation Time (\u00b5s)Estimation ErrorComputation Time vs. Estimation Error PCAVQPrincipal Component Analysis, D = 2True ScoreEstimated Score\u22123\u22122.6\u22122.2\u22121.8\u22121.4\u22121\u22121\u22121.4\u22121.8\u22122.2\u22122.6\u22123Vector Quantization, C = 4096True ScoreEstimated Score\u22123\u22122.6\u22122.2\u22121.8\u22121.4\u22121\u22121\u22121.4\u22121.8\u22122.2\u22122.6\u22123\fSpatial Padding\n\nSapp\n\nSdef\n\nS\n\nFigure 3: Left: A single template can be zero-padded spatially to generate multiple larger templates.\nWe pack the spatially padded templates to evaluate several locations in one pass. Right: visualization\nof Sapp, Sdef and S. to estimate the maximum score we start from center and move to the highest\nscoring neighbour until we reach a local maximum. In this example, we take three iterations to reach\nglobal maximum. In this example we compute the template on 17 locations in three steps (right most\nimage).\n\n3 Fast Score Estimation Techniques\n\nImplementing a Vector Quantization score estimation is straightforward, and is the primary source of\nour speedup. However, a straightforward implementation cannot leverage the full speed-up potential\navailable with Vector Quantization. In this section we describe a few important techniques we used\nto obtain further speed.\nExploiting Cascades: It should be obvious that our VQ approximation technique is compatible with\na cascade. We incorporated our Vector Quantization technique into the cascade detection algorithm\nof [7], resulting in a few folds speed-up with no loss of accuracy. The cascade algorithm estimates\nthe root score and the part scores iteratively (based on a pre-trained order). At each iteration it\nprunes out the locations lower than a certain score threshold. This process is done in two passes;\nthe \ufb01rst pass uses a fast score estimation technique while the second pass uses the original template\nevaluation. Felzenswalb et al. [7] use PCA for the fast approximation stage. We instead use Vector\nQuantization to estimate the scores. In the case of deformable part models this procedure limits the\nprocess for both convolution and distance transform together. Furthermore, we use more aggressive\npruning thresholds because our estimation is more accurate.\nFast deformation estimates: To \ufb01nd the best deformation for a part template, Felzenswalb et al. [7]\nperform an exhaustive search over a 9 \u00d7 9 grid of locations and \ufb01nd the deformation (\u2206x, \u2206y) that\nmaximizes:\n\nmax\n\u2206x,\u2206y\n\nS(\u2206x, \u2206y) = Sapp(\u2206x, \u2206y) + Sdef (\u2206x, \u2206y)\n\n\u2212 4 \u2264 \u2206x, \u2206y \u2264 4\n\nwhere Sapp is the appearance score and Sdef is the deformation score. We observed that since Sdef\nis convex and signi\ufb01cantly in\ufb02uences the score, searching for a local minima would be a reason-\nable approximation. In a hill-climbing process we start from S(0, 0) and iteratively move to any\nneighbouring location that has the highest score among all neighbours. We stop when S(\u2206x, \u2206y)\nis larger than all its 8 neighbouring cells (Figure 3). This process considerably limits the number of\nlocations to be processed and further speeds up the process without any loss in accuracy.\nPacked Lookup Tables: Depending on the detailed structure of memory, a table lookup instruc-\ntion could be a couple of folds slower than a multiplication instruction. When there are multiple\ntemplates to be evaluated at a certain location we pack their corresponding lookup tables and index\nthem all in one memory access, thereby reducing the number of individual memory references. This\nallow using SIMD instructions to run multiple additions in one CPU instruction.\nPadding Templates: Packing lookup tables appears unhelpful when there is only one template\nto evaluate. However, we can obtain multiple templates in this case by zero-padding the original\ntemplate (to represent various translates of that template; Figure 3). This allows packing the lookup\ntables to obtain the score of multiple locations in one pass.\n\n5\n\n0000000000000000000000000000000000000000123\fOriginal DPM [2]\nDPM Cascade [7]\nFFLD [9]\nOur+rescoring\nOur-rescoring\n\nHOG features\n\n40ms\n40ms\n40ms\n40ms\n40ms\n\nper image\n\n0ms\n6ms\n7ms\n76ms\n76ms\n\nper (image\u00d7category)\n\n665ms\n84ms\n91ms\n21ms\n9ms\n\nper category\n\n0ms\n3ms\n43ms\n6ms\n6ms\n\nTable 1: Average running time of the state-of-the-art detection algorithms on PASCAL VOC 2007\ndataset. The running time is braked into four major terms. Feature computation, per image pre-\nprocess, per (image\u00d7category) process and per category preprocess. The running times refer to a\nparallel implementation using 6 threads on a XEON E5-1650 Processor.\n\nSparse lookup tables: Depending on the design of features and the clustering approach lookup\ntables can be sparse in some applications. Packing p dense lookup tables would require a dense\nc \u00d7 p table. However, if the lookup tables are sparse each row of the table could be stored in a\nsparse data structure. Thus, when indexing the table with a certain index, we just need to update the\nscores of a small fraction of templates. This would both limit the memory complexity and the time\ncomplexity for evaluating the templates.\nFixed point arithmetic: The most popular data type for linear classi\ufb01cation systems is 32-bit single\nprecision \ufb02oating point. In this architecture 24 bits are speci\ufb01ed for mantissa and sign. Since the\ntemplate evaluation process in this paper does not involve multiplication, the power datum would\nstay in about the same range so one could keep the data in \ufb01xed-point format as it requires simpler\naddition arithmetic. Our experiments have shown that using 16-bit \ufb01xed point precision speeds up\nevaluation without sacri\ufb01cing the accuracy.\n\n4 Computation Cost Model\n\nIn order to assess detection speed we need to understand the underlying computation cost. The\ncurrent literature is confusing because there is no established speed evaluation measure. Dean et\nal. [10] report a running time for all 20 PASCAL VOC categories that include all the preprocessing.\nDubout et al. [9] only report convolution time and distance transform time. Felzenszwalb et al. [7]\ncompare single-core running time while others report multi-core running times.\nComputation costs break into two major terms: per image terms, where the cost scales with the num-\nber of images and per (image\u00d7category) terms, where the cost scales with the number of categories\nas well as the number of images. The total time taken is the sum of four costs:\n\n\u2022 Computing HOG features is a mandatory, per image step, shared by all HOG-based de-\n\ntection algorithms.\n\n\u2022 per image preprocessing is any process on image data-structure except HOG feature ex-\n\ntraction. Examples include applying an FFT, or vector quantizing the HOG features.\n\n\u2022 per category preprocessing establishes the required detector data-structure. This is not\n\nusually a signi\ufb01cant bottle-neck as there are often more images than categories.\n\n\u2022 per (image\u00d7category) processes include convolution, distance transform and any post-\n\nprocess that depends both on the image and the category.\n\nTable 1 compares the performance of our approach with four major state-of-the-art algorithms. The\nalgorithms described are evaluated on various scales of the image with various root templates. We\ncompared algorithms based on parallel implementation. Reference codes published by the authors\n(except [7]) were all implemented to use multiple cores. We parallelized [7] and the HOG feature ex-\ntraction function for fair comparison. We evaluate all running times on a XEON E5-1650 Processor\n(6 Cores, 12MB Cache, 3.20 GHz).\n\n6\n\n\fMethod\nHSC [20]\nWTA [10]\nDPM V5 [22]\nDPM V4 [21]\nDPM V3 [2]\nRigid templates [23]\n\nmAP\n0.343\n0.240\n0.330\n0.301\n0.268\n0.31\n\ntime\n180s*\n26s*\n13.3s\n13.2s\n11.6s\n10s*\n\nMethod\nVedaldi [12]\nDPM V4 -parts\nFFLD [9]\nDPM Cascade [7]\nOur+rescoring\nOur-rescoring\n\nmAP\n0.277\n0.214\n0.323\n0.331\n0.331\n0.298\n\ntime\n7s*\n2.8s\n1.8s\n1.7s\n0.53s\n0.29s\n\nTable 2: Comparison of various different object detection methods on PASCAL VOC 2007 dataset.\nThe reported time here is the time to complete the detection of 20 categories starting from raw\nimage. The reference implementations of the marked (*) algorithms were not accessible so we used\npublished time statistics. These four works were published after 2012 and their baseline computers\nare comparable to ours in terms of speed.\n\n5 Experimental Results\n\nWe tested our template evaluation library for two well known detections methods. (a) Deformable\npart models and (b) exemplar SVM detectors. We used PASCAL VOC 2007 dataset that is a estab-\nlished benchmark for object detection algorithms. We also used legacy models from [1, 22] trained\non this dataset. We use the state-of-the-art baselines published in [1, 22].\nWe compare our algorithm using the 20 standard VOC objects. We report our average precision on\nall categories and compare them to the baselines. We also report mean average precision (mAP) and\nrunning time by averaging over categories (Table 3).\nWe run all of our experiments with c = 256 clusters. We perform an exhaustive search to \ufb01nd\nthe nearest cluster for all HOG pyramid cells that takes on average 76ms for one image. The\ncomputation of our exhaustive nearest neighbour search linearly depends on the number of clusters.\nIn our experiments c = 256 is shown to be enough for preserving detection accuracy. However, for\nmore general applications one might need to consider a different c.\n\n5.1 Deformable Part Models\n\nDeformable part models algorithm is the standard object detection baseline. Although there is sig-\nni\ufb01cant difference between the latest version [22] and the earlier versions [2] various authors still\ncompare to the old versions. Table 2 compares our implementation to ten prominent methods includ-\ning the original deformable part models versions 3, 4 and 5. In this paper we compare the average\nrunning time of the algorithms together with mean average precision of 20 categories. Detailed per\ncategory average precisions are published in the reference papers.\nThe original DPM package comes with a number of implementations for convolution (that is the\ndominant process). We compare to the fastest version that uses both CPU SIMD instructions and\nmulti-threading. All baseline algorithms are also multi-threaded. We present two versions of our\ncascade method. The \ufb01rst version (FTVQ+rescoring) selects a pool of candidate locations by quickly\nestimating scores. It then evaluates the original templates on the candidates to \ufb01ne tune the scores.\nThe second version (FTVQ-rescoring) purely relies on Vector Quantization to estimate scores and\ndoes not rescore templates. The second algorithm runs twice as fast with about 3% drop in mean\naverage precision.\n\n5.2 Exemplar Detectors\n\nExemplar SVMs are important benchmarks as they deal with a large set of independent templates\nthat must be evaluated throughout the images. We \ufb01rst estimate template scores using our Vector\nQuantization based library. For the convolution we get roughly 25 fold speedup comparing to the\nbaseline implementation. Both our library and the baseline convolution make use of SIMD opera-\ntions and multi-threading. We re-estimate the score of the top 1% of locations for each category and\nwe are virtually able to reproduce the original average precisions (Table 3). Including MATLAB\nimplementation overhead, our version of exemplar SVM is roughly 8-fold faster than the baseline\nwithout any loss in accuracy.\n\n7\n\n\fe\nl\nc\ny\nc\ni\nb\n\no\nr\ne\na\n\nd\nr\ni\nb\n\nt\na\no\nb\n\ne\nl\nt\nt\no\nb\n\ns\nu\nb\n\nr\na\nc\n\nt\na\nc\n\nr\ni\na\nh\nc\n\nw\no\nc\n\np\ne\ne\nh\ns\n\na\nf\no\ns\n\nn\ni\na\nr\nt\n\ne\nl\nb\na\nt\ng\nn\ni\nn\ni\nd\n\ne\nk\ni\nb\n\nr\no\nt\no\nm\n\nn\no\ns\nr\ne\np\n\nt\nn\na\nl\np\nd\ne\nt\nt\no\np\n\ne\ns\nr\no\nh\n\ng\no\nd\n\nMethod\n.33 .59 .10 .18 .25 .51 .53 .19 .21 .24 .28 .12 .57 .48 .43 .14 .22 .36 .47 .39 0.330 665ms\nDPM V5 [22]\nOurs+rescoring .33 .59 .10 .16 .27 .51 .54 .22 .20 .24 .27 .13 .57 .49 .43 .14 .21 .36 .45 .42 0.331 21ms\nOurs-rescoring .26 .58 .10 .11 .22 .45 .53 .20 .17 .19 .21 .11 .53 .44 .41 .11 .19 .32 .43 .41 0.298 9ms\nExemplar [1]\nOurs\n\n.19 .47 .03 .11 .09 .39 .40 .02 .06 .15 .07 .02 .44 .38 .13 .05 .20 .12 .36 .28 0.198 13.7ms\n.18 .47 .03 .11 .09 .39 .40 .02 .06 .15 .07 .02 .44 .38 .13 .05 .20 .12 .36 .28 0.197 1.7ms\n\nv mAP time\n\nt\n\nTable 3: Comparison of our method with two baselines on PASCAL VOC 2007. The top three rows\nrefer to DPM implementation while the last two rows refer to exemplar SVMs. We test our algorithm\nboth with and without accurate rescoring. The two bottom rows compare the performance of our\nexemplar SVM implementation with the baseline. For the top three rows running time refers to per\n(image\u00d7category) time. For the two bottom rows running time refers to per (image\u00d7exemplar) time\nthat includes MATLAB overhead.\n\n6 Discussion\n\nIn this paper we present a method to speed-up object detection by two orders of magnitude with little\nor no loss of accuracy. The main contribution of this paper lies in the right selection of techniques\nthat are compatible and together lead to a major speedup in template evaluation. The implementation\nof this work is available online to facilitate future research. This library is of special interest in large-\nscale and real-time object detection tasks.\nWhile our method is focussed on fast evaluation, it has implications for training. HOG features\nrequire 32 \u00d7 4 = 128 bytes to store the information in each cell (more than 60GB for the entire\nPASCAL VOC 2007 training set). This is why current detector training algorithms need to reload\nimages and recompute their feature vectors every time they are being used. Batching is not compat-\nible with the random-access nature of most training algorithms.\nIn contrast, Vector Quantized HOG features into 256 clusters would need 1 Byte per cell. This\nmakes storing the feature vectors of the whole PASCAL VOC 2007 training images in random access\nmemory entirely feasible (it would require about 1GB of memory). Doing so allows a SVM solver to\naccess points in the training set quickly. Our application speci\ufb01c implementation of PEGASOS [24]\nsolves a SVM classi\ufb01er for a 12 \u00d7 12 template with 108 training examples (uniformly distributed in\nthe training set) in a matter of one minute. Being able to access the whole training set plus faster\ntemplate evaluation could make hard negative mining either faster or unnecessary.\nThere are more opportunities for speedup. Notice that we pay a per image penalty computing the\nVector Quantization of the HOG features, on top of the cost of computing those features. We expect\nthat this could be sped up considerably, because we believe that estimating the Vector Quantized\ncenter to which an image patch goes should be much faster than evaluating the HOG features, then\nmatching.\n\nAcknowledgement\n\nThis work was supported in part by NSF Expeditions award IIS-1029035 and in part by ONR MURI\naward N000141010934.\n\nReferences\n[1] T. Malisiewicz and A. Gupta and A. Efros. Ensemble of Exemplar-SVMs for Object Detection\n\nand Beyond. In International Conference on Computer Vision, 2011.\n\n8\n\n\f[2] P. F. Felzenszwalb and R. B. Girshick and D. McAllester and D. Ramanan. Object Detection\nwith Discriminatively Trained Part Based Models. In IEEE Transactions on Pattern Analysis\nand Machine Intelligence, 2010.\n\n[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Confer-\n\nence on Computer Vision and Pattern Recognition, 2005.\n\n[4] H. Rowley and S. Baluja and T. Kanade. Neural Network-Based Face Detection.\n\nTransactions On Pattern Analysis and Machine intelligence, 1998.\n\nIn IEEE\n\n[5] P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple features in Con-\n\nference on Computer Vision and Pattern Recognition, 2001\n\n[6] R. Sznitman, C. Becker, F. Fleuret, and P. Fua. Fast Object Detection with Entropy-Driven\n\nEvaluation. in Conference on Computer Vision and Pattern Recognition, 2013\n\n[7] P. F. Felzenszwalb and R. B. Girshick and D. McAllester. Cascade Object Detection with De-\nformable Part Models. In IEEE Conference on Computer Vision and Pattern Recognition, 2010.\n[8] M. Pedersoli and J. Gonzalez and A. Bagdanov and and JJ. Villanueva. Recursive Coarse-to-\nFine Localization for fast Object Detection. In European Conference on Computer Vision, 2010.\n[9] C. Dubout and F. Fleuret. Exact Acceleration of Linear Object Detectors. In European Confer-\n\nence on Computer Vision, 2012.\n\n[10] T. Dean and M. Ruzon and M. Segal and J. Shlens and S. Vijayanarasimhan and J. Yagnik.\nFast, Accurate Detection of 100,000 Object Classes on a Single Machine. In IEEE Conference\non Computer Vision and Pattern Recognition, 2013.\n\n[11] P. Indyk and R. Motwani. Approximate nearest neighbours: Towards removing the curse of\n\ndimensionality. In ACM Symposium on Theory of Computing, 1998.\n\n[12] A. Vedaldi and A. Zisserman. Sparse Kernel Approximations for Ef\ufb01cient Classi\ufb01cation and\n\nDetection In IEEE Conference on Computer Vision and Pattern Recognition, 2012.\n\n[13] S. Maji and A. Berg, J. Malik. Ef\ufb01cient Classi\ufb01cation for Additive Kernel SVMs. In IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 2013.\n\n[14] I. Kokkinos. Bounding Part Scores for Rapid Detection with Deformable Part Models In 2nd\n\nParts and Attributes Workshop, in conjunction with ECCV, 2012.\n\n[15] Herv Jgou and Matthijs Douze and Cordelia Schmid. Product quantization for nearest neigh-\n\nbour search. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.\n\n[16] R. M. Gray and D. L. Neuhoff. Quantization. In IEEE Transactions on Information Theory,\n\n1998.\n\n[17] S. Singh, and A. Gupta and A. Efros. Unsupervised Discovery of Mid-level Discriminative\n\nPatches. In European Conference on Computer Vision, 2012.\n\n[18] I. Endres and K. Shih and J. Jiaa and D. Hoiem. Learning Collections of Part Models for\n\nObject Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.\n\n[19] C. Vondrick and A. Khosla and T. Malisiewicz and A. Torralba.\n\nInverting and Visualizing\n\nFeatures for Object Detection. In arXiv preprint arXiv:1212.2278, 2012.\n\n[20] X. Ren and D. Ramanan. Histograms of Sparse Codes for Object Detection. In IEEE Confer-\n\nence on Computer Vision and Pattern Recognition, 2013.\n\n[21] P. Felzenszwalb and R. Girshick and D. McAllester. Discriminatively Trained Deformable Part\n\nModels, Release 4. In http://people.cs.uchicago.edu/ pff/latent-release4/.\n\n[22] R. Girshick and P. Felzenszwalb and D. McAllester. Discriminatively Trained Deformable Part\n\nModels, Release 5. In http://people.cs.uchicago.edu/ rbg/latent-release5/.\n\n[23] S. Divvala and A. Efros and M. Hebert. How important are \u2018Deformable Parts\u2019 in the De-\nIn European Conference on Computer Vision, Parts and Attributes\n\nformable Parts Model?\nWorkshop, 2012\n\n[24] S. Shalev-Shwartz and Y. Singer and N. Srebro. Pegasos: Primal Estimated sub-GrAdient\nSOlver for SVM in Proceedings of the 24th international conference on Machine learning,\n2007\n\n9\n\n\f", "award": [], "sourceid": 1346, "authors": [{"given_name": "Mohammad Amin", "family_name": "Sadeghi", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "David", "family_name": "Forsyth", "institution": "University of Illinois at Urbana-Champaign"}]}