{"title": "Maximin affinity learning of image segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 1865, "page_last": 1873, "abstract": "Images can be segmented by first using a classifier to predict an affinity graph that reflects the degree to which image pixels must be grouped together and then partitioning the graph to yield a segmentation. Machine learning has been applied to the affinity classifier to produce affinity graphs that are good in the sense of minimizing edge misclassification rates. However, this error measure is only indirectly related to the quality of segmentations produced by ultimately partitioning the affinity graph. We present the first machine learning algorithm for training a classifier to produce affinity graphs that are good in the sense of producing segmentations that directly minimize the Rand index, a well known segmentation performance measure. The Rand index measures segmentation performance by quantifying the classification of the connectivity of image pixel pairs after segmentation. By using the simple graph partitioning algorithm of finding the connected components of the thresholded affinity graph, we are able to train an affinity classifier to directly minimize the Rand index of segmentations resulting from the graph partitioning. Our learning algorithm corresponds to the learning of maximin affinities between image pixel pairs, which are predictive of the pixel-pair connectivity.", "full_text": "Maximin af\ufb01nity learning of image segmentation\n\nSrinivas C. Turaga \u2217\n\nMIT\n\nKevin L. Briggman\n\nMax-Planck Insitute for Medical Research\n\nMoritz Helmstaedter\n\nMax-Planck Insitute for Medical Research\n\nWinfried Denk\n\nMax-Planck Insitute for Medical Research\n\nH. Sebastian Seung\n\nMIT, HHMI\n\nAbstract\n\nImages can be segmented by \ufb01rst using a classi\ufb01er to predict an af\ufb01nity graph\nthat re\ufb02ects the degree to which image pixels must be grouped together and then\npartitioning the graph to yield a segmentation. Machine learning has been applied\nto the af\ufb01nity classi\ufb01er to produce af\ufb01nity graphs that are good in the sense of\nminimizing edge misclassi\ufb01cation rates. However, this error measure is only indi-\nrectly related to the quality of segmentations produced by ultimately partitioning\nthe af\ufb01nity graph. We present the \ufb01rst machine learning algorithm for training a\nclassi\ufb01er to produce af\ufb01nity graphs that are good in the sense of producing seg-\nmentations that directly minimize the Rand index, a well known segmentation\nperformance measure.\nThe Rand index measures segmentation performance by quantifying the classi\ufb01-\ncation of the connectivity of image pixel pairs after segmentation. By using the\nsimple graph partitioning algorithm of \ufb01nding the connected components of the\nthresholded af\ufb01nity graph, we are able to train an af\ufb01nity classi\ufb01er to directly\nminimize the Rand index of segmentations resulting from the graph partitioning.\nOur learning algorithm corresponds to the learning of maximin af\ufb01nities between\nimage pixel pairs, which are predictive of the pixel-pair connectivity.\n\n1 Introduction\n\nSupervised learning has emerged as a serious contender in the \ufb01eld of image segmentation, ever\nsince the creation of training sets of images with \u201cground truth\u201d segmentations provided by humans,\nsuch as the Berkeley Segmentation Dataset [15]. Supervised learning requires 1) a parametrized\nalgorithm that map images to segmentations, 2) an objective function that quanti\ufb01es the performance\nof a segmentation algorithm relative to ground truth, and 3) a means of searching the parameter space\nof the segmentation algorithm for an optimum of the objective function.\nIn the supervised learning method presented here,\nthe segmentation algorithm consists of a\nparametrized classi\ufb01er that predicts the weights of a nearest neighbor af\ufb01nity graph over image\npixels, followed by a graph partitioner that thresholds the af\ufb01nity graph and \ufb01nds its connected\ncomponents. Our objective function is the Rand index [18], which has recently been proposed as a\nquantitative measure of segmentation performance [23]. We \u201csoften\u201d the thresholding of the classi-\n\ufb01er output and adjust the parameters of the classi\ufb01er by gradient learning based on the Rand index.\n\n\u2217sturaga@mit.edu\n\n1\n\n\fsegmentation algorithm\n\nhypothetical thresholded a(cid:31)nity graphs\n\nmerge!\n\nmissing\nmissing\nmissing\nmissing\nmissing\nmissing\nmissing\nmissing\nmissing\nmissing\nmissing\nmissing\n\na\ufb03nity\n\nprediction\n\nthreshold,\nconnected\ncomponents\n\nimage\n\nweighted a(cid:31)nity graph\n\nsegmentation\n\na(cid:31)nity graph #1\n\na(cid:31)nity graph #2\na(cid:31)nity graph #2\na(cid:31)nity graph #2\na(cid:31)nity graph #2\n\nFigure 1: (left) Our segmentation algorithm. We \ufb01rst generate a nearest neighbor weighted af\ufb01n-\nity graph representing the degree to which nearest neighbor pixels should be grouped together. The\nsegmentation is generated by \ufb01nding the connected components of the thresholded af\ufb01nity graph.\n(right) Af\ufb01nity misclassi\ufb01cation rates are a poor measure of segmentation performance. Af\ufb01n-\nity graph #1 makes only 1 error (dashed edge) but results in poor segmentations, while graph #2\ngenerates a perfect segmentation despite making many af\ufb01nity misclassi\ufb01cations (dashed edges).\n\nBecause maximin edges of the af\ufb01nity graph play a key role in our learning method, we call it max-\nimin af\ufb01nity learning of image segmentation, or MALIS. The minimax path and edge are standard\nconcepts in graph theory, and maximin is the opposite-sign sibling of minimax. Hence our work can\nbe viewed as a machine learning application of these graph theoretic concepts. MALIS focuses on\nimproving classi\ufb01er output at maximin edges, because classifying these edges incorrectly leads to\ngenuine segmentation errors, the splitting or merging of segments.\nTo the best of our knowledge, MALIS is the \ufb01rst supervised learning method that is based on opti-\nmizing a genuine measure of segmentation performance. The idea of training a classi\ufb01er to predict\nthe weights of an af\ufb01nity graph is not novel. Af\ufb01nity classi\ufb01ers were previously trained to minimize\nthe number of misclassi\ufb01ed af\ufb01nity edges [9, 16]. This is not the same as optimizing segmentations\nproduced by partitioning the af\ufb01nity graph. There have been attempts to train af\ufb01nity classi\ufb01ers to\nproduce good segmentations when partitioned by normalized cuts [17, 2]. But these approaches do\nnot optimize a genuine measure of segmentation performance such as the Rand index. The work of\nBach and Jordan [2] is the closest to our work. However, they only minimize an upper bound to a\nrenormalized version of the Rand index. Both approaches require many approximations to make the\nlearning tractable.\nIn other related work, classi\ufb01ers have been trained to optimize performance at detecting image pixels\nthat belong to object boundaries [16, 6, 14]. Our classi\ufb01er can also be viewed as a boundary detector,\nsince a nearest neighbor af\ufb01nity graph is essentially the same as a boundary map, up to a sign\ninversion. However, we combine our classi\ufb01er with a graph partitioner to produce segmentations.\nThe classi\ufb01er parameters are not trained to optimize performance at boundary detection, but to\noptimize performance at segmentation as measured by the Rand index.\nThere are also methods for supervised learning of image labeling using Markov or conditional ran-\ndom \ufb01elds [10]. But image labeling is more similar to multi-class pixel classi\ufb01cation rather than\nimage segmentation, as the latter task may require distinguishing between multiple objects in a\nsingle image that all have the same label.\nIn the cases where probabilistic random \ufb01eld models have been used for image parsing and seg-\nmentation, the models have either been simplistic for tractability reasons [12] or have been trained\npiecemeal. For instance, Tu et al. [22] separately train low-level discriminative modules based on a\nboosting classi\ufb01er, and train high-level modules of their algorithm to model the joint distribution of\nthe image and the labeling. These models have never been trained to minimize the Rand index.\n\n2 Partitioning a thresholded af\ufb01nity graph by connected components\n\nOur class of segmentation algorithms is constructed by combining a classi\ufb01er and a graph partitioner\n(see Figure 1). The classi\ufb01er is used to generate the weights of an af\ufb01nity graph. The nodes of the\ngraph are image pixels, and the edges are between nearest neighbor pairs of pixels. The weights of\nthe edges are called af\ufb01nities. A high af\ufb01nity means that the two pixels tend to belong to the same\n\n2\n\n\fsegment. The classi\ufb01er computes the af\ufb01nity of each edge based on an image patch surrounding the\nedge.\nThe graph partitioner \ufb01rst thresholds the af\ufb01nity graph by removing all edges with weights less\nthan some threshold value \u03b8. The connected components of this thresholded af\ufb01nity graph are the\nsegments of the image.\nFor this class of segmentation algorithms, it\u2019s obvious that a single misclassi\ufb01ed edge of the af\ufb01nity\ngraph can dramatically alter the resulting segmentation by splitting or merging two segments (see\nFig. 1). This is why it is important to learn by optimizing a measure of segmentation performance\nrather than af\ufb01nity prediction.\nWe are well aware that connected components is an exceedingly simple method of graph partition-\ning. More sophisticated algorithms, such as spectral clustering [20] or graph cuts [3], might be more\nrobust to misclassi\ufb01cations of one or a few edges of the af\ufb01nity graph. Why not use them instead?\nWe have two replies to this question.\nFirst, because of the simplicity of our graph partitioning, we can derive a simple and direct method\nof supervised learning that optimizes a true measure of image segmentation performance. So far\nlearning based on more sophisticated graph partitioning methods has fallen short of this goal [17, 2].\nSecond, even if it were possible to properly learn the af\ufb01nities used by more sophisticated graph\npartitioning methods, we would still prefer our simple connected components. The classi\ufb01er in\nour segmentation algorithm can also carry out sophisticated computations, if its representational\npower is suf\ufb01ciently great. Putting the sophistication in the classi\ufb01er has the advantage of making it\nlearnable, rather than hand-designed.\nThe sophisticated partitioning methods clean up the af\ufb01nity graph by using prior assumptions about\nthe properties of image segmentations. But these prior assumptions could be incorrect. The spirit of\nthe machine learning approach is to use a large amount of training data and minimize the use of prior\nassumptions. If the sophisticated partitioning methods are indeed the best way of achieving good\nsegmentation performance, we suspect that our classi\ufb01er will learn them from the training data. If\nthey are not the best way, we hope that our classi\ufb01er will do even better.\n\n3 The Rand index quanti\ufb01es segmentation performance\n\nImage segmentation can be viewed as a special case of the general problem of clustering, as image\nsegments are clusters of image pixels. Long ago, Rand proposed an index of similarity between two\nclusterings [18]. Recently it has been proposed that the Rand index be applied to image segmen-\ntations [23]. De\ufb01ne a segmentation S as an assignment of a segment label si to each pixel i. The\nindicator function \u03b4(si, sj) is 1 if pixels i and j belong to the same segment (si = sj) and 0 otherwise.\nGiven two segmentations S and \u02c6S of an image with N pixels, de\ufb01ne the function\n\n(cid:18)N\n\n(cid:19)\u22121 \u2211\n\n(cid:12)(cid:12)\u03b4(si, sj) \u2212 \u03b4(\u02c6si, \u02c6sj)(cid:12)(cid:12)\n\n(1)\n\n1 \u2212 RI( \u02c6S, S) =\n\n2\n\ni<j\n\nwhich is the fraction of image pixel pairs on which the two segmentations disagree. We will refer to\nthe function 1 \u2212 RI( \u02c6S, S) as the Rand index, although strictly speaking the Rand index is RI( \u02c6S, S),\nthe fraction of image pixel pairs on which the two segmentations agree. In other words, the Rand\nindex is a measure of similarity, but we will often apply that term to a measure of dissimilarity.\nIn this paper, the Rand index is applied to compare the output \u02c6S of a segmentation algorithm with a\nground truth segmentation S, and will serve as an objective function for learning. Figure 1 illustrates\nwhy the Rand index is a sensible measure of segmentation performance. The segmentation of af\ufb01nity\ngraph #1 incurs a huge Rand index penalty relative to the ground truth. A single wrongly classi\ufb01ed\nedge of the af\ufb01nity graph leads to an incorrect merger of two segments, causing many pairs of\nimage pixels to be wrongly assigned to the same segment. On the other hand, the segmentation\ncorresponding to af\ufb01nity graph #2 has a perfect Rand index, even though there are misclassi\ufb01cations\nin the af\ufb01nity graph. In short, the Rand index makes sense because it strongly penalizes errors in the\naf\ufb01nity graph that lead to split and merger errors.\n\n3\n\n\f1\n\n2\n\n3\n\n4\n\n1\u2019\n\n2\u2019\n\n3\u2019\n\n4\u2019\n\ngroundtruth\n\ntest\n\n1\n1\u2019\n\nmerger\n\n2\n\n33333\n2\u201932\u2019332\u201933332\u2019333\n3\n\nsplit\n\n3\u2019\n\n4\n4\n4\n\n4\u2019\n\nrand index\n\nFigure 2: The Rand index quanti\ufb01es segmentation performance by comparing the difference in\npixel pair connectivity between the groundtruth and test segmentations. Pixel pair connectiv-\nities can be visualized as symmetric binary block-diagonal matrices (cid:31) (si, sj). Each diagonal block\ncorresponds to connected pixel pairs belonging to one of the image segments. The Rand index incurs\npenalties when pixels pairs that must not be connected are connected or vice versa. This corresponds\nto locations where the two matrices disagree. An erroneous merger of two groundtruth segments in-\ncurs a penalty proportional to the product of the sizes of the two segments. Split errors are similarly\npenalized.\n\n4 Connectivity and maximin af\ufb01nity\n\nRecall that our segmentation algorithm works by \ufb01nding connected components of the thresholded\naf\ufb01nity graph. Let \u02c6S be the segmentation produced in this way. To apply the Rand index to train\nour classi\ufb01er, we need a simple way of relating the indicator function (cid:31) (\u02c6si, \u02c6sj) in the Rand index\nto classi\ufb01er output. In other words, we would like a way of characterizing whether two pixels are\nconnected in the thresholded af\ufb01nity graph.\nTo do this, we introduce the concept of maximin af\ufb01nity, which is de\ufb01ned for any pair of pixels in an\naf\ufb01nity graph (the de\ufb01nition is generally applicable to any weighted graph). Let Aklbe the af\ufb01nity\nof pixels k and l. Let P ij be the set of all paths in the graph that connect pixels i and j. For every\npath P in Pij, there is an edge (or edges) with minimal af\ufb01nity. This is written as min(cid:31)k,l(cid:30)(cid:29)P Akl,\nwhere (cid:31)k, l(cid:30)(cid:29)P means that the edge between pixels k and l are in the path P.\nA maximin path P(cid:28)\n\nij is a path between pixels i and j that maximizes the minimal af\ufb01nity,\n\nP(cid:28)\nij = arg max\nP(cid:29)Pij\n\nmin(cid:31)k,l(cid:30)(cid:29)P\n\nAkl\n\n(2)\n\nThe maximin af\ufb01nity of pixels i and j is the af\ufb01nity of the maximin edge, or the minimal af\ufb01nity of\nthe maximin path,\n\nA(cid:28)\nij = max\nP(cid:29)P ij\nWe are now ready for a trivial but important theorem.\nTheorem 1. A pair of pixels is connected in the thresholded af\ufb01nity graph if and only if their\nmaximin af\ufb01nity exceeds the threshold value.\n\nmin(cid:31)k,l(cid:30)(cid:29)P\n\nAkl\n\n(3)\n\nProof. By de\ufb01nition, a pixel pair is connected in the thresholded af\ufb01nity graph if and only if there\nexists a path between them. Such a path is equivalent to a path in the unthresholded af\ufb01nity graph\nfor which the minimal af\ufb01nity is above the threshold value. This path in turn exists if and only if the\nmaximin af\ufb01nity is above the threshold value.\n\nAs a consequence of this theorem, pixel pairs can be classi\ufb01ed as connected or disconnected by\nthresholding maximin af\ufb01nities. Let \u02c6S be the segmentation produced by thresholding the af\ufb01nity\ngraph Aij and then \ufb01nding connected components. Then the connectivity indicator function is\n\n(cid:31) (\u02c6si, \u02c6sj) = H(A(cid:28)\n\nij (cid:27) (cid:30) )\n\n(4)\n\nwhere H is the Heaviside step function.\nMaximin af\ufb01nities can be computed ef\ufb01ciently using minimum spanning tree algorithms [8]. A max-\nimum spanning tree is equivalent to a minimum spanning tree, up to a sign change of the weights.\n\n4\n\n\fAny path in a maximum spanning tree is a maximin path. For our nearest neighbor af\ufb01nity graphs,\nthe maximin af\ufb01nity of a pixel pair can be computed in O(|E| \u00b7 \u03b1(|V|)) where |E| is the number of\ngraph edges and |V| is the number of pixels and \u03b1(\u00b7) is the inverse Ackerman function which grows\nij can be computed in time O(|V|2) since the computation\nsub-logarithmically. The full matrix A\u2217\ncan be shared. Note that maximin af\ufb01nities are required for training, but not testing. For segmenting\nthe image at test time, only a connected components computation need be performed, which takes\ntime linear in the number of edges |E|.\n\n5 Optimizing the Rand index by learning maximin af\ufb01nities\n\nSince the af\ufb01nities and maximin af\ufb01nities are both functions of the image I and the classi\ufb01er param-\neters W, we will write them as Aij(I; W) and A\u2217\nij(I; W), respectively. By Eq. (4) of the previous\nsection, the Rand index of Eq. (1) takes the form\n\n1 \u2212 RI(S, I; W) =\n\n(cid:18)N\n\n(cid:19)\u22121 \u2211\n\n2\n\ni<j\n\n(cid:12)(cid:12)(cid:12)\u03b4(si, sj) \u2212 H(A\u2217\n\n(cid:12)(cid:12)(cid:12)\n\nij(I; W) \u2212 \u03b8)\n\nSince this is a discontinuous function of the maximin af\ufb01nities, we make the usual relaxation by\nreplacing |\u03b4(si, sj) \u2212 H(A\u2217\nij(I; W)).\n2 (x \u2212 \u02c6x)2, or the hinge loss can be used for\nAny standard loss such as the such as the square loss, 1\nl(x, \u02c6x). Thus we obtain a cost function suitable for gradient learning,\n\nij(I; W) \u2212 \u03b8)| with a continuous loss function l(\u03b4(si, sj), A\u2217\n\nE(S, I; W) =\n\n=\n\n(cid:18)N\n(cid:18)N\n\n2\n\n(cid:19)\u22121 \u2211\n(cid:19)\u22121 \u2211\n\ni<j\n\n2\n\ni<j\n\nl(\u03b4(si, sj), A\u2217\n\nij(I; W))\n\nl(\u03b4(si, sj), max\nP\u2208Pij\n\nmin\n(cid:104)k,l(cid:105)\u2208P\n\nAkl(I; W))\n\n(5)\n\nThe max and min operations are continuous and differentiable (though not continuously differen-\ntiable). If the loss function l is smooth, and the af\ufb01nity Akl(I; W) is a smooth function, then the\ngradient of the cost function is well-de\ufb01ned, and gradient descent can be used as an optimization\nmethod.\nDe\ufb01ne (k, l) = mm(i, j) to be the maximin edge for the pixel pair (i, j). If there is a tie, choose\nbetween the maximin edges at random. Then the cost function takes the form\n\n(cid:18)N\n\n(cid:19)\u22121 \u2211\n\n2\n\ni<j\n\nE(S, I; W) =\n\nl(\u03b4(si, sj), Amm(i,j)(I; W))\n\nIt\u2019s instructive to compare this with the cost function for standard af\ufb01nity learning\n\nEstandard(S, I; W) = 2\ncN\n\n\u2211\n(cid:104)i,j(cid:105)\n\nl(\u03b4(si, sj), Aij(I; W))\n\nwhere the sum is over all nearest neighbor pixel pairs (cid:104)i, j(cid:105) and c is the number of nearest neighbors\n[9]. In contrast, the sum in the MALIS cost function is over all pairs of pixels, whether or not they\nare adjacent in the af\ufb01nity graph. Note that a single edge can be the maximin edge for multiple pairs\nof pixels, so its af\ufb01nity can appear multiple times in the MALIS cost function. Roughly speaking,\nthe MALIS cost function is similar to the standard cost function, except that each edge in the af\ufb01nity\ngraph is weighted by the number of pixel pairs that it causes to be incorrectly classi\ufb01ed.\n\n6 Online stochastic gradient descent\n\nComputing the cost function or its gradient requires \ufb01nding the maximin edges for all pixel pairs.\nSuch a batch computation could be used for gradient learning. However, online stochastic gradient\n\n5\n\n\flearning is often more ef\ufb01cient than batch learning [13]. Online learning makes a gradient update of\nthe parameters after each pair of pixels, and is implemented as described in the box.\nMaximin af\ufb01nity learning\n1. Pick a random pair of (not necessarily nearest\nneighbor) pixels i and j from a randomly drawn\ntraining image I.\n\nStandard af\ufb01nity learning\n1. Pick a random pair of nearest neighbor pixels i\nand j from a randomly drawn training image I\n\ndW l(\u03b4(si, sj), Amm(i,j)(I; W))\n\ndW l(\u03b4(si, sj), Aij(I; W))\n\n2. Make the gradient update:\nW \u2190 W + \u03b7 d\n\n2. Find a maximin edge mm(i, j)\n3. Make the gradient update:\nW \u2190 W + \u03b7 d\nFor comparison, we also show the standard af\ufb01nity learning [9]. For each iteration, both learning\nmethods pick a random pair of pixels from a random image. Both compute the gradient of the weight\nof a single edge in the af\ufb01nity graph. However, the standard method picks a nearest neighbor pixel\npair and trains the af\ufb01nity of the edge between them. The maximin method picks a pixel pair of\narbitrary separation and trains the minimal af\ufb01nity on a maximin path between them.\nEffectively, our connected components performs spatial integration over the nearest neighbor af\ufb01nity\ngraph to make connectivity decisions about pixel pairs at large distances. MALIS trains these global\ndecisions, while standard af\ufb01nity learning trains only local decisions. MALIS is superior because it\ntruly learns segmentation, but this superiority comes at a price. The maximin computation requires\nthat on each iteration the af\ufb01nity graph be computed for the whole image. Therefore it is slower\nthan the standard learning method, which requires only a local af\ufb01nity prediction for the edge being\ntrained. Thus there is a computational price to be paid for the optimization of a true segmentation\nerror.\n\n7 Application to electron microscopic images of neurons\n\n7.1 Electron microscopic images of neural tissue\n\nBy 3d imaging of brain tissue at suf\ufb01ciently high resolution, as well as identifying synapses and trac-\ning all axons and dendrites in these images, it is possible in principle to reconstruct connectomes,\ncomplete \u201cwiring diagrams\u201d for a brain or piece of brain [19, 4, 21]. Axons can be narrower than\n100 nm in diameter, necessitating the use of electron microscopy (EM) [19]. At such high spatial\nresolution, just one cubic millimeter of brain tissue yields teravoxel scale image sizes. Recent ad-\nvances in automation are making it possible to collect such images [19, 4, 21], but image analysis\nremains a challenge. Tracing axons and dendrites is a very large-scale image segmentation problem\nrequiring high accuracy. The images used for this study were from the inner plexiform layer of the\nrabbit retina, and were taken using Serial Block-Face Scanning Electron Microscopy [5]. Two large\nimage volumes of 1003 voxels were hand segmented and reserved for training and testing purposes.\n\n7.2 Training convolutional networks for af\ufb01nity classi\ufb01cation\n\nAny classi\ufb01er that is a smooth function of its parameters can be used for maximin af\ufb01nity learning.\nWe have used convolutional networks (CN), but our method is not restricted to this choice. Convo-\nlutional networks have previously been shown to be effective for similar EM images of brain tissue\n[11].\nWe trained two identical four-layer CNs, one with standard af\ufb01nity learning and the second with\nMALIS. The CNs contained 5 feature maps in each layer with sigmoid nonlinearities. All \ufb01lters in\nthe CN were 5 \u00d7 5 \u00d7 5 in size. This led to an af\ufb01nity classi\ufb01er that uses a 17 \u00d7 17 \u00d7 17 cubic image\npatch to classify a af\ufb01nity edge. We used the square-square loss function l(x, \u02c6x) = x \u00b7 max(0, 1 \u2212\n\u02c6x \u2212 m)2 + (1 \u2212 x) \u00b7 max(0, \u02c6x \u2212 m)2, with a margin m = 0.3.\nAs noted earlier, maximin af\ufb01nity learning can be signi\ufb01cantly slower than standard af\ufb01nity learning,\ndue to the need for computing the entire af\ufb01nity graph on each iteration, while standard af\ufb01nity\ntraining need only predict the weight of a single edge in the graph. For this reason, we constructed\na proxy training image dataset by picking all possible 21 \u00d7 21 \u00d7 21 sized overlapping sub-images\n\n6\n\n\ffrom the original training set. Since each 21\u00d7 21\u00d7 21 sub-image is smaller than the original image,\nthe size of the af\ufb01nity graph needed to be predicted for the sub-image is signi\ufb01cantly smaller, leading\nto faster training. A consequence of this approximation is that the maximum separation between\nimage pixel pairs chosen for training is less than about 20 pixels. A second means of speeding up the\nmaximin procedure is by pretraining the maximin CN for 500,000 iterations using the fast standard\naf\ufb01nity classi\ufb01cation cost function. At the end, both CNs were trained for a total of 1,000,000\niterations by which point the training error plateaued.\n\n7.3 Maximin learning leads to dramatic improvement in segmentation performance\n\nFigure 3: Quanti\ufb01cation of segmentation performance on 3d electron microscopic images of\nneural tissue. A) Clustering accuracy measuring the number of correctly classi\ufb01ed pixel pairs. B)\nand C) ROC curve and precision-recall quanti\ufb01cation of pixel-pair connectivity classi\ufb01cation shows\nnear perfect performance. D) Segmentation error as measured by the number of splits and mergers.\n\nWe benchmarked the performance of the standard and maximin af\ufb01nity classi\ufb01ers by measuring\nthe the pixel-pair connectivity classi\ufb01cation performance using the Rand index. After training the\nstandard and MALIS af\ufb01nity classi\ufb01ers, we generated af\ufb01nity graphs for the training and test im-\nages.\nIn principle, the training algorithm suggests a single threshold for the graph partitioning.\nIn practice, one can generate a full spectrum of segmentations leading from over-segmentations to\nunder-segmentations by varying the threshold parameter. In Fig. 3, we plot the Rand index for\nsegmentations resulting from a range of threshold values.\nIn images with large numbers of segments, most pixel pairs will be disconnected from one another\nleading to a large imbalancing the number of connected and disconnected pixel pairs. This is re-\n\ufb02ected in the fact that the Rand index is over 95% for both segmentation algorithms. While this\nimbalance between positive and negative examples is not a signi\ufb01cant problem for training the af\ufb01n-\nity classi\ufb01er, it can make comparisons between classi\ufb01ers dif\ufb01cult to interpret. Instead, we can use\nthe ROC and precision-recall methodologies, which provide for accurate quanti\ufb01cation of the accu-\nracy of classi\ufb01ers even in the presence of large class imbalance. From these curves, we observe that\nour maximin af\ufb01nity classi\ufb01er dramatically outperforms the standard af\ufb01nity classi\ufb01er.\nOur positive results have an intriguing interpretation. The poor performance of the connected com-\nponents when applied to a standard learned af\ufb01nity classi\ufb01er could be interpreted to imply that 1) a\nlocal classi\ufb01er lacks the context important for good af\ufb01nity prediction; 2) connected components is\na poor strategy for image segmentation since mistakes in the af\ufb01nity prediction of just a few edges\ncan merge or split segments. On the contrary, our experiments suggest that when trained properly,\nthresholded af\ufb01nity classi\ufb01cation followed by connected components can be an extremely competi-\ntive method of image segmentations.\n\n8 Discussion\n\nIn this paper, we have trained an af\ufb01nity classi\ufb01er to produce af\ufb01nity graphs that result in excellent\nsegmentations when partitioned by the simple graph partitioning algorithm of thresholding followed\nby connected components. The key to good performance is the training of a segmentation-based cost\nfunction, and the use of a powerful trainable classi\ufb01er to predict af\ufb01nity graphs. Once trained, our\nsegmentation algorithm is fast. In contrast to classic graph-based segmentation algorithms where\n\n7\n\n0.50.60.70.80.90.930.940.950.960.970.980.991ThresholdFraction correctA. Clustering accuracy00.5100.20.40.60.81False positive rateTrue positive rateB. ROC curve00.5100.20.40.60.81RecallPrecisionC. Precision\u2212Recall curve01234500.20.40.60.81Splits/objectMergers/objectD. Splits vs. MergersStandard (Train)Standard (Test)Minimax (Test)Minimax (Train)\fFigure 4: A 2d cross-section through a 3d segmentation of the test image. The maximin segmen-\ntation correctly segments several objects which are merged in the standard segmentation, and even\ncorrectly segments objects which are missing in the groundtruth segmentation. Not all segments\nmerged in the standard segmentation are merged at locations visible in this cross section. Pixels col-\nored black in the machine segmentations correspond to pixels completely disconnected from their\nneighbors and represent boundary regions.\n\nthe partitioning phase dominates, our partitioning algorithm is simple and can partition graphs in\ntime linearly proportional to the number of edges in the graph. We also do not require any prior\nknowledge of the number of image segments or image segment sizes at test time, in contrast to other\ngraph partitioning algorithms [7, 20].\nThe formalism of maximin af\ufb01nities used to derive our learning algorithm has connections to single-\nlinkage hierarchical clustering, minimum spanning trees and ultrametric distances. Felzenszwalb\nand Huttenlocher [7] describe a graph partitioning algorithm based on a minimum spanning tree\ncomputation which resembles our segmentation algorithm, in part. The Ultrametric Contour Map\nalgorithm [1] generates hierarchical segmentations nearly identical those generated by varying the\nthreshold of our graph partitioning algorithm. Neither of these methods incorporates a means for\nlearning from labeled data, but our work shows how the performance of these algorithms can be\nimproved by use of our maximin af\ufb01nity learning.\n\nAcknowledgements\n\nSCT and HSS were supported in part by the Howard Hughes Medical Institute and the Gatsby\nCharitable Foundation.\n\nReferences\n[1] P. Arbelaez. Boundary extraction in natural images using ultrametric contour maps. Proc.\n\nPOCV, 2006.\n\n[2] F. Bach and M. Jordan. Learning spectral clustering, with application to speech separation.\n\nThe Journal of Machine Learning Research, 7:1963\u20132001, 2006.\n\n[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts.\nPattern Analysis and Machine Intelligence, IEEE Transactions on, 23(11):1222\u20131239, 2001.\n[4] K. L. Briggman and W. Denk. Towards neural circuit reconstruction with volume electron\n\nmicroscopy techniques. Curr Opin Neurobiol, 16(5):562\u201370, 2006.\n\n[5] W. Denk and H. Horstmann. Serial block-face scanning electron microscopy to reconstruct\n\nthree-dimensional tissue nanostructure. PLoS Biol, 2(11):e329, 2004.\n\n[6] P. Doll\u00e1r, Z. Tu, and S. Belongie. Supervised learning of edges and object boundaries.\n\nIn\n\nCVPR, June 2006.\n\n[7] P. Felzenszwalb and D. Huttenlocher. Ef\ufb01cient Graph-Based Image Segmentation. Interna-\n\ntional Journal of Computer Vision, 59(2):167\u2013181, 2004.\n\n[8] B. Fischer, V. Roth, and J. Buhmann. Clustering with the connectivity kernel. In Advances\nin Neural Information Processing Systems 16: Proceedings of the 2003 Conference. Bradford\nBook, 2004.\n\n[9] C. Fowlkes, D. Martin, and J. Malik. Learning af\ufb01nity functions for image segmentation: com-\nbining patch-based and gradient-based approaches. Computer Vision and Pattern Recognition,\n2003. Proceedings. 2003 IEEE Computer Society Conference on, 2, 2003.\n\n8\n\nimagegroundtruthmaximin trainingstandard trainingmergers\f[10] X. He, R. Zemel, and M. Carreira-Perpinan. Multiscale conditional random \ufb01elds for image\nlabeling. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition,\nvolume 2. IEEE Computer Society; 1999, 2004.\n\n[11] V. Jain, J. Murray, F. Roth, S. Turaga, V. Zhigulin, K. Briggman, M. Helmstaedter, W. Denk,\nand H. Seung. Supervised learning of image restoration with convolutional networks. ICCV\n2007, 2007.\n\n[12] S. Kumar and M. Hebert. Discriminative random \ufb01elds: a discriminative framework for con-\ntextual interaction in classi\ufb01cation. Computer Vision, 2003. Proceedings. Ninth IEEE Interna-\ntional Conference on, pages 1150\u20131157, 2003.\n\n[13] Y. LeCun, L. Bottou, G. Orr, and K. M\u00fcller. Ef\ufb01cient backprop. Lecture notes in computer\n\nscience, pages 9\u201350, 1998.\n\n[14] M. Maire, P. Arbelaez, C. Fowlkes, and J. Malik. Using contours to detect and localize junc-\ntions in natural images. In IEEE Conference on Computer Vision and Pattern Recognition,\n2008. CVPR 2008, pages 1\u20138, 2008.\n\n[15] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images\nand its application to evaluating segmentation algorithms and measuring ecological statistics.\nIn Proc. Eighth Int\u2019l Conf. Computer Vision, volume 2, pages 416\u2013423, 2001.\n\n[16] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using\nlocal brightness, color, and texture cues. IEEE Trans Pattern Anal Mach Intell, 26(5):530\u2013549,\nMay 2004.\n\n[17] M. Meila and J. Shi. Learning segmentation by random walks. ADVANCES IN NEURAL\n\nINFORMATION PROCESSING SYSTEMS, pages 873\u2013879, 2001.\n\n[18] W. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American\n\nStatistical association, pages 846\u2013850, 1971.\n\n[19] H. Seung. Reading the Book of Memory: Sparse Sampling versus Dense Mapping of Connec-\n\ntomes. Neuron, 62(1):17\u201329, 2009.\n\n[20] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 22(8):888\u2013905, 2000.\n\n[21] S. J. Smith. Circuit reconstruction tools today. Curr Opin Neurobiol, 17(5):601\u2013608, Oct\n\n2007.\n\n[22] Z. Tu, X. Chen, A. Yuille, and S. Zhu. Image parsing: Unifying segmentation, detection, and\n\nrecognition. International Journal of Computer Vision, 63(2):113\u2013140, 2005.\n\n[23] R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward objective evaluation of image segmen-\ntation algorithms. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTEL-\nLIGENCE, pages 929\u2013944, 2007.\n\n9\n\n\f", "award": [], "sourceid": 84, "authors": [{"given_name": "Kevin", "family_name": "Briggman", "institution": null}, {"given_name": "Winfried", "family_name": "Denk", "institution": null}, {"given_name": "Sebastian", "family_name": "Seung", "institution": null}, {"given_name": "Moritz", "family_name": "Helmstaedter", "institution": null}, {"given_name": "Srinivas", "family_name": "Turaga", "institution": null}]}