{"title": "Learning the Local Statistics of Optical Flow", "book": "Advances in Neural Information Processing Systems", "page_first": 2373, "page_last": 2381, "abstract": "Motivated by recent progress in natural image statistics, we use newly available datasets with ground truth optical flow to learn the  local statistics of optical flow and rigorously compare the learned model to prior models assumed by computer vision optical flow algorithms.  We find that a Gaussian mixture model with 64 components provides a significantly better model for local flow statistics when compared to commonly used models. We investigate the source of the GMMs success and show it is related to an explicit representation of flow boundaries. We also learn a model that jointly models the local intensity pattern and the local optical flow. In accordance with the assumptions often made in computer vision, the model learns that flow boundaries are more likely at intensity boundaries. However, when evaluated on a large dataset, this dependency is very weak and the benefit of conditioning flow estimation on the local intensity pattern is marginal.", "full_text": "Learning the Local Statistics of Optical Flow\n\nDan Rosenbaum1, Daniel Zoran2, Yair Weiss1,2\n1 CSE , 2 ELSC , Hebrew University of Jerusalem\n{danrsm,daniez,yweiss}@cs.huji.ac.il\n\nAbstract\n\nMotivated by recent progress in natural image statistics, we use newly available\ndatasets with ground truth optical \ufb02ow to learn the local statistics of optical \ufb02ow\nand compare the learned models to prior models assumed by computer vision\nresearchers. We \ufb01nd that a Gaussian mixture model (GMM) with 64 components\nprovides a signi\ufb01cantly better model for local \ufb02ow statistics when compared to\ncommonly used models. We investigate the source of the GMM\u2019s success and\nshow it is related to an explicit representation of \ufb02ow boundaries. We also learn\na model that jointly models the local intensity pattern and the local optical \ufb02ow.\nIn accordance with the assumptions often made in computer vision, the model\nlearns that \ufb02ow boundaries are more likely at intensity boundaries. However,\nwhen evaluated on a large dataset, this dependency is very weak and the bene\ufb01t of\nconditioning \ufb02ow estimation on the local intensity pattern is marginal.\n\n1\n\nIntroduction\n\nSintel MPI\n\nKITTI\n\nFigure 1: Samples of frames and \ufb02ows from new \ufb02ow databases. We leverage these newly available\nresources to learn the statistics of optical \ufb02ow and compare this to assumptions used by computer\nvision researchers.\n\nThe study of natural image statistics is a longstanding research topic with both scienti\ufb01c and engi-\nneering interest. Recent progress in this \ufb01eld has been achieved by approaches that systematically\ncompare different models of natural images with respect to numerical criteria such as log likelihood\non held-out data or coding ef\ufb01ciency [1, 10, 14]. Interestingly, the best models in terms of log like-\nlihood, when used as priors in image restoration tasks, also yield state-of-the-art performance [14].\nMany problems in computer vision require good priors. A notable example is the computation of\noptical \ufb02ow: a vector at every pixel that corresponds to the two dimensional projection of the motion\n\n1\n\n\fat that pixel. Since local motion information is often ambiguous, nearly all optical \ufb02ow estimation\nalgorithms work by minimizing a cost function that has two terms: a local data term and a \u201cprior\u201d\nterm (see. e.g. [13, 11] for some recent reviews).\nGiven the success in image restoration tasks, where learned priors give state-of-the-art performance,\none might expect a similar story in optical \ufb02ow estimation. However, with the notable exception\nof [9] (which served as a motivating example for this work and is discussed below) there have been\nvery few attempts to learn priors for optical \ufb02ow by modeling local statistics. Instead, the state-of-\nthe-art methods still use priors that were formulated by computer vision researchers. In fact, two\nof the top performing methods in modern optical \ufb02ow benchmarks use a hand-de\ufb01ned smoothness\nconstraint that was suggested over 20 years ago [6, 2].\nOne big difference between image statistics and \ufb02ow statistics is the availability of ground truth\ndata. Whereas for modeling image statistics one merely needs a collection of photographs (so that\nthe amount of data is essentially unlimited these days), for modeling \ufb02ow statistics one needs to\nobtain the ground truth motion of the points in the scene. In the past, the lack of availability of\nground truth data did not allow for learning an optical \ufb02ow prior from examples. In the last two\nyears, however, two ground truth datasets have become available. The Sintel dataset (\ufb01gure 1)\nconsists of a thousand pairs of frames from a highly realistic computer graphics \ufb01lm with a wide\nvariety of locations and motion types. Although it is synthetic, the work in [3] convincingly show\nthat both in terms of image statistics and in terms of \ufb02ow statistics, the synthetic frames are highly\nsimilar to real scenes. The KITTI dataset (\ufb01gure 1) consists of frames taken from a vehicle driving\nin a European city [5]. The vehicle was equipped with accurate range \ufb01nders as well as accurate\nlocalization of its own motion, and the combination of these two sources allow computing optical\n\ufb02ow for points that are stationary in the world. Although this is real data, it is sparse (only about\n50% of the pixels have ground truth \ufb02ow).\nIn this paper we leverage the availability of ground truth datasets to learn explicit statistical models\nof optical \ufb02ow. We compare our learned model to the assumptions made by computer vision algo-\nrithms for estimating \ufb02ow. We \ufb01nd that a Gaussian mixture model with 64 components provides a\nsigni\ufb01cantly better model for local \ufb02ow statistics when compared to commonly used models. We\ninvestigate the source of the GMM\u2019s success and show that it is related to an explicit representation\nof \ufb02ow boundaries. We also learn a model that jointly models the local intensity pattern and the\nlocal optical \ufb02ow. In accordance with the assumptions often made in computer vision, the model\nlearns that \ufb02ow boundaries are more likely at intensity boundaries. However, when evaluated on a\nlarge dataset, this dependency is very weak and the bene\ufb01t of conditioning \ufb02ow estimation on the\nlocal intensity pattern is marginal.\n\n1.1 Priors for optical \ufb02ow\n\nOne of the earliest methods for optical \ufb02ow that is still used in applications is the celebrated Lucas-\nKanade algorithm [7]. It overcomes the local ambiguity of motion analysis by assuming that the\noptical \ufb02ow is constant within a small image patch and \ufb01nds this constant motion by least-squares\nestimation. Another early algorithm that is still widely used is the method of Horn and Schunck [6].\nIt \ufb01nds the optical \ufb02ow by minimizing a cost function that has a data term and a \u201csmoothness\u201d term.\nDenoting by u the horizontal \ufb02ow and v the vertical \ufb02ow, the smoothness term is of the form:\n\n(cid:88)\n\nx,y\n\nJHS =\n\nu2\nx + u2\n\ny + v2\n\nx + v2\ny\n\nwhere ux, uy are the spatial derivatives of the horizontal \ufb02ow u and vx, vy are the spatial derivatives\nof the vertical \ufb02ow v. When combined with modern optimization methods, this algorithm is often\namong the top performing methods on modern benchmarks [11, 5].\nRather than using a quadratic smoothness term, many authors have advocated using more robust\nterms that would be less sensitive to outliers in smoothness. Thus the Black and Anandan [2] algo-\nrithm uses:\n\n(cid:88)\n\nJBA =\n\n\u03c1(ux) + \u03c1(uy) + \u03c1(vx) + \u03c1(vy)\n\nwhere \u03c1(t) is a function that grows slower than a quadratic. Popular choices for \u03c1 include the\nLorentzian, the truncated quadratic and the absolute value \u03c1(x) = |x| (or a differentiable approxi-\nmation to it \u03c1(x) =\n\u0001 + x2)[11]. Both the Lorentzian and the absolute value robust smoothness\n\n\u221a\n\nx,y\n\n2\n\n\fterms were shown to outperform quadratic smoothness in [11] and the absolute value was better\namong the two robust terms.\nSeveral authors have also suggested that the smoothness term be based on the local intensity pattern,\nsince motion discontinuities are more likely to occur at intensity boundaries. Ren [8] modi\ufb01ed\nthe weights in the Lucas and Kanade least-squares estimation so that pixels that are on different\nsides of an intensity boundary will get lower weights. In the context of Horn and Shunck, several\nauthors suggest using weights to the horizontal and vertical \ufb02ow derivatives, where the weights had\nan inverse relationship with the image derivatives: large image derivatives lead to low weight in the\n\ufb02ow smoothness (see [13] and references within for different variations on this idea). Perhaps the\nsimplest such regularizer is of the form:\n\n(cid:88)\n\nJHSI =\n\nw(Ix)(u2\n\nx + v2\n\nx) + w(Iy)(u2\n\ny + v2\ny)\n\n(1)\n\nx,y\n\nAs we discuss below, this prior can be seen as a Gaussian prior on the \ufb02ow that is conditioned on\nthe intensity.\nIn contrast to all the previously discussed priors, Roth and Black [9] suggested learning a prior from\na dataset. They used a training set of optical \ufb02ow obtained by simulating the motion of a camera in\nnatural range images. The prior learned by their system was similar to a robust smoothness prior,\nbut the \ufb01lters are not local derivatives but rather more random-looking high pass \ufb01lters. They did not\nobserve a signi\ufb01cant improvement in performance when using these \ufb01lters, and standard derivative\n\ufb01lters are still used in most smoothness based methods.\nGiven the large number of suggested priors, a natural question to ask is: what is the best prior to use?\nOne way to answer this question is to use these priors as a basis for an optical \ufb02ow estimation algo-\nrithm and see which algorithm gives the best performance. Although such an approach is certainly\ninformative it is dif\ufb01cult to get a de\ufb01nitive answer using it. For example, Sun et al. [11] reported that\nadding a non-local smoothness term to a robust smoothness prior signi\ufb01cantly improved results on\nthe Middlebury benchmark, while Geiger et al. [5] reported that this term decreased performance on\nKITTI benchmark. Perhaps the main dif\ufb01culty with this approach is that the prior is only one part of\nan optical \ufb02ow estimation algorithm. It is always combined with a non-convex likelihood term and\noptimized using a nonlinear optimization algorithm. Often the parameters of the optimization have\na very large in\ufb02uence on the performance of the algorithm.\nIn this paper we take an alternative approach. Motivated by recent advances in natural image statis-\ntics and the availability of new datasets, we compare different priors in terms of (1) log likelihood\non held-out data and (2) inference performance with tractable posteriors. Our results allow us to\nrigorously compare different prior assumptions.\n\n2 Comparing priors as density models\n\nIn order to compare different prior models as density models, we generate a training set and test\nset of optical \ufb02ow patches from the ground truth databases. Denoting by f a single vector that\nconcatenates all the optical \ufb02ow in a patch (e.g. if we consider 8 \u00d7 8 patches, f is a vector of length\n128 where the \ufb01rst 64 components denote u and the last 64 components denote v). Given a prior\nprobability model Pr(f ; \u03b8) we use the training set to estimate the free parameters of the model \u03b8 and\nthen we measure the log likelihood of held out patches from the test set.\nFrom Sintel, we divided the pairs of frames for which ground truth is available into 708 pairs which\nwe used for training and 333 pairs which we used for testing. The data is divided into scenes and we\nmade sure that different scenes are used in training and testing. We created a second test set from\nthe KITTI dataset by choosing a subset of patches for which full ground truth \ufb02ow was available.\nSince we only consider full patches, this set is smaller and hence we use it only for testing, not for\ntraining.\nThe priors we compared are:\n\n\u2022 Lucas and Kanade. This algorithm is equivalent to the assumption that the observed \ufb02ow is\ngenerated by a constant (u0, v0) that is corrupted by IID Gaussian noise. If we also assume\n\n3\n\n\fpOOt + \u03c32\n\nthat u0, v0 have a zero mean Gaussian distribution, Pr(f ) is a zero mean multidimensional\nnI where O is a binary 128 \u00d7 2 matrix and\nGaussian with covariance given by \u03c32\n\u03c3p the standard deviation of u0, v0 and \u03c3n the standard deviation of the noise.\n\u2022 Horn and Schunck. By exponentiating JHS we see that Pr(f ; \u03b8) is a multidimensional\nGaussian with covariance matrix \u03bbDDT where D is a 256 \u00d7 128 derivative matrix that\ncomputes the derivatives of the \ufb02ow \ufb01eld at each pixel and \u03bb is the weight given to the\nprior relative to the data term. This covariance matrix is not positive de\ufb01nite, so we use\n\u03bbDDT + \u0001I and determine \u03bb, \u0001 using maximum likelihood.\n\u2022 L1. We exponentiate JBA and obtain a multidimensional Laplace distribution. As in Horn\nand Schunck, this distribution is not normalizeable so we multiply it by an IID Laplacian\nprior on each component with variance 1/\u0001. This again gives two free parameters (\u03bb, \u0001)\nwhich we \ufb01nd using maximum likelihood. Unlike the Gaussian case, the solution of the\nML parameters and the normalization constant cannot be done in closed form, and we use\nHamiltonian Annealed Importance Sampling [10].\n\n\u2022 Gaussian Mixture Models (GMM). Motivated by the success of GMMs in modeling natural\nimage statistics [14] we use the training set to estimate GMM priors for optical \ufb02ow. Each\nmixture component is a multidimensional Gaussian with full covariance matrix and zero\nmean and we vary the number of components between 1 and 64. We train the GMM using\nthe standard Expectation-Maximization (EM) algorithm using mini-batches. Even with a\nfew mixture components, the GMM has far more free parameters than the previous models\nbut note that we are measuring success on held out patches so that models that over\ufb01t\nshould be penalized.\n\nThe summary of our results are shown in \ufb01gure 2 where we show the mean log likelihood on the\nSintel test set. One interesting thing that can be seen is that the local statistics validate some as-\nsumptions commonly used by computer vision researchers. For example, the Horn and Shunck\nsmoothness prior is as good as the optimal Gaussian prior (GMM1) even though it uses local \ufb01rst\nderivatives. Also, the robust prior (L1) is much better than Horn and Schunck. However, as the num-\nber of Gaussians increase the GMM is signi\ufb01cantly better than a robust prior on local derivatives.\nA closer inspection of our results is shown in \ufb01gure 3. Each \ufb01gure shows the histogram of log like-\nlihood of held out patches: the more shifted the histogram is to the right, the better the performance.\nIt can be seen that the GMM is indeed much better than the other priors including cases where the\ntest set is taken from KITTI (rather than Sintel) and when the patch size is 12\u00d7 12 rather than 8\u00d7 8.\n\nFigure 2: mean log likelihood of the different models for 8 \u00d7 8 patches extracted from held out data\nfrom Sintel. The GMM outperforms the models that are assumed by computer vision researchers.\n\n2.1 Comparing models using tractable inference\n\nA second way of comparing the models is by their ability to restore corrupted patches of optical\n\ufb02ow. We are not claiming that optical \ufb02ow restoration is a real-world application (although using\npriors to \u201c\ufb01ll in\u201d holes in optical \ufb02ow is quite common, e.g. [12, 8]). Rather, we use it because\nfor the models we are discussing the inference can either be done in closed form or using convex\noptimization, so we would expect that better priors will lead to better performance.\nWe perform two \ufb02ow restoration tasks. In \u201c\ufb02ow denoising\u201d we take the ground truth \ufb02ow and add\nIID Gaussian noise to all \ufb02ow vectors. In \u201c\ufb02ow inpainting\u201d we add a small amount of noise to all\n\n4\n\nLKHSL1GMM1GMM2GMM4GMM8GMM16GMM64012345Modelslog-likelihood\fSintel\n\nKITTI\n\n8\n\u00d7\n8\n\np\na\nt\nc\nh\ne\ns\n\n1\n2\n\u00d7\n1\n2\n\np\na\nt\nc\nh\ne\ns\n\nFigure 3: Histograms of log-likelihood of different models on the KITTI and Sintel test sets with\ntwo different patch sizes. As can be seen, the GMM outperforms other models in all four cases.\n\n\ufb02ow vectors and a very big amount of noise to some of the \ufb02ow vectors (essentially meaning that\nthese \ufb02ow vectors are not observed). For the Gaussian models and the GMM models the Bayesian\nLeast Squares (BLS) estimator of f given y can be computed in closed form. For the Laplacian\nmodel, we use MAP estimation which leads to a convex optimization problem. Since MAP may be\nsuboptimal for this case, we optimize the parameters \u03bb, \u0001 for MAP inference performance.\nResults are shown in \ufb01gures 4,5. The standard deviation of the ground truth \ufb02ow is approximately\n11.6 pixels and we add noise with standard deviations 10, 20 and 30 pixel. Consistent with the\nlog likelihood results, L1 outperforms the Gaussian methods but is outperformed by the GMM. For\nsmall noise values the difference between L1 and the GMM is small, but as the amount of noise\nincreases L1 becomes similar in performance to the Gaussian methods and is much worse than the\nGMM.\n\n3 The secret of the GMM\n\nWe now take a deeper look at how the GMM models optical \ufb02ow patches. The \ufb01rst (and not surpris-\ning) thing we found is that the covariance matrices learned by the model are block diagonal (so that\nthe u and v components are independent given the assignment to a particular component).\nMore insight can be gained by considering the GMM as a local subspace model: a patch which\nis generated by component k is generated as a linear combination of the eigenvectors of the kth\ncovariance. The coef\ufb01cients of the linear combination have energy that decays with the eigenvalue:\nso each patch can be well approximated by the leading eigenvectors of the corresponding covariance.\nUnlike global subspace models, different subspace models can be used for different patches, and\nduring inference with the model one can infer which local subspace is most likely to have generated\nthe patch.\nFigure 6 shows the dominant leading eigenvectors of all 32 covariance matrices in the GMM32\nmodel: the eigenvectors of u are followed by the eigenvectors of v. The number of eigenvectors\ndisplayed in each row is set so that they capture 99% of the variance in that component. The rows\nare organized by decreasing mixing weight. The right hand half of each row shows (u,v) patches\nthat are sampled from that Gaussian.\n\n5\n\n\u2212200\u2212150\u2212100\u2212500\u221215\u221210\u221250log-likelihoodlog(fractionofpatches)  LKHSL1GMM64\u22126\u22124\u2212202\u221210\u22128\u22126\u22124\u221220log-likelihoodlog(fractionofpatches)  LKHSL1GMM64\u2212200\u2212150\u2212100\u2212500\u221215\u221210\u221250log-likelihoodlog(fractionofpatches)  LKHSL1GMM64\u22126\u22124\u2212202\u22128\u22126\u22124\u221220log-likelihoodlog(fractionofpatches)  LKHSL1GMM64\fDenoising: \u03c3 = 10\n\n\u03c3 = 20\n\n\u03c3 = 30\n\nInpainting: 2 \u00d7 2\n\n4 \u00d7 4\n\n6 \u00d7 6\n\nFigure 4: Denoising with different noise values and inpainting with different hole sizes.\n\nFigure 5: Visualizing denoising performance (\u03c3 = 30).\n\nIt can be seen that the \ufb01rst 10 components or so model very smooth components (in fact the samples\nappear to be completely \ufb02at). A closer examination of the eigenvalues shows that these ten com-\nponents correspond to smooth motions of different speeds. This can also be seen by comparing the\nv samples on the top row which are close to gray with those in the next two rows which are much\ncloser to black or white (since the models are zero mean, black and white are equally likely for any\ncomponent).\nAs can be seen in the \ufb01gure, almost all the energy in the \ufb01rst components is captured by uniform\nmotions. Thus these components are very similar to a non-local smoothness assumption similar to\nthe one suggested in [11]): they not only assume that derivatives are small but they assume that all\nthe 8 \u00d7 8 patch is constant. However, unlike the suggestion in [11] to enforce non-local smoothness\nby applying a median \ufb01lter at all pixels, the GMM only applies non-local smoothness at a subset of\npatches that are inferred to be generated by such components.\nAs we go down in the \ufb01gure towards more rare components. we see that the components no longer\nmodel \ufb02at components but rather motion boundaries. This can be seen both in the samples (rightmost\nrows) and also in the leading eigenvectors (shown on the left) which each control one side of a\nboundary. For example, the bottom row of the \ufb01gure illustrates a component that seems to generate\nprimarily diagonal motion boundaries.\nInterestingly, such local subspace models of optical \ufb02ow have also been suggested by Fleet et al. [4].\nThey used synthetic models of moving occlusion boundaries and bars to learn linear subspace mod-\nels of the \ufb02ow. The GMM seems to support their intuition that learning separate linear subspace\nmodels for \ufb02at vs motion boundary is a good idea. However, unlike the work of Fleet et al.\nthe\nseparation into \u201c\ufb02at\u201d vs. \u201cmotion boundary\u201d was learned in an unsupervised fashion directly from\nthe data.\n\n6\n\n20406080100\u221210\u22128\u22126\u22124\u22122PSNRlog(fractionofpatches)  LKHSL1GMM6420406080100\u221210\u22128\u22126\u22124\u22122PSNRlog(fractionofpatches)  LKHSL1GMM6420406080100\u221210\u22128\u22126\u22124\u22122PSNRlog(fractionofpatches)  LKHSL1GMM6420406080100\u221210\u22128\u22126\u22124\u22122PSNRlog(fractionofpatches)  LKHSL1GMM6420406080100\u221210\u22128\u22126\u22124\u22122PSNRlog(fractionofpatches)  LKHSL1GMM6420406080100\u221210\u22128\u22126\u22124\u221220PSNRlog(fractionofpatches)  LKHSL1GMM64\fleading eigenvectors\n\npatch samples\n\nu\n\nv\n\nu\n\nv\n\nFigure 6: The eigenvectors and samples of the GMM components. GMM is better because it explic-\nitly models edges and \ufb02at patches separately.\n\n4 A joint model for optical \ufb02ow and intensity\n\nAs mentioned in the introduction, many authors have suggested modifying the smoothness assump-\ntion by conditioning it on the local intensity pattern and giving a higher penalty for motion discon-\ntinuities in the absence of intensity discontinuities. We therefore ask, does conditioning on the local\nintensity give better log likelihood on held out \ufb02ow patches? Does it give better performance in\ntractable inference tasks?\nWe evaluated two \ufb02ow models that are conditioned on the local intensity pattern. The \ufb01rst one is a\nconditional Gaussian (eq. 1) with exponential weights, i.e. w(Ix) = exp(\u2212I 2\nx/\u03c32) and the variance\nparameter \u03c32 is optimized to maximize performance. The second one is a Gaussian mixture model\nthat simultaneously models both intensity and \ufb02ow.\nThe simultaneous GMM we use includes a 200 component GMM to model the intensity together\nwith a 64 dimensional GMM to model the \ufb02ow. We allow a dependence between the hidden variable\nof the intensity GMM and that of the \ufb02ow GMM. This is equivalent to a hidden Markov model\n(HMM) with 2 hidden variables: one represents the intensity component and one represents the\n\ufb02ow component (\ufb01gure 8). We learn the HMM using the EM algorithm.\nInitialization is given\nby independent GMMs learned for the intensity (we actually use the one learned by [14] which is\navailable on their website) and for the \ufb02ow. The intensity GMM is not changed during the learning.\nConditioned on the intensity pattern, the \ufb02ow distribution is still a GMM with 64 components (as in\nthe previous section) but the mixing weights depend on the intensity.\nGiven these two conditional models, we now ask: will the conditional models give better perfor-\nmance than the unconditional ones? The answer, shown in \ufb01gure 7 was surprising (to us). Condi-\ntioning on the intensity gives basically zero improvement in log likelihood and a slight improvement\nin \ufb02ow denoising only for very large amounts of noise. Note that for all models shown in this \ufb01gure,\nthe denoised estimate is the Bayesian Least Squares (BLS) estimate, and is optimal given the learned\nmodels.\nTo investigate this effect, we examine the transition matrix between the intensity components and\nthe \ufb02ow components (\ufb01gure 8). If intensity and \ufb02ow were independent, we would expect all rows\nof the transition matrix to be the same. If an intensity boundary always lead to a \ufb02ow boundary,\nwe would expect the bottom rows of the matrix to have only one nonzero element. By examining\nthe learned transition matrix we \ufb01nd that while there is a dependency structure, it is not very strong.\n\n7\n\n\fRegardless of whether the intensity component corresponds to a boundary or not, the most likely\n\ufb02ow components are \ufb02at. When there is an intensity boundary, the \ufb02ow boundary in the same\norientation becomes more likely. However, even though it is more likely than in the unconditioned\ncase, it is still less likely than the \ufb02at components.\nTo rule out that this effect is due to a local optimum found by EM, we conducted additional exper-\niments whereby the emission probabilities were held \ufb01xed to the GMMs learned independently for\n\ufb02ow and motion and each patch in the training set was assigned one intensity and one \ufb02ow compo-\nnent. We then estimated the joint distribution over \ufb02ow and motion components by simply counting\nthe relative frequency in the training set. The results were nearly identical to those found by EM.\nIn summary, while our learned model supports the standard intuition that motion boundaries are\nmore likely at intensity boundaries, it suggests that when dealing with a large dataset with high\nvariability, there is very little bene\ufb01t (if any) in conditioning \ufb02ow models on the local intensity.\n\nHidden Markov model\n\nLikelihood\n\nDenoising: \u03c3 = 90\n\nFigure 7: The hidden Markov model we use to jointly model intensity and \ufb02ow. Both log likelihood\nand inference evaluations show almost no improvement of conditioning \ufb02ow on intensity.\n\nun-conditional mixing-weights\n\nintensity\n\nconditional mixing-weights\n\nFigure 8: Left: the transition matrix learned by the HMM. Right: comparing rows of the matrix\nto the unconditional mixing weights. Conditioned on an intensity boundary, motion boundaries\nbecome more likely but are still less likely than a \ufb02at motion.\n\n5 Discussion\n\nOptical \ufb02ow has been an active area of research for over 30 years in computer vision, with many\nmethods based on assumed priors over \ufb02ow \ufb01elds. In this paper, we have leveraged the availability\nof large ground truth databases to learn priors from data and compare our learned models to the\nassumptions typically made by computer vision researchers. We \ufb01nd that many of the assumptions\nare actually supported by the statistics (e.g.\nthe Horn and Schunck model is close to the opti-\nmal Gaussian model, robust models are better, intensity discontinuities make motion discontinuities\nmore likely). However, a learned GMM model with 64 components signi\ufb01cantly outperforms the\nstandard models used in computer vision, primarily because it explicitly distinguishes between \ufb02at\npatches and boundary patches and then uses a different form of nonlocal smoothness for the different\ncases.\n\nAcknowledgments\n\nSupported by the Israeli Science Foundation, Intel ICRI-CI and the Gatsby Foundation.\n\n8\n\nhintensityhflowintensityflow\u221220\u221215\u221210\u221250\u221215\u221210\u221250log-likelihoodlog(fractionofpatches)  HSHSIGMMHMM20406080100\u221210\u22128\u22126\u22124\u22122PSNRlog(fractionofpatches)  HSHSIGMMHMMh\ufb02owhintensity10203040506050100150200\fReferences\n[1] M. Bethge. Factorial coding of natural images: how effective are linear models in removing\n\nhigher-order dependencies? 23(6):1253\u20131268, June 2006.\n\n[2] Michael J. Black and P. Anandan. A framework for the robust estimation of optical \ufb02ow. In\n\nICCV, pages 231\u2013236, 1993.\n\n[3] Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open\n\nsource movie for optical \ufb02ow evaluation. In ECCV (6), pages 611\u2013625, 2012.\n\n[4] David J. Fleet, Michael J. Black, Yaser Yacoob, and Allan D. Jepson. Design and use of linear\nmodels for image motion analysis. International Journal of Computer Vision, 36(3):171\u2013193,\n2000.\n\n[5] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the\n\nkitti vision benchmark suite. In CVPR, pages 3354\u20133361, 2012.\n\n[6] Berthold KP Horn and Brian G Schunck. Determining optical \ufb02ow. Arti\ufb01cial intelligence,\n\n17(1):185\u2013203, 1981.\n\n[7] Bruce D Lucas, Takeo Kanade, et al. An iterative image registration technique with an appli-\ncation to stereo vision. In Proceedings of the 7th international joint conference on Arti\ufb01cial\nintelligence, 1981.\n\n[8] Xiaofeng Ren. Local grouping for optical \ufb02ow. In CVPR, 2008.\n[9] Stefan Roth and Michael J. Black. On the spatial statistics of optical \ufb02ow.\n\nJournal of Computer Vision, 74(1):33\u201350, 2007.\n\nInternational\n\n[10] J Sohl-Dickstein and BJ Culpepper. Hamiltonian annealed importance sampling for partition\n\nfunction estimation. 2011.\n\n[11] Deqing Sun, Stefan Roth, and Michael J Black. Secrets of optical \ufb02ow estimation and their\nprinciples. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on,\npages 2432\u20132439. IEEE, 2010.\n\n[12] Li Xu, Zhenlong Dai, and Jiaya Jia. Scale invariant optical \ufb02ow. In Computer Vision\u2013ECCV\n\n2012, pages 385\u2013399. Springer, 2012.\n\n[13] Henning Zimmer, Andr\u00b4es Bruhn, and Joachim Weickert. Optic \ufb02ow in harmony. International\n\nJournal of Computer Vision, 93(3):368\u2013388, 2011.\n\n[14] Daniel Zoran and Yair Weiss. Natural images, gaussian mixtures and dead leaves. In NIPS,\n\npages 1745\u20131753, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1130, "authors": [{"given_name": "Dan", "family_name": "Rosenbaum", "institution": "Hebrew University"}, {"given_name": "Daniel", "family_name": "Zoran", "institution": "Hebrew University"}, {"given_name": "Yair", "family_name": "Weiss", "institution": "Hebrew University"}]}