{"title": "Practical Deep Stereo (PDS): Toward applications-friendly deep stereo matching", "book": "Advances in Neural Information Processing Systems", "page_first": 5871, "page_last": 5881, "abstract": "End-to-end deep-learning networks recently demonstrated extremely good performance for stereo matching. However, existing networks are difficult to use for practical applications since (1) they are memory-hungry and unable to process even modest-size images, (2) they have to be fully re-trained to handle a different disparity range.\n\nThe Practical Deep Stereo (PDS) network that we propose addresses both issues: First, its architecture relies on novel bottleneck modules that drastically reduce the memory footprint in inference, and additional design choices allow to handle greater image size during training. This results in a model that leverages large image context to resolve matching ambiguities. Second, a novel sub-pixel cross-entropy loss combined with a MAP estimator make this network less sensitive to ambiguous matches, and applicable to any disparity range without re-training.\n\nWe compare PDS to state-of-the-art methods published over the recent months, and demonstrate its superior performance on FlyingThings3D and KITTI sets.", "full_text": "Practical Deep Stereo (PDS): Toward\n\napplications-friendly deep stereo matching.\n\nStepan Tulyakov\n\nSpace Engineering Center at\n\nAnton Ivanov\n\nSpace Engineering Center at\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\nstepan.tulyakov@epfl.ch\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\nanton.ivanov@epfl.ch\n\nFrancois Fleuret\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\nand Idiap Research Institute\n\nfrancois.fleuret@idiap.ch\n\nAbstract\n\nEnd-to-end deep-learning networks recently demonstrated extremely good perfor-\nmance for stereo matching. However, existing networks are dif\ufb01cult to use for\npractical applications since (1) they are memory-hungry and unable to process even\nmodest-size images, (2) they have to be trained for a given disparity range.\nThe Practical Deep Stereo (PDS) network that we propose addresses both issues:\nFirst, its architecture relies on novel bottleneck modules that drastically reduce\nthe memory footprint in inference, and additional design choices allow to handle\ngreater image size during training. This results in a model that leverages large\nimage context to resolve matching ambiguities. Second, a novel sub-pixel cross-\nentropy loss combined with a MAP estimator make this network less sensitive to\nambiguous matches, and applicable to any disparity range without re-training.\nWe compare PDS to state-of-the-art methods published over the recent months, and\ndemonstrate its superior performance on FlyingThings3D and KITTI sets.\n\n1\n\nIntroduction\n\nStereo matching consists in matching every point from an image taken from one viewpoint to\nits physically corresponding one in the image taken from another viewpoint. The problem has\napplications in robotics [22], medical imaging [23], remote sensing [32], virtual reality and 3D\ngraphics and computational photography [37, 1].\nRecent developments in the \ufb01eld have been focused on stereo for hard / uncontrolled environ-\nments (wide-baseline, low-lighting, complex lighting, blurry, foggy, non-lambertian) [36, 11, 3, 5, 27],\nusage of high-order priors and cues [9, 8, 14, 17, 34], and data-driven, and in particular, deep neural\nnetwork based, methods [25, 3, 39, 40, 19, 33, 30, 16, 31, 7, 13, 20, 24, 2, 18, 43]. This work\nimproves on this latter line of research.\nThe \ufb01rst successes of neural networks for stereo matching were achieved by substitution of hand-\ncrafted similarity measures with deep metrics [3, 39, 40, 19, 33] inside a legacy stereo pipeline for the\npost-processing (often [21]). Besides deep metrics, neural networks were also used in other subtasks\nsuch as predicting a smoothness penalty in a CRF model from a local intensity pattern [30, 16]. In [31]\na \u201cglobal disparity\u201d network smooth the matching cost volume and predicts matching con\ufb01dences,\nand in [7] a network detects and \ufb01xes incorrect disparities.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTable 1: Number of parameters, inference memory footprint, 3-pixels-error (3PE) and mean-absolute-\nerror on FlyingThings3D (960 \u00d7 540 with 192 disparities). DispNetCorr1D [20], CRL [24], iResNet-\ni2 [18] and LRCR [12] predict disparities as classes and are consequently over-parameterized.\nGC [13] omits an explicit correlation step, which results in a large memory usage during inference.\nOur PDS has a small number of parameters and memory footprint, the smallest 3PE and smallest or\nsecond smallest MAE, depending on evaluation protocol, and it is the only method able to handle\ndifferent disparity ranges without re-training. Note, that for our method we report two results. The\nresult outside of brackets is obtained using protocol of PSM [2] method, according to which the\nerrors are calculated only for ground truth pixel with disparity < 192. The result in the brackets is\ncalculated according to protocol of CRL [24], DispNetCorr1D [20] and iResNet-i2 [18] methods,\naccording to which the error is calculated only for images where less than 25% of pixels have\ndisparity > 300, as explained in [24]. Inference memory footprints are our theoretical estimates based\non network structures and do not include memory required for storing networks\u2019 parameters (real\nmemory footprint will depend on implementation). Error rates and numbers of parameters are taken\nfrom the respective publications.\n\nMethod\nPDS (proposed)\nPSM [2]\nCRL [24]\niResNet-i2 [18]\nDispNetCorr1D [20]\nLRCR [12]\nGC [13]\n\nParams Memory\n[M]\n2.2\n5.2\n78\n43\n42\n30\n3.5\n\n[GB]\n0.4\n0.6\n0.2\n0.2\n0.1\n9.0\n4.5\n\n3EP\n[%]\n3.38 (2.89)\nn/a\n6.20\n4.57\nn/a\n8.67\n9.34\n\nMAE\n[px]\n1.12 (0.87)\n1.09\n1.32\n1.40\n1.68\n2.02\n2.02\n\nModify.\nDisp.\n\n\u0013\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\nEnd-to-end deep stereo. Recent works attempt at solving stereo matching using neural network\ntrained end-to-end without post-processing [4, 20, 13, 43, 24, 12, 18, 2]. Such a network is typically\na pipeline composed of embedding, matching, regularization and re\ufb01nement modules:\nThe embedding module produces image descriptors for left and right images, and the (non-\nparametric) matching module performs an explicit correlation between shifted descriptors to compute\na cost volume for every disparity [4, 20, 24, 12, 18]. This matching module may be absent, and\nconcatenated left-right descriptors directly fed to the regularization module [13, 2, 43]. This strategy\nuses more context, but the deep network implementing such a module has a larger memory footprint\nas shown in Table 1. In this work we reduce memory use without sacri\ufb01cing accuracy by introducing\na matching module that compresses concatenated left-right image descriptors into compact matching\nsignatures.\nThe regularization module takes the cost volume, or the concatenation of descriptors, regularizes it,\nand outputs either disparities [20, 4, 24, 18] or a distribution over disparities [13, 43, 12, 2]. In the\nlatter case, sub-pixel disparities can be computed as a weighted average with SoftArgmin, which is\nsensitive to erroneous minor modes in the inferred distribution.\nThis regularization module is usually implemented as a hourglass deep network with shortcut\nconnections between the contracting and the expanding parts [20, 4, 24, 13, 43, 2, 18]. It is composed\nof 2D convolutions and does not treat all disparities symmetrically in some models [20, 4, 24, 18],\nwhich makes the network over-parametrized and prohibits the change of the disparity range without\nmodi\ufb01cation of its structure and re-training. Or it can use 3D convolutions that treat all disparities\nsymmetrically [13, 43, 12, 2]. As a consequence these networks have less parameters, but their\ndisparity range is still non-adjustable without re-training due to SoftArgmin as we show in \u00a7 3.3. In\nthis work, we propose to use a novel sup-pixel MAP approximation for inference which computes a\nmean around the disparity with minimum matching cost. It is more robust to erroneous modes in the\ndistribution and allows to modify the disparity range without re-training.\nFinally, some methods [24, 18, 12] also have a re\ufb01nement module, that re\ufb01nes the initial low-\nresolution disparity relying on attention map, computed as left-right warping error. The training of\nend-to-end networks is usually performed in fully supervised manner (except of [43]).\n\n2\n\n\fAll described methods [4, 20, 13, 43, 24, 12, 18, 2] use modest-size image patches during training.\nIn this work, we show that training on full-size images boosts networks ability to utilize large context\nand improves its accuracy. Also, the methods, even the ones producing disparity distribution, rely on\nL1 loss, since it allows to train the network to produce sub-pixel disparities. We, instead propose to\nuse more \u201cnatural\u201d sub-pixel cross-entropy loss that ensures faster converges and better accuracy.\nOur contributions can be summarize as follows:\n\n1. We decrease the memory footprint by introducing a novel bottleneck matching module.\n\nIt\ncompresses the concatenated left-right image descriptors into compact matching signatures, which\nare then concatenated and fed to the hourglass network we use as regularization module, instead\nof the concatenated descriptors themselves as in [13, 2]. Reduced memory footprint allows to\nprocess larger images and to train on full-size images, that boosts networks ability to utilize large\ncontext.\n\n2. Instead of computing the posterior mean of the disparity and training with a L1 penalty [2, 12, 43,\n13] we propose for inference a sub-pixel MAP approximation that computes a expectation around\nthe disparity with minimum matching cost, which is robust to erroneous modes in the disparity\ndistribution and allows to modify the disparity range without re-training. For training we similarly\nintroduce a sub-pixel criterion by combining the standard cross-entropy with a kernel interpolation,\nwhich provides faster convergence rates and higher accuracy.\n\nIn the experimental section, we validate our contributions. In \u00a7 3.2 we show how the reduced\nmemory footprint allows to train on full-size images and to leverage large image contexts to improve\nperformance. In \u00a7 3.3 we demonstrate that, thanks to the proposed sub-pixel MAP and cross-entropy,\nwe are able to modify the disparity range without re-training, and to improve the matching accuracy.\nThen, in \u00a7 3.4 we compare our method to state-of-the-art baselines and show that it has smallest 3-\npixels error (3PE) and smallest or second smallest mean absolute error (MAE) on the FlyingThings3D\nset, depending on the evaluation protocol and ranked third and fourth on KITTI\u201915 and KITTI\u201912\nsets respectively.\n\nFigure 1: Network structure and processing \ufb02ow during training and inference. Input / output\nquantities are outlined with thin lines, while processing modules are drawn with thick ones. Following\nthe vocabulary introduced in \u00a7 1, the yellow shapes are embedding modules, the red rectangle the\nmatching module and the blue shape the the regularization module. The matching module is a\ncontribution of our work, as in previous methods [13, 2] left and shifted right descriptors are directly\nfed to the regularization module (hourglass network). Note that the concatenated compact matching\nsignature tensor is a 4D tensor represented here as 3D by combining the feature indexes and disparities\nalong the vertical axis.\n\n2 Method\n\n2.1 Network structure\nOur network takes as input the left and right color images {xL, xR} of size W \u00d7 H and produces a\n\u201ccost tensor\u201d C = N et(xL, xR | \u0398, D) of size D\n2 \u00d7 W \u00d7 H, where \u0398 are the model\u2019s parameters,\nan D \u2208 N is the maximum disparity.\n\n3\n\nSub-pix. MAP estimator64 x W/4 x H/4Right descriptorLeft descriptorMatchRegularization network(3D conv.) 3 x W x HLeft image3 x W x HRight imageW/4H/4Compact matching signatures8 x D/48 D / 4 pairs WHCost volumeD/2InferenceTrainingSub-pix. Cross EntropyW x HDisparityW x HGroundtruth64 x W/4 x H/4\fi,j in the left image to\ni\u22122k,j in the right image, which is equivalent to assigning the disparity di,j = 2k to the\n\nThe computed cost tensor is such that Ck,i,j is the cost of matching the pixel xL\nthe pixel xR\nleft image pixel.\nThis cost tensor C can then be converted into an a posterior probability tensor as\n\nP(cid:0)d | xL, xR(cid:1) = softmax\n\nk\n\n(\u2212Ck,i,j) .\n\nThe overall structure of the network and processing \ufb02ow during training and inference are shown in\nFigure 1, and we can summarize for clarity the input/output to and from each of the modules:\n\u2022 The embedding module takes as input a color image 3\u00d7W \u00d7H, and computes an image descriptor\n\n64 \u00d7 W\n\n4 \u00d7 H\n4 .\n\n\u2022 The matching module takes as input, for each disparity d, a left and a (shifted) right image\n4 . This\n\n4 , and computes a compact matching signature 8 \u00d7 W\n\n4 \u00d7 H\n\ndescriptor both 64 \u00d7 W\nmodule is unique to our network and described in details in \u00a7 2.2.\n\n4 \u00d7 H\n\n\u2022 The regularization module is a hourglass 3D convolution neural network with shortcut connections\nbetween the contracting and the expanding parts. It takes a tensor composed of concatenated\ncompact matching signatures for all disparities of size 8 \u00d7 D\n4 , and computes a matching\ncost tensor C of size D\n\n4 \u00d7 W\n\n4 \u00d7 H\n\n2 \u00d7 W \u00d7 H.\n\nAdditional information such as convolution \ufb01lter size or channel numbers is provided in the Supple-\nmentary materials.\nAccording to the taxonomy in [28] all traditional stereo matching methods consist of (1) matching\ncost computation, (2) cost aggregation, (3) optimization, and (4) disparity re\ufb01nement steps. In the\nproposed network, the embedding and the matching modules are roughly responsible for the step\n(1) and the regularization module for the steps (2-4).\nBesides the matching module, there are several other design choices that reduce test and training\nmemory footprint of our network. In contrast to [13] we use aggressive four-times sub-sampling\nin the embedding module, and the hourglass DNN we use for regularization module produces\nprobabilities only for even disparities. Also, after each convolution and transposed convolution in our\nnetwork we place Instance Normalization (IN) [35] instead of Batch Normalization (BN), since we\nuse individual full-size images during training.\n\n2.2 Matching module\n\nThe core of state-of-the-art methods [13, 43, 12, 2] is the 3D convolutions Hourglass network used\nas regularization module, that takes as input a tensor composed of concatenated left-right image\ndescriptor for all possible disparity values. The size of this tensor makes such networks have a huge\nmemory footprint during inference.\nWe decrease the memory usage by implementing a novel matching with a DNN with a \u201cbottleneck\u201d\narchitecture. This module compresses the concatenated left-right image descriptors into a compact\nmatching signature for each disparity, and the results is then concatenated and fed to the Hourglass\nmodule. This contrasts with existing methods, which directly feed the concatenated descriptors [13,\n43, 12, 2] to the Hourglass regularization module. For example, while in [13] authors feed 64 channels\n3D tensor to the regularization network, we feed 8 channels tensor and reach a similar accuracy.\nReducing the memory footprint allows to process a larger area during inference, and consequently to\nuse a larger context to estimate disparity which solve ambiguities and translates directly into better\nperformance.\nThis module is inspired by CRL [24] and DispNetCorr1D [24, 20] which control the memory\nfootprint (as shown in Table 1) by feeding correlation results instead of concatenated embeddings\nto the Hourglass network and by [38] that show superior performance of joint left-right image\nembedding. We also borrowed some ideas from the bottleneck module in ResNet [10], since it also\nencourages compressed intermediate representations.\n\n4\n\n\f(a)\n\n(b)\n\nFigure 2: Comparison of the proposed Sub-pixel MAP with the standard\nSoftArgmin: (a) in presence of a multi-modal distribution SoftArgmin\nblends all the modes and produces an incorrect disparity estimate. (b)\nwhen the disparity range is extended (blue area), SoftArgmin estimate\nmay degrade due to additional modes.\n\nFigure 3: Target distri-\nbution of sub-pixel cross-\nentropy is a discretized\nLaplace distribution cen-\ntered at sub-pixel ground-\ntruth disparity.\n\n2.3 Sub-pixel MAP\n\nIn state-of-the-art methods, a network produces an posterior disparity distribution and then use a\nSoftArgmin module [13, 43, 12, 2], introduced in [13], to compute the predicted sub-pixel disparity\nas an expectation of this distribution1:\n\u02c6d =\n\nd \u00b7 P(cid:0)d = d | xL, xR(cid:1)\n\n(cid:88)\n\nd\n\nThis SoftArgmin approximates a sub-pixel maximum a posteriori (MAP) solution when the distri-\nbution is unimodal and symmetric. However, as illustrated in Figure 2, this strategy suffers from\ntwo key weaknesses: First, when these assumptions are not ful\ufb01lled, for instance if the posterior is\nmulti-modal, this averaging blends the modes and produces a disparity estimate far from all of them.\nSecond, if we want to apply the model to a greater disparity range without re-training, the estimate\nmay degrade even more due to additional modes.\nThe authors of [13] argue that when the network is trained with the SoftArgmin, it adapts to it during\nlearning by rescaling its output values to make the distribution unimodal. However, the network\nlearns rescaling only for disparity range used during training. If we decide to change the disparity\nrange during the test, we will have to re-train the network.\nTo address both of these drawbacks, we propose to use for inference a sub-pixel MAP approximation\nthat computes a expectation around the disparity with minimum matching cost as\n\n(cid:88)\n\nd\n\n\u02dcd =\n\nd \u00b7 P(cid:0)d = d | xL, xR(cid:1) , where\n\nP(cid:0)d | xL, xR(cid:1) = softmax\n\nd:| \u02c6d\u2212d|\u2264\u03b4\n\n(\u2212Cd,x,y) and \u02c6d = arg min\n\nd\n\n(Cd,x,y)\n\n(1)\n\nwith \u03b4 a meta-parameter (in our experiments we choose \u03b4 = 4 based on small scale grid search\nexperiment on the validation set). The approximation works under assumption that the distribution is\nsymmetric in a vicinity of a major mode.\nIn contrast to the SoftArgmin, the proposed sup-pixel MAP is used only for inference. During training\nwe use the posterior disparity distribution and the sub-pixel cross-entropy loss discussed in the next\nsection.\n\n2.4 Sub-pixel cross-entropy\n\nMany methods use the L1 loss [2, 12, 43, 13], even though the \u201cnatural\u201d choice for the classi\ufb01cation\nby design networks, producing distribution over discrete disparity values is a cross-entropy. The L1\nloss is often selected because it empirically [13] performs better than cross-entropy, and because\nwhen it is combined with SoftArgmin, it allows to train a network with sub-pixel ground truth.\n\n1The name SoftArgmin comes from the fact that the function computes disparity of the match with the\nminimum matching cost in a \u201csoft\u201d way. The matching cost, unlike likelihood probability, is small for correct\nmatches and large for incorrect ones. However, it can be also interpreted as expectation of probability distribution\nover disparities.\n\n5\n\nSub-pixel MAP estimationSoftArgmin estimationSub-pixel MAP estimationSoftArgmin estimationGround TruthTarget distribution\fIn this work, we propose a novel sub-pixel cross-entropy that provides faster convergence and better\naccuracy. The target distribution of our cross-entropy loss is a discretized Laplace distribution\ncentered at the ground-truth disparity dgt, shown in Figure 3 and computed as\n\n(cid:88)\n\ni\n\n(cid:18)\n\n\u2212|i \u2212 dgt|\n\nb\n\n(cid:19)\n\n,\n\n(cid:18)\n\u2212|d \u2212 dgt|\n\n(cid:19)\n\nb\n\nQgt(d) =\n\n1\nN\n\nexp\n\n, where N =\n\nexp\n\nwhere b is a diversity of the Laplace distribution (in our experiments we set b = 2, reasoning that\nthe distribution should cover at least several discrete disparities). With this target distribution we\ncompute cross-entropy as usual\n\n(cid:88)\n\nQgt(d) \u00b7 log P(cid:0)d = d | xL, xR, \u0398(cid:1) .\n\nL(\u0398) =\n\n(2)\n\nd\n\nThe proposed sub-pixel cross-entropy is different from soft cross entropy [19], since in our case\nprobability in each discrete location of the target distribution is a smooth function of a distance to the\nsub-pixel ground-truth. This allows to train the network to produce a distribution from which we can\ncompute sub-pixel disparities using our sub-pixel MAP.\n\n3 Experiments\n\nOur experiments are done with the PyTorch framework [26]. We initialize weights and biases of the\nnetwork using default PyTorch initialization and train the network as shown in Table 2. During the\ntraining we normalize training patches to zero mean and unit variance. The optimization is performed\nwith the RMSprop method with standard settings.\n\nTable 2: Summary of training settings for every dataset.\n\nMode\nLr. schedule\nIter. #\nTr. image size\nMax disparity\nAugmentation\n\nFlyingThings3D\nfrom scratch\n0.01 for 120k, half every 20k\n160k\n960 \u00d7 540 full-size\n255\nnot used\n\nKITTI\n\ufb01ne-tune\n0.005 for 50k, half every 20k\n100k\n1164 \u00d7 330\n255\nmixUp [42], anisotropic zoom, random crop\n\nWe guarantee reproducibility of all experiments in this section by using only available data-sets, and\nmaking our code available online under open-source license after publication.\n\n3.1 Datasets and performance measures\n\nWe used three data-sets for our experiments: KITTI\u201912 [6] and KITTI\u201915 [22], that we combined\ninto a KITTI set, and FlyingThings3D [20] summarized in Table 3. KITTI\u201912, KITTI\u201915 sets have\nonline scoreboards [15].\nThe FlyingThings3D set suffers from two problems: (1) as noticed in [24, 42], some images have\nvery large (up to 103) or negative disparities; (2) some images are rendered with black dots artifacts.\nFor the training we use only images without artifacts and with disparities \u2208 [0, 255].\nTo deal with this problem, in some previous publications authors process the test set using the\nground truth which is used for benchmarking. Such pre-processing may consist of ignoring pixels\nwith disparity > 192 [2], or discarding images with more than 25% of pixels with disparity >\n300 [24, 18, 20]. For the sake of fairness of the comparison we computed the error using both\nprotocols during the benchmarking on FlyingThings3D set. In all other experiments we use the\nunaltered test set.\nWe make validation sets by withholding 500 images from the FlyingThings3D training set, and 58\nfrom the KITTI training set, respectively.\nWe measure the performance of the network using two standard measures: (1) 3-pixel-error (3PE),\nwhich is the percentage of pixels for which the predicted disparity is off by more than 3 pixels, and\n\n6\n\n\fTable 3: Datasets used for experiments. During benchmarking, we follow previous works and use\nmaximum disparity, that is different from absolute maximum for the datasets, provided between\nparentheses.\nDataset\nTest # Train #\nKITTI\n395\nFlyingThings3D 4370\n\nMax disp. Ground truth\nsparse, \u2264 3 px.\n192 (230)\ndense , unknown\n192 (6773)\n\nSize\n1226 \u00d7 370\n960 \u00d7 540\n\nWeb score\n\n\u0013\n\u0017\n\n395\n25756\n\nTable 4: Error of the proposed PDS network on FlyingThings3d test set as a function of training\npatch size. The network trained on full-size images (highlighted), outperforms the network trained on\nsmall image patches. Note, that in this experiment we used SoftArgmin with L1 loss during training.\n\nTrain size Test size\n512 \u00d7 256\n512 \u00d7 256\n960 \u00d7 540\n512 \u00d7 256\n960 \u00d7 540\n960 \u00d7 540\n\n3PE, [%] MAE, [px]\n\n8.63\n5.28\n4.50\n\n4.18\n3.55\n3.40\n\n(2) mean-absolute-error (MAE), the average difference of the predicted disparity and the ground truth.\nNote, that 3PE and MAE are complimentary, since 3PE characterize error robust to outliers, while\nMAE accounts for sub-pixel error.\n\n3.2 Training on full-size images\n\nIn this section we show the effectiveness of training on full-size images. For that we train our network\ntill convergence on FlyingThings3D dataset with the L1 loss and SoftArgmin twice, the \ufb01rst time\nwe use 512 \u00d7 256 training patches randomly cropped from the training images as in [13, 2], and the\nsecond time we used full-size 960 \u00d7 540 training images. Note, that the latter is possible thanks to\nthe small memory footprint of our network.\nAs seen in Table 4, the network trained on small patches, performs better on larger than on smaller\ntest images. This suggests, that even the network that has not seen full-size images during training\ncan utilize a larger context. As expected, the network trained on full-size images makes better use of\nthe said context, and performs signi\ufb01cantly better.\n\n3.3 Sub-pixel MAP and cross-entropy\n\nFigure 4: Example of disparity estimation errors with the Sof-\ntArgmin and sup-pixel MAP on FlyingThings3d set. The \ufb01rst\ncolumn shows the input image, the second \u2013 ground truth disparity,\nthe third \u2013 SoftArgmin estimate and the fourth sub-pixel MAP\nestimate. Note that SoftArgmin estimate, though wrong, is closer\nto the ground truth than sub-pixel MAP estimate. This can explain\nlarger MAE of the sub-pixel MAP estimate.\n\nFigure 5: Comparison of the\nconvergence speed on FlyingTh-\nings3d set with sub-pixel cross\nentropy and L1 loss. With the\nproposed sub-pixel cross-entropy\nloss (blue) network converges\nfaster. Note, that the error is\ncomputed using the validation set,\ncontaining 500 examples.\n\nIn this section, we \ufb01rst show the advantages of the sub-pixel MAP over the SoftArgmin. We train our\nPDS network till convergence on FlyingThings3D with SoftArgmin, L1 loss and full-size training\n\n7\n\n2.55.07.510.012.515.017.520.0epoch345678validation set 3PESub-pix CrossEntropyL1\fimages and then test it twice: the \ufb01rst time with SoftArgmin for inference, and the second time with\nour sub-pixel MAP for inference instead.\nAs shown in Table 5, the substitution leads to the reduction of the 3PE and slight increase of the\nMAE. The latter probably happens because in the erroneous area SoftArgmin estimates are wrong,\nbut nevertheless closer to the ground truth since it blends all distribution modes, as shown in Figure 4.\n\nTable 5: Performance of the sub-pixel MAP estimator and cross-entropy loss on FlyingThings3d set.\nNote, that: (1) if we substitute SoftArgmin with sub-pixel MAP during the test we get lower 3PE and\nsimilar MAE; (2) if we increase disparity range twice MAE and 3PE of the network with sub-pixel\nMAP almost does not change, while errors of the network with SoftArgmin increase; (3) if we train\nnetwork with with sub-pixel cross entropy it has much lower 3PE and only slightly worse MAE.\n\nLoss\n\nEstimator\n\n3PE, [%] MAE, [px]\n\n3.40\n3.42\n3.63\n\n3.81\n3.53\n\nL1 + SoftArgmin\nL1 + SoftArgmin\nSub-pixel cross-entropy.\n\nL1 + SoftArgmin\nL1 + SoftArgmin\n\nSoftArgmin\nSub-pixel MAP\nSub-pixel MAP\n\nStandard disparity range \u2208 [0, 255]\n4.50\n4.22\n3.80\nIncreased disparity range \u2208 [0, 511]\n5.20\n4.27\n\nSoftArgmin\nSub-pixel MAP\n\nWhen we test the same network with the disparity range increased from 255 to 511 pixels the\nperformance of the network with the SoftArgmin plummets, while performance of the network with\nsub-pixel MAP remains almost the same as shown in Table 5. This shows that with Sub-pixel MAP\nwe can modify the disparity range of the network on-the-\ufb02y, without re-training.\nNext, we train the network with the sub-pixel cross-entropy loss and compare it to the network trained\nwith SoftArgmin and the L1 loss. As show in Table 5, the former network has much smaller 3PE and\nonly slightly larger MAE. The convergence speed with sub-pixel cross-entropy is also much faster\nthan with L1 loss as shown in Figure 5. Interestingly, in [13] also reports faster convergence with\none-hot cross-entropy than with L1 loss, but contrary to our results, they found that L1 provided\nsmaller 3PE.\n\n3.4 Benchmarking\n\nIn this section we show the effectiveness of our method, compared to the state-of-the-art methods.\nFor KITTI, we computed disparity maps for the test sets with withheld ground truth, and uploaded\nthe results to the evaluation web site. For the FlyingThings3D set we evaluated performance on the\ntest set ourselves, following the protocol of [2] as explained in \u00a7 3.1.\nFlyingThings3D set benchmarking results are shown in Table 1. Notably, the method we propose\nhas lowest 3PE error according to both evaluation protocols and has lowest or second lowest MAE,\ndepending on the protocol. Moreover, in contrast to other methods, our method has small memory\nfootprint, number of parameters, and it allows to change the disparity range without re-training.\nKITTI\u201912, KITTI\u201915 benchmarking results are shown in Table 6. The method we propose ranks third\non KITTI\u201915 set and fourth on KITTI\u201912 set, taking into account state-of-the-art results published a\nfew months ago or not of\ufb01cially published yet iResNet-i2 [18], PSMNet [2] and LRCR [12] methods.\n\n4 Conclusion\n\nIn this work we addressed two issues precluding the use of deep networks for stereo matching in\nmany practical situations in spite of their excellent accuracy: their large memory footprint, and the\ninability to adjust to a different disparity range without complete re-training.\nWe showed that by carefully revising conventionally used networks architecture to control the memory\nfootprint and adapt analytically the network to the disparity range, and by using a new loss and\nestimator to cope with multi-modal posterior and sub-pixel accuracy, it is possible to resolve these\npractical issues and reach state-of-the-art performance.\n\n8\n\n\fTable 6: KITTI\u201915 (top) and KITTI\u201912 (bottom) snapshots from 15/05/2018 with top-10 methods,\nincluding published in a recent months on not of\ufb01cially published yet: iResNet-i2 [18], PSMNet [2]\nand LRCR [12]. Our method (highlighted) is 3rd in KITTI\u201915 and 4th in KITTI\u201912 leader boards.\n\n#\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n#\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\ndd/mm/yy Method\n30/12/17\n18/03/18\n15/05/18\n24/03/17\n27/01/17\n15/11/17\n15/11/16\n08/11/17\n15/12/16\n26/10/15\n\nPSMNet [2]\niResNet-i2 [18]\nPDS (proposed)\nCRL [24]\nGC-NET [13]\nLRCR [12]\nDRR [7]\nSsSMnet [43]\nL-ResMatch [31]\nDisplets v2 [8]\n\n3PE (all pixels), [%]\n2.16\n2.44\n2.58\n2.67\n2.87\n3.03\n3.16\n3.40\n3.42\n3.43\n\nTime, [s]\n0.4\n0.12\n0.5\n0.47\n0.9\n49\n0.4\n0.8\n48\n265\n\ndd/mm/yy Method\n31/12/17\n23/11/17\n27/01/17\n15/05/18\n15/12/16\n11/09/16\n15/12/16\n08/11/17\n27/04/16\n26/10/15\n\nPSMNet [2]\niResNet-i2 [18]\nGC-NET [13]\nPDS (proposed)\nL-ResMatch [31]\nCNNF+SGM [41]\nSGM-NET [30]\nSsSMnet [43]\nPBCP [29]\nDisplets v2 [8]\n\n3PE (non-occluded), [%] Time, [s]\n1.49\n1.71\n1.77\n1.92\n2.27\n2.28\n2.29\n2.30\n2.36\n2.37\n\n0.4\n0.12\n0.9\n0.5\n48\n71\n67\n0.8\n68\n265\n\n5 Acknowledgement\n\nWe gratefully acknowledge support from the NCCR PlanetS and CaSSIS project of the University of\nBern funded through the Swiss Space Of\ufb01ce via ESA\u2019s PRODEX program. We also acknowledge\nthe support of NVIDIA Corporation with the donation of the GeForce GTX TITAN X used for this\nresearch.\n\nReferences\n[1] Jonathan T Barron, Andrew Adams, YiChang Shih, and Carlos Hern\u00e1ndez. Fast bilateral-space\n\nstereo for synthetic defocus. In CVPR, 2015.\n\n[2] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. CoRR, 2018.\n\n[3] Zhuoyuan Chen, Xun Sun, and Liang Wang. A Deep Visual Correspondence Embedding Model\n\nfor Stereo Matching Costs. ICCV, 2015.\n\n[4] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov,\nPatrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical \ufb02ow with\nconvolutional networks. In CVPR, 2015.\n\n[5] Meirav Galun, Tal Amir, Tal Hassner, Ronen Basri, and Yaron Lipman. Wide baseline stereo\n\nmatching with convex bounded distortion constraints. In ICCV, 2015.\n\n[6] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the\n\nkitti vision benchmark suite. In CVPR, 2012.\n\n[7] Spyros Gidaris and Nikos Komodakis. Detect, replace, re\ufb01ne: Deep structured prediction for\n\npixel wise labeling. CVPR, 2017.\n\n[8] Fatma G\u00fcney and Andreas Geiger. Displets: Resolving Stereo Ambiguities using Object\n\nKnowledge. CVPR, 2015.\n\n9\n\n\f[9] Simon Had\ufb01eld and Richard Bowden. Exploiting high level scene cues in stereo reconstruction.\n\nIn ICCV, 2015.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016.\n\n[11] Hae-Gon Jeon, Joon-Young Lee, Sunghoon Im, Hyowon Ha, and In So Kweon. Stereo matching\n\nwith color and monochrome cameras in low-light conditions. In CVPR, 2016.\n\n[12] Zequn Jie, Pengfei Wang, Yonggen Ling, Bo Zhao, Yunchao Wei, Jiashi Feng, and Wei Liu.\n\nLeft-right comparative recurrent model for stereo matching. In CVPR, 2018.\n\n[13] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham\nBachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo\nregression. ICCV, 2017.\n\n[14] K. R. Kim and C. S. Kim. Adaptive smoothness constraints for ef\ufb01cient stereo matching using\n\ntexture and edge information. In ICIP, 2016.\n\n[15] KITTY. Kitti stereo scoreboards. http://www.cvlibs.net/datasets/kitti/ Accessed:\n\n05 May 2018.\n\n[16] Patrick Kn\u00f6belreiter, Christian Reinbacher, Alexander Shekhovtsov, and Thomas Pock. End-to-\n\nend training of hybrid cnn-crf models for stereo. CVPR, 2017.\n\n[17] Ang Li, Dapeng Chen, Yuanliu Liu, and Zejian Yuan. Coordinating multiple disparity proposals\n\nfor stereo computation. In CVPR, 2016.\n\n[18] Zhengfa Liang, Yiliu Feng, Yulan Guo Hengzhu Liu Wei Chen, and Linbo Qiao Li Zhou Jianfeng\n\nZhang. Learning for disparity estimation through feature constancy. CoRR, 2018.\n\n[19] Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Ef\ufb01cient deep learning for stereo\n\nmatching. In CVPR, 2016.\n\n[20] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy,\nand Thomas Brox. A large dataset to train convolutional networks for disparity, optical \ufb02ow,\nand scene \ufb02ow estimation. In CVPR, 2016.\n\n[21] Xing Mei, Xun Sun, Mingcai Zhou, Shaohui Jiao, Haitao Wang, and Xiaopeng Zhang. On\nbuilding an accurate stereo matching system on graphics hardware. In ICCV Workshops, 2011.\n\n[22] Moritz Menze and Andreas Geiger. Object scene \ufb02ow for autonomous vehicles. In CVPR, 2015.\n\n[23] Kyoung Won Nam, Jeongyun Park, In Young Kim, and Kwang Gi Kim. Application of\n\nstereo-imaging technology to medical \ufb01eld. Healthcare informatics research, 2012.\n\n[24] Jiahao Pang, Wenxiu Sun, JS Ren, Chengxi Yang, and Qiong Yan. Cascade residual learning: A\n\ntwo-stage convolutional neural network for stereo matching. In ICCVW, 2017.\n\n[25] Min-Gyu Park and Kuk-Jin Yoon. Leveraging stereo matching with learning-based con\ufb01dence\n\nmeasures. In CVPR, 2015.\n\n[26] PyTorch. Pytorch web site. http://http://pytorch.org/ Accessed: 05 May 2018.\n\n[27] Yvain QUeau, Tao Wu, Fran\u00e7ois Lauze, Jean-Denis Durou, and Daniel Cremers. A non-convex\n\nvariational approach to photometric stereo under inaccurate lighting. In CVPR, 2017.\n\n[28] Daniel Scharstein and Richard Szeliski. A Taxonomy and Evaluation of Dense Two-Frame\n\nStereo Correspondence Algorithms. IJCV, 2001.\n\n[29] Akihito Seki and Marc Pollefeys. Patch based con\ufb01dence prediction for dense disparity map. In\n\nBMVC, 2016.\n\n[30] Akihito Seki and Marc Pollefeys. Sgm-nets: Semi-global matching with neural networks. 2017.\n\n10\n\n\f[31] Amit Shaked and Lior Wolf. Improved stereo matching with constant highway networks and\n\nre\ufb02ective con\ufb01dence learning. CVPR, 2017.\n\n[32] David E Shean, Oleg Alexandrov, Zachary M Moratto, Benjamin E Smith, Ian R Joughin,\nClaire Porter, and Paul Morin. An automated, open-source pipeline for mass production of\ndigital elevation models (DEMs) from very-high-resolution commercial stereo satellite imagery.\n{ISPRS}, 2016.\n\n[33] S. Tulyakov, A. Ivanov, and F. Fleuret. Weakly supervised learning of deep metrics for stereo\n\nreconstruction. In ICCV, 2017.\n\n[34] Ali Osman Ulusoy, Michael J Black, and Andreas Geiger. Semantic multi-view stereo: Jointly\n\nestimating objects and voxels. In CVPR, 2017.\n\n[35] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing\n\ningredient for fast stylization. CoRR, 2016.\n\n[36] Cedric Verleysen and Christophe De Vleeschouwer. Piecewise-planar 3d approximation from\n\nwide-baseline stereo. In CVPR, 2016.\n\n[37] Ting-Chun Wang, Manohar Srikanth, and Ravi Ramamoorthi. Depth from semi-calibrated\n\nstereo and defocus. In CVPR, 2016.\n\n[38] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional\n\nneural networks. 2015.\n\n[39] Jure \u017dbontar and Yann LeCun. Computing the Stereo Matching Cost With a Convolutional\n\nNeural Network. CVPR, 2015.\n\n[40] Jure Zbontar and Yann LeCun. Stereo matching by training a convolutional neural network to\n\ncompare image patches. JMLR, 2016.\n\n[41] F. Zhang and B. W. Wah. Fundamental principles on learning new features for effective dense\nmatching. IEEE Transactions on Image Processing, 27(2):822\u2013836, Feb 2018. ISSN 1057-7149.\ndoi: 10.1109/TIP.2017.2752370.\n\n[42] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond\nempirical risk minimization. In International Conference on Learning Representations, 2018.\n\n[43] Yiran Zhong, Yuchao Dai, and Hongdong Li. Self-supervised learning for stereo matching with\n\nself-improving ability. CoRR, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2833, "authors": [{"given_name": "Stepan", "family_name": "Tulyakov", "institution": "\u00c9cole polytechnique f\u00e9d\u00e9rale de Lausanne (EPFL)"}, {"given_name": "Anton", "family_name": "Ivanov", "institution": "EPFL"}, {"given_name": "Fran\u00e7ois", "family_name": "Fleuret", "institution": "Idiap Research Institute"}]}