{"title": "Quadratic Video Interpolation", "book": "Advances in Neural Information Processing Systems", "page_first": 1647, "page_last": 1656, "abstract": "Video interpolation is an important problem in computer vision, which helps overcome the temporal limitation of camera sensors. Existing video interpolation methods usually assume uniform motion between consecutive frames and use linear models for interpolation, which cannot well approximate the complex motion in the real world. To address these issues, we propose a quadratic video interpolation method which exploits the acceleration information in videos. This method allows prediction with curvilinear trajectory and variable velocity, and generates more accurate interpolation results. For high-quality frame synthesis, we develop a flow reversal layer to estimate flow fields starting from the unknown target frame to the source frame. In addition, we present techniques for flow refinement. Extensive experiments demonstrate that our approach performs favorably against the existing linear models on a wide variety of video datasets.", "full_text": "Quadratic Video Interpolation\n\nXiangyu Xu\u2217\u2020\n\nCarnegie Mellon University\nxuxiangyu2014@gmail.com\n\nLi Siyao\u2217\n\nSenseTime Research\n\nWenxiu Sun\n\nSenseTime Research\n\nlisiyao1@sensetime.com\n\nsunwenxiu@sensetime.com\n\nQian Yin\n\nBeijing Normal University\n\nyinqian@bnu.edu.cn\n\nMing-Hsuan Yang\n\nUniversity of California, Merced Google\n\nmhyang@ucmerced.edu\n\nAbstract\n\nVideo interpolation is an important problem in computer vision, which helps\novercome the temporal limitation of camera sensors. Existing video interpolation\nmethods usually assume uniform motion between consecutive frames and use linear\nmodels for interpolation, which cannot well approximate the complex motion in\nthe real world. To address these issues, we propose a quadratic video interpolation\nmethod which exploits the acceleration information in videos. This method allows\nprediction with curvilinear trajectory and variable velocity, and generates more\naccurate interpolation results. For high-quality frame synthesis, we develop a \ufb02ow\nreversal layer to estimate \ufb02ow \ufb01elds starting from the unknown target frame to the\nsource frame. In addition, we present techniques for \ufb02ow re\ufb01nement. Extensive\nexperiments demonstrate that our approach performs favorably against the existing\nlinear models on a wide variety of video datasets.\n\n1\n\nIntroduction\n\nVideo interpolation aims to synthesize intermediate frames between the original input images, which\ncan temporally upsample low-frame rate videos to higher-frame rates. It is a fundamental problem in\ncomputer vision as it helps overcome the temporal limitations of camera sensors and can be used in\nnumerous applications, such as motion deblurring [5, 35], video editing [25, 38], virtual reality [1],\nand medical imaging [11].\nMost state-of-the-art video interpolation methods [2, 3, 9, 14, 17] explicitly or implicitly assume\nuniform motion between consecutive frames, where the objects move along a straight line at a\nconstant speed. As such, these approaches usually adopt linear models for synthesizing intermediate\nframes. However, the motion in real scenarios can be complex and non-uniform, and the uniform\nassumption may not always hold in the input videos, which often leads to inaccurate interpolation\nresults. Moreover, the existing models are mainly developed based on two consecutive frames for\ninterpolation, and the higher-order motion information of the video (e.g., acceleration) has not been\nwell exploited. An effective frame interpolation algorithm should use additional input frames and\nestimate the higher-order information for more accurate motion prediction.\nTo this end, we propose a quadratic video interpolation method to exploit additional input frames\nto overcome the limitations of linear models. Speci\ufb01cally, we develop a data-driven model which\nintegrates convolutional neural networks (CNNs) [13, 24] and quadratic models [15] for accurate\nmotion estimation and image synthesis. The proposed algorithm is acceleration-aware, and thus\n\n\u2217Equal contributions.\n\u2020Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Exploiting the quadratic model for acceleration-aware video interpolation. The leftmost\nsub\ufb01gure shows four consecutive frames from a video, describing the projectile motion of a football.\nThe other three sub\ufb01gures show the interpolated results between frame 0 and 1 by different algorithms.\nNote that we overlap these results for better visualizing the interpolation trajectories. Since the linear\nmodel [31] assumes uniform motion between the two frames, it does not approximate the movement\nin real world well. In contrast, our quadratic approach can exploit the acceleration information from\nthe four neighboring frames and generate more accurate in-between video frames.\n\nallows predictions with curvilinear trajectory and variable velocity. Although the ideas of our method\nare intuitive and sensible, this task is challenging as we need to estimate the \ufb02ow \ufb01eld from the\nunknown target frame to the source frame (i.e., backward \ufb02ow) for image synthesis, which cannot\nbe easily obtained with existing approaches. To address this issue, we propose a \ufb02ow reversal layer\nto effectively convert forward \ufb02ow to backward \ufb02ow. In addition, we introduce new techniques for\n\ufb01ltering the estimated \ufb02ow maps. As shown in Figure 1, the proposed quadratic model can better\napproximate pixel motion in real world and thus obtain more accurate interpolation results.\nThe contributions of this work can be summarized as follows. First, we propose a quadratic inter-\npolation algorithm for synthesizing accurate intermediate video frames. Our method exploits the\nacceleration information of the video, which can better model the nonlinear movements in the real\nworld. Second, we develop a \ufb02ow reversal layer to estimate the \ufb02ow \ufb01eld from the target frame to\nthe source frame, thereby facilitating high-quality frame synthesis. In addition, we present novel\ntechniques for re\ufb01ning \ufb02ow \ufb01elds in the proposed method. We demonstrate that our method performs\nfavorably against the state-of-the-art video interpolation methods on different video datasets. While\nwe focus on quadratic functions in this work, the proposed framework for exploiting the acceleration\ninformation is general, and can be further extended to higher-order models.\n\n2 Related Work\n\nMost state-of-the-art approaches [2, 3, 4, 9, 14, 17, 19] for video interpolation explicitly or implicitly\nassume uniform motion between consecutive frames. As a typical example, Baker et al. [2] use\noptical \ufb02ow and forward warping to linearly move pixels to the intermediate frames. Liu et al.\n[14] develop a CNN model to directly learn the uniform motion for interpolating the middle frame.\nSimilarly, Jiang et al. [9] explicitly assume uniform motion with \ufb02ow estimation networks, which\nenables a multi-frame interpolation model.\nOn the other hand, Meyer et al. [17] develop a phase-based method to combine the phase information\nacross different levels of a multi-scale pyramid, where the phase is modeled as a linear function of\ntime with the implicit uniform motion assumption. Since the above linear approaches do not exploit\nhigher-order information in videos, the interpolation results are less accurate.\nKernel-based algorithms [20, 21, 33] have also been proposed for frame interpolation. While these\nmethods are not constrained by the uniform motion model, existing schemes do not handle nonlinear\nmotion in complex scenarios well as only the visual information of two consecutive frames is used\nfor interpolation.\nClosely related to our work is the method by McAllister and Roulier [15] which uses quadratic\nsplines for data interpolation to preserve the convexity of the input. However, this method can only\nbe applied to low-dimensional data, while we solve the problem of video interpolation which is in\nmuch higher dimensions.\n\n2\n\nframe -1frame 0frame 1frame 2ground truthlinear modelquadratic model\fFigure 2: Overview of the quadratic video interpolation algorithm. We \ufb01rst use the off-the-shelf\nmodel to estimate \ufb02ow \ufb01elds for the input frames. Then we introduce quadratic \ufb02ow prediction and\n\ufb02ow reversal layers to estimate ft\u21920 and ft\u21921. We describe the estimation process of ft\u21920 in details\nin this paper, and ft\u21921 can be computed similarly. Finally, we synthesize the in-between frame by\nwarping and fusing the input frames with ft\u21920 and ft\u21921.\n\n3 Proposed Algorithm\nTo synthesize an intermediate frame \u02c6It where t \u2208 (0, 1), existing algorithms [9, 14, 21] usually\nassume uniform motion between the two consecutive frames I0, I1, and adopt linear models for\ninterpolation. However, this assumption cannot approximate the complex motion in real world well\nand often leads to inaccurately interpolated results. To solve this problem, we propose a quadratic\ninterpolation method for predicting more accurate intermediate frames. The proposed method is\nacceleration-aware, and thus can better approximate real-world scene motion.\nAn overview of our quadratic interpolation algorithm is shown in Figure 2, where we synthesize the\nframe \u02c6It by fusing pixels warped from I0 and I1. We use I\u22121, I0, and I1 to warp pixels from I0 and\ndescribe this part in details in the following sections, and the warping from the other side (i.e., I1)\ncan be performed similarly by using I0, I1, and I2. Speci\ufb01cally, we \ufb01rst compute optical \ufb02ow f0\u21921\nand f0\u2192\u22121 with the state-of-the-art \ufb02ow estimation network PWC-Net [31]. We then predict the\nintermediate \ufb02ow map f0\u2192t using f0\u21921 and f0\u2192\u22121 in Section 3.1. In Section 3.2, we propose a\nnew method to estimate the backward \ufb02ow ft\u21920 by reversing the forward \ufb02ow f0\u2192t. Finally, we\nsynthesize the interpolated results with the backward \ufb02ow in Section 3.3.\n\n3.1 Quadratic \ufb02ow prediction\n\nTo interpolate frame \u02c6It, we \ufb01rst consider the motion model of a pixel from I0:\n\n(cid:90) t\n\n(cid:20)\n\n(cid:90) \u03ba\n\n(cid:21)\n\nf0\u2192t =\n\nv0 +\n\na\u03c4 d\u03c4\n\nd\u03ba,\n\n(1)\n\n0\n\n0\n\nwhere f0\u2192t denotes the displacement of the pixel from frame 0 to t, v0 is the velocity at frame 0,\nand a\u03c4 represents the acceleration at frame \u03c4.\nExisting models [9, 14, 21] usually explicitly or implicitly assume uniform motion and set a\u03c4 = 0\nbetween consecutive frames, where (1) can be rewritten as a linear function of t:\n\nf0\u2192t = tf0\u21921.\n\n(2)\n\nHowever, the objects in real scenarios do not always travel in a straight line at a constant velocity,.\nThus, these linear approaches cannot effectively model the complex non-uniform motion and often\nlead to inaccurate interpolation results.\nIn contrast, we take higher-order information into consideration and assume a constant a\u03c4 for\n\u03c4 \u2208 [\u22121, 1]. Correspondingly, the \ufb02ow from frame 0 to t can be derived as:\n\nf0\u2192t = (f0\u21921 + f0\u2192\u22121)/2 \u00d7 t2 + (f0\u21921 \u2212 f0\u2192\u22121)/2 \u00d7 t,\nwhich is equivalent to temporally interpolating pixels with a quadratic function.\n\n(3)\n\n3\n\n1012,,,IIII\u2212Flow estimationQuadratic flow predictionQuadratic flow predictionFlow reversalFlow reversalSynthesistI\uf02401f\u2192\u221201f\u219210f\u219212f\u21920tf\u21921tf\u21920tf\u21921tf\u2192\fFigure 3: Effectiveness of the \ufb02ow reversal layer and the adaptive \ufb02ow \ufb01ltering. The car is moving\nalong the arrow direction in the frame sequence. (a) is the ft\u21920 estimated with the naive strategy\nfrom [9]. (b) is the backward \ufb02ow generated by our \ufb02ow reversal layer. (c) and (d) represent the\nresults of the deep CNNs in [9] and our adaptive \ufb02ow \ufb01ltering, respectively.\n\nThis formulation relaxes the constraint of constant velocity and rectilinear movement of linear models,\nand thus allows accelerated and curvilinear motion prediction between frames. In addition, existing\nmethods with linear models only use the two closest frames I0, I1, whereas our algorithm naturally\nexploits visual information from more neighboring frames.\n\n3.2 Flow reversal layer\n\nWhile we obtain forward \ufb02ow f0\u2192t from quadratic \ufb02ow prediction, it cannot be easily used for\nsynthesizing images. Instead, we need backward \ufb02ow ft\u21920 for high-quality frame synthesis [9, 14,\n16, 31]. To estimate the backward \ufb02ow, Jiang et al. [9] introduce a simple method which linearly\ncombines f0\u21921 and f1\u21920 to approximate ft\u21920. However, this approach does not perform well\naround motion boundaries as shown in Figure 3(a). More importantly, this approach cannot be applied\nin our quadratic method to exploit the acceleration information.\nIn this work, we propose a \ufb02ow reversal layer for better prediction of ft\u21920. We \ufb01rst project the \ufb02ow\nmap f0\u2192t to frame t, where a pixel x on I0 corresponds to x + f0\u2192t(x) on It. Next, we compute\nthe \ufb02ow of a pixel u on It by reversing and averaging the projected \ufb02ow values that fall into the\nneighborhood N (u) of pixel u. Mathematically, this process can be written as:\n\n(cid:80)\n(cid:80)\nx+f0\u2192t(x)\u2208N (u) w((cid:107)x + f0\u2192t(x) \u2212 u(cid:107)2)(\u2212f0\u2192t(x))\nx+f0\u2192t(x)\u2208N (u) w((cid:107)x + f0\u2192t(x) \u2212 u(cid:107)2)\n\nft\u21920(u) =\n\n,\n\n(4)\n\nwhere w(d) = e\u2212d2/\u03c32 is the Gaussian weight for each \ufb02ow. The proposed \ufb02ow reversal layer is\nconceptually similar to the surface splatting [39] in computer graphics where the optical \ufb02ow in our\nwork is replaced by camera projection. During training, while the reversal layer itself does not have\nlearnable parameters, it is differentiable and allows the gradients to be backpropagated to the \ufb02ow\nestimation module in Figure 2, and thus enables end-to-end training of the whole system.\nNote that the proposed reversal approach can lead to holes in the estimated \ufb02ow map ft\u21920, which is\nmostly due to the objects visible in It but occluded in I0. And the missing objects are \ufb01lled with the\npixels warped from I1 which is on the other side of the interpolation model.\n\n3.3 Frame synthesis\n\nIn this section, we \ufb01rst re\ufb01ne the reversed \ufb02ow \ufb01eld with adaptive \ufb01ltering. Then we use the re\ufb01ned\n\ufb02ow to generate the interpolated results with backward warping and frame fusion.\n\nAdaptive \ufb02ow \ufb01ltering. While our approach is effective in reversing \ufb02ow maps, the generated\nbackward \ufb02ow ft\u21920 may still have some ringing artifacts around edges as shown in Figure 3(b),\nwhich are mainly due to outliers in the original estimations of the PWC-Net. A straightforward\nway to reduce these artifacts [6, 9, 31] is to train deep CNNs with residual connections to re\ufb01ne the\ninitial \ufb02ow maps. However, this strategy does not work well in our practice as shown in Figure 3(c).\nThis is because the artifacts from the \ufb02ow reversal layer are mostly thin streaks with spike values\n(Figure 3(b)). Such outliers cannot be easily removed since the weighted averaging of convolution\ncan be affected by the spiky outliers.\nInspired by the median \ufb01lter [7] which samples only one pixel from a neighborhood and avoids the\nissues of weighted averaging, we propose a \ufb02ow \ufb01ltering network to adaptively sample the \ufb02ow\n\n4\n\n\ud835\udc3c\"(a)(b)(c)(d)moving direction\fmap for removing outliers. While the classical median \ufb01lter involves indifferentiable operation and\ncannot be easily trained in our end-to-end model, the proposed method learns to sample one pixel in\na neighborhood with neural networks and can more effectively reduce the artifacts of the \ufb02ow map.\nSpeci\ufb01cally, we formulate the adaptive \ufb01ltering process as follows:\n\nf(cid:48)\nt\u21920(u) = ft\u21920(u + \u03b4(u)) + r(u),\n\n(5)\nwhere f(cid:48)\nt\u21920 denotes the \ufb01ltered backward \ufb02ow, and \u03b4(u) is the learned sampling offset of pixel\nu. We constrain \u03b4(u) \u2208 [\u2212k, k] by using k \u00d7 tanh(\u00b7) as the activation function of \u03b4 such that the\nproposed \ufb02ow \ufb01lter has a local receptive \ufb01eld of 2k + 1. Since the \ufb02ow map is sparse and smooth in\nmost regions, we do not directly rectify the artifacts with CNNs as the schemes in [6, 9, 31]. Instead,\nwe rely on the \ufb02ow values around outliers by sampling in a neighborhood, where \u03b4 is trained to\n\ufb01nd the suitable sampling locations. The residual map r is learned for further improvement. Our\n\ufb01ltering method enables spatially-variant and nonlinear re\ufb01nement of ft\u21920, which could be seen as a\nlearnable median \ufb01lter in spirit. As show in Figure 3(d), the proposed algorithm can effectively reduce\nthe artifacts in the reversed \ufb02ow maps. More implementation details are presented in Section 4.2.\nWarping and fusing source frames. While we obtain f(cid:48)\nwe can also estimate f(cid:48)\nvideo frames as:\n\nt\u21920 with the input frames I\u22121, I0, and I1,\nt\u21921 in a similar way with I0, I1, and I2. Finally, we synthesize the intermediate\n(1 \u2212 t)m(u)I0(u + f(cid:48)\n\nt\u21920(u)) + t(1 \u2212 m(u))I1(u + f(cid:48)\n\nt\u21921(u))\n\n(1 \u2212 t)m(u) + t(1 \u2212 m(u))\n\n\u02c6It(u) =\n\n(6)\n\n(7)\n\nwhere Ii(u + f(cid:48)\nt\u2192i(u)) denotes the pixel warped from frame i to t with bilinear function [8]. m\nis a mask learned with a CNN to fuse the warped frames. Similar to [9], we also use the temporal\ndistance 1 \u2212 t and t for the source frames I0 and I1, such that we can give higher con\ufb01dence to\ntemporally-closer pixels. Note that we do not directly use the pixels in I\u22121 and I2 for image synthesis,\nas almost all the contents in the intermediate frame can be found in I0 and I1. Instead, I\u22121 and I2 are\nexploited for acceleration-aware motion estimation.\nSince all the above steps of our method are differentiable, we can train the proposed interpolation\nmodel in an end-to-end manner. The loss function for training our network is a combination of the (cid:96)1\nloss and perceptual loss [10, 37]:\n\n(cid:107) \u02c6It \u2212 It(cid:107)1 + \u03bb(cid:107)\u03c6( \u02c6It) \u2212 \u03c6(It)(cid:107)2,\n\nwhere It is the ground truth, and \u03c6 is the conv4_3 feature extractor of the VGG16 model [28].\n\n4 Experiments\n\nIn this section, we \ufb01rst provide implementation details of the proposed model, including training data,\nnetwork structure, and hyper-parameters. We then present evaluation results of our algorithm with\ncomparisons to the state-of-the-art methods on video datasets. The source code, data, and the trained\nmodels are available at: https://sites.google.com/view/xiangyuxu/qvi_nips19.\n\n4.1 Training data\n\nTo train the proposed interpolation model, we collect high-quality videos from the Internet, where\neach frame is of 1080\u00d71920 pixels at the frame rate of 960 fps. From the collected videos, we\nselect the clips with both camera shake and dynamic object motion, which are bene\ufb01cial for more\neffective network training. The \ufb01nal training dataset consists of 173 video clips of different scenes\nand 36926 frames in total. In addition, the 960 fps video clips are randomly downsampled to 240 fps\nand 480 fps for data augmentation. During the training process, we extract non-overlapped frame\ngroups from these video clips, where each has 4 input frames I\u22121, I0, I1, I2, and 7 target frames It,\nt = 0.125, 0.25, . . . , 0.875. We resize the frames into 360\u00d7640 and randomly crop 352\u00d7352 patches\nfor training. Image \ufb02ipping and sequence reversal are also performed to fully utilize the video data.\n\n4.2\n\nImplementation details\n\nWe learn the adaptive \ufb02ow \ufb01ltering with a 23-layer U-Net [26, 36] which is an encoder-decoder\nnetwork. The encoder is composed of 12 convolution layers with 5 average pooling layers for\n\n5\n\n\fTable 1: Quantitative evaluations on the GOPRO and Adobe240 datasets. \u201cOurs w/o qua.\u201d represents\nour model without using the quadratic \ufb02ow prediction.\n\nGOPRO\n\nAdobe240\n\nwhole\n\ncenter\n\nwhole\n\ncenter\n\nMethod\n\nPSNR SSIM\n\nIE\n\nPSNR SSIM\n\nIE\n\nPSNR SSIM\n\nIE\n\nPSNR SSIM\n\nIE\n\nPhase\nDVF\nSepConv\nSuperSloMo\nOurs w/o qua.\nOurs\n\n23.95\n21.94\n29.52\n29.00\n29.57\n31.27\n\n0.700\n0.776\n0.922\n0.918\n0.923\n0.948\n\n17.89\n21.30\n9.26\n9.51\n9.02\n7.23\n\n22.05\n20.55\n27.69\n27.33\n27.86\n29.62\n\n0.620\n0.720\n0.895\n0.892\n0.898\n0.929\n\n22.08\n25.14\n11.38\n11.50\n10.93\n8.73\n\n25.60\n28.23\n32.19\n31.30\n31.64\n32.95\n\n0.735\n0.896\n0.954\n0.949\n0.952\n0.966\n\n16.93\n11.76\n7.71\n8.18\n7.93\n6.84\n\n23.65\n26.90\n30.87\n30.17\n30.48\n32.09\n\n0.647\n0.871\n0.940\n0.935\n0.939\n0.959\n\n20.65\n13.30\n8.91\n9.22\n8.96\n7.47\n\nt\u21921 with (5). Then we warp I0 and I1 with \ufb02ow f(cid:48)\n\n1, f0\u21921, f1\u21920, ft\u21920, and ft\u21921, where I t\nt\u21920 and f(cid:48)\n\ndowsampling, and the decoder has 11 convolution layers with 5 bilinear layers for upsampling.\nWe add skip connections with pixel-wise summation between the same-resolution layers in the\nencoder and decoder to jointly use low-level and high-level features. The input of our network is\na concatenation of I0, I1, I t\n0, I t\ni (u) = Ii(u + ft\u2192i(u))\ndenotes the pixel warped with ft\u2192i. The U-Net produces the output \u03b4 and r which are used to\nestimate the \ufb01ltered \ufb02ow map f(cid:48)\nt\u21920 and\nf(cid:48)\nt\u21921, and feed the warped images to a 3-layer CNN to estimate the fusion mask m which is \ufb01nally\nused for frame interpolation with (6).\nWe \ufb01rst train the proposed network with the \ufb02ow estimation module \ufb01xed for 200 epochs, and then\n\ufb01netune the whole system for another 40 epochs. Similar to [34], we use the Adam optimizer [12] for\ntraining. We initialize the learning rate as 10\u22124 and further decrease it by a factor of 0.1 at the end of\nthe 100th and 150th epochs. The trade-off parameter \u03bb of the loss function (7) is set to be 0.005. k in\nthe activation function of \u03b4 is set to be 10. In the \ufb02ow reversal layer, we set the Gaussian standard\ndeviation \u03c3 = 1.\nFor evaluation, we report PSNR, Structural Similarity Index (SSIM) [32], and the interpolation\nerror (IE) [2] between the predictions and ground truth intermediate frames, where IE is de\ufb01ned as\nroot-mean-squared (RMS) difference between the reference and interpolated image.\n\n4.3 Comparison with the state-of-the-arts\n\nWe evaluate our model with the state-of-the-art video interpolation approaches, including the\nphase-based method (Phase) [17], separable adaptive convolution (SepConv) [21], deep voxel \ufb02ow\n(DVF) [14], and SuperSloMo [9]. We use the original implementations for Phase, SepConv, DVF and\nthe implementation from [22] for SuperSlomo. We retrain DVF and SuperSlomo with our data. We\nwere not able to retrain SepConv as the training code is not publicly available, and directly use the\noriginal models [21] in our experiments instead. Note that the proposed quadratic video interpolation\ncan be used for synthesizing arbitrary intermediate frames, which is evaluated on the high frame\nrate video datasets such as GOPRO [18] and Adobe240 [30]. We also conduct experiments on the\nUCF101 [29] and DAVIS [23] datasets for performance evaluation of single-frame interpolation.\n\nMulti-frame interpolation on the GOPRO [18] dataset. This dataset is composed of 33 high-\nquality videos with a frame rate of 720 fps and image resolution of 720\u00d71280. These videos are\nrecorded with hand-held cameras, which often contain non-linear camera motion. In addition, this\ndataset has dynamic object motion from both indoor and outdoor scenes, which are challenging for\nexisting interpolation algorithms.\nWe extract 4275 non-overlapped frame sequences with a length of 25 from the GOPRO videos. To\nevaluate the proposed quadratic model, we use the 1st, 9th, 17th, and 25th frames of each sequence\nas our inputs, which respectively correspond to I\u22121, I0, I1, I2 in the proposed model. As discussed\nin Section 1, the baseline methods only exploit the 9th and 17th frames for video interpolation. We\nsynthesize 7 frames between the 9th and 17th frames, and thus all the corresponding ground truth\nframes are available for evaluation.\nAs shown in Table 1, we separately evaluate the scores of the center frame (i.e., the 4th frame,\ndenoted as center) and the average of all the 7 interpolated frames (denoted as whole). The quadratic\ninterpolation model consistently performs favorably against all the other linear methods. Noticeably,\n\n6\n\n\fframe 0 & frame 1\n\n(a) GT\n\n(b) SepConv\n\n(c) SuperSloMo\n\n(d) Ours w/o qua.\n\n(e) Ours\n\nFigure 4: Qualitative results on the GOPRO dataset. The \ufb01rst row of each example shows the overlap\nof the interpolated center frame and the ground truth. A clearer overlapped image indicates more\naccurate interpolation result. The second row of each example shows the interpolation trajectory of\nall the 7 interpolated frames by feature point tracking.\n\nthe PSNRs of our results on either the center frame or the average of the whole frames improve over\nthe second best method by more than 1 dB.\nTo understand the effectiveness of the proposed quadratic interpolation algorithm, we visualize\nthe trajectories of the interpolated results in Figure 4 and compare with the baseline methods both\nqualitatively and quantitatively. Speci\ufb01cally, for each test sequence from the GOPRO dataset, we use\nthe classic feature tracking algorithm [27] to select 10000 feature points in the 9th frame, and track\nthem through the 7 synthesized in-between frames. For better performance evaluation, we exclude\nthe points that disappear or move out of the image boundaries during tracking.\nWe show two typical examples in Figure 4 and visualize the interpolation trajectory by connecting\nthe tracking points (i.e., the red lines). In the the \ufb01rst example, the object moves along a quite sharp\ncurve, mostly due to a sudden violent change of the camera\u2019s moving direction. All the existing\nmethods fail on this example as the linear models assume uniform motion and cannot predict the\nmotion change well. In contrast, our quadratic model enables higher-order video interpolation and\nexploits the acceleration information from the neighboring frames. As shown in Figure 4(e), the\nproposed method approximates the curvilinear motion well against the ground truth.\nIn addition, we overlap the predicted center frame with its ground truth to evaluate the interpolation\naccuracy of different methods (\ufb01rst row of each example of Figure 4). For linear models, the\noverlapped frames are severely blurred, which demonstrates the large shift between ground truth and\nthe linearly interpolated results. In contrast, the generated frames by our approach align with the\nground truth well, which indicates better interpolation results with smaller errors.\nDifferent from the \ufb01rst example which contains severe non-linear movements, we present a video\nwith motion trajectory closer to straight lines in the second example. As shown in Figure 4, although\nthe motion in this video is closer to the uniform assumption of linear models, existing approaches\nstill do not generate accurate interpolation results. This demonstrates the importance of the proposed\nquadratic algorithm, since there are few scenes strictly satisfying uniform motion, and minor per-\nturbations to this strict motion assumption can lead to obvious shifts in the synthesized images. As\nshown in the second example of Figure 4(e), the proposed quadratic method estimates the moving\ntrajectory well against the ground truth and thus generates more accurate interpolation results.\n\n7\n\n\fTable 2: ASFP on the GOPRO dataset.\n\nTable 3: Evaluations on the UCF101 and DAVIS datasets.\n\nMethod\n\nwhole center\n\nSepConv\nSuperSloMo\nOurs w/o qua.\nOurs\n\n1.79\n2.04\n1.33\n0.97\n\n2.17\n2.38\n1.69\n1.22\n\nUCF101\n\nDAVIS\n\nMethod\n\nPSNR SSIM IE\n\nPSNR SSIM IE\n\nPhase\nDVF\nSepConv\nSuperSloMo\nOurs w/o qua.\nOurs\n\n29.84\n29.88\n31.97\n32.04\n32.02\n32.54\n\n0.900\n0.916\n0.943\n0.945\n0.945\n0.948\n\n7.97\n7.66\n5.89\n5.99\n5.99\n5.79\n\n21.54\n22.24\n26.21\n25.76\n26.83\n27.73\n\n0.556\n0.742\n0.857\n0.850\n0.874\n0.894\n\n26.76\n23.66\n15.84\n15.93\n13.69\n12.32\n\nframe 1\n\nframe 2\n\nGT\n\nSepConv\nFigure 5: Visual results from the DAVIS dataset.\n\nPhase\n\nDVF\n\nSuperSloMo\n\nOurs\n\nIn addition, if we do not consider the acceleration in the proposed method (i.e., using (2) to replace\n(3)), the interpolation performance of our model decreases drastically to that of linear models (\u201cOurs\nw/o qua.\u201d in Table 1 and Figure 4), which shows the importance of the higher-order information.\nTo quantitatively measure the shifts between the synthesized frames and ground truth, we de\ufb01ne a\nnew error metric for video interpolation denoted as average shift of feature points (ASFP):\n\nN(cid:88)\n\ni=1\n\nASF P (It, \u02c6It) =\n\n1\nN\n\n(cid:107)p(It, i) \u2212 p( \u02c6It, i)(cid:107)2,\n\n(8)\n\nwhere p(It, i) denotes the position of the ith feature point on It, and N is the number of feature\npoints. We respectively compute the average ASFP of the center frame and the whole 7 interpolated\nframes on the GOPRO dataset. Table 2 shows the proposed quadratic algorithm performs favorably\nagainst the state-of-the-art methods while signi\ufb01cantly reducing the average shift.\n\nEvaluations on the Adobe240 [30] dataset. This dataset consists of 133 videos with a frame rate\nof 240 fps and image resolution of 720\u00d71280 pixels. The frames are resized to 360\u00d7480 during\ntesting. We extract 8702 non-overlapped frame sequences from the videos in the Adobe240 dataset,\nand each sequence contain 25 consecutive frames similar with the settings of the GOPRO dataset.\nWe also synthesize 7 in-between frames for 8 times temporally upsampling. As shown in Table 1,\nthe proposed quadratic algorithm performs favorably against the state-of-the-art linear interpolation\nmethods.\n\nSingle-frame interpolation on the UCF101 [29] and DAVIS [23] datasets.\nIn addition to the\nmulti-frame interpolation evaluated on high frame rate videos, we test the proposed quadratic model\non single-frame interpolation using videos with 30 fps, i.e., UCF101 [29] and DAVIS [23] datasets.\nLiu et al. [14] previously extract 100 triplets from the videos of the UCF101 dataset as test data,\nwhich cannot be used to evaluate our algorithms since we need four consecutive frames as inputs.\nThus, we re-generate the test data by \ufb01rst temporally downsampling the original videos to 15 fps and\nthen randomly extracting 4 adjacent frames (i.e., I\u22121, I0, I1, I2) from these videos. The sequences\nwith static scenes are removed for more accurate evaluations. We collect 100 quintuples (4 input\nframes I\u22121, I0, I1, I2 and 1 target frame I0.5) where each frame is resized to 225\u00d7225 pixels as [14].\nFor the DAVIS dataset, we evaluate our method on the whole 90 video clips which are divided into\n2847 quintuples using the original image resolution.\nWe interpolate the center frame for these two dataset, which is equivalent to converting a 15 fps video\nto a 30 fps one. As shown in Table 3, the quadratic interpolation approach performs slightly better\nthan the baseline models on the UCF101 dataset as the videos are of relatively low quality with low\nimage resolution and slow motion. For the DAVIS dataset which contains complex motion from\nboth camera shake and dynamic scenes, our method signi\ufb01cantly outperforms other approaches in\nterms of all evaluation metrics. We show one example from the DAVIS dataset in Figure 5 for visual\ncomparisons.\n\n8\n\n\f(a) ft\u21920\n\n(b) f(cid:48)\n\nt\u21920\n\n(c) Ours w/o ada.\n\n(d) Ours\n\nFigure 6: Adaptive \ufb02ow \ufb01ltering reduces artifacts in (a) and generates higher-quality image (d).\n\nOverall, the quadratic approach achieves state-of-the-art performance on a wide variety of video\ndatasets for both single-frame and multi-frame interpolations. More importantly, experimental results\ndemonstrate that it is important and effective to exploit the acceleration information for accurate\nvideo frame interpolation.\n\n4.4 Ablation study\n\nMethod\n\nPSNR SSIM IE\n\nTable 4: Ablation study on the DAVIS dataset.\n\nWe analyze the contribution of each component in\nour model on the DAVIS video dataset [23] in Ta-\nble 4. In particular, we study the impact of quadratic\ninterpolation by replacing the quadratic \ufb02ow predic-\ntion (3) with the linear function (2) (w/o qua.). We\nfurther study the effectiveness of the adaptive \ufb02ow\n\ufb01ltering by directly learning residuals for \ufb02ow re\ufb01ne-\nment similar with [6, 9, 31] (w/o ada.). In addition, we compare the \ufb02ow reversal layer with the linear\ncombination strategy in [9] which approximates ft\u21920 by simply fusing f0\u21921 and f1\u21920 (w/o rev.).\nAs shown in Table 4, removing each of the three components degrades performance in all metrics.\nParticularly, the quadratic \ufb02ow prediction plays a crucial role, which veri\ufb01es our approach to exploit\nthe acceleration information from additional neighboring frames. Note that while the quantitative\nimprovement from the adaptive \ufb02ow \ufb01ltering is small, this component is effective in generating\nhigh-quality interpolation results by reducing artifacts of the \ufb02ow \ufb01elds as shown in Figure 6.\n\nOurs w/o rev.\nOurs w/o qua.\nOurs w/o ada.\nOurs full model\n\n0.873\n0.874\n0.892\n0.894\n\n26.71\n26.83\n27.60\n27.73\n\n13.84\n13.69\n12.41\n12.32\n\n5 Conclusion\n\nIn this paper we propose a quadratic video interpolation algorithm which can synthesize high-quality\nintermediate frames. This method exploits the acceleration information from neighboring frames of\na video for non-linear video frame interpolation, and facilitates end-to-end training. The proposed\nmethod is able to model complex motion in real world more accurately and generate more favorable\nresults than existing linear models on different video datasets. While we focus on quadratic function in\nthis work, the proposed formulation is general and can be extended to even higher-order interpolation\nmethods, e.g., the cubic model. We also expect this framework to be applied to other related tasks,\nsuch as multi-frame optical \ufb02ow and novel view synthesis.\n\nReferences\n[1] R. Anderson, D. Gallup, J. T. Barron, J. Kontkanen, N. Snavely, C. Hern\u00e1ndez, S. Agarwal, and S. M. Seitz.\n\nJump: virtual reality video. ACM Transactions on Graphics (TOG), 35:198, 2016. 1\n\n[2] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation\n\nmethodology for optical \ufb02ow. IJCV, 92:1\u201331, 2011. 1, 2, 6\n\n[3] W. Bao, W.-S. Lai, X. Zhang, Z. Gao, and M.-H. Yang. Memc-net: Motion estimation and motion\ncompensation driven neural network for video interpolation and enhancement. arXiv:1810.08768, 2018. 1,\n2\n\n[4] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical \ufb02ow techniques. IJCV, 12:43\u201377,\n\n[5] T. Brooks and J. T. Barron. Learning to synthesize motion blur. In CVPR, 2019. 1\n[6] Y. Gan, X. Xu, W. Sun, and L. Lin. Monocular depth estimation with af\ufb01nity, vertical pooling, and label\n\n1994. 2\n\n2015. 5\n\nenhancement. In ECCV, 2018. 4, 5, 9\n\n[7] R. C. Gonzalez and R. E. Woods. Digital image processing. Prentice hall New Jersey, 2002. 4\n[8] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS,\n\n9\n\n\f[9] H. Jiang, D. Sun, V. Jampani, M. Yang, E. G. Learned-Miller, and J. Kautz. Super slomo: High quality\n\nestimation of multiple intermediate frames for video interpolation. In CVPR, 2018. 1, 2, 3, 4, 5, 6, 9\n\n[10] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In\n\nECCV, 2016. 5\n\n[11] A. Karargyris and N. Bourbakis. Three-dimensional reconstruction of the digestive wall in capsule\nendoscopy videos using elastic video interpolation. IEEE Transactions on Medical Imaging, 30:957\u2013971,\n2010. 1\n\n[12] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2014. 6\n[13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86:2278\u20132324, 1998. 1\n\n[14] Z. Liu, R. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel \ufb02ow. In ICCV,\n\n[15] D. F. McAllister and J. A. Roulier. Interpolation by convex quadratic splines. Mathematics of Computation,\n\n[16] S. Meister, J. Hur, and S. Roth. Un\ufb02ow: Unsupervised learning of optical \ufb02ow with a bidirectional census\n\n[17] S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. Sorkine-Hornung. Phase-based frame interpolation for\n\n2017. 1, 2, 3, 4, 6, 8\n\n32:1154\u20131162, 1978. 1, 2\n\nloss. In AAAI, 2018. 4\n\nvideo. In CVPR, 2015. 1, 2, 6\n\ndeblurring. In CVPR, 2017. 6\n\n[18] S. Nah, T. H. Kim, and K. M. Lee. Deep multi-scale convolutional neural network for dynamic scene\n\n[19] S. Niklaus and F. Liu. Context-aware synthesis for video frame interpolation. In CVPR, 2018. 2\n[20] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolution. In CVPR, 2017. 2\n[21] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable convolution. In ICCV,\n\n[22] A. Paliwal. Pytorch implementation of super slomo. https://github.com/avinashpaliwal/Super-SloMo, 2018.\n\n2017. 2, 3, 6\n\n6\n\n[23] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark\n\ndataset and evaluation methodology for video object segmentation. In CVPR, 2016. 6, 8, 9\n\n[24] W. Ren, S. Liu, L. Ma, Q. Xu, X. Xu, X. Cao, J. Du, and M.-H. Yang. Low-light image enhancement via a\n\ndeep hybrid network. TIP, 28:4364\u20134375, 2019. 1\n\n[25] W. Ren, J. Zhang, X. Xu, L. Ma, X. Cao, G. Meng, and W. Liu. Deep video dehazing with semantic\n\nsegmentation. TIP, 28:1895\u20131908, 2018. 1\n\n[26] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation.\n\n[27] J. Shi and C. Tomasi. Good features to track. In CVPR, 1994. 7\n[28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nIn MICCAI, 2015. 5\n\nICLR, 2015. 5\n\n[29] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the\n\n[30] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang. Deep video deblurring for hand-held\n\n[31] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical \ufb02ow using pyramid, warping, and\n\nwild. arXiv:1212.0402, 2012. 6, 8\n\ncameras. In CVPR, 2017. 6, 8\n\ncost volume. In CVPR, 2018. 2, 3, 4, 5, 9\n\n[32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility\n\nto structural similarity. TIP, 13:600\u2013612, 2004. 6\n\n[33] X. Xu, M. Li, and W. Sun. Learning deformable kernels for image and video denoising. arXiv:1904.06903,\n\n[34] X. Xu, Y. Ma, and W. Sun. Towards real scene super-resolution with raw images. In CVPR, 2019. 6\n[35] X. Xu, J. Pan, Y.-J. Zhang, and M.-H. Yang. Motion blur kernel estimation via deep learning. TIP,\n\n2019. 2\n\n27:194\u2013205, 2017. 1\n\n[36] X. Xu, D. Sun, S. Liu, W. Ren, Y.-J. Zhang, M.-H. Yang, and J. Sun. Rendering portraitures from monocular\n\ncamera and beyond. In ECCV, 2018. 5\n\nimages. In ICCV, 2017. 5\n\n[37] X. Xu, D. Sun, J. Pan, Y. Zhang, H. P\ufb01ster, and M.-H. Yang. Learning to super-resolve blurry face and text\n\n[38] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski. High-quality video view interpolation\n\nusing a layered representation. ACM transactions on graphics (TOG), 23:600\u2013608, 2004. 1\n\n[39] M. Zwicker, H. P\ufb01ster, J. Van Baar, and M. Gross. Surface splatting. In Proceedings of the 28th annual\n\nconference on Computer graphics and interactive techniques, 2001. 4\n\n10\n\n\f", "award": [], "sourceid": 928, "authors": [{"given_name": "Xiangyu", "family_name": "Xu", "institution": "Carnegie Mellon University"}, {"given_name": "Li", "family_name": "Siyao", "institution": "SenseTime Research"}, {"given_name": "Wenxiu", "family_name": "Sun", "institution": "SenseTime Research"}, {"given_name": "Qian", "family_name": "Yin", "institution": "Beijing Normal University"}, {"given_name": "Ming-Hsuan", "family_name": "Yang", "institution": "Google / UC Merced"}]}