{"title": "Memory-oriented Decoder for Light Field Salient Object Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 898, "page_last": 908, "abstract": "Light field data have been demonstrated in favor of many tasks in computer vision, but existing works about light field saliency detection still rely on hand-crafted features. In this paper, we present a deep-learning-based method where a novel memory-oriented decoder is tailored for light field saliency detection. Our goal is to deeply explore and comprehensively exploit internal correlation of focal slices for accurate prediction by designing feature fusion and integration mechanisms. The success of our method is demonstrated by achieving the state of the art on three datasets. We present this problem in a way that is accessible to members of the community and provide a large-scale light field dataset that facilitates comparisons across algorithms. The code and dataset will be made publicly available.", "full_text": "Memory-oriented Decoder for Light Field\n\nSalient Object Detection\n\nMiao Zhang\u2217\n\nJingjing Li\u2217\n\nWei Ji\u2217\n\nYongri Piao\u2020\n\nHuchuan Lu\n\nDalian University of Technology, China\n\nmiaozhang@dlut.edu.cn, {lijingjing, jiwei521}@mail.dlut.edu.cn,\n\n{yrpiao, lhchuan}@dlut.edu.cn\n\nAbstract\n\nLight \ufb01eld data have been demonstrated in favor of many tasks in computer vision,\nbut existing works about light \ufb01eld saliency detection still rely on hand-crafted\nfeatures. In this paper, we present a deep-learning-based method where a novel\nmemory-oriented decoder is tailored for light \ufb01eld saliency detection. Our goal is\nto deeply explore and comprehensively exploit internal correlation of focal slices\nfor accurate prediction by designing feature fusion and integration mechanisms.\nThe success of our method is demonstrated by achieving the state of the art on\nthree datasets. We present this problem in a way that is accessible to members\nof the community and provide a large-scale light \ufb01eld dataset that facilitates\ncomparisons across algorithms. The code and dataset are made publicly available\nat https://github.com/OIPLab-DUT/MoLF.\n\n1\n\nIntroduction\n\nSalient object detection (SOD) is the ability to identify the most visually distinctive objects despite\nsubstantial appearance similarity in a scene. This fundamental task has attracted lots of interest due\nto its importance in various applications, such as visual tracking [20, 47], object recognition [43, 10],\nimage segmentation [33], image retrieval [44], and robot navigation [9].\nExisting methods can be categorized into 2D (RGB), 3D (RGB-D) and 4D (light \ufb01eld) saliency\ndetection based on the input data types. 2D methods [15, 23, 8, 18, 21, 36, 27, 63] have achieved great\nsuccess and long been dominant in the \ufb01eld of saliency detection. However, 2D saliency detection\nmethods may suffer from false positives when it comes to challenging scenes shown in Fig. 1. The\nreasons are twofold: First, traditional 2D methods underlie many prior knowledges in which violations\nhighly pose a risk under complex scenes; Second, 2D deep-learning-based methods are subject to the\nfeatures extracted from limited RGB data not containing as much special information from RGB-D\ndata or light \ufb01eld data. 3D saliency detection has also attracted a lot of attention because depth maps\nproviding scene layout can improve the saliency accuracy to some extent. However, mediocre-quality\ndepth maps heavily jeopardize the accuracy of saliency detection.\nThe light \ufb01eld provides images of the scene from an array of viewpoints which spread over the extent\nof the lens aperture. These different views can be used to produce a stack of focal slices, containing\nabundant spatial parallax information as well as accurate depth information about the objects in the\nscene. Furthermore, focusness is one of the strongest information, allowing a human observer to\ninstantly understand the order in which objects are arranged along the depth in a scene [24, 59, 29].\nLight \ufb01eld data have been demonstrated in favor of many applications in computer vision, such as\ndepth estimation [16, 48, 64], super resolution [67, 55], and material recognition [51]. Due to the\n\n\u2217denotes equal contributions.\n\u2020Prof.Piao is the corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Left: some challenging scenes, e.g., similar foreground and background, complex back-\nground, transparent objects, and low intensity environment. Right: the light \ufb01eld data. (a)-(d) are\nfour focal slices that focus at different depth levels. The green box with red dot represents different\nfocus positions. From our observation, they are bene\ufb01cial for ef\ufb01cient foreground and background\nseparation. (e) shows our model\u2019s saliency results. \u2018GT\u2019 means ground truths.\n\nunique property of light \ufb01eld, it has shown promising prospects in saliency detection [24, 28, 58, 56,\n59, 29]. However, deep-learning-based light \ufb01eld methods have been missing from contemporary\nstudies in saliency detection. We have strong reasons to believe introducing the CNN framework for\nlight \ufb01eld saliency detection is an important aspect, as do 2D and 3D methods in SOD.\nIn order to incorporate the CNN framework and light \ufb01eld for accurate SOD, there are three key\nissues needed to be considered. First, how do we solve the de\ufb01ciency of training data? Second, how\ndo we effectively and properly fuse light \ufb01eld features generated from different focal slices? Third,\nhow do we comprehensively integrate multi-level features?\nIn this paper, we leverage the ideas from light \ufb01eld to confront these challenges. To better adapt our\nnetwork to fuse features from focal slices, we may neither want to ignore more contribution of the\ncorresponding focal slices where the salient object happens to be in focus, nor destroy the spatial\ncorrelation between different focal slices. Therefore, we propose a novel memory-oriented spatial\nfusion module (Mo-SFM) to resemble the memory mechanism of how human fuse information to\nunderstand a scene by going through all pieces of information and emphasizing the most relevant ones.\nOn the other hand, integration of fused features is used for higher cognitive processing. Therefore,\nwe propose a sophisticated multi-level integration mechanism in a top-down manner where high-level\nfeatures are used to guide low-level feature selection, namely memory-oriented feature integration\nmodule (Mo-FIM). The previous information referred to as memory is used in our channel attention to\nupdate the current light \ufb01eld feature, so that important and unnecessary features can be distinguishable.\nIn summary, our main contributions are as follows:\n\n\u2022 We introduce a large-scale light \ufb01led saliency dataset with 1462 samples, each of which con-\ntains an all-focus image, a focal stack with 12 focal slices, a depth map, and a corresponding\nground truth, genuinely hoping that this could pave the way for light \ufb01eld SOD and enable\nmore advanced research and development.\n\u2022 We propose a novel memory-oriented decoder tailored for light \ufb01eld SOD. Feature fusion\nmechanism in Mo-SFM and feature integration mechanism in Mo-FIM enable more accurate\nprediction. This work is, to the best of our knowledge, the \ufb01rst exploitation of using the\nunique focal slices in light \ufb01eld data for deep-learning-based saliency detection.\n\u2022 Extensive experiments on three light \ufb01eld datasets show that our method achieves consis-\n\ntently superior performance over 25 state-of-the-art 2D, 3D and 4D approaches.\n\n2 Related Work\n\nSalient Object Detection. Early works [23, 8, 18, 19, 40, 68, 32, 30, 41, 49] for saliency detection\nmainly rely on hand-crafted features and prior knowledges, such as color-contrast and background\nprior. Recently, with the utilization of CNNs, 2D SOD has achieved appealing performance. Li\net al. [27] adopt a CNN to extract multi-scale features to predict saliency for each super-pixel. Wang\net al. [50] propose two CNNs to integrate local super-pixel estimation and global search for SOD.\nZhao et al. [63] utilize two independent CNNs to extract both global and local contexts. Lee et al. [26]\ncombine low-level distant map with high-level semantic features of deep CNNs for SOD. These\n\n2\n\n0393\t0420\t1029RGB(a)(b)(c)(d)GT(e)Complex Scenarios\fmethods achieve better performance but suffer from time-consuming computation and injure the\nspatial information of the input images. Afterwords, Liu and Han [35] \ufb01rst generate a coarse saliency\nmap and then re\ufb01ne its details step by step. Hou et al. [21] introduce short connections into multiple\nside-outputs based on HED [54] architecture. Zhang et al. [60] integrate multi-level features in\nmultiple resolutions and combine them for accurate prediction. Luo et al. [37] propose a simpli\ufb01ed\nCNN to combine both local and global information and design a loss to penalize boundary errors.\nZhang et al. [62] and Liu et al. [36] introduce attention mechanism to guide feature integration. Deng\net al. [11] design a residual re\ufb01nement block to learn the complementary saliency information of the\nintermediate prediction. Li et al. [31] transfer contour knowledge to saliency detection without using\nany manual saliency masks. Detailed surveys about 2D SOD can be found in [3, 2, 4, 52].\nIn 3D SOD, depth images with af\ufb02uent spatial information can act as complementary cues for\nsaliency detection [38, 39, 14, 25, 42, 5]. Peng et al. [39] regard the depth data as one channel of\ninput and feed it into a multi-stage saliency detection model. Ju et al. [25] and Feng et al. [14] present\nsaliency methods based on anisotropic center-surround difference or local background enclosure.\nZhu et al. [66] propose a center-dark channel prior for RGB-D SOD. Qu et al. [42] use hand-crafted\nfeatures to train a CNN and achieve better performance than tradition methods. In [17, 7], two-stream\nmodels are used to process the RGB image and depth map separately and cross-modal features are\ncombined to jointly make prediction. Due to limited training sets, they are trained in a stage-wise\nmanner. Chen et al. [5] design a progressive fusion network to fuse cross-modal multi-level features\nto predict saliency maps. Chen et al. [6] propose a three-stream network to extract RGB-D features\nand use attention mechanism to adaptively select complement. Zhu et al. [65] use large-scale RGB\ndatasets to pre-train a prior model and employ depth-induced features to enhance the network.\nPrevious works in light \ufb01eld SOD have shown promising prospects, especially for some complex\nscenarios. Li et al. [29, 28] report a saliency detection approach on the light \ufb01eld data and propose\nthe \ufb01rst light \ufb01eld saliency dataset-LFSD. Zhang et al. [58] propose saliency method based on depth\ncontrast and focusness-based background priors, and show the effectiveness and superiority of light\n\ufb01eld properties. Li et al. [56] introduce a weighted sparse coding structure for handling heterogenous\ntypes of input data. Zhang et al. [59] integrate multiple visual cues from light \ufb01eld images to detect\nsalient regions. However, deep-learning-based light \ufb01eld methods are still in the infancy, and many\nissues have yet to be explored.\n\n3 Light Field Dataset\n\nTo remedy the data de\ufb01ciency problem, we introduce a large-scale light \ufb01eld saliency dataset with\n1462 selected high-quality samples captured by Lytro Illum camera. We decode the light \ufb01eld format\n\ufb01le using Lytro Desktop. Each light \ufb01eld consists of an all-focus image, a focal stack with 12 focal\nslices focusing at different depths, a depth image, and a corresponding manually labeled ground truth.\nThe focal stack resembles human perception using eyes, i.e., the eyes can dynamically refocus at\ndifferent focal slices to determine saliency [29]. Fig. 1 shows samples of light \ufb01elds in our proposed\ndataset. From our observation, they are bene\ufb01cial for ef\ufb01cient foreground and background separation.\nDuring annotation, three volunteers are asked to draw a rectangle to the most attractive objects.\nThen, we collect 1462 samples by choosing the images with consensus. We manually label the\nsalient objects from the all-focus image using a commonly used segmentation tool. By supplying\nthe easy-to-understand dataset, we hope to promote the research and make the SOD problem more\naccessible to those familiar with this \ufb01eld. The proposed light \ufb01eld saliency dataset provides the\nunique focal slices that can be used to support the training needs of deep neural networks.\nThis dataset consists of 900 indoor and 562 outdoor scenes captured in the surrounding environments\nof our daily life, e.g., of\ufb01ces, supermarkets, campuses, streets and so on. Besides, this dataset contains\nmany challenging scenes as shown in Fig. 1, e.g., similar foreground and background(108), complex\nbackground(31), transparent objects(28), multiple objects(95), and low-intensity environments(9).\n\n4 The Proposed Network\n\n4.1 The Overall Architecture\n\nWe adopt the widely utilized VGG-19 net [46] as the backbone architecture, drop the last pooling\nlayer and fully connected layers, and reserve \ufb01ve convolutional blocks to better \ufb01t for our task, as\n\n3\n\n\fFigure 2: The overall architecture of our proposed network, which contains an encoder and a\nmemory-oriented decoder.\n\nshown in Fig. 2. In the encoder, RGB image is fed into a stream to generate raw RGB features\nwhile all focal slices are fed into another stream to generate light \ufb01eld features with abundant spacial\ninformation. For simplicity, we just illustrate one single encoder, which represents the two streams\nsimultaneously. As suggested in [5], the Conv1_2 block (i.e., Block1) might be too shallow to make\nreliable prediction. We hereby perform our decoder on deeper layers (i.e., Block2-Block5). More\nspeci\ufb01cally, given the RGB image I0 and the focal slices {Ii}12\ni=1 with size H \u00d7 W , we denote the\noutputs of the last four blocks as {f i\ni=0, where i = 0 represents features generated\nin the RGB stream, i = 1 \u00b7 \u00b7 \u00b7 12 represents the indexes of focal slices and m = 2, 3, 4, 5 represents\nthe last four convolution blocks.\n\nm, m = 2, 3, 4, 5}12\n\n4.2 The Memory-oriented Spatial Fusion Module (Mo-SFM)\n\nWith the raw RGB and light \ufb01eld features generated from the encoder, we aim at fusing all available\ninformation to address the challenging problem of light \ufb01eld SOD. A straightforward solution is to\nsimply concatenate light \ufb01eld features produced by different focal slices. However, two drawbacks\nemerge in this approach. First, it ignores the relative contributions of different focal slices to the\n\ufb01nal results. Focal slices represent images focused at different depths in a scene as shown in Fig. 1.\nIntuitively, different focal slices have different weights regarding the salient objects. Second, direct\nconcatenation operation seriously damages the spatial correlation of those focal slices. A more proper\nand effective fusion strategy should be considered. Hence, we propose a novel memory-oriented\nspatial fusion module (Mo-SFM) to address this problem. In this module, we introduce an attention\nmechanism shown in Fig. 2 to emphasize the useful features and suppress the unnecessary ones from\nfocused and blurred information. This procedure can be de\ufb01ned as:\n\nAttm = \u03b4(Wm \u2217 AvgP ooling(D[f 0\n\nm; f 1\n\nm;\u00b7\u00b7\u00b7 ; f 12\n\nm ]) + bm),\n\n(1)\n\n(cid:101)f i\n\nm (cid:12) Atti\n\nm, i = 0, 1,\u00b7\u00b7\u00b7 , 12,\n\nm = f i\n\n(2)\nwhere D[ \u00b7 ; \u00b7\u00b7\u00b7 ; \u00b7 ] means concatenation operation. \u2217, Wm and bm represent convolution operator\nand convolution parameters in m-th layers. AvgP ooling(\u00b7) means global average pooling operation\nand \u03b4(\u00b7) means softmax function. Attm \u2208 R1\u00d71\u00d7N means the channel-wise attention map in m-th\nThen those weighted light \ufb01eld features {(cid:101)f i\nlayers. (cid:12) denotes feature-wise multiplication.\nm}12\n\ni=0 are regarded as a sequence of inputs corresponding\nto the consecutive time steps. They are fed into a ConvLSTM [45] structure to gradually re\ufb01ne their\n\n4\n\nMo-FIMCellMo-FIMCellMo-FIMCellMo-FIMCellEncoderThe illustration of Mo-SFM ConvLSTMCellSCIMThe illustration of Mo-FIM Mo-FIMCellConvLSTMCellSCIMMo-FIMCellPoolingSoftmax(cid:2)Channel\tAttentionSCIMMo-SFM,-\u00d7,-\u00d7,-Mo-SFM,-\u00d7,-\u00d7,-Mo-SFM,-\u00d7,-\u00d7,-Mo-SFM,-\u00d7,-\u00d7,-Block 2/01\u00d7/01\u00d7/01Block 3,-\u00d7,-\u00d702,Block 430\u00d730\u00d72/0Block 5/,\u00d7/,\u00d72/0Block 102,\u00d702,\u00d7,-Memory-orientedDecoderPredictionRGBFocal StackSupervision456476486496:;<=\u210e;<=:;\u210e;(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)\u2131@ABCDEElement-wisemultiplicationElement-wiseadditionFeature-wisemultiplicationUp-Sample\u2131@A\u2295\u2297\u2a00\u2297\u2295I9I8I9I7I8I9I5I7I8I9ICIC<=\u210e;<=ICConvConvIC~CellCellCellCellGPMSupervision(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)~4C64CIC\u2a00PoolingConcatConv(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)(cid:16)4C6KLLCSoftmaxGPMConcatenateDilated_rate= 7 Dilated_rate= 5 Dilated_rate= 3 Dilated_rate= 1 64(cid:17)64(cid:17)(64(cid:17)5)4CIC(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:1)(cid:11)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:1)(cid:2)(cid:2)(cid:1)(cid:2)64\fFigure 3: Visual comparisons in ablation studies. (a) means using RGB image only. (b) means using\nlight \ufb01eld data (concatenation without weighting). (c) means concatenation with weighting. (d)\nmeans ConvLSTM fusion with weighting. (e) represents (d)+GPM (i.e., full Mo-SFM). (f) means our\nwhole network without the SCIM. (g) means the \ufb01nal model.\n\nspatial information for accurately identifying the salient objects. This procedure can be de\ufb01ned as:\n\nm + Whi \u2217 Ht\u22121 + Wci\u25e6Ct\u22121 + bi),\nm + Whf \u2217 Ht\u22121 + Wcf\u25e6Ct\u22121 + bf ),\n\nit = \u03c3(Wxi \u2217 (cid:101)f i\nft = \u03c3(Wxf \u2217 (cid:101)f i\nCt = ft\u25e6Ct\u22121 + it\u25e6 tanh(Wxc \u2217 (cid:101)f i\not = \u03c3(Wxo \u2217 (cid:101)f i\n\nHt = ot\u25e6 tanh(Ct),\n\nm + Whc \u2217 Ht\u22121 + bc),\n\n(3)\n\nm + Who \u2217 Ht\u22121 + Wco\u25e6Ct + bo),\n\nwhere \u25e6 denotes the Hadamard product and \u03c3(\u00b7) is sigmoid function. A memory cell Ct stores the\nearlier information. All W\u2217 and b\u2217 are model parameters to be learned. All the gates it, ft, ot,\nmemory cell Ct, and hidden state Ht are 3D tensors. In this way, after 13 steps, four fused light \ufb01eld\nfeatures {f2, f3, f4, f5} are effectively generated: fm = H13. The unique property of the light \ufb01eld\ndata makes it spontaneously suitable to use ConvLSTM for feature fusion. The ConvLSTM is also\nbene\ufb01cial for making better use of the spatial correlation between multiple focal slices thanks to its\npowerful gate and memory mechanism. By now, our model enhances the average MAE performance\nby nearly 14.7% points on our proposed dataset and LFSD dataset (b vs d in Tab. 1).\nFurthermore, to capture global contextual information at different scales, we further extend a global\nperception module (GPM) on the top of fm. The GPM can be de\ufb01ned as:\n\nFm = Conv1\u00d71(D(fm;(cid:113)r\u2208RS (Convd(fm; \u03b8m; r)))), m = 2, 3, 4, 5,\n\n(4)\nwhere D[ \u00b7 ; \u00b7\u00b7\u00b7 ; \u00b7 ] denotes concatenation operation. (cid:113)r\u2208RS (OP) means operation, OP is\nperformed several times using different dilation rates r in rates_set (denoted as RS) and all results\nare returned. \u03b8m is parameters to be learned in m-th layer. {Fm}5\nm=2 are the \ufb01nal fused light \ufb01eld\nfeatures in multiple layers. At the end, we add several intermediate supervisions on Fm in each layer\nto facilitate network convergence and encourage explicit fusion of those light \ufb01eld features.\n\n4.3 The Memory-oriented Feature Integration Module (Mo-FIM)\n\nEf\ufb01cient integration of hierarchical\ndeep features is signi\ufb01cant for pixel-\nwise prediction tasks, e.g., salient\nobject detection [60, 5], semantic\nsegmentation [34]. We propose a\nnew memory-oriented module, which\nfrom a novel perspective, utilizes the\nmemory mechanism to effectively in-\ntegrate multi-level light \ufb01eld features\nin a top-down manner. Speci\ufb01cally, as\neach channel of a feature map is con-\nsidered as a \u2018feature detector\u2019 [53, 57],\nwe design a scene context integration\n\nTable 1: Quantitative results of the ablation analysis for our\nnetwork. The meaning of indexes has been explained in the\ncaption of Fig. 3.\n\nindexes Modules\n\n(a)\n(b)\n(c)\n(d)\n(e)\n(f)\n(g)\n\nRGB\nLF(w/o weighting)\nLF(with weighting)\n+SFM(w/o GPM)\n+SFM(with GPM)\n+FIM(w/o SCIM)\n+FIM(with SCIM)\n\nOurs\n\nF\u03b2 \u2191 M AE \u2193\n0.144\n0.643\n0.074\n0.805\n0.069\n0.819\n0.821\n0.062\n0.059\n0.825\n0.054\n0.838\n0.843\n0.052\n\nLFSD\n\nF\u03b2 \u2191 M AE \u2193\n0.194\n0.607\n0.121\n0.781\n0.116\n0.789\n0.797\n0.105\n0.099\n0.807\n0.092\n0.814\n0.819\n0.089\n\n5\n\nHFUT(cid:1)0060DUT(cid:1)0141\t0202Image(a)(b)(c)(d)(e)(f)GT(g)\fmodule (SCIM) shown in Fig. 2, which utilizes memory information from toper layers to learn a\nchannel attention map and updates the current light \ufb01eld feature by focusing on important channels\nand suppressing unnecessary ones. Then, the ConvLSTM progressively integrates the high-level\nmemories and the current elaborately re\ufb01ned input. That is to say, the high-level features with\nabundant semantic information are gradually summarized as memory and then being used to guide\nthe selection of low-level spatial details for precise saliency prediction.\nMore speci\ufb01cally, in the SCIM shown in Fig. 2, Ht\u22121 represents the previous scene understanding\n(i.e., hidden state of ConvLSTM in t \u2212 1 time step) and Fm means the fused light \ufb01eld feature in\nmth layer. The SCIM can de de\ufb01ned as:\n\n(cid:101)Fm = \u03b4(AvgP ooling(W1 \u2217 Ht\u22121 \u2295 W2 \u2217 Fm)) \u2297 Fm,\n\n(5)\nwhere \u2295 and \u2297 denote element-wise addition and multiplication, respectively. Then the updated\n\nfeature (cid:101)Fm is fed into a ConvLSTM cell to further summarize spatial information from the historical\nmemory and current input (cid:101)Fm. We use the output of Block5 as the initial state of ConvLSTM and\nSCIM, i.e., H0 = F5. After 4 steps (corresponding to (cid:101)F5, (cid:101)F4, (cid:101)F3, (cid:101)F2, respectively), the output of the\n\nbe de\ufb01ned as: Fm =(cid:80)5\n\nConvLSTM is followed by a transition convolutional layer and an up-sample operation to get the\n\ufb01nal saliency map S. The calculation procedure is similar to Equ. 3 by replacing the inputs.\nHowever, the top-down structure may cause\nhigh-level features diluted as they are trans-\nmitted to the lower layers. To address this\nproblem, inspired by DenseNet [22], we link\nthe features in low and high levels in the way\nshown in Fig. 2, to alleviate gradient vanish-\ning and meanwhile encourage feature reuse.\nThe \ufb01nal light \ufb01eld features to be used can\nr=m Fr, m is set to\n2, 3, 4, 5, successively. Besides, in order to\nguarantee each time step of the ConvLSTM\ncan explicitly learn the most important in-\nformation for accurately identifying salient\nobjects, we add intermediate supervisions\non all internal outputs of the ConvLSTM.\nGenerally speaking, those intermediate supervisions can act as instruction to guide the SCIM and\nConvLSTM to accurately \ufb01lter the non-salient areas and retain salient areas. Intermediate results are\nillustrated in Fig. 4. Full details about codes will be made publicly available.\n\nFigure 4: Visual results of the intermediate supervi-\nsions. In such a complex scene, our model can gradu-\nally optimize the saliency maps and produce a precise\nprediction.\n\n5 Experiments\n\n5.1 Datasets\n\nTo evaluate the performance of our proposed network, we conduct experiments on our proposed\ndataset and the only two public light \ufb01eld saliency datasets: LFSD [29] and HFUT [59].\nOurs: This dataset consists of 1462 light \ufb01eld samples. We randomly select 1000 samples for training\nand the remaining 462 samples for testing. More details can be found in Sec. 3.\nLFSD: This dataset contains 100 light \ufb01elds captured by Lytro camera. This dataset is proposed by\nLi et al., in [29], which pioneered the use of light \ufb01eld for solving challenging problems in SOD.\nHFUT: HFUT consists of 255 samples captured by Lytro camera. It is a challenging dataset, with\nthe real-life scenarios at various distances, sensors noises, lighting conditions, and so on.\nAll samples in LFSD and HFUT are used for testing to evaluate the generalization abilities of saliency\nmodels. To avoid over\ufb01tting, we augment the training set by \ufb02ipping, cropping and rotating.\n\n5.2 Experiments Setup\n\nEvaluation Metrics. We adopt \ufb01ve metrics for comprehensive evaluation, including Precision-\nRecall (PR) curve, F-measure [1], Mean Absolute Error (MAE), S-measure [12] and E-measure [13].\n\n6\n\nMo-FIMCellMo-FIMCellMo-FIMCellMo-FIMCellMo-SFMMo-SFMMo-SFMMo-SFMDecoderImageGT(a)(b)(c)(d)(e)(f)(g)!\"#$(cid:3)(cid:2)(cid:1)\fFigure 5: Illustration of the baseline network. Using RGB or light \ufb01eld data as input correspond to\n(a) and (b) in Fig. 3 and Tab. 1, respectively. In term of light \ufb01eld input, here, we use \u2018concatenation\nwithout weighting\u2019 strategy to fuse light \ufb01eld features from different focal slices in each Conv-Block.\nFor fairness, the intermediate supervisions are same as our proposed network.\n\nFigure 6: The PR curves of our proposed method and other CNNs-based methods. Obviously, ours is\nconsistently outstanding over other approaches.\nThey are universally-agreed and standard for evaluating a SOD model and well explained in many\nliteratures. Due to limited space, we will not show the detailed description.\nImplementation Details. Our network is implemented on Pytorch framework and trained with a\nGTX 2080 Ti GPU. All training and test images are uniformly resized to 256 \u00d7 256. Our network\nis trained in an end-to-end manner, in which the momentum, weight decay and learning rate are set\nto 0.9, 0.0005, 1e-10, respectively. During the training phrase, we use softmax entropy loss, and\nthe network is trained by standard SGD and converges after 40 epochs with batch size of 1. The\ntwo backbone networks of the RGB and focal stack streams are all initialized with corresponding\npre-trained VGG-19 net [46]. Other parameters are initialized with Gaussian kernels.\n\n5.3 Ablation Studies\n\nThe Effectiveness of Light Field Data. Tab. 1 (a) and (b) show the detection results of our baseline\nnetwork illustrated in Fig. 5 with RGB data and with light \ufb01eld data, respectively. Numerical results\nmeasured by F-measure and MAE demonstrate that the network using light \ufb01eld data outperforms\nthe one only using RGB data. Fig. 3 (a) and (b) show the visual comparisons of two aforementioned\nnetworks, respectively. This also indicates that light \ufb01eld data improve prediction performance under\nchallenging circumstances. Moreover, we conduct an experiment by repeating the RGB input-frame\n12 times, in such a way that the model architecture is identical to the 4D version but the input data is\nonly 2D. The quantitative results in term of F-measure and MAE are 0.819 / 0.089 (focal slices) and\n0.740 / 0.140 (RGB) respectively. This further con\ufb01rms the effectiveness of the focusness information\nand our spatial fusion module.\nThe Effectiveness of Mo-SFM. To give evidence for the effectiveness of the Mo-SFM, we compare\nthe baseline network with it adding the Mo-SFM. Signi\ufb01cant improvement can be visually observed\nbetween them shown in Fig. 3 (b) and (e). Numerically, our Mo-SFM reduces the MAE performances\nby nearly 19.2% on two datasets. To conduct further investigation, we provide internal inspection on\nthe Mo-SFM. The gradual improvements, as we add our feature weighting mechanism, ConvLSTM\nintegrator and the GPM into the Mo-SFM shown in Fig. 3 (c), (d) and (e), are consistent with our\nassertion that different contributions and spatial correlation of different focal slices are bene\ufb01cial to\nSOD. Also, GPM is proved to be able to adaptively detect objects of different scales. Quantitative\nresults in Tab. 1 also numerically show the accumulative accuracy gains from the three components.\nThe Effectiveness of Mo-FIM. The Mo-FIM is proposed for higher cognitive processing. Fig. 3\n(g) visually shows the in\ufb02uence of the addition of the Mo-FIM. We observe that considerable gains\n(reduce the MAE by 11.8% and 10.1% shown in Tab. 1) are achieved. This result is logical since\nhigh-level features are gradually summarized as memory and then being used to guide the selection\nof low-level spatial details by using the Mo-FIM. Results in Fig. 3 show that removing the SCIM\nfrom the Mo-FIM may lead to false positives. This suggests that the SCIM effectively updates the\noriginal input according to memory-oriented scene understanding and may greatly bias the results.\n\n7\n\nDecoder(cid:7)(cid:10)(cid:11)(cid:8)(cid:9)(cid:1)(cid:2)(cid:7)(cid:10)(cid:11)(cid:8)(cid:9)(cid:1)(cid:3)(cid:7)(cid:10)(cid:11)(cid:8)(cid:9)(cid:1)(cid:4)(cid:7)(cid:10)(cid:11)(cid:8)(cid:9)(cid:1)(cid:5)(cid:7)(cid:10)(cid:11)(cid:8)(cid:9)(cid:1)(cid:6)InputOutput\u2295\u2295\u2295\u22951\u00d71, 64, Conv2\u00d7, Up-Sample#$%#$%#$%#$%#$%Skip-connectionSkip-connectionSkip-connectionRecall00.20.40.60.81Precision0.10.20.30.40.50.60.70.80.9OursAmuletC2SCTMFDFDHSDSSMMCINLDFPAGRNPCAPDNetPiCANetR\u00b3NetTANetUCFRecall00.20.40.60.81Precision0.20.30.40.50.60.70.80.91OursAmuletC2SCTMFDFDHSDSSMMCINLDFPAGRNPCAPDNetPiCANetR\u00b3NetTANetUCFRecall00.20.40.60.81Precision0.20.30.40.50.60.70.80.91OursAmuletC2SCTMFDFDHSDSSMMCINLDFPAGRNPCAPDNetPiCANetR\u00b3NetTANetUCFRecallRecallRecallPrecisionPrecisionPrecisionOursHFUTLFSD\fTable 2: Quantitative comparisons on the light \ufb01eld datasets. The best three results are shown in\nboldface, red, and green fonts respectively. \u2217 means non-deep-learning. - means no available results.\n\nTypes Methods\n\n4D\n\n3D\n\n2D\n\nYears\n-\nOurs\nLFS\u2217 [29]\nTPAMI\u201917\nMCA\u2217 [59]\nTOMM\u201917\nWSC\u2217 [56]\nCVPR\u201915\nDILF\u2217 [58]\nIJCAI\u201915\nTIP\u201919\nTANet [6]\nPR\u201919\nMMCI [7]\nCVPR\u201918\nPCA [5]\narXiv\u201918\nPDNet [65]\nTCyb\u201917\nCTMF [17]\nTIP\u201917\nDF [42]\nCDCP\u2217 [66]\nICCVW\u201917\nACSD\u2217 [25]\nICIP\u201915\nNLPR\u2217 [39]\nECCV\u201914\nPiCANet [36] CVPR\u201918\nCVPR\u201918\nPAGRN [62]\nECCV\u201918\nC2S [31]\nR3Net [11]\nIJCAI\u201918\nICCV\u201917\nAmulet [60]\nICCV\u201917\nUCF [61]\nCVPR\u201917\nNLDF [37]\nDSS [21]\nCVPR\u201917\nDHS [35]\nCVPR\u201916\nMST\u2217 [49]\nCVPR\u201916\nBSCA\u2217 [41]\nCVPR\u201915\nDSR\u2217 [30]\nICCV\u201913\n\nEs \u2191\n0.923\n0.728\n\n-\n-\n\n0.805\n0.861\n0.853\n0.857\n0.864\n0.881\n0.838\n0.795\n0.629\n0.768\n0.892\n0.878\n0.874\n0.833\n0.882\n0.850\n0.862\n0.827\n0.872\n0.785\n0.811\n0.799\n\nS\u03b1 \u2191\n0.887\n0.563\n\n-\n-\n\n0.705\n0.803\n0.785\n0.800\n0.803\n0.823\n0.716\n0.690\n0.385\n0.564\n0.829\n0.822\n0.844\n0.819\n0.847\n0.837\n0.786\n0.764\n0.841\n0.686\n0.720\n0.678\n\nOurs\n\nF\u03b2 \u2191 MAE\u2193 Es \u2191\n0.785\n0.843\n0.484\n0.650\n0.714\n\n0.052\n0.240\n\nHFUT [59]\nS\u03b1 \u2191\n0.742\n0.559\n0.652\n\n-\n-\n\n0.641\n0.771\n0.750\n0.762\n0.763\n0.790\n0.733\n0.639\n0.151\n0.659\n0.821\n0.828\n0.791\n0.783\n0.805\n0.769\n0.778\n0.728\n0.801\n0.629\n0.690\n0.645\n\n-\n-\n\n0.168\n0.096\n0.116\n0.100\n0.111\n0.100\n0.151\n0.159\n0.321\n0.177\n0.083\n0.084\n0.084\n0.113\n0.083\n0.107\n0.103\n0.128\n0.090\n0.157\n0.180\n0.164\n\n-\n\n0.701\n0.761\n0.748\n0.757\n0.758\n0.747\n0.701\n0.696\n0.665\n0.706\n0.762\n0.758\n0.762\n0.697\n0.737\n0.729\n0.761\n0.759\n0.720\n0.693\n0.693\n0.695\n\n-\n\n0.669\n0.711\n0.711\n0.730\n0.741\n0.723\n0.641\n0.653\n0.559\n0.579\n0.719\n0.704\n0.736\n0.720\n0.739\n0.736\n0.685\n0.699\n0.642\n0.641\n0.651\n0.655\n\n0.095\n0.222\n0.139\n\n-\n\n-\n\nF\u03b2 \u2191 MAE\u2193 Es \u2191\n0.886\n0.627\n0.416\n0.771\n0.841\n0.558\n0.794\n0.810\n0.849\n0.848\n0.846\n0.849\n0.856\n0.816\n0.739\n0.803\n0.744\n0.780\n0.805\n0.820\n0.838\n0.821\n0.776\n0.810\n0.749\n0.836\n0.754\n0.777\n0.736\n\n0.148\n0.111\n0.116\n0.104\n0.112\n0.119\n0.156\n0.159\n0.201\n0.148\n0.115\n0.116\n0.112\n0.151\n0.118\n0.144\n0.107\n0.138\n0.129\n0.156\n0.193\n0.153\n\n0.529\n0.605\n0.608\n0.619\n0.608\n0.596\n0.531\n0.528\n0.421\n0.567\n0.600\n0.619\n0.618\n0.606\n0.615\n0.596\n0.583\n0.606\n0.542\n0.529\n0.530\n0.518\n\nLFSD [29]\nS\u03b1 \u2191\n0.830\n0.680\n0.749\n0.706\n0.755\n0.803\n0.799\n0.807\n0.786\n0.801\n0.751\n0.659\n0.731\n0.553\n0.729\n0.727\n0.806\n0.789\n0.773\n0.762\n0.745\n0.677\n0.770\n0.659\n0.718\n0.633\n\nF\u03b2 \u2191 MAE\u2193\n0.089\n0.819\n0.740\n0.208\n0.150\n0.815\n0.156\n0.706\n0.168\n0.728\n0.112\n0.804\n0.128\n0.796\n0.801\n0.112\n0.116\n0.780\n0.119\n0.791\n0.162\n0.756\n0.201\n0.642\n0.185\n0.764\n0.216\n0.712\n0.671\n0.158\n0.147\n0.725\n0.113\n0.749\n0.128\n0.781\n0.135\n0.757\n0.169\n0.710\n0.138\n0.748\n0.644\n0.190\n0.133\n0.761\n0.191\n0.631\n0.203\n0.688\n0.631\n0.208\n\nThe Limitations of Our Approach. In this paper, we present a deep-learning-based light \ufb01eld\nsaliency detection method for deeply exploring and comprehensively exploiting internal correlation\nof focal slices. We demonstrate the success of our method by achieving the state-of-the-art on three\ndatasets. We see this work as opening two potential directions for future study. The \ufb01rst is building a\nbig and versatile dataset for training and validating different models. We present one dataset-training\nour model and testing other 2D, 3D and 4D models-but one could also be bigger for improving\ngeneralization ability of all the models training on it. The other direction is developing a more\ncomputation-ef\ufb01cient and memory-ef\ufb01cient method as the focal stack is employed in the training\nprocess. We present the \ufb01rst deep-learning-based method for light \ufb01eld saliency detection, but there\nare other lightweight models that could potentially bene\ufb01t from the light \ufb01eld data.\n\n5.4 Comparisons with State-of-the-arts\n\nWe compare results from our method and other 25 2D, 3D and 4D ones, containing both deep-learning-\nbased methods and non-deep learning ones(remarked with \u2217). There are 4 4D light \ufb01eld methods:\nLFS\u2217 [29], MCA\u2217 [59], WSC\u2217 [56], DILF\u2217 [58]; 9 3D RGB-D methods: TANet [6], MMCI [7],\nPCA [5], PDNet [65], CTMF [17], DF [42], CDCP\u2217 [66], ACSD\u2217 [25], NLPR\u2217 [39]; and 12 top-\nranking RGB methods: PiCANet [36], PAGR [62], C2S [31], R3Net [11], Amulet [60], UCF [61],\nNLDF [37], DSS [21], DHS [35], MST\u2217 [49], BSCA\u2217 [41], DSR\u2217 [30]. For fair comparisons, the\nresults from competing methods are generated by authorized codes or directly provided by authors.\nQuantitative Evaluation. Quantitative results are shown in Tab. 2. The proposed model consistently\nachieves the highest scores on all datasets across four evaluation metrics. An important observation\nshould be noted: compared to the latest CNNs-based RGB SOD methods with large-quantity training\nsets, our method also achieves signi\ufb01cant advantages with a relatively small training set. This\nindicates that light \ufb01eld data are signi\ufb01cant and promising. Fig. 6 shows that the PR curves of our\nmethod outperform those top-ranking approaches.\nQualitative Evaluation. Fig. 7 shows some selected representative samples of results comparing our\nmethod with those of the current state-of-the-art methods. Our method is able to handle a wide rage\nof challenging scenes, including shown in Fig. 7, small objects (1st row), similar foreground and\nbackground (2nd, 4th and 9th rows), clutter background (3rd-5th and 8th rows), and other dif\ufb01cult\nscenes (6th and 7th rows). In those complex cases, we can see that our predicted results can be\npositively in\ufb02uenced by the light \ufb01eld data and our proposed network where the light \ufb01eld features\nfrom different focal slices are effectively fused and the multi-level global semantic information and\nlocal detail cues are suf\ufb01ciently integrated.\n\n8\n\n\fFigure 7: Visual comparisons of our method with top-ranking CNNs-based methods in some chal-\nlenging cases. Obviously, our model can generate precise salient results even in those complex scenes,\nwhich indicates that our method takes full advantages of light \ufb01elds for accurate saliency prediction.\n6 Conclusion\n\nIn this paper, we develop a novel memory-oriented decoder tailored for light \ufb01eld saliency detection.\nOur Mo-SFM resembles the memory mechanism of how human fuse information and effectively\nexcavates the various contributions and spatial correlations of different focal slices. The Mo-FIM\nalso suf\ufb01ciently integrates multi-level features by leveraging high-level memory to guide low-level\nselection. Additionally, we introduce a large-scale light \ufb01eld saliency dataset to pave the way for\nfuture studies. Experiments show that our method achieves superior performance over 25 methods\nincluding 2D, 3D and 4D ones, especially in complex scenarios.\n\nAcknowledgements\n\nThis work was supported by the National Natural Science Foundation of China (61605022 and\n61976035) and the Fundamental Research Funds for the Central Universities (DUT19JC58). The\nauthors are grateful to the reviewers for their suggestions in improving the quality of the paper.\n\nReferences\n[1] R. Achanta, S. S. Hemami, F. J. Estrada, and S. S\u00fcsstrunk. Frequency-tuned salient region detection. In\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 1597\u20131604, 2009.\n\n[2] A. Borji, M. Cheng, H. Jiang, and J. Li. Salient object detection: A survey. arXiv preprint arXiv:1411.5878,\n\n2014.\n\n[3] A. Borji, M.-M. Cheng, H. Jiang, and J. Li. Salient object detection: A benchmark. IEEE Transactions on\n\nImage Processing (TIP), 24(12):5706\u20135722, 2015.\n\n[4] A. Borji, D. N. Sihite, and L. Itti. Salient object detection: a benchmark. In European Conference on\n\nComputer Vision (ECCV), pages 414\u2013429, 2012.\n\n[5] H. Chen and Y. Li. Progressively complementarity-aware fusion network for rgb-d salient object detection.\n\nIn Conference on Computer Vision and Pattern Recognition (CVPR), pages 3051\u20133060, 2018.\n\n[6] H. Chen and Y. Li. Three-stream attention-aware network for rgb-d salient object detection.\n\nIEEE\n\nTransactions on Image Processing (TIP), 28(6):2825\u20132835, 2019.\n\n[7] H. Chen, Y. Li, and D. Su. Multi-modal fusion network with multi-scale multi-path and cross-modal\n\ninteractions for rgb-d salient object detection. Pattern Recognition, 86:376\u2013385, 2019.\n\n[8] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu. Global contrast based salient region\ndetection. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), volume 37, pages\n409\u2013416, 2011.\n\n[9] C. Craye, D. Filliat, and J.-F. Goudou. Environment exploration for object-based visual saliency learning.\n\nIn IEEE International Conference on Robotics and Automation (ICRA), pages 2303\u20132309, 2016.\n\n[10] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: object detection via region-based fully convolutional networks.\n\nInternational Conference on Neural Information Processing Systems (NIPS), pages 379\u2013387, 2016.\n\n[11] Z. Deng, X. Hu, L. Zhu, X. Xu, J. Qin, G. Han, and P.-A. Heng. R3net: Recurrent residual re\ufb01nement\nnetwork for saliency detection. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages\n684\u2013690, 2018.\n\n9\n\nOursTANetPCAMMCIPAGRNR3NetC2SImageGTPiCANet0070,\t0082,0090,\t0126,0148,0160,0202,0272,0432,0881,1563,1604LFSD:\t0012,0024PDNetAmuletDSSUCFCTMFNLDF\f[12] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji. Structure-measure: A new way to evaluate foreground\n\nmaps. In International Conference on Computer Vision (ICCV), pages 4558\u20134567, 2017.\n\n[13] D.-P. Fan, C. Gong, Y. Cao, B. Ren, M.-M. Cheng, and A. Borji. Enhanced-alignment measure for binary\nforeground map evaluation. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages\n698\u2013704, 2018.\n\n[14] D. Feng, N. Barnes, S. You, and C. McCarthy. Local background enclosure for rgb-d salient object\ndetection. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2343\u20132350, 2016.\n[15] D. Gao, V. Mahadevan, and N. Vasconcelos. The discriminant centersurround hypothesis for bottom-up\n\nsaliency. In International Conference on Neural Information Processing Systems (NIPS), 2007.\n\n[16] X. Guo, Z. Chen, S. Li, Y. Yang, and J. Yu. Deep depth inference using binocular and monocular cues.\n\narXiv preprint arXiv:1711.10729, 2017.\n\n[17] J. Han, H. Chen, N. Liu, C. Yan, and X. Li. Cnns-based rgb-d saliency detection via cross-view transfer\n\nand multiview fusion. IEEE Transactions on Systems, Man, and Cybernetics, 48(11):3171\u20133183, 2018.\n[18] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In International Conference on Neural\n\nInformation Processing Systems (NIPS), pages 545\u2013552, 2006.\n\n[19] B. Hariharan, P. A. Arbel\u00e1ez, R. B. Girshick, and J. Malik. Hypercolumns for object segmentation and\n\ufb01ne-grained localization. In Conference on Computer Vision and Pattern Recognition (CVPR), pages\n447\u2013456, 2015.\n\n[20] S. Hong, T. You, S. Kwak, and B. Han. Online tracking by learning discriminative saliency map with\nconvolutional neural network. International Conference on Machine Learning (ICML), pages 597\u2013606,\n2015.\n\n[21] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. S. Torr. Deeply supervised salient object detection\nwith short connections. In Conference on Computer Vision and Pattern Recognition (CVPR), volume 41,\npages 815\u2013828, 2017.\n\n[22] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.\n\nIn Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261\u20132269, 2017.\n\n[23] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence (TPAMI), 20(11):1254\u20131259, 1998.\n\n[24] P. Jiang, H. Ling, J. Yu, and J. Peng. Salient region detection by ufo: Uniqueness, focusness and objectness.\n\nIn International Conference on Computer Vision (ICCV), pages 1976\u20131983, 2013.\n\n[25] R. Ju, L. Ge, W. Geng, T. Ren, and G. Wu. Depth saliency based on anisotropic center-surround difference.\n\nIn International Conference on Image Processing (ICIP), pages 1115\u20131119, 2014.\n\n[26] G. Lee, Y.-W. Tai, and J. Kim. Deep saliency with encoded low level distance map and high level features.\n\nIn Conference on Computer Vision and Pattern Recognition (CVPR), pages 660\u2013668, 2016.\n\n[27] G. Li and Y. Yu. Visual saliency based on multiscale deep features. In Conference on Computer Vision and\n\nPattern Recognition (CVPR), pages 5455\u20135463, 2015.\n\n[28] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu. Saliency detection on light \ufb01eld. In Conference on Computer Vision\n\nand Pattern Recognition (CVPR), pages 2806\u20132813, 2014.\n\n[29] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu. Saliency detection on light \ufb01eld. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence (TPAMI), 39(8):1605\u20131616, 2017.\n\n[30] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang. Saliency detection via dense and sparse reconstruction.\n\nIn International Conference on Computer Vision (ICCV), pages 2976\u20132983, 2013.\n\n[31] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen. Contour knowledge transfer for salient object detection. In\n\nEuropean Conference on Computer Vision (ECCV), pages 370\u2013385, 2018.\n\n[32] X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang. Deepsaliency: Multi-task\ndeep neural network model for salient object detection. IEEE Transactions on Image Processing (TIP),\n25(8):3919\u20133930, 2016.\n\n[33] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. The secrets of salient object segmentation. In\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 280\u2013287, 2014.\n\n[34] G. Lin, A. Milan, C. Shen, and I. D. Reid. Re\ufb01nenet: Multi-path re\ufb01nement networks for high-resolution\nIn Conference on Computer Vision and Pattern Recognition (CVPR), pages\n\nsemantic segmentation.\n5168\u20135177, 2017.\n\n[35] N. Liu and J. Han. Dhsnet: Deep hierarchical saliency network for salient object detection. In Conference\n\non Computer Vision and Pattern Recognition (CVPR), pages 678\u2013686, 2016.\n\n[36] N. Liu, J. Han, and M.-H. Yang. Picanet: Learning pixel-wise contextual attention for saliency detection.\n\nIn Conference on Computer Vision and Pattern Recognition (CVPR), pages 3089\u20133098, 2018.\n\n[37] Z. Luo, A. K. Mishra, A. Achkar, J. A. Eichel, S. Li, and P.-M. Jodoin. Non-local deep features for salient\nobject detection. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 6593\u20136601,\n2017.\n\n[38] Y. Niu, Y. Geng, X. Li, and F. Liu. Leveraging stereopsis for saliency analysis. In Conference on Computer\n\nVision and Pattern Recognition (CVPR), pages 454\u2013461, 2012.\n\n[39] H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji. Rgbd salient object detection: A benchmark and algorithms.\n\nIn European Conference on Computer Vision (ECCV), pages 92\u2013109, 2014.\n\n[40] F. Perazzi, P. Kr\u00e4henb\u00fchl, Y. Pritch, and A. Hornung. Saliency \ufb01lters: Contrast based \ufb01ltering for salient\nregion detection. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 733\u2013740,\n2012.\n\n10\n\n\f[41] Y. Qin, H. Lu, Y. Xu, and H. Wang. Saliency detection via cellular automata. In Conference on Computer\n\nVision and Pattern Recognition (CVPR), pages 110\u2013119, 2015.\n\n[42] L. Qu, S. He, J. Zhang, J. Tian, Y. Tang, and Q. Yang. Rgbd salient object detection via deep fusion. IEEE\n\nTransactions on Image Processing (TIP), 26(5):2274\u20132285, 2017.\n\n[43] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region\nproposal networks. In International Conference on Neural Information Processing Systems (NIPS), volume\n2015, pages 91\u201399, 2015.\n\n[44] L. Shao and M. Brady. Speci\ufb01c object retrieval based on salient regions. Pattern Recognition, 39(10):1932\u2013\n\n1948, 2006.\n\n[45] X. Shi, Z. Chen, H. Wang, D. Y. Yeung, W. K. Wong, and W. Woo. Convolutional lstm network: a\nmachine learning approach for precipitation nowcasting. International Conference on Neural Information\nProcessing Systems (NIPS), pages 802\u2013810, 2015.\n\n[46] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nInternational Conference on Learning Representations (ICLR), 2015.\n\n[47] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking:\nAn experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),\n36(7):1442\u20131468, 2014.\n\n[48] G. Song and K. M. Lee. Depth estimation network for dual defocused images with different depth-of-\ufb01eld.\n\nIn International Conference on Image Processing (ICIP), pages 1563\u20131567, 2018.\n\n[49] W.-C. Tu, S. He, Q. Yang, and S.-Y. Chien. Real-time salient object detection with a minimum spanning\n\ntree. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2334\u20132342, 2016.\n\n[50] L. Wang, H. Lu, X. Ruan, and M.-H. Yang. Deep networks for saliency detection via local estimation and\nglobal search. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3183\u20133192,\n2015.\n\n[51] T.-C. Wang, J.-Y. Zhu, E. Hiroaki, M. Chandraker, A. A. Efros, and R. Ramamoorthi. A 4d light-\ufb01eld\ndataset and cnn architectures for material recognition. European Conference on Computer Vision (ECCV),\npages 121\u2013138, 2016.\n\n[52] W. Wang, Q. Lai, H. Fu, J. Shen, and H. Ling. Salient object detection in the deep learning era: An in-depth\n\nsurvey. arXiv preprint arXiv:1904.09146, 2019.\n\n[53] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon. Cbam: Convolutional block attention module. European\n\nConference on Computer Vision (ECCV), pages 3\u201319, 2018.\n\n[54] S. Xie and Z. Tu. Holistically-nested edge detection. International Journal of Computer Vision (IJCV),\n\n125(1-3):3\u201318, 2015.\n\n[55] H. W. F. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Y. Chung. Light \ufb01eld spatial super-resolution\nusing deep ef\ufb01cient spatial-angular separable convolution. IEEE Transactions on Image Processing (TIP),\n28(5):2319\u20132330, 2019.\n\n[56] N. yi Li, B. Sun, and J. Yu. A weighted sparse coding framework for saliency detection. In Conference on\n\nComputer Vision and Pattern Recognition (CVPR), pages 5216\u20135223, 2015.\n\n[57] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. European Conference\n\non Computer Vision (ECCV), pages 818\u2013833, 2014.\n\n[58] J. Zhang, M. Wang, J. Gao, Y. Wang, X. Zhang, and X. Wu. Saliency detection with a deeper investigation\n\nof light \ufb01eld. In International Conference on Arti\ufb01cial Intelligence (IJCAI), pages 2212\u20132218, 2015.\n\n[59] J. Zhang, M. Wang, L. Lin, X. Yang, J. Gao, and Y. Rui. Saliency detection on light \ufb01eld: A multi-cue\napproach. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM),\n13(3):32, 2017.\n\n[60] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan. Amulet: Aggregating multi-level convolutional features\nfor salient object detection. In International Conference on Computer Vision (ICCV), pages 202\u2013211,\n2017.\n\n[61] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin. Learning uncertain convolutional features for accurate\n\nsaliency detection. In International Conference on Computer Vision (ICCV), pages 212\u2013221, 2017.\n\n[62] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang. Progressive attention guided recurrent network for salient\nobject detection. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 714\u2013722,\n2018.\n\n[63] R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-context deep learning. In Conference\n\non Computer Vision and Pattern Recognition (CVPR), pages 1265\u20131274, 2015.\n\n[64] W. Zhou, L. Liang, H. Zhang, A. Lumsdaine, and L. Lin. Scale and orientation aware epi-patch learning for\nlight \ufb01eld depth estimation. In International Conference on Pattern Recognition (ICPR), pages 2362\u20132367,\n2018.\n\n[65] C. Zhu, X. Cai, K. Huang, T. H. Li, and G. Li. Pdnet: Prior-model guided depth-enhanced network for\n\nsalient object detection. arXiv preprint arXiv:1803.08636, 2018.\n\n[66] C. Zhu, G. Li, W. Wang, and R. Wang. An innovative salient object detection using center-dark channel\n\nprior. In International Conference on Computer Vision Workshops (ICCVW), pages 1509\u20131515, 2017.\n\n[67] H. Zhu, M. Guo, H. Li, Q. Wang, and A. Robles-Kelly. Breaking the spatio-angular trade-off for light \ufb01eld\n\nsuper-resolution via lstm modelling on epipolar plane images. arXiv preprint arXiv:1902.05672, 2019.\n\n[68] W. Zhu, S. Liang, Y. Wei, and J. Sun. Saliency optimization from robust background detection. In\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 2814\u20132821, 2014.\n\n11\n\n\f", "award": [], "sourceid": 480, "authors": [{"given_name": "Miao", "family_name": "Zhang", "institution": "Dalian University of Technology"}, {"given_name": "Jingjing", "family_name": "Li", "institution": "Dalian University of Technology"}, {"given_name": "JI", "family_name": "WEI", "institution": "Dalian University of Technology"}, {"given_name": "Yongri", "family_name": "Piao", "institution": "Dalian University of Technology"}, {"given_name": "Huchuan", "family_name": "Lu", "institution": "Dalian University of Technology"}]}