{"title": "A Self Validation Network for Object-Level Human Attention Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 14729, "page_last": 14740, "abstract": "Due to the foveated nature of the human vision system, people can focus their visual attention on a small region of their visual field at a time, which usually contains only a single object. Estimating this object of attention in first-person (egocentric) videos is useful for many human-centered real-world applications such as augmented reality applications and driver assistance systems. A straightforward solution for this problem is to pick the object whose bounding box is hit by the gaze, where eye gaze point estimation is obtained from a traditional eye gaze estimator and object candidates are generated from an off-the-shelf object detector. However, such an approach can fail because it addresses the where and the what problems separately, despite that they are highly related, chicken-and-egg problems. In this paper, we propose a novel unified model that incorporates both spatial and temporal evidence in identifying as well as locating the attended object in firstperson videos. It introduces a novel Self Validation Module that enforces and leverages consistency of the where and the what concepts. We evaluate on two public datasets, demonstrating that Self Validation Module significantly benefits both training and testing and that our model outperforms the state-of-the-art.", "full_text": "A Self Validation Network for Object-Level Human\n\nAttention Estimation\n\nZehua Zhang,1 Chen Yu,2 David Crandall1\n\n1Luddy School of Informatics, Computing, and Engineering\n\n2Department of Psychological and Brain Sciences\n\nIndiana University Bloomington\n\n{zehzhang, chenyu, djcran}@indiana.edu\n\nAbstract\n\nDue to the foveated nature of the human vision system, people can focus their visual\nattention on only a small region of their visual \ufb01eld at a time, which usually contains\na single object. Estimating this object of attention in \ufb01rst-person (egocentric) videos\nis useful for many human-centered real-world applications such as augmented\nreality and driver assistance systems. A straightforward solution for this problem\nis to \ufb01rst estimate the gaze with a traditional gaze estimator and generate object\ncandidates from an off-the-shelf object detector, and then pick the object that the\nestimated gaze falls in. However, such an approach can fail because it addresses\nthe where and the what problems separately, despite that they are highly related,\nchicken-and-egg problems. In this paper, we propose a novel uni\ufb01ed model that\nincorporates both spatial and temporal evidence in identifying as well as locating\nthe attended object in \ufb01rst-person videos. It introduces a novel Self Validation\nModule that enforces and leverages consistency of the where and the what concepts.\nWe evaluate on two public datasets, demonstrating that the Self Validation Module\nsigni\ufb01cantly bene\ufb01ts both training and testing and that our model outperforms the\nstate-of-the-art.\n\n1\n\nIntroduction\n\nHumans can focus their visual attention on only a small part of their surroundings at any moment,\nand thus have to choose what to pay attention to in real time [43]. Driven by the tasks and intentions\nwe have in mind, we manage attention with our foveated visual system by adjusting our head pose\nand our gaze point in order to focus on the most relevant object in the environment at any moment in\ntime [8, 17, 29, 47, 62].\nThis close relationship between intention, attention, and semantic objects has inspired a variety\nof work in computer vision, including image classi\ufb01cation [26], object detection [27, 46, 50, 52],\naction recognition [4, 36, 42, 48], action prediction [53], video summarization [30], visual search\nmodeling [51], and irrelevant frame removal [38], in which the attended object estimation serves as\nauxiliary information. Despite being a key component of these papers, how to identify and locate the\nimportant object is seldom studied explicitly. This problem in and of itself is of broad potential use in\nreal-world applications such as driver assistance systems and intelligent human-like robots.\nIn this paper, we discuss how to identify and locate the attended object in \ufb01rst-person videos. Recorded\nby head-mounted cameras along with eye trackers, \ufb01rst-person videos capture an approximation\nof what people see in their \ufb01elds of view as they go about their lives, yielding interesting data for\nstudying real-time human attention. In contrast to gaze studies of static images or pre-recorded\nvideos, \ufb01rst-person video is unique in that there is exactly one correct point of attention in each frame,\nas a camera wearer can only gaze at one point at a time. Accordingly, one and only one gazed object\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a)\n\n(b)\n\nFigure 1: Among the many objects appearing in an ego-\ncentric video frame of a person\u2019s \ufb01eld of view, we want\nto identify and locate the object to which the person is vi-\nsually attending. Combining traditional eye gaze estima-\ntors and existing object detectors can fail when eye gaze\nprediction (blue dot) is slightly incorrect, such as when\n(a) it falls in the intersection of two object bounding\nboxes or (b) it lies between two bounding boxes sharing\nthe same class. Red boxes shown actual attended object\naccording to ground truth gaze and yellow dashed boxes\nshow incorrect predictions.\n\nexists for each frame, re\ufb02ecting the camera wearer\u2019s real-time attention and intention. We will use\nthe term object of interest to refer to the attended object in our later discussion.\nSome recent work [22, 66, 68] has discussed\nestimating probability maps of ego-attention\nor predicting gaze points in egocentric videos.\nHowever, people think not in terms of points\nin their \ufb01eld of view, but in terms of the ob-\njects that they are attending to. Of course, the\nobject of interest could be obtained by \ufb01rst es-\ntimating the gaze with the gaze estimator and\ngenerating object candidates from an off-the-\nshelf object detector, and then picking the ob-\nject that the estimated gaze falls in. Because\nthis bottom-up approach estimates where and\nwhat separately, it could be doomed to fail if\nthe eye gaze prediction is slightly inaccurate,\nsuch as falling between two objects or in the\nintersection of multiple object bounding boxes\n(Figure 1). To assure consistency, one may think\nof performing anchor-level attention estimation\nand directly predicting the attended box by mod-\nifying existing object detectors. Class can be\neither predicted simultaneously with the anchor-\nlevel attention estimation using the same set of\nfeatures, as in SSD [40], or afterwards using the\nfeatures pooled within the attended box, as in Faster-RCNN [49]. Either way, these methods still\ndo not yield satisfying performance, as we will show in Sec. 4.2, because they lack the ability to\nleverage the consistency to re\ufb01ne the results.\nWe propose to identify and locate the object of interest by jointly estimating where it is within the\nframe as well as recognizing what its identity is. In particular, we propose a novel model \u2014 which\nwe cheekily call Mindreader Net or Mr. Net \u2014 to jointly solve the problem. Our model incorporates\nboth spatial evidence within frames and temporal evidence across frames, in a network architecture\n(which we call the Cogged Spatial-Temporal Module) with separate spatial and temporal branches to\navoid feature entanglement.\nA key feature of our model is that it explicitly enforces and leverages a simple but extremely useful\nconstraint: our estimate of what is being attended should be located in exactly the position of where\nwe estimate the attention to be. This Self Validation Module \ufb01rst computes similarities between the\nglobal object of interest class prediction vector and each local anchor box class prediction vector as the\nattention validation score to update the anchor attention score prediction, and then, with the updated\nanchor attention score, we select the attended anchor and use its corresponding class prediction score\nto update the global object of interest class prediction. With global context originally incorporated by\nextracting features from the whole clip using 3D convolution, the Self Validation Module helps the\nnetwork focus on the local context in a spatially-local anchor box and a temporally-local frame.\nWe evaluate the approach on two existing \ufb01rst-person video datasets that include attended object\nground truth annotations. We show our approach outperforms baselines, and that our Self Validation\nModule not only improves performance by re\ufb01ning the outputs with visual consistency during testing,\nbut also it helps bridge multiple components together during training to guide the model to learn a\nhighly meaningful latent representation. More information is available at http://vision.soic.indiana.\nedu/mindreader/.\n\n2 Related Work\n\nCompared with many efforts to understand human attention by modeling eye gaze [2, 7, 16, 20\u2013\n22, 24, 34, 35, 45, 59, 60, 64, 66, 68] or saliency [19, 25, 31\u201333, 39, 55, 67, 69], there are relatively\nfew papers that detect object-level attention. Lee et al. [30] address video summarization with\nhand-crafted features to detect important people and objects, while object-level reasoning plays a key\n\n2\n\n\fFigure 2: The architecture of our proposed Mindreader Net. Numbers indicate output size of each component\n(where c is the number of object classes). Softmax is applied before computing the losses on global classi\ufb01cation\nLglobalclass, anchor box classi\ufb01cation Lboxclass, and attention Lattn (which is \ufb01rst \ufb02attened to be 8732-d).\nPlease refer to supplementary materials for details about the Cogged Spatial-Temporal Module.\n\nrole in Baradel et al.\u2019s work on understanding videos through interactions of important objects [4].\nIn the particular case of egocentric video, Pirsiavash and Ramanan [48] and Ma et al. [42] detect\nobjects in hands as a proxy for attended objects to help action recognition. However, eye gaze\nusually precedes hand motion and thus objects in hand are not always those being visually attended\n(Fig. 1a). Shen et al. [53] combine eye gaze ground truth and detected object bounding boxes to\nextract attended object information for future action prediction. EgoNet [5], among the \ufb01rst papers to\nfocus on important object detection in \ufb01rst-person videos, combines visual appearance and 3D layout\ninformation to generate probability maps of object importance. Multiple objects can be detected in a\nsingle frame, making their results more similar to saliency than human attention in egocentric videos.\nPerhaps the most related work to ours is Bertasius et al.\u2019s Visual-Spatial Network (VSN) [6],\nwhich proposes an unsupervised method for important object detection in \ufb01rst-person videos that\nincorporates the idea of consistency between the where and what concepts to facilitate learning.\nHowever, VSN requires a much more complicated training strategy of switching the cascade order\nof the two pathways multiple times, whereas we present a uni\ufb01ed framework that can be learned\nend-to-end.\n3 Our approach\n\nGiven a video captured with a head-mounted camera, our goal is to detect the object that is visually\nattended in each frame. This is challenging because egocentric videos can be highly cluttered, with\nmany competing objects vying for attention. We thus incorporate temporal cues that consider multiple\nframes at a time. We \ufb01rst consider performing detection for the middle frame of a short input\nsequence (as in [42]), and then further develop it to work online (considering only past information)\nby performing detection on the last frame. Our novel model consists of two main parts (Figure 2),\nwhich we call the Cogged Spatial-Temporal Module and the Self Validation Module.\n\n3.1 Cogged Spatial-Temporal Module\n\nThe Cogged Spatial-Temporal Module consists of a spatial and a temporal branch. The \u201ccogs\u201d refer\nto the way that the outputs of each layer of the two branches are combined together, reminiscent of\nthe interlocking cogs of two gears (Figure 2). Please see supplementary material for more details.\nThe Spatial Gear Branch, inspired by SSD300 [40], takes a single frame It of size h \u00d7 w and\nperforms spatial prediction of local anchor box offsets and anchor box classes. It is expected to work\nas an object detector, although we only have ground truth for the objects of interest to train it, so\nwe do not add an extra background class as in [40], and only compute losses for the spatial-based\ntasks on the matched positive anchors. We use atrous [10, 65] VGG16 [54] as the backbone and\nfollow a similar anchor box setting as [40]. We also apply the same multi-anchor matching strategy.\n\n3\n\nInput Single RGB Frame300 x 300 x 3Input RGB Sequence15 x 300 x 300 x 3Input Optical Flows15 x 300 x 300 x 2Cogged Spatial-Temporal ModuleTemporal Gear BranchSpatial Gear BranchObject of Interest Class Logits1 x cSelf Validation ModuleWhat?Where?globalclassLattnLboxLAnchor Box Class Logits8732 x cAnchor Box Attention Logits8732 x 1Anchor Box Offsets8732 x 4Cosine Similarity8732 x 1Update Anchor Box Attention Logits8732 x 1Attended Anchor Box\u2019s Offsets1 x 4Attended Anchor Box\u2019s Location1 x 4Attended Anchor Box\u2019s Class Logits1 x cUpdate Object of Interest Class Logits1 x ccube with ballboxclassL\fWith the spatial branch, we obtain anchor box offset predictions O \u2208 Ra\u00d74 and class predictions\nCbox \u2208 Ra\u00d7c, where a is the number of anchor boxes and c is the number of classes in our problem.\nFollowing SSD300 [40], we have a = 8732, h = 300, and w = 300.\nThe Temporal Gear Branch takes N continuous RGB frames It\u2212 N\u22121\nas well as N corre-\n, both of spatial resolution h \u00d7 w (with N = 15, set\nsponding optical \ufb02ow \ufb01elds Ft\u2212 N\u22121\nempirically). We use Inception-V1 [58] I3D [9] as the backbone of our temporal branch. With\naggregated global features from 3D convolution, we obtain global object of interest class predictions\nCglobal \u2208 R1\u00d7c and anchor box attention predictions A \u2208 Ra\u00d71. We match the ground truth box only\nto the anchor with the greatest overlap (intersection over union). The matching strategy is empirical\nand discussed in Section 4.3.\n\n,t+ N\u22121\n\n,t+ N\u22121\n\n2\n\n2\n\n2\n\n2\n\n3.2 Self Validation Module\n\nThe Self Validation Module connects the above branches and delivers global and local context\nbetween the two branches at both spatial (e.g., whole frame versus an anchor box) and temporal (e.g.,\nwhole sequence versus a single frame) levels. It incorporates the constraint on consistency between\nwhere and what by embedding a double validation mechanism: what\u2212\u2192where and where\u2212\u2192what.\nWhat\u2212\u2192where. With the outputs of the Cogged Spatial-Temporal Module, we compute the cosine\nsimilarities between the global class prediction Cglobal and the class prediction for each anchor box,\nCboxi, yielding an attention validation score for each box i,\nCglobalC T\n\nThen the attention validation vector Vattn = [V 1\nanchor box attention scores A by element-wise summation, A(cid:48) = A + Vattn. Since \u22121 \u2264 V i\nwe make the optimization easier by rescaling each Ai to the range [\u22121, 1],\nA \u2212 (max(A) + min(A))/2\n\nattn, V 2\n\nboxi\n\n||Cglobal|| \u00d7 ||Cboxi||.\nattn, ..., V a\n\n(1)\nattn] \u2208 Ra\u00d71 is used to update the\nattn \u2264 1,\n\nV i\nattn =\n\nA(cid:48) = R(A) + Vattn =\n\nmax(A) \u2212 (max(A) + min(A))/2\n\n+ Vattn,\n\n(2)\n\nwhere max() and min() are element-wise vector operations.\nWhere\u2212\u2192what. Intuitively, obtaining the attended anchor box index m is a simple matter of com-\nputing m = argmax(A(cid:48)), and the class validation score is simply Vclass = Cboxm. Similarly, after\nrescaling, we take an element-wise summation, Vclass and Cglobal, to update the global object of\ninterest class prediction (R(\u00b7) in Equation 2), C(cid:48)\nglobal = R(Cglobal) + R(Vclass). However, the hard\nargmax is not differentiable, and thus gradients are not able to backpropagate properly during training.\nWe thus use soft argmax. Softmax is applied to the updated anchor box attention score A(cid:48) to produce\n\na weighting vector (cid:101)A(cid:48) for class validation score estimation,\nwith (cid:101)A(cid:48)\n\na(cid:88)\n\n\u02c6Vclass =\n\n(cid:101)A(cid:48)\n\niCboxi ,\n\ni(cid:80)a\neA(cid:48)\nj=1 eA(cid:48)\n\nj\n\n(3)\n\ni =\n\ni=1\n\nNow we replace Vclass with \u02c6Vclass to update Cglobal, C(cid:48)\nThis soft what\u2212\u2192where validation is closely related to the soft attention mechanism widely used in\nmany recent papers [3, 11, 41, 56, 61, 63]. While soft attention learns the mapping itself inside the\nmodel, we explicitly incorporate the coherence of the where and what concepts into our model to\nself-validate the output during both training and testing. In contrast to soft attention which describes\nrelationships between e.g. words, graph nodes, etc., this self-validation mechanism naturally mirrors\nthe visual consistency of our foveated vision system.\n\nglobal = R(Cglobal) + R( \u02c6Vclass).\n\n3.3\n\nImplementation and training details\n\nWe implemented our model with Keras [12] and Tensor\ufb02ow [1]. A batch normalization layer [23]\nis inserted after each layer in both spatial and temporal backbones, and momentum for batch\nnormalization is 0.8. Batch normalization is not used in the four prediction heads. We found\npretraining the spatial branch helps the model converge faster. No extra data is introduced as we\n\n4\n\n\fstill only use the labels of the objects of interest for pretraining. VGG16 [54] is initialized with\nweights pretrained on ImageNet [14]. We use Sun et al.\u2019s method [44, 57] to extract optical \ufb02ow and\nfollow [9] to truncate the maps to [\u221220, 20] and then rescale them to [\u22121, 1]. The RGB input to the\nTemporal Gear Branch is rescaled to [\u22121, 1] [9], while for the Spatial Gear Branch the RGB input is\nnormalized to have 0 mean and the channels are permuted to BGR.\nWhen training the whole model, the spatial branch is initialized with the pretrained weights from\nabove. The I3D backbone is initialized with weights pretrained on Kinetics [28] and ImageNet [14],\nwhile other parts are randomly initialized. We use stochastic gradient descent with learning rate\n0.03, momentum 0.9, decay 0.0001, and L2 regularizer 5e\u22125. The loss function consists of four\nparts: global classi\ufb01cation Lglobalclass, attention Lattn, anchor box classi\ufb01cation Lboxclass, and box\nregression Lbox,\n\nLtotal = \u03b1Lglobalclass + \u03b2Lattn +\n\n1\n\nNpos\n\n(\u03b3Lboxclass + Lbox),\n\n(4)\n\nwhere we empirically set \u03b1 = \u03b2 = \u03b3 = 1, and Npos is the total number of matched anchors for\ntraining the anchor box class predictor and anchor box offset predictor. Lglobalclass and Lattn apply\ncross entropy loss, computed on the updated predictions of object of interest class and anchor box\nattention. Lboxclass is the total cross entropy loss and Lbox is the total box regression loss over only\nall the matched anchors. The box regression loss follows [40, 49] and we refer readers there for\ndetails. Our full model has 64M trainable parameters, while the Self Validation Module contains no\nparameters, making it very \ufb02exible so that it can be added to training or testing anytime. It is even\npossible to stack multiple Self Validation Modules or use only half of it.\nDuring testing, the anchor with the highest anchor box attention score A(cid:48)\ni is selected as the attended\nanchor. The corresponding anchor box offset prediction Oi indicates where the object of interest is,\nwhile the argmax of the global object of interest class score C(cid:48)\n\nglobal gives its class.\n\n4 Experiments\n\nWe evaluate our model on identifying attended objects in two \ufb01rst-person datasets collected in very\ndifferent contexts: child and adult toy play, and adults in kitchens.\nATT [68] (Adult-Toddler Toy play) consists of \ufb01rst-person videos from head-mounted cameras of\nparents and toddlers playing with 24 toys in a simulated home environment. The dataset consists of\n20 synchornized video pairs (child head cameras and parent head cameras), although we only use the\nparent videos. The object being attended is determined using gaze tracking. We randomly select 90%\nof the samples in each object class for training and use the remaining 10% for testing, resulting in\nabout 17, 000 training and 1, 900 testing samples, each with 15 continuous frames. We do not restrict\nthe object of interest to remain the same in each sample sequence and only use the label of the object\nof interest for training.\nEpic-Kitchen Dataset [13] contains 55 hours of \ufb01rst-person video from 32 participants in their own\nkitchens. The dataset includes anntoations on the \u201cactive\u201d objects related to the person\u2019s current\naction. We use this as a proxy for attended object by we selecting only frames containing one active\nobject and assuming that they are attended. Object classes with fewer than 1000 samples are also\nexcluded, resulting in 53 classes. We randomly select 90% of samples for training, yielding about\n120, 000 training and 13, 000 testing samples.\nFor evaluation, we report accuracy \u2014 number of correct predictions over the number of samples. A\nprediction is considered correct if it has both (a) the correct class prediction and (b) an IoU between\nthe estimated and the ground truth boxes above a threshold. Similar to [37], we report accuracies at\nIOU thresholds of 0.5 and 0.75, as well as a mean accuracy mAcc computed by averaging accuracies\nat 10 IOU thresholds evenly distributed from 0.5 to 0.95. Accuracy thus measures ability to correctly\npredict both what and where is being attended.\n\n4.1 Baselines\n\nWe evaluate against several strong baselines. Gaze + GT bounding box, inspired by Li et al. [35],\napplies Zhang et al.\u2019s gaze prediction method [68] (since it has state-of-the-art performance on the\nATT) and directly uses ground truth object bounding boxes. This is equivalent to having a perfect\n\n5\n\n\fSelf validation?\n\nStreams Training Testing Acc0.5 \u2191 Acc0.75 \u2191 mAcc \u2191\n44.78\nTwo\n43.88\nTwo\n41.18\nTwo\n39.48\nTwo\nTwo\n37.87\n37.18\nTwo\n42.48\nRGB\n37.60\nFlow\n25.10\nFlow\nFlow\n18.40\n\nyes\nhalf\nno\nyes\nhalf\nno\nyes\nyes\nyes\nno\n\nyes\nyes\nyes\nno\nno\nno\nyes\nyes\nno\nno\n\n46.78\n\u2014\n\n42.83\n40.06\n\n74.27\n\u2014\n\n68.19\n67.18\n\n\u2014\n\n62.33\n74.59\n64.30\n\n\u2014\n\u2014\n\n\u2014\n\n38.31\n43.15\n38.63\n\n\u2014\n\u2014\n\nMethod\nOur Mr. Net\nGaze [68] + GT Box + Hit\nGaze [68] + GT Box + Closest\nI3D [9]-based SSD [40]\nCascade Model\nOIH Detectors + WH Classi\ufb01er\nLeft Handed Model\nRight Handed Model\nOIH GT + WH Classi\ufb01er\nEither Handed Model\nCenter GT Box\n\nAcc0.5 \u2191 Acc0.75 \u2191 mAcc \u2191\n74.27\n44.78\n25.26\n25.26\n35.86\n35.86\n70.11\n40.85\n41.93\n66.97\n37.16\n37.16\n38.31\n38.31\n39.00\n39.00\n40.83\n40.83\n42.94\n42.94\n23.97\n23.97\n\n46.78\n25.26\n35.86\n42.10\n45.10\n37.16\n38.31\n39.00\n40.83\n42.94\n23.97\n\nTable 1: Accuracy of our method compared to oth-\ners, on the ATT dataset. OIH represents Object-in-\nHand, while WH means Which-Hand.\n\nTable 2: Ablation results. Testing with half means that\nthe model is tested with only what\u2212\u2192where validation.\n\nobject detector (with mAP = 100%), resulting in a very strong baseline. We use two different\nmethods to match the predicted eye gaze to the object boxes: (1) Hit: only boxes in which the gaze\nfalls in are considered matched, and if the estimated gaze point is within multiple boxes, the accuracy\nscore is averaged by the number of matched boxes; and (2) Closest: the box whose center is the\nclosest to the predicted gaze is considered to be matched. I3D [9]-based SSD [40] tries to overcome\nthe discrepancy caused by solving the where and what problems separately by directly performing\nanchor-level attention estimation with an I3D [9]-backboned SSD [40]. The anchor box setting is\nsimilar to SSD300 [40]. For each anchor we predict an attention score, a class score, and box offsets.\nCascade model contains a temporal branch with I3D backbone and a spatial branch with VGG16\nbackbone. From the temporal branch, the important anchor as well as its box offsets are predicted,\nand then features are pooled [18, 49] from the spatial branch for classi\ufb01cation. Object in hands +\nGT bounding box, inspired by [15, 42, 48], tries to detect object of interest by detecting the object\nin hand. We use several variants; the \u201ceither handed model\u201d is strongest, and uses both the ground\ntruth object boxes and the ground truth label of the object in hands. When two hands hold different\nobjects, the model always picks the one yielding higher accuracy, thus re\ufb02ecting the best performance\nwe can obtain with this baseline. Please refer to the supplementary materials for details of other\nvariants. Center GT box uses the ground truth object boxes and labels to select the object closest to\nthe frame center, inspired by the fact that people tend to adjust their head pose so that their gaze is\nnear the center of their view [34].\n\n4.2 Results on ATT dataset\n\nTable 1 presents quantitative results of our Mindreader Net and baselines on the ATT dataset. Both\nenforcing and leveraging the visual consistency, our method even outperformed the either-handed\nmodel in terms of mAcc, which is built upon several strong oracles \u2014 a perfect object detector, two\nperfect object-in-hand detectors, and a perfect which-hand classi\ufb01er. Other methods without perfect\nobject detectors suffer from a rapid drop in Acc as the IOU threshold becomes higher. For example,\nwhen the IOU threshold reaches 0.75, the either-handed model already has no obvious advantage\ncompared with I3D-based SSD, and the Cascade model achieves a much higher score. When the\nthreshold becomes 0.5, not only our Mindreader Net but also Cascade and I3D-based SSD outperform\nthe either-handed model by a signi\ufb01cant margin. Though the Acc0.5 of the cascade model is lower\nthan I3D-based SSD by about 3%, its mAcc and Acc0.75 are higher, suggesting bad box predictions\nwith low IOU confuses the class head of the cascade model, but having a separate spatial branch to\novercome feature entanglement improves the overall performance with higher-quality predictions.\nWe also observed that the Closest variant of the Gaze + GT Box model is about 40% better than the\nHit variant. This suggests that gaze prediction often misses the ground truth box a bit or may fall\nin the intersection of several bounding boxes, re\ufb02ecting the discrepancy between the where and the\nwhat concepts in exiting eye gaze estimation algorithms.\nSample results of our model compared with other baselines are shown in Figure 3. Regular gaze\nprediction models fail in (c) & (d), supporting our hypothesis about the drawback of estimating where\nand what independently \u2014 the model is not robust to small errors in gaze estimation (recall the\ngaze-based baseline uses ground truth bounding boxes so failures must be caused by gaze estimation).\nIn particular, the estimated gaze falls on 3 objects in (c), slightly closer to the center of the rabbit; In\n(d), eye gaze does not fall on any object. More uni\ufb01ed models (I3D-based SSD, the cascade model,\n\n6\n\n\fand our model) thus achieve better performance. In (a) & (b), our model outperforms I3D-based\nSSD and Cascade. Because a Self Validation Module is applied to inject consistency, our Mr. Net\nperforms better when many objects including the object of interest are close to each other.\nFigure 4 illustrate how various parts of our model work. Image (a) shows the intermediate anchor\nattention score A\u2208Ra\u00d71 from the temporal branch, visualized as the top 5 attended anchors with\nattention scores. These are anchor-level attention and no box offsets are predicted here. Image\n(b) shows visualizations of the predicted anchor offsets O\u2208Ra\u00d74 and box class score Cbox\u2208Ra\u00d7c\nfrom the spatial branch (only of the top 5 attended anchors). We do not have negative samples or a\nbackground class for training the spatial branch and thus there are some false positives. Image (c)\ncombines output from both branches; this is also the \ufb01nal prediction of the model trained with the\nSelf Validation Module but tested without it in the ablation studies in Section 4.3. The predicted class\nis obtained from Cglobal and we combine A and O to get the location. Discrepancy happens in this\nexample as the class prediction is correct but not the location. Image (d) shows prediction of our full\nmodel. By applying double self validation, the full model correctly predicts location and class.\nSome failure cases of our model are shown in Figure 5: (a) heavy occlusion, (b) ambiguity of which\nheld object is attended, (c) the model favors the object that is reached for, and (d) an extremely\ndif\ufb01cult case where parent\u2019s reach is occluded by an object held by the child.\n\n4.3 Ablation studies\n\nWe conduct several ablation studies to evaluate the importance of the parts of our model.\nHard argmax vs. soft argmax during testing. The soft version of what\u2212\u2192where is necessary for\ngradient backpropagation during training, but there is no such issue in testing. Our full model achieves\nmAcc = 44.78% when tested with hard argmax, versus mAcc = 44.13% when tested with soft\nargmax. When doing the same experiments with other model settings, we observed similar results.\nSelf Validation Module. To study the importance of the Self Validation Module, we conduct \ufb01ve\nexperiments: (1) Train and test the model without the Self Validation Module; (2) Train the model\nwithout the Self Validation Module but test with only the what\u2212\u2192where validation (the \ufb01rst step of\nSelf Validation); (3) Train the model without Self Validation but test with it; (4) Train the model with\nSelf Validation but test with only what\u2212\u2192where validation; (5) Train the model with Self Validation\nbut test without it. As shown in Table 2, the Self Validation Module yields consistent performance\ngain. If we train the model with Self Validation but remove it during testing, the remaining model\nstill outperforms other models trained without the module. This implies that embedding the Self\nValidation Module during training helps learn a better model by bridging each component and\nproviding guidance of how components are related to each other. Even when Self Validation is\nremoved during testing, consistency is still maintained between the temporal and the spatial branches.\nAlso, recall that when training the model with the Self Validation Module, the loss is computed\nbased on the \ufb01nal output, and thus when we test the full model without Self Validation, the output\nis actually a latent representation in our full model. This suggests that our Self Validation Module\nencourages the model to learn a highly semantically-meaningful latent representation. Furthermore,\nthe consistency injected by Self Validation helps prevent over\ufb01tting, while signi\ufb01cant over\ufb01tting was\nobserved without the Self Validation Module during training.\nValidation method for what\u2212\u2192where. We used element-wise summation for what\u2212\u2192where vali-\ndation. Another strategy is to treat Vattn as an attention vector in which rescaling is unnecessary,\n\ni = Ai \u00b7(cid:101)V i\n\nA(cid:48)\n\nattn,\n\nwith (cid:101)V i\n\nattn =\n\n(cid:80)a\n\nattn\n\neV i\nj=1 eV j\n\nattn\n\n.\n\n(5)\n\nWe repeated experiments using this technique and obtained mAcc = 43.30%, a slight drop that may\nbe because the double softmax inside the Self Validation Module increases optimization dif\ufb01culty.\nSingle stream versus two streams. We conducted experiments to study the effect of each stream\nin our task. As Table 2 shows, a single optical \ufb02ow stream performs much worse than single RGB\nor two-stream, indicating that object appearance is very important for problems related to object\ndetection. However, it still acheived acceptable results since the network can refer to the spatial\nbranch for appearance information through the Self Validation Module. To test this, we removed the\nSelf Validation Module from the single \ufb02ow stream model during training. When testing this model\n\n7\n\n\fFigure 3: Sample results of our Mr. Net and baselines on ATT dataset. Detections are in blue, ground truth in\nred, and the predicted gaze of gaze-based methods in yellow.\n\nFigure 4: Illustration of how parts\nof our model work.\n\nFigure 5: Some failure cases of our\nmodel, with detections in blue and\nground truth in red.\n\nFigure 6: Sample results of Mr. Net\non Epic-Kitchens.\n\n8\n\nOur modelGaze + GT Box + HitGaze + GT Box + CloseI3D-based SSDCascade Model(a)(b)(c)(d)housecubeladybugspongebobwhite carwhite carrabbitrabbitgavelgavelgavelgavelladybugrabbitcube with ballrabbitladybugladybugspongebobgavelgavelhousehousehousehousehousecubecubecubecubecubeladybugladybugladybugladybugladybugspongebobspongebobspongebobspongebobspongebobgavelsawdollelephanttrucklady bug rattlehouserubik\u2019scube(a)(b)(c)(d)spongebob(a)(b)(c)(d)0.3550.2010.2070.0980.207white carwhite carwhite carwhite carsawblue carwhite cargavelgavelsawdollelephanttrucklady bug rattlehouserubik\u2019scube(a)(b)(c)(d)spongebob(a)(b)(c)(d)0.3550.2010.2070.0980.207white carwhite carwhite carwhite carsawblue carwhite cargavelknifesaucepansaucepanknifebowlbowldoordoor\fModel\nMr. Net\nGaze [68] + GT Boxes Hit\nGaze [68] + GT Boxes Closest\nI3D [9]-bsaed SSD [40]\nCascade Model\n\nAcc0.5 \u2191 Acc0.75 \u2191 mAcc \u2191\n39.04\n71.34\n26.46\n26.46\n36.81\n36.81\n37.22\n67.43\n65.96\n37.93\n\n38.26\n26.46\n36.81\n37.90\n38.01\n\nMethod\nOur Mr. Net\nI3D [9]-based SSD [40]\nCascade Model\n\nAcc0.5 \u2191 Acc0.75 \u2191 mAcc \u2191\n31.20\n57.18\n25.42\n47.58\n51.20\n28.36\n\n31.00\n24.38\n28.18\n\nTable 4: Accuracies on the Epic-Kitchen dataset.\n\nTable 3: Results of online detection.\n\ndirectly, we observed a very poor result of mAcc = 18.4%; adding the Self Validation Module back\nduring testing yields a large gain to mAcc = 25.1%.\nAlternative matching strategy for box attention prediction. For the anchor box attention predictor,\nwe perform experiments with different anchor matching strategies. When multi anchor matching is\nused, we do hard negative mining as suggested in [40] with the negative:positive ratio set to 3. The\nmodel with the multi anchor matching strategy achieves mAcc = 44.27%, versus mAcc = 44.78%\nwith one-best anchor matching. We tried other different negative:positive ratios ( e.g. 5, 10, 20) and\nstill found the one best anchor matching strategy works better. This may be because we have an\nacceptable number of anchor boxes; once we set more anchor boxes, multi matching may work better.\nObject of interest class prediction. We explore where to place the global object of interest class\npredictor. When we connect it to the temporal branch after the fused block 5, we obtain mAcc =\n44.78%; when placed after the conv block 8 at the end of the temporal branch, we achieve mAcc =\n43.69%. This implies that for detecting the object of interest among others, a higher spatial resolution\nof the feature map is helpful.\n\n4.4 Online Detection\n\nOur model can be easily modi\ufb01ed to do online detection, in which only previous frames are available.\nWe modi\ufb01ed the model to detect the object of interest in the last frame of a given sequence. As shown\nin Table 3, except for the Gaze + GT boxes model, all other models suffer from dropping Acc scores,\nindicating that online detection is more dif\ufb01cult. However, since the gaze prediction model that we\nuse [68] is trained to predict eye gaze in each frame of the video sequence and thus works for both\nonline and of\ufb02ine tasks, its performance remains stable.\n\n4.5 Results on Epic-Kitchen Dataset\n\nWe show the generalizability of our model by performing experiments on Epic-Kitchens [13]. Results\nby applying our model as well as the I3D-based SSD model and the cascade model on this dataset\nare shown in Table 4. On this dataset, the Acc0.5 of the Cascade model is higher than that of the I3D\n+ SSD model. The reason may be that objects are sparser in this dataset and thus poorly-predicted\nboxes will be less likely to lead to wrong classi\ufb01cation. Sample results are shown in Figure 6.\n\n5 Conclusion\n\nWe considered the problem of detecting attended object in cluttered \ufb01rst-person views. We proposed\na novel uni\ufb01ed model with a Self Validation Module to leverage the visual consistency of human\nvision system. The module jointly optimizes the class and the attention estimates as self validation.\nExperiments on two public datasets show our model outperforms other state-of-the-art methods by a\nlarge margin.\n\n6 Acknowledgements\n\nThis work was supported in part by the National Science Foundation (CAREER IIS-1253549), the\nNational Institutes of Health (R01 HD074601, R01 HD093792), NVidia, Google, and the IU Of\ufb01ce\nof the Vice Provost for Research, the College of Arts and Sciences, and the School of Informatics,\nComputing, and Engineering through the Emerging Areas of Research Project \u201cLearning: Brains,\nMachines, and Children.\u201d\n\n9\n\n\fReferences\n\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard,\nM. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden,\nM. Wicke, Y. Yu, and X. Zheng. Tensor\ufb02ow: A system for large-scale machine learning. In USENIX\nConference on Operating Systems Design and Implementation, pages 265\u2013283, 2016.\n\n[2] S. O. Ba and J. M. Odobez. Multiperson visual focus of attention from head pose and meeting contextual\ncues. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 33(1):101\u2013116, Jan 2011.\n[3] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\narXiv preprint arXiv:1409.0473, 2014.\n\n[4] F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori. Object level visual reasoning in videos. In\n\nProceedings of the European Conference on Computer Vision (ECCV), pages 105\u2013121, 2018.\n\n[5] G. Bertasius, H. S. Park, S. X. Yu, and J. Shi. First person action-object detection with egonet. arXiv\n\npreprint arXiv:1603.04908, 2016.\n\n[6] G. Bertasius, H. Soo Park, S. X. Yu, and J. Shi. Unsupervised learning of important objects from \ufb01rst-person\nvideos. In Proceedings of the IEEE International Conference on Computer Vision, pages 1956\u20131964, 2017.\nIn IEEE\n\n[7] A. Borji, D. N. Sihite, and L. Itti. Probabilistic learning of task-speci\ufb01c visual attention.\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2012.\n\n[8] M. C. Bowman, R. S. Johannson, and J. R. Flanagan. Eye\u2013hand coordination in a sequential target contact\n\ntask. Experimental brain research, 195(2):273\u2013283, 2009.\n\n[9] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In\nproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299\u20136308,\n2017.\n\n[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image\nsegmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions\non pattern analysis and machine intelligence, 40(4):834\u2013848, 2018.\n\n[11] K. Cho, A. Courville, and Y. Bengio. Describing multimedia content using attention-based encoder-decoder\n\nnetworks. IEEE Transactions on Multimedia, 17(11):1875\u20131886, 2015.\n\n[12] F. Chollet, J. Allaire, et al. R interface to keras. https://github.com/rstudio/keras, 2017.\n[13] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro,\nT. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The epic-kitchens dataset. In European\nConference on Computer Vision (ECCV), 2018.\n\n[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\ndatabase. In 2009 IEEE conference on computer vision and pattern recognition, pages 248\u2013255. Ieee,\n2009.\n\n[15] A. Furnari, S. Battiato, K. Grauman, and G. M. Farinella. Next-active-object prediction from egocentric\n\nvideos. J. Vis. Comun. Image Represent., 49(C):401\u2013411, Nov. 2017.\n\n[16] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency.\n\nProcessing Systems (NeurIPS), pages 545\u2013552, 2007.\n\nIn Advances in Neural Information\n\n[17] M. Hayhoe and D. Ballard. Eye movements in natural behavior. Trends in cognitive sciences, 9(4):188\u2013194,\n\n2005.\n\n[18] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international\n\nconference on computer vision, pages 2961\u20132969, 2017.\n\n[19] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr. Deeply supervised salient object detection\nwith short connections. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n3203\u20133212, 2017.\n\n[20] X. Hou, J. Harel, and C. Koch. Image signature: Highlighting sparse salient regions. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence (PAMI), 34(1):194\u2013201, Jan 2012.\n\n[21] X. Huang, C. Shen, X. Boix, and Q. Zhao. Salicon: Reducing the semantic gap in saliency prediction by\nadapting deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 262\u2013270, 2015.\n\n[22] Y. Huang, M. Cai, Z. Li, and Y. Sato. Predicting gaze in egocentric video by learning task-dependent\n\nattention transition. In European Conference on Computer Vision (ECCV), pages 754\u2013769, 2018.\n\n[23] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In International Conference on Machine Learning (ICML), 2015.\n\n[24] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence (PAMI), 20(11):1254\u20131259, 1998.\n\n[25] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predict where humans look. In IEEE\n\nInternational Conference on Computer Vision (ICCV), pages 2106\u20132113. IEEE, 2009.\n\n[26] N. Karessli, Z. Akata, B. Schiele, and A. Bulling. Gaze embeddings for zero-shot image classi\ufb01cation. In\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4525\u20134534, 2017.\n\n[27] S. Karthikeyan, V. Jagadeesh, R. Shenoy, M. Ecksteinz, and B. Manjunath. From where and how to what\nwe see. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 625\u2013632, 2013.\n[28] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back,\n\nP. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.\n\n[29] S. Lazzari, D. Mottet, and J.-L. Vercher. Eye-hand coordination in rhythmical pointing. Journal of motor\n\nbehavior, 41(4):294\u2013304, 2009.\n\n10\n\n\f[30] Y. J. Lee, J. Ghosh, and K. Grauman. Discovering important people and objects for egocentric video\nsummarization. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1346\u20131353.\nIEEE, 2012.\n\n[31] G. Li, Y. Xie, L. Lin, and Y. Yu. Instance-level salient object segmentation. In IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), pages 2386\u20132395, 2017.\n\n[32] G. Li, Y. Xie, T. Wei, K. Wang, and L. Lin. Flow guided recurrent neural encoder for video salient object\ndetection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3243\u20133252,\n2018.\n\n[33] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen. Contour knowledge transfer for salient object detection. In\n\nEuropean Conference on Computer Vision (ECCV), pages 355\u2013370, 2018.\n\n[34] Y. Li, A. Fathi, and J. M. Rehg. Learning to predict gaze in egocentric video. In IEEE International\n\nConference on Computer Vision (ICCV), 2013.\n\n[35] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. The secrets of salient object segmentation. In IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 280\u2013287, 2014.\n\n[36] Y. Li, M. Liu, and J. M. Rehg. In the eye of beholder: Joint learning of gaze and actions in \ufb01rst person\n\nvideo. In Proceedings of the European Conference on Computer Vision (ECCV), pages 619\u2013635, 2018.\n\n[37] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\ncoco: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740\u2013755.\nSpringer, 2014.\n\n[38] D. Liu, G. Hua, and T. Chen. A hierarchical visual model for video object summarization.\n\nIEEE\n\nTransactions on Pattern Analysis and Machine Intelligence (PAMI), 32(12):2178\u20132190, 2010.\n\n[39] N. Liu and J. Han. Dhsnet: Deep hierarchical saliency network for salient object detection. In IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 678\u2013686, 2016.\n\n[40] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox\n\ndetector. In European Conference on Computer Vision (ECCV), pages 21\u201337. Springer, 2016.\n\n[41] M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine\n\ntranslation. arXiv preprint arXiv:1508.04025, 2015.\n\n[42] M. Ma, H. Fan, and K. M. Kitani. Going deeper into \ufb01rst-person activity recognition. In Proceedings of\n\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 1894\u20131903, 2016.\n\n[43] M. C. Mozer and M. Sitton. Computational modeling of spatial attention. Attention, 9:341\u2013393.\n[44] S. Niklaus. A reimplementation of PWC-Net using PyTorch. https://github.com/sniklaus/pytorch-pwc,\n\n2018.\n\n[45] J. Pan, E. Sayrol, X. Gir\u00f3 i Nieto, K. McGuinness, and N. E. O\u2019Connor. Shallow and deep convolutional\nnetworks for saliency prediction. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 598\u2013606, 2016.\n\n[46] D. P. Papadopoulos, A. D. Clarke, F. Keller, and V. Ferrari. Training object class detectors from eye\n\ntracking data. In European Conference on Computer Vision (ECCV), pages 361\u2013376. Springer, 2014.\n\n[47] S. Perone, K. L. Madole, S. Ross-Sheehy, M. Carey, and L. M. Oakes. The relation between infants\u2019\n\nactivity with objects and attention to object appearance. Developmental psychology, 44(5):1242, 2008.\n\n[48] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in \ufb01rst-person camera views. In IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 2847\u20132854, 2012.\n\n[49] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region\n\nproposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n\n[50] U. Rutishauser, D. Walther, C. Koch, and P. Perona. Is bottom-up attention useful for object recognition?\nIn IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages II\u2013II. IEEE,\n2004.\n\n[51] H. Sattar, S. Muller, M. Fritz, and A. Bulling. Prediction of search targets from \ufb01xations in open-world\nsettings. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 981\u2013990, 2015.\n[52] I. Shcherbatyi, A. Bulling, and M. Fritz. Gazedpm: Early integration of gaze information in deformable\n\npart models. arXiv preprint arXiv:1505.05753, 2015.\n\n[53] Y. Shen, B. Ni, Z. Li, and N. Zhuang. Egocentric activity prediction via event modulated attention. In\n\nProceedings of the European Conference on Computer Vision (ECCV), pages 197\u2013212, 2018.\n\n[54] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[55] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam. Pyramid dilated deeper convlstm for video salient\n\nobject detection. In European Conference on Computer Vision (ECCV), pages 715\u2013731, 2018.\n\n[56] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In Advances in Neural Information\n\nProcessing Systems (NeurIPS), pages 2440\u20132448, 2015.\n\n[57] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical \ufb02ow using pyramid, warping, and\n\ncost volume. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\nGoing deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 1\u20139, 2015.\n\n[59] A. Torralba, M. S. Castelhano, A. Oliva, and J. M. Henderson. Contextual guidance of eye movements and\nattention in real-world scenes: the role of global features in object search. Psychological Review, 113:2006,\n2006.\n\n[60] A. M. Treisman and G. Gelade. A feature-integration theory of attention. Cognitive Psychology, 12(1):97 \u2013\n\n136, 1980.\n\n11\n\n\f[61] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and I. Polosukhin.\nIn Advances in Neural Information Processing Systems (NeurIPS), pages\n\nAttention is all you need.\n5998\u20136008, 2017.\n\n[62] E. D. Vidoni, J. S. McCarley, J. D. Edwards, and L. A. Boyd. Manual and oculomotor performance\ndevelop contemporaneously but independently during continuous tracking. Experimental brain research,\n195(4):611\u2013620, 2009.\n\n[63] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend\nand tell: Neural image caption generation with visual attention. In International conference on machine\nlearning, pages 2048\u20132057, 2015.\n\n[64] K. Yamada, Y. Sugano, T. Okabe, Y. Sato, A. Sugimoto, and K. Hiraki. Attention prediction in egocentric\nvideo using motion and visual saliency. In Y.-S. Ho, editor, Advances in Image and Video Technology,\npages 277\u2013288, 2012.\n\n[65] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions.\n\narXiv preprint\n\narXiv:1511.07122, 2015.\n\n[66] M. Zhang, K. T. Ma, J. H. Lim, Q. Zhao, and J. Feng. Deep future gaze: Gaze anticipation on egocentric\nvideos using adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2017.\n\n[67] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan. Amulet: Aggregating multi-level convolutional features\nfor salient object detection. In IEEE International Conference on Computer Vision (ICCV), pages 202\u2013211,\n2017.\n\n[68] Z. Zhang, S. Bambach, C. Yu, and D. J. Crandall. From coarse attention to \ufb01ne-grained gaze: A two-stage\n3d fully convolutional network for predicting eye gaze in \ufb01rst person video. In British Machine Vision\nConference (BMVC), 2018.\n\n[69] R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-context deep learning. In IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 1265\u20131274, 2015.\n\n12\n\n\f", "award": [], "sourceid": 8329, "authors": [{"given_name": "Zehua", "family_name": "Zhang", "institution": "Indiana University Bloomington"}, {"given_name": "Chen", "family_name": "Yu", "institution": "Indiana University"}, {"given_name": "David", "family_name": "Crandall", "institution": "Indiana University"}]}