{"title": "Memorability of Image Regions", "book": "Advances in Neural Information Processing Systems", "page_first": 296, "page_last": 304, "abstract": null, "full_text": "Memorability of Image Regions\n\nAditya Khosla\n\nAude Oliva\n\nJianxiong Xiao\nMassachusetts Institute of Technology\n\nAntonio Torralba\n\n{khosla,xiao,torralba,oliva}@csail.mit.edu\n\nAbstract\n\nWhile long term human visual memory can store a remarkable amount of visual\ninformation, it tends to degrade over time. Recent works have shown that image\nmemorability is an intrinsic property of an image that can be reliably estimated\nusing state-of-the-art image features and machine learning algorithms. However,\nthe class of features and image information that is forgotten has not been explored\nyet.\nIn this work, we propose a probabilistic framework that models how and\nwhich local regions from an image may be forgotten using a data-driven approach\nthat combines local and global images features. The model automatically discov-\ners memorability maps of individual images without any human annotation. We\nincorporate multiple image region attributes in our algorithm, leading to improved\nmemorability prediction of images as compared to previous works.\n\n1\n\nIntroduction\n\nHuman long-term memory can store a remarkable amount of visual information and remember thou-\nsands of different pictures even after seeing each of them only once [25, 1]. However, it appears to\nbe the fate of visual memories that they degrade [13, 30]. While most of the work in visual cognition\nhas examined how people forget for general classes of visual or verbal stimuli [30], little work has\nlooked at which image information is forgotten and which is retained. Does all visual information\nfade alike? Are there some features, image regions or objects that are forgotten more easily than\nothers? Inspired by work in visual cognition showing that humans selectively forget some objects\nand regions from an image while retaining others [22], we propose a novel probabilistic framework\nfor modeling image memorability, based on the fading of local image information.\nRecent work on image memorability [6, 7, 12] has shown that there are large differences between\nthe memorabilities of different images, and these differences are consistent across context and ob-\nservers, suggesting that memory differences are intrinsic to the images themselves. Using machine\nlearning tools such as support vector regression and a fully annotated dataset of images with hu-\nman memorability scores, Isola et al [7] show that an automatic image ranking algorithm matches\nindividual image memory scores quite well: with dynamic scenes with people interacting as most\nmemorable, static indoor environments and human-scale objects as somewhat less memorable, and\noutdoor vistas as forgettable. In addition, using manual annotation, Isola et al. quanti\ufb01ed the contri-\nbution of segmented regions to the image memorability score, creating a memorability map for each\nindividual image that identi\ufb01es objects that are correlated with high or low memorability scores.\nHowever, this previous work did not attempt to discover in an automatic fashion which part of the\nimage is memorable and which regions are forgettable.\nIn this paper, we introduce a novel framework for predicting image memorability that is able to\naccount for how memorability of image regions and different types of features fade over time, offer-\ning memorability maps that are more interpretable than [7]. The current work offers three original\ncontributions: (1) a probabilistic model that simulates the forgetting local image regions, (2) the\nautomatic discovery of memorability maps of individual images that reveal which regions are mem-\norable/forgettable, and (3) an improved overall image memorability prediction from [7], using an\nautomatic, data-driven approach combining local and global images features.\n\n1\n\n\fFigure 1: Overview of our probabilistic framework. This \ufb01gure illustrates a possible external or\n\u2018observed\u2019 representation of an image. The conversion to an internal representation in memory can\nbe thought of as a noisy process where some elements of the image are changed probabilistically as\ndescribed by \u03b1 and \u03b2 (Sec. 3.1). The image on the right illustrates a possible internal representation:\nthe green and blue regions remain unchanged, while the red region is forgotten and the pink region\nis hallucinated. Note that the internal representation cannot be observed and is only shown here for\nillustrating the framework.\n\n2 Related work\n\nLarge scale visual memory experiments [26, 25, 1, 13, 14, 28] have shown that humans can re-\nmember speci\ufb01c images they have seen among thousands of images, hours to days later, even after\nbeing exposed to each picture only once. In addition, humans seem to have a massive capacity in\nlong term memory to store speci\ufb01c details about these images, like remembering whether the glass\nof orange juice they saw thousands of images earlier was full or half full [1] or which speci\ufb01c door\npicture they saw after being exposed to hundreds of pictures of doors [28].\nHowever, not all images are equally memorable as shown by the Memory Game experiment de-\nscribed in [7, 12], and importantly, not all kinds of local information are equally retained from an\nimage: on average, observers will more likely remember visual details attached to objects that have\na speci\ufb01c semantic label or a distinctive interpretation (for example observers will remember differ-\nent types of cars by tagging each car with a different brand name, but would more likely confuse\ndifferent types of apples, which only differ by their color [14]). This suggests that different features,\nobjects and regions in an image may have themselves different memorability status: indeed, works\nby Isola et al [7, 6] have shown that different individual features, objects, local regions and attributes\nare correlated with image that are highly memorable or forgettable. For instance, indoor spaces,\npictures containing people, particularly if their face is visible, close up views on objects, animals,\nare more memorable than buildings, pictures of natural landscapes, and natural surfaces in general\n(like mountains, grass, \ufb01eld). However, to date, there is no work which has attempted to predict\nwhich local information from an image is memorable or forgettable, in an automatic manner.\n\n3 Modeling memorability using image regions\n\nWe propose to predict memorability using a noisy memory process of encoding images in our mem-\nory, illustrated in Fig. 1. In our setting, an image consists of different types of image regions and\nfeatures. After a delay between the \ufb01rst and second presentation of an image, people are likely to\nremember some image regions and objects more than others. For example, as shown in [7], people\nand close up views on objects tend to be more memorable than natural objects and regions of land-\nscapes, suggesting for instance that an image region containing a person is less likely to be forgotten\nthan an image region containing a tree. It is well established that stored visual information decays\nover time [30, 31, 14], which can be represented in a model by a novel image vector with missing\nglobal and local information. We postulate that the farther the stored representation of the image is\nfrom its veridical representation, the less likely it is to be remembered.\nHere, we propose to model this noisy memorability process in a probabilistic framework. We assume\nthat the representation of an image is composed of image regions where different regions of an\n\n2\n\n+\"+\"#\"+\"+\"#\"Original Image!Internal Image!External!Representation!Internal!Representation!vj!#\"#\"+\"+\"vj!~!\u03b1,\"\u03b2\"Noisy Memory !Process!\fimage correspond to different sets of objects. These regions have different probabilities of being\nforgotten and some regions have a probability of being imagined or hallucinated. We postulate that\nthe likelihood of an image to be remembered depends on the distance between the initial image\nrepresentation and its internal degraded version. An image with a larger distance to the internal\nrepresentation is more likely to be forgotten, thereby the image should have a lower memorability\nscore. In our algorithm, we model this probabilistic process and show its effectiveness at predicting\nimage memorability and at producing interpretable memorability maps.\n\n3.1 Formulation\n\nGiven some image Ij, we de\ufb01ne its representation vj and \u02dcvj as the external and internal represen-\ntation of the image respectively. The external representation refers to the original image which is\nobserved, while internal representation refers to the noisy representation of the same image that is\nstored in the observer\u2019s memory. Assume that there are N types of regions or objects an image can\ncontain. We de\ufb01ne vj \u2208 {0, 1}N as a binary vector of size N containing a 1 at index n when the\ncorresponding region is present in image Ij and 0 otherwise. Similarly, the internal representation\nconsists of the same set of region types, but has different presence and absence values as memory is\nnoisy.\nIn this setting, one of two things can happen when the external representation of an image is ob-\nserved: (1) An image region that was shown is forgotten i.e. \u02dcvj(i) = 0 when vj(i) = 1, where vj(i)\nrefers to the ith element of vj, or (2) An image region is hallucinated i.e. an image region that did\nnot exist in the image is believed to be present. We expect this to happen with different probabilities\nfor different types of image regions. Therefore, we de\ufb01ne two probability vectors (cid:126)\u03b1, (cid:126)\u03b2 \u2208 [0, 1]N ,\nwhere \u03b1i corresponds to the probability of region type i being forgotten while \u03b2i corresponds to the\nprobability of hallucinating a region of type i.\nUsing this representation, we de\ufb01ne the distance between the internal and external representation as\nDj = D(vj, \u02dcvj) = ||vj \u2212 \u02dcvj||1. Dj is inversely proportional to the memorability score of an image\nsj; the higher the distance of an image in the brain from its true representation, the less likely it is\nto be remembered, i.e. when D increases, s decreases. Thus, we can compute the expected distance\nE(Dj|vj) of an image as:\n\nE(Dj|vj) =\n\n\u03b1vj (i)\ni\n\n\u2217 \u03b21\u2212vj (i)\n\ni\n\n= vT\n\nj (cid:126)\u03b1 + (\u00acvj)T (cid:126)\u03b2\n\n(1)\n\nThis represents the expected number of modi\ufb01cations in v from 1 to 0 (\u03b1) or from 0 to 1 (\u03b2). Thus,\nover all images, we can de\ufb01ne the expected distance E(D|v) as\n\nE(D|v) =\n\n\u221drank \u2212(cid:126)s\n\n(2)\n\nwhere \u03b1i, \u03b2i \u2208 [0, 1] and \u221drank represents that the proportionality is only related to the relative\nranking of the image memorability scores, and M is the total number of images. We do not explicitly\npredict a memorability score, rather the ranking of scores between images.\nThe above equation represents a typical ordinal rank regression setting with additional constraints\non the learning parameters (cid:126)\u03b1 and (cid:126)\u03b2. Since we are only interested in the rank, we can rescale the\nlearned parameters to lie between [0, 1], allowing us to use standard solvers such as SVM-Rank [9].\nWe note that (cid:126)\u03b2 cannot be uniquely determined when considering ranking of images alone, and thus\nwe focus our attention on (cid:126)\u03b1 for the rest of this paper.\nImplementation details: To generate the region types automatically, we randomly sample rectan-\ngular regions of arbitrary width and height from the training images. The regions can be overlapping\nwith each other. For each region, we compute a particular feature (described in Sec. 4.2), ensuring\nthe same dimension for all regions of different shapes and sizes (using Bag-of-Words like repre-\nsentations). Then we perform k-means clustering to learn the dictionary of region types as cluster\ncentroids. The region type is determined by the closest cluster centroid. This method allows us to\n\n3\n\ni=1\n\nN(cid:88)\n\uf8eb\uf8ec\uf8ed vT\n\n1\n\n...\nvT\nM\n\n\uf8f6\uf8f7\uf8f8 \u00b7\n\n(cid:18)(cid:126)\u03b1\n\n(cid:19)\n\n(cid:126)\u03b2\n\n1\n\n\u00acvT\n...\n\u00acvT\n\nM\n\n\fFigure 2: Illustration of multiple feature integration. Refer to Sec. 3.2 for details.\n\nbypass the need for human annotation as done in [7]. The details of the dictionary size and feature\ntypes used are provided in Sec. 4. As we sample overlapping regions, we only encode the pres-\nence of a region type by 1 or 0. There may be more than one sampled region that corresponds to a\nparticular region type.\nWe evaluate our algorithm on test images by applying a similar method as that on the train images.\nIn this case, we assume the dictionary of region types is given, and we simply assign the randomly\nsampled image regions to region types, and use the learned parameters ((cid:126)\u03b1, (cid:126)\u03b2) to compute a score.\n\n3.2 Multiple feature integration\n\nWe incorporate multiple attributes of each region type such as color, texture and gradient in the form\nof image features into our algorithm. Our method is illustrated in Fig. 2. For each attribute, we learn\na separate dictionary of region types. An image region is encoded using each feature dictionary\nindependently, and the (cid:126)\u03b1, (cid:126)\u03b2 parameters are learned jointly in our learning algorithm. Subsequently,\nwe use each set of (cid:126)\u03b1 and (cid:126)\u03b2 for individual features to construct memorability maps that are later\ncombined using weighted pooling1 to produce an overall memorability map as shown in Fig. 2.\nWe demonstrate experimentally (Sec. 4) that multiple feature integration helps to improve both the\nmemorability score prediction and produce visually more consistent memorability maps.\n\n4 Experiments\n\nIn this section, we describe the experimental setup and dataset used (Sec. 4.1), provide details about\nthe region attributes used in our experiments (Sec. 4.2) and describe the experimental results on\nthe image memorability dataset (Sec. 4.3). Experimental results show that our method outperforms\nstate-of-the-art methods on this dataset while providing automatic memorability maps of images that\ncompare favorably to when ground truth segmentation is used.\n\n4.1 Setup\n\nDataset: We use the dataset proposed by Isola et al. [7] consisting of 2222 images from the SUN\ndataset [32]. The images are fully annotated with segmented object regions and randomly sampled\nfrom different scene categories. The images are cropped and resized to 256\u2217256 and a memorability\nscore corresponding to each image is provided. The memorability score is de\ufb01ned as the percentage\nof correct detections by participants in their study.\nPerformance evaluation: The performance is evaluated using Spearman\u2019s rank correlation(\u03c1). We\nevaluate our performance on 25 different training/testing splits of the data (same splits as [7]) with\n\n1We weight the importance of individual features by summing the (cid:126)\u03b1 corresponding to the particular feature.\n\n4\n\n\u03b1gradient!\u03b2gradient!gradient!color!texture!\u03b1color!\u03b2color!\u03b1texture!\u03b2texture!pooling!+\"feature !memorability maps!overall!memorability map!\fan equal number of images for training and testing (1111). The train splits are scored by one half\nof the participants and the test splits are scored by the other half of the participants with a human\nconsistency of \u03c1 = 0.75. This can be thought of as an upper bound in the performance of automatic\nmethods.\nAlgorithmic details: We sample 2000 patches per image with size 0.2\u2217 0.2 to 0.7\u2217 0.7 with random\naspect ratios in normalized image coordinates. To speed up convergence of SVM-Rank, we do not\ninclude rank constraints for memorability scores that lie within 0.001 of each other. We \ufb01nd that\nthis does not affect the performance signi\ufb01cantly. The hyperparameter of the SVM-Rank algorithm\nis set using 5-fold cross-validation.\n\n4.2\n\nImage region attributes\n\nOur goal is to choose various features as attributes that human likely use to represent image regions.\nIn this work, we consider six common attributes, namely gradient, color, texture, shape, saliency\nand semantic meaning of the images. The attributes are extracted for each region and assigned to a\nregion type as described in Sec. 3.2 with a dictionary size of 1024 for each feature. For each of the\nattributes, we describe our motivation and the method used for extraction.\nGradient: In human vision system, much evidence suggests that retinal ganglion cells and receptive\n\ufb01elds of cells in the visual cortex V1 are essentially gradient-based features. Furthermore, recent\nsuccess of many computer vision algorithms [2, 4] also demonstrated the power of such features. In\nthis work, we use the powerful Histogram of Oriented Gradients (HOG) features for our task. We\ndensely sample HOG [2] with a cell size of 2x2 at a grid spacing of 4 and learn a dictionary of size\n256. The descriptors for a given image region are max-pooled at 2 spatial pyramid levels[15] using\nLocality-Constrained Linear Coding (LLC) [29].\nColor: Color is an important part of human vision. Color usually has large variations caused by\nchanges in illumination, shadows, etc, and these variations make the task of robust color description\ndif\ufb01cult. Isola et al. [7] show that simple image color features, such as mean hue, saturation and\nintensity, only exhibits very weak correlation with memorability. In contrast to this, color has been\nshown to yield excellent results in combination with shape features for image classi\ufb01cation [11].\nFurthermore, many studies show that color names are actually linguistic labels that humans assign\nto color spectrum space. In this paper, we use the color names feature [27] to better exploit the\ncolor information. We densely sample the feature at multiple scales (12, 16, 24 and 32) with a grid\nspacing of 4. Then we learn a dictionary of size 100 and apply LLC at 2-level spatial pyramid to\nobtain the color descriptor for each region.\nTexture: We interact with a variety of materials on a daily basis and we constantly assess their tex-\nture properties by visual means and tactile touch. To encode visual texture perception information,\nwe make use of the popular texture features \u2013 Local Binary Pattern [21] (LBP). We use a 2-level\nspatial pyramid of non-uniform LBP descriptor.\nSaliency: Image saliency is a biologically inspired model to capture the regions that attract more\nvisual attention and \ufb01xation focus [8]. Inspired by this, we extract a saliency value for each pixel\nusing natural statistics [10]. Then we perform average pooling at 3-level spatial pyramid to obtain\nthe descriptor for each region.\nShape: Humans constantly use geometric patterns to determine the similarity between visual enti-\nties, and the layout of shapes is directly relevant to mid level representations of the image. We denote\nshape as a histogram of local Self-Similarity geometric pattens (SSIM [23]). We densely sample the\nSSIM descriptor with a grid spacing of 4 and learn a dictionary of sie 256. The descriptors for a\ngiven image region are max-pooled at 2 spatial pyramid levels using LLC.\nSemantic: High-level semantic meaning contained in images has been shown to be strongly corre-\nlated to image memorability [7], where manual annotation of object labels lead to great performance\nin predicting image memorability. Here, our goal is to design a fully automatic approach to predict\nimage memorability, while still exploiting the semantic information. Thus, we use the automatic\nObject Bank [17] feature to model the presence/absence of various objects in the images. We reduce\nthe feature dimension by using simple max pooling instead of spatial pyramid pooling.\n\n5\n\n\fTable 1: Images are sorted into sets according to predictions made on the basis of a variety of\nfeatures (denoted by column headings). Average measured memorabilities are reported for each set.\ne.g. The Top 20 row reports average measured memorability of the 20 highest predicted images. \u03c1\nis the Spearman rank correlation between predictions and measurements.\n\nTop 20\nTop 100\n\nBottom 100\nBottom 20\n\n\u03c1\n\nMultiple global features [7] Our Global Our Local Our Full Model\n\n83%\n80%\n57%\n55%\n0.46\n\n84%\n80%\n56%\n53%\n0.48\n\n83%\n80%\n57%\n54%\n0.45\n\n85%\n81%\n55%\n52%\n0.50\n\nFigure 3: Visualization of region types and corresponding \u03b1 learned by our algorithm for gradient\nand semantic features. The histograms represent the distribution of memorability scores correspond-\ning to the particular region type. We observe that high-scoring images tend to have a small value of\n\u03b1 while low scoring regions have a high value. This corresponds well with the proposed framework.\nThe color of the bounding boxes corresponds to the memorability score of the image shown (using\na jet color scheme).\n\n4.3 Results\n\nIn this section, we evaluate the performance of our model with single and multiple features, and later\nexplore what the model has learned using memorability maps and the ranking of different types of\nimage regions.\nSingle + multiple features: Fig. 6(a) and Tbl. 1 summarize the performance of our algorithm when\nusing single and multiple features. We compare our results with [7], and \ufb01nd that our algorithm\noutperforms the automatic methods from [7] by 4%, and achieve comparable performance to when\nground truth annotation is used. This shows the effectiveness of our method at predicting memo-\nrability. Further, we note that our model provides complementary information to global features as\nit focuses on local image regions, increasing performance by 2% when combined with our global\nfeatures. We use the same set of attributes described in Sec. 4.2 as global features in our model.\nThe global features are learned independently using SVM-Rank and the predicted score is com-\nbined with the predicted scores of our local model in SVM-Rank algorithm. Despite using the same\nset of features, we are able to obtain performance gain suggesting that our algorithm is effective at\ncapturing local information in the image that was overlooked by the global features.\nMemorability maps: We obtain memorability maps using max-pooling of the \u03b1 from different\nimage regions. Fig. 4 shows the memorability maps obtained when using different features and\nthe overall memorability map when combining multiple features. Despite using no annotation, the\nlearned maps are similar to those obtained using ground truth objects and segments. From the\nimages shown, we observe that there is no single attribute that is always effective at producing\nmemorability maps, but the combination of the attributes leads to a signi\ufb01cantly improved version.\nWe show additional results in Fig. 5.\n\n6\n\nGradient (HOG)!\u03b1!0.107!0.909!\u03b1!0.048!0.931!Semantic (ObjectBank)!\fFigure 4: Visualization of the memorability maps obtained using different features, and the overall\nmemorability map. Additionally, we also include the memorability map obtained when using ground\ntruth segmentation on the right. We observe that it resembles our automatically generated maps.\n\nFigure 5: Additional examples of memorability maps generated by our algorithm.\n\nImage region types: In Fig. 3, we rank the image region types by their \u03b1 value and visualize the\nregions for the corresponding region type when \u03b1 is close to 0 or 1. We observe that the region types\nare consistent with our intuition of what is memorable from [7]. People often exist in image regions\nwith low \u03b1 (i.e. low probability of being forgotten) while natural scenes and plain backgrounds are\nobserved in high \u03b1.\nFurther, we analyze the image region types by computing the standard deviation of the memorability\nscores of the image regions that correspond to the particular type. Fig. 6(b) and 6(c) show the\nresults. The results are encouraging as regions that have high standard deviation tend to have a value\nof \u03b1 close to 0.5, which means they are not very informative for prediction. The same behavior is\nobserved for multiple feature types, and we \ufb01nd that the overall performance for individual features\n(shown in Fig. 6(a)) corresponds well with the distance of the peaks in Fig. 6(b) from \u03b1 = 0.5. This\nsuggests that our algorithm is effective at learning the regions with high and low probability of being\nforgotten as proposed in our framework.\n\n7\n\nOverall !memorability map!Original !Image!Feature !memorability maps!1\"2\"4\"3\"5\"6\"Memory!Score!0.900!0.406!high!low!Ground truth!segments!1   Gradient#2   Saliency#3   Color#4   Texture#5   Shape#6   Semantic!1\"2\"4\"3\"5\"6\"0.811!0.561!1\"2\"4\"3\"5\"6\"1\"2\"4\"3\"5\"6\"\f(a) Comparison of results averaged\nacross the 25 splits.\nImages are\nranked by predicted memorability\nand plotted against the cumulative\naverage of measured memorability\nscores.\nFigure 6: Plot of various results and analysis of our method. Fig. 6(b) and Fig. 6(c) are explained in\ngreater detail in Sec. 4.3\n\n(b) Standard deviation of memora-\nbility score of all region types av-\neraged across the 25 splits for all\nfeatures, sorted by \u03b1. Graphs are\nsmoothed using a median \ufb01lter.\n\n(c) Standard deviation of re-\ngion types for Gradient feature\naveraged across the 25 splits.\nNo smoothing is applied in this\ncase.\n\n5 Conclusion\n\nWith the emergence of large scale photo collections and growing demands in storing, organizing,\ninterpreting, and summarizing large amount of digital information, it becomes essential to be able\nto automatically annotate images on various novel dimensions that are interpretable to human users.\nRecently, learning algorithms have been proposed to automatically interpret whether an image is\naesthetically pleasant or not [20, 3], memorable or forgettable [7, 6], and the role that other high\nlevel photographic properties plays in image interpretation (photo quality [19], attractiveness [16],\ncomposition [5, 18], and object importance [24]). Here, we propose a novel probabilistic frame-\nwork for automatically constructing memorability maps, discovering regions in the image that are\nmore likely to be memorable or forgettable by human observers. We demonstrate an effective yet\ninterpretable framework to model the process of forgetting. Future development of such automatic\nalgorithms of image memorability could have many exciting and far-reaching applications in com-\nputer science, graphics, media, designs, gaming and entertainment industries in general.\n\nAcknowledgements\nWe thank Phillip Isola and the reviewers for helpful discussions. This work is funded by NSF grant\n(1016862) to A.O, Google research awards to A.O and A.T, ONR MURI N000141010933 and NSF\nCareer Award (0747120) to A.T. J.X. is supported by Google U.S./Canada Ph.D. Fellowship in\nComputer Vision.\n\nReferences\n\n[1] T. F. Brady, T. Konkle, G. A. Alvarez, and A. Oliva. Visual long-term memory has a massive storage\n\ncapacity for object details. PNAS, pages 14325\u201314329, 2008.\n\n[2] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages\n\n886\u2013893. IEEE, 2005.\n\n[3] S. Dhar, V. Ordonez, and T.L. Berg. High level describable attributes for predicting aesthetics and inter-\n\nestingness. In CVPR, pages 1657\u20131664. IEEE, 2011.\n\n[4] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\n\ntrained part-based models. TPAMI, 2010.\n\n[5] B. Gooch, E. Reinhard, C. Moulding, and P. Shirley. Artistic composition for image creation. In Rendering\nTechniques 2001: Proceedings of the Eurographics Workshop in London, United Kingdom, June 25-27,\n2001, page 83. Springer Verlag Wien, 2001.\n\n[6] P. Isola, D. Parikh, A. Torralba, and A. Oliva. Understanding the intrinsic memorability of images. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2011.\n\n8\n\n Average memorability for top N ranked images (%)Image rank (N)7580857010020030040050060070080090010000Other Human [0.75]Isola et al. [0.46]Objects and Scenes [0.50]Our Final Model [0.50]Global Only [0.48]Local Only [0.45]Gradient [0.40]Shape [0.38]Semantic [0.37]Texture [0.34]Color [0.29]Saliency [0.28]Gradient)Color)Texture)Shape)Seman4c)Saliency)\u03b1=1\"\u03b1=0\"\u03b1=0.5\"Standard Deviation\"\u03b1=1\"\u03b1=0\"\u03b1=0.5\"Standard Deviation\"\f[7] P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an image memorable? In IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), pages 145\u2013152, 2011.\n\n[8] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention.\n\nVision Research, 40:1489\u20131506, 2000.\n\n[9] T. Joachims. Training linear SVMs in linear time. In ACM SIGKDD, pages 217\u2013226, 2006.\n[10] C. Kanan, M.H. Tong, L. Zhang, and G.W. Cottrell. Sun: Top-down saliency using natural statistics.\n\nVisual Cognition, 17(6-7):979\u20131003, 2009.\n\n[11] F. S. Khan, J. van de Weijer, A. D. Bagdanov, and M. Vanrell. Portmanteau vocabularies for multi-cue\n\nimage representation. In NIPS, Granada, Spain, 2011.\n\n[12] A. Khosla\u2217, J. Xiao\u2217, P. Isola, A. Torralba, and A. Oliva. Image memorability and visual inception. In\n\nSIGGRAPH Asia, 2012. \u2217 indicates equal contribution.\n\n[13] T. Konkle, T.F. Brady, G.A Alvarez, and A. Oliva. Conceptual distinctiveness supports detailed visual\nlong-term memory for real-world objects. Journal of Experimental Psychology, (139):558\u2013578, 3 2010.\n[14] T. Konkle, T.F. Brady, G.A. Alvarez, and A. Oliva. Scene memory is more detailed than you think: the\n\nrole of categories in visual long-term memory. Psychological Science, (21):1551\u20131556, 11 2010.\n\n[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. In CVPR, volume 2, pages 2169\u20132178. IEEE, 2006.\n\n[16] T. Leyvand, D. Cohen-Or, G. Dror, and D. Lischinski. Data-driven enhancement of facial attractiveness.\n\nIn ACM Transactions on Graphics (TOG), volume 27, page 38. ACM, 2008.\n\n[17] L.-J. Li, H. Su, E. P. Xing, and L. Fei-Fei. Object bank: A high-level image representation for scene\n\nclassi\ufb01cation & semantic feature sparsi\ufb01cation. In NIPS, Vancouver, Canada, December 2010.\n\n[18] L. Liu, R. Chen, L. Wolf, and D. Cohen-Or. Optimizing photo composition.\n\nForum, volume 29, pages 469\u2013478. Wiley Online Library, 2010.\n\nIn Computer Graphics\n\n[19] Y. Luo and X. Tang. Photo and video quality evaluation: Focusing on the subject. In Proceedings of the\n\n10th European Conference on Computer Vision: Part III, pages 386\u2013399. Springer-Verlag, 2008.\n\n[20] L. Marchesotti, F. Perronnin, D. Larlus, and G. Csurka. Assessing the aesthetic quality of photographs\nusing generic image descriptors. In Computer Vision (ICCV), 2011 IEEE International Conference on,\npages 1784\u20131791. IEEE, 2011.\n\n[21] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation invariant texture clas-\n\nsi\ufb01cation with local binary patterns. Pattern Analysis and Machine Intelligence, 24(7):971\u2013987, 2002.\n\n[22] R. A. Rensink, J. K. O\u2019Regan, and J. J. Clark. To See or not to See: The Need for Attention to Perceive\n\nChanges in Scenes. Psychological Science, 8(5):368\u2013373, September 1997.\n\n[23] E. Shechtman and M. Irani. Matching local self-similarities across images and videos.\n\nVision and Pattern Recognition, 2007. CVPR\u201907. IEEE Conference on, pages 1\u20138. Ieee, 2007.\n\nIn Computer\n\n[24] M. Spain and P. Perona. Some objects are more equal than others: Measuring and predicting importance.\n\nComputer Vision\u2013ECCV 2008, pages 523\u2013536, 2008.\n\n[25] L. Standing. Learning 10000 pictures. The Quarterly journal of experimental psychology, 25(2):207\u2013222,\n\n1973.\n\n[26] L. Standing, J. Conezio, and R.N. Haber. Perception and memory for pictures: Single-trial learning of\n\n2500 visual stimuli. Psychonomic Science; Psychonomic Science, 1970.\n\n[27] J. Van De Weijer, C. Schmid, and J. Verbeek. Learning color names from real-world images. In Computer\n\nVision and Pattern Recognition, 2007. CVPR\u201907. IEEE Conference on, pages 1\u20138. IEEE, 2007.\n\n[28] S. Vogt and S. Magnussen. Long-term memory for 400 pictures on a common theme. Experimental\n\nPsychology (formerly Zeitschrift f\u00a8ur Experimentelle Psychologie), 54(4):298\u2013303, 2007.\n\n[29] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image\n\nclassi\ufb01cation. In CVPR, pages 3360\u20133367. IEEE, 2010.\n\n[30] J. T. Wixted. The Psychology and Neuroscience of Forgetting. Annual Review of Psychology, 55(1),\n\n20040101.\n\n[31] J. T. Wixted and S. K. Carpenter. The Wickelgren Power Law and the Ebbinghaus Savings Function.\n\nPsychological Science, 18(2):133\u2013134, February 2007.\n\n[32] J. Xiao, J. Hays, K.A. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition\n\nfrom abbey to zoo. In CVPR, pages 3485\u20133492. IEEE, 2010.\n\n9\n\n\f", "award": [], "sourceid": 4570, "authors": [{"given_name": "Aditya", "family_name": "Khosla", "institution": null}, {"given_name": "Jianxiong", "family_name": "Xiao", "institution": null}, {"given_name": "Antonio", "family_name": "Torralba", "institution": null}, {"given_name": "Aude", "family_name": "Oliva", "institution": null}]}