{"title": "Understanding the Intrinsic Memorability of Images", "book": "Advances in Neural Information Processing Systems", "page_first": 2429, "page_last": 2437, "abstract": "Artists, advertisers, and photographers are routinely presented with the task of creating an image that a viewer will remember. While it may seem like image memorability is purely subjective, recent work shows that it is not an inexplicable phenomenon: variation in memorability of images is consistent across subjects, suggesting that some images are intrinsically more memorable than others, independent of a subjects' contexts and biases. In this paper, we used the publicly available memorability dataset of Isola et al., and augmented the object and scene annotations with interpretable spatial, content, and aesthetic image properties. We used a feature-selection scheme with desirable explaining-away properties to determine a compact set of attributes that characterizes the memorability of any individual image. We find that images of enclosed spaces containing people with visible faces are memorable, while images of vistas and peaceful scenes are not. Contrary to popular belief, unusual or aesthetically pleasing scenes do not tend to be highly memorable. This work represents one of the first attempts at understanding intrinsic image memorability, and opens a new domain of investigation at the interface between human cognition and computer vision.", "full_text": "Understanding the Intrinsic Memorability of Images\n\nPhillip Isola\n\nMIT\n\nDevi Parikh\nTTI-Chicago\n\nAntonio Torralba\n\nMIT\n\nAude Oliva\n\nMIT\n\nphillipi@mit.edu\n\ndparikh@ttic.edu\n\ntorralba@mit.edu\n\noliva@mit.edu\n\nAbstract\n\nArtists, advertisers, and photographers are routinely presented with the task of\ncreating an image that a viewer will remember. While it may seem like image\nmemorability is purely subjective, recent work shows that it is not an inexplicable\nphenomenon: variation in memorability of images is consistent across subjects,\nsuggesting that some images are intrinsically more memorable than others, inde-\npendent of a subjects\u2019 contexts and biases. In this paper, we used the publicly\navailable memorability dataset of Isola et al. [13], and augmented the object and\nscene annotations with interpretable spatial, content, and aesthetic image proper-\nties. We used a feature-selection scheme with desirable explaining-away proper-\nties to determine a compact set of attributes that characterizes the memorability of\nany individual image. We \ufb01nd that images of enclosed spaces containing people\nwith visible faces are memorable, while images of vistas and peaceful scenes are\nnot. Contrary to popular belief, unusual or aesthetically pleasing scenes do not\ntend to be highly memorable. This work represents one of the \ufb01rst attempts at\nunderstanding intrinsic image memorability, and opens a new domain of investi-\ngation at the interface between human cognition and computer vision.\n\n1\n\nIntroduction\n\n(a)\n(f)\nFigure 1: Which of these images are the most memorable? See footnote 1 for the answer key.\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nWhen glancing at a magazine or browsing the Internet we are continuously exposed to photographs\nand images. Despite this over\ufb02ow of visual information, humans are extremely good at remembering\nthousands of pictures and a surprising amount of their visual details [1, 15, 16, 25, 30]. But, while\nsome images stick in our minds, others are ignored or quickly forgotten. Artists, advertisers, and\nphotographers are routinely challenged by the question \u201cwhat makes an image memorable?\u201d and\nare then presented with the task of creating an image that will be remembered by the viewer.\nWhile psychologists have studied human capacity to remember visual stimuli [1,15,16,25,30], little\nwork has systematically studied the differences in stimuli that make them more or less memorable.\nIn a recent paper [13], we quanti\ufb01ed the memorability of 2222 photographs as the rate at which\nsubjects detect a repeat presentation of the image a few minutes after its initial presentation. The\nmemorability of these images was found to be consistent across subjects and across a variety of\ncontexts, making some of these images intrinsically more memorable than others, independent of\nthe subjects\u2019 past experiences or biases. Thus, while image memorability may seem like a quality\nthat is hard to quantify, our recent work suggests that it is not an inexplicable phenomenon.\n\n1\n\n\f(a)\n\n(d)\n\n(e) \u2191A \u2193M (f) \u2193A \u2191M\n\n(b) \u2191U \u2193M (c) \u2193U \u2191M\n\n(h) \u2191m \u2193M (i) \u2193m \u2191M\nFigure 2: Distribution of memorability M of photographs with respect to unusualness U (left), aesthetics A\n(middle) and subjects\u2019 guess on how memorable an image is m (right). All 2222 images from the memorability\ndataset were rated along these three aspects by 10 subjects each. Contrary to popular belief, unusual and\naesthetically pleasing images are not predominantly the most memorable ones. Also shown are example images\nthat demonstrate this (e.g. (f) shows an image that is very aesthetic, but not memorable). Clearly, which images\nare memorable is not intuitive, as seen by poor estimates from subjects (g).\n\n(g)\n\nBut then again, subjective intuitions of what make an image memorable may need to be revised. For\ninstance, look at the photographs of Figure 1. Which images do you think are more memorable?1\nWe polled various human and computer vision experts to get ideas as to what people think drives\nmemorability. Among the most frequent responses were unusualness (8 out of 16) and aesthetic\nbeauty (7 out of 16). Surprisingly, as shown in Figure 2, we \ufb01nd that these are weakly correlated\n(and, in fact, negatively correlated) with memorability as measured in [13]. Further, when subjects\nwere asked to rate how memorable they think an image would be, their responses were weakly\n(negatively) correlated to true memorability (Figure 2)!\nWhile our previous work aimed at predicting memorability [13], here we aim to better understand\nmemorability. Any realistic use of the memorability of images requires an understanding of the key\nfactors that underly memorability; be it for cognitive scientists to discover the mechanisms behind\nmemory or for advertisement designers to create more effective visual media.\nThus, the goal of this paper is to identify a collection of human-understandable visual attributes that\nare highly informative about image memorability. First, we annotate the memorability dataset [13]\nwith interpretable and semantic attributes. Second, we employ a greedy feature selection algorithm\nwith desirable explaining-away properties that allows us to explicitly determine a compact set of\ncharacteristics that make an image memorable. Finally, we train automatic detectors that predict\nthese characteristics, which are in turn used to predict memorability.\n2 Related work\n\nVisual memory: People have been shown to have a remarkable ability to remember particular\nimages in long-term memory, be they everyday scenes, objects and events [30], or the shapes of\narbitrary forms [25]. As most of us would expect, image memorability depends on the user context\nand is likely to be subject to some inter-subject variability [12]. However, in our previous work [13],\nwe found that despite this expected variability, there is also a large degree of agreement between\nusers. This suggests that there is something intrinsic to images that make some more memorable than\nothers, and in [13] we developed a computer vision algorithm to predict this intrinsic memorability.\nWhile being a useful goal, prediction systems are often uninterpretable, giving us little insight into\nwhat makes the image memorable. Hence in this work, we focus on identifying the characteristics of\nimages that make them memorable. A discussion of different models of memory retrieval [3,11,27]\nand formation [22] are beyond the scope of this paper.\nAttributes for interpretability: Attributes-based visual recognition has received a lot of attention\nin computer vision literature in recent years. Attributes can be thought of as mid-level interpretable\nfeatures such as \u201cfurry\u201d and \u201cspacious\u201d. Attributes are attractive because they allow for transfer-\nlearning among categories that share attributes [18]. Attributes also allow for descriptions of pre-\nviously unseen images [8]. In this work, we exploit attributes to understand which properties of an\nimage make it memorable.\nPredicting image properties: While image memorability is vastly unexplored, many other pho-\ntographic properties have been studied in the literature, such as photo quality [21], saliency [14],\nattractiveness [20], composition [10, 24], color harmony [5], and object importance [29]. Most re-\nlated to our work is the recent work of Dhar et al. [7], who use attributes to predict the aesthetic\nquality of an image. Towards the goal of improved prediction, they use a list of attributes known to\nin\ufb02uence the aesthetic quality of an image. In our work, since it is not known what makes an image\n\n1Images (a,d,e) are among the most memorable images in our dataset, while (b,c,f) are among the least.\n\n2\n\nUM(cid:70)(cid:82)(cid:85)(cid:85)(cid:29)(cid:3)(cid:239)(cid:19)(cid:17)(cid:20)(cid:21)AM(cid:70)(cid:82)(cid:85)(cid:85)(cid:29)(cid:3)(cid:239)(cid:19)(cid:17)(cid:21)(cid:28) mMcorr: \u22120.19\f(a) \u2191attractive\n\n(b) \u2191funny\n\n(c) \u2191makes-sad\n\n(d) \u2191qual. photo\n\n(e) \u2191peaceful\n\n(f) \u2193attractive\n\n(g) \u2193funny\n\n(h) \u2193makes-sad\n\n(i) \u2193qual. photo\n\n(j) \u2193peaceful\n\nFigure 3: Example images depicting varying values of a subset of attributes annotated by subjects.\n\nmemorable, we use an exhaustive list of attributes, and use a feature selection scheme to identify\nwhich attributes make an image memorable.\n\n3 Attribute annotations\n\nWe investigate memorability using the memorability dataset from [13]. The dataset consists of 2222\nnatural images of everyday scenes and events selected from the SUN dataset [32], as well as mem-\norability scores for each image. The memorability scores were obtained via 665 subjects playing\na \u2018memory game\u2019 on Amazon\u2019s Mechanical Turk. A series of natural images were \ufb02ashed for 1\nsecond each. Subjects were instructed to press a key whenever they detected a repeat presentation of\nan image. The memorability score of an image corresponds to the number of subjects that correctly\ndetected a repeat presentation of the image. The rank correlation between two halves of the subjects\nwas found to be 0.75, providing evidence for intrinsic image memorability. Examples images from\nthis dataset can be seen throughout the paper.\nThe images in the memorability dataset come from \u223c700 scene categories [32]. They have been\nlabeled via the LabelMe [26] online annotation tool, and contain about \u223c1300 object categories.\nWhile the scene and object categories depicted in an image may very well in\ufb02uence its memorability,\nthere are many other properties of an image that could be at play. To get a handle on these, we\nconstructed an extensive list of image properties or attributes, and had the 2222 images annotated\nwith these properties using Amazon\u2019s Mechanical Turk. An organization of the attributes collected\nis shown in Table 1. Binary attributes are listed with a \u2018?\u2019, while multi-valued attributes (on a scale\nof 1-5) are listed with a \u2018;\u2019. Each image was annotated by 10 subjects for each of the attributes.\nThe average response across the subjects was stored as the value of the attribute for an image. The\n\u2018Length of description\u2019 attribute was computed as the average number of words subjects used to\ndescribe the image (free-form). The spatial layout attributes were based on the work of Oliva and\nTorralba [23]. Many of the aesthetic attributes are based on the work of Dhar et al. [7].\nWe noticed that images containing people tend to be highly memorable. However even among\nimages containing people, there is a variation in memorability that is consistent across subjects (split\nhalf rank correlation = 0.71). In an effort to better understand memorability of images containing\npeople, we collected several attributes that are speci\ufb01c to people. These are listed in Table 2. The\nannotations of these attributes were collected only on images containing people (and are considered\nto be absent for images not containing people). This is compactly captured by the \u2018contains a person\u2019\nattribute.\nSome questions had multiple choice answers (for example, Age can take four values: child, teenager,\nadult and senior). When applicable, the multiple choices are listed in parentheses in Table 2. Each\nchoice was treated as a separate binary attribute (e.g. is child). Some of the people-attributes were\nreferring to the entire image (\u2018whole image\u2019) while others were referring to each person in the image\n(\u2018per-person\u2019). The per-person attributes were aggregated across all subjects and all people in the\nimage. See Figure 3 for example attribute annotations.\n\n3\n\n\fTable 1: General attributes\n\nSpatial layout: Enclosed space vs. Open space; Perspective view vs. Flat view; Empty space vs.\nCluttered space; Mirror symmetry vs. No mirror symmetry (cf. [23])\nAesthetics: Post-card like? Buy this painting? Hang-on wall? Is aesthetic? Pleasant vs. Unpleas-\nant; Unusual or strange vs. Routine or mundane; Boring vs. Striking colors; High quality (expert\nphotography) vs. Poor quality photo; Attractive vs. Dull photo; Memorable vs. Not memorable; Sky\npresent? Clear vs. Cloudy sky; Blue vs. Sunset sky; Picture of mainly one object vs. Whole scene;\nSingle focus vs. Many foci; Zoomed-in vs. Zoomed-out; Top down view vs. Side view (cf. [7])\nEmotions: Frightening? Arousing? Funny? Engaging? Peaceful? Exciting? Interesting? Mysteri-\nous? Strange? Striking? Makes you happy? Makes you sad?\nDynamics: Action going on? Something moving in scene? Picture tells a story? About to happen?\nLot going on? Dynamic scene? Static scene? Have a lot to say; Length of description\nLocation: Famous place? Recognize place? Like to be present in scene? Many people go here?\nContains a person?\n\nFor further analysis, we utilize the most fre-\nquent 106 of the \u223c1300 objects present in the\nimages (their presence, count, area in the im-\nage, and for a subset of these objects, area\noccupied in four quadrants of the image), 237\nof the \u223c700 scene categories, and the 127\nattributes listed in Tables 1 and 2. We also\nappend image annotations with a scene hi-\nerarchy provided with the SUN dataset [32]\nthat groups similar categories into a meta-\ncategory (e.g.indoor), as well as an object hi-\nerarchy derived from the WordNet [9], that\nincludes meta-categories such as organism\nand furniture. The scene hierarchy resulted\nin 19 additional scene meta-categories, while\nthe object hierarchy resulted in 134 additional\nmeta-categories. From here on, we will re-\nfer to all these annotations as features. We\nhave a total of 923 features. The goal now\nis to determine a concise subset of these fea-\ntures that characterizes the memorability of\nan image. Since all our features are human-\ninterpretable, this allows us to gain an understanding of what makes an image memorable. Figure 4\nshows the correlation of different feature types with memorability.\n\nFigure 4: Correlation of attribute, scene, and object an-\nnotations with memorability. We see that the attributes\nare most strongly correlated with memorability. Many of\nthe features are correlated with each other (e.g. face vis-\nible and eye contact), suggesting a need for our feature\nselection strategy to have explaining-away properties.\n\n4 Feature selection\n\nOur goal is to identify a compact set of features that characterizes the memorability of an image.\nWe note that several of our features are redundant. Some by design (such as pleasant and aesthetic)\nto better capture subjective notions, but others due to contextual relationships that prevail in our\nvisual world (e.g. outdoor images typically contain sky). Hence, it becomes crucial that our feature\nselection algorithm has explaining away properties so as to determine a set of distinct characteristics\nthat make an image memorable. Not only is this desirable via the Occam\u2019s razor view, it is also\npractical from an applications stand-point.\nMoreover, we note that some features in our set subsume other features. For example, since the\nperson attributes (e.g. hair-color) are only labeled for images containing people, they include the\nperson presence / absence information in them. If a naive feature selection approach picked \u2018hair-\ncolor\u2019 as an informative feature, it would be unclear whether the mere presence or absence of a\nperson in the image is what contributes to memorability, or if the color of the hair really matters.\nThis issue of miscalibration of information contained in a feature also manifests itself in a more\nsubtle manner. Our set of features include inherently multi-valued information (e.g. mood of the\n\n4\n\nEnclosed space Person: face visible Person: eye contact Number of people in image Sky 0.36 922 Features Attributes Scenes Objects Magnitude of correlation with memorability \fTable 2: Attributes describing people in image\n\nVisibility (per-person): Face visible? Making eye-contact?\nDemographics (per-person): Gender (male, female)? Age (child, teenager, adult, senior)? Race\n(Caucasian, SouthEast-Asian, East-Asian, African-American, Hispanic)?\nAppearance (per-person): Hair length (short, medium, long, bald)? Hair color (blonde, black,\nbrown, red, grey)? Facial hair?\nClothing (per-person): Attire (casual, business-casual, formal)? Shirt? T-shirt? Blouse? Tie?\nJacket? Sweater? Sweat-shirt? Skirt? Trousers? Shorts? A uniform?\nAccessories (per-person): Dark eye-glasses? Clear eye-glasses? Hat? Earrings? Watch? Wrist\njewelry? Neck jewelry? Belt? Finger Ring(s)? Make-up?\nActivity (per-person): Standing? Sitting? Walking? Running? Working? Smiling? Eating?\nClapping? Engaging in art? Professional activity? Buying? Selling? Giving a speech? Holding?\nActivity (whole image): Sports? Adventurous? Tourist? Engaging in art? Professional? Group?\nSubject (whole image): Audience? Crowd? Group? Couple? Individual? Individuals interacting?\nScenario (whole image): Routine/mundane? Unusual/strange? Pleasant? Unpleasant? Top-down?\n\nimage), as well as inherently binary information like \u201ca car is present in the image\u201d. It is important\nto calibrate the features by the amount of information captured by them.\nEmploying an information-theoretic approach to feature selection allows us to naturally capture both\nthese goals: selecting a compact set of non-redundant features and calibrating features based on the\ninformation they contain.\n\n4.1\n\nInformation-theoretic\n\nWe formulate our problem as that of selecting features that maximize mutual information with mem-\norability, such that the total number of bits required to encode all selected features (i.e. the number\nof bits required to describe an image using the selected features) does not exceed B. Formally,\n\nF \u2217 = arg max\n\nI (F ; M)\ns.t. C(F ) \u2264 B\n\n(1)\n\nwhere F is a subset of the features, I (F ; M) is the mutual information between F and memorability\nM, B is the budget (in bits), and C(F ) is the total number of bits required to encode F . We assume\nthat each feature is encoded independently, and thus\n\nn(cid:88)\n\nC(F ) =\n\nC(fi), fi \u2208 F\n\n(2)\n\ni=1\n\nwhere C(fi) is the number of bits required to encode feature fi, computed as H(fi), the entropy of\nfeature fi across the training images.\nThis optimization is combinatorial in nature, and is NP-hard to solve. Fortunately, the work of\nKrause et al. [17] and Leskovec et al. [19] provides us with a computationally feasible algorithm\nto solve the problem. Krause et al. [17] showed mutual information to be a submodular function.\nA greedy optimization scheme to maximize submodular functions was shown to be optimal, with a\nconstant approximation factor of (1 \u2212 1\ne ); i.e. no polynomial time algorithm can provide a tighter\nbound. Subsequently, Leskovec et al. [19] presented a similar greedy algorithm to select features,\nwhere each feature has a different cost associated with it (as in our set-up). The algorithm selects\nfeatures with the maximum ratio of improvement in mutual information to their cost, while the total\ncost of the features does not exceed the allotted budget. In parallel, the cost-less version of the\ngreedy algorithm is also used to select features (still not exceeding budget). Finally, of the two, the\nset of features that provides the higher mutual information is retained. This solution is at most a\ne ) away from the optimal solution [19]. Moreover, Leskovec et al. [19] also\nconstant factor 1\nprovided a lazy evaluation scheme that provides signi\ufb01cant computation bene\ufb01ts in practice, while\nstill maintaining the bound.\n\n2(1 \u2212 1\n\n5\n\n\fHowever, this lazy-greedy approach still requires the computation of mutual information between\nmemorability and subsets of features. At each iteration, the additional information provided by a\ncandidate feature fi over an existing set of features F would be the following:\n\nIG (fi) = I (F \u222a fi; M) \u2212 I (F ; M)\n\n(3)\n\nThis computation is not feasible given our large number of features and limited training data. Hence,\nwe greedily add features that maximize an approximation to the mutual information between a subset\nof features and memorability, as also employed by Ullman et al. [31]. The additional information\nprovided by a candidate feature fi over an existing set of features F is approximated as:\n\n\u02c6IG (fi) = min\n\nj\n\n(I (fj \u222a fi; M) \u2212 I (fj; M)) , fj \u2208 F\n\n(4)\n\nThe ratio of this approximation to the cost of the feature is used as the score to evaluate the usefulness\nof features during greedy selection. Intuitively, this ensures that the feature selected at each itera-\ntion maximizes the per-bit minimal gain in mutual information over each of the individual features\nalready selected.\nIn order to maximize the mutual information (approximation) beyond the greedy algorithm, we\nemploy multiple passes on the feature set. Given a budget B, we \ufb01rst greedily add features using\na budget of 2B, and then greedily remove features (that reduce the mutual information the least)\nuntil we fall within the allotted budget B. This allows for the features that were added greedily\nearly on in the forward pass, but are explained away by subsequently added features, to be dropped.\nThese forward and backward passes are repeated 4 times each. Note that at each pass, the objective\nfunction cannot decrease, and the \ufb01nal solution is still guaranteed to have a total cost within the\nallotted budget B.\n\n4.2 Predictive\n\nThe behavior of the above approximation to mutual information has not been formally studied.\nWhile this may provide a good means to prune out many candidate features, it is unclear how close\nto optimal the selections will be. Feature selection within the realm of a predictive model allows us\nto better capture features that achieve a concrete and practical measure of performance: \u201cwhich set\nof features allows us to make the best predictions about an image\u2019s memorability?\u201d While selecting\nsuch features would be computationally expensive to do over all our 923 features, using a pruned set\nof features obtained via information-theoretic selection makes this feasible. We employ a support\nvector regressor (SVR, [28]) as our predictive model.\nGiven a set of features selected by the information-theoretic method above, we greedily select fea-\ntures (again, while maintaining a budget) that provide the biggest boost in regression performance\n(Spearman\u2019s rank correlation between predicted and ground truth memorabilities) over the training\nset. The same cost-based lazy-greedy selection algorithm is used as above, except with only a single\npass over the feature set. This is inspired from the recent work of Das et al. [6], who analyzed\nthe performance of greedy approaches to maximize submodular-like functions. They found that the\nsubmodularity ratio of a function is the best predictor of how well a greedy algorithm performs.\nMoreover, they found that in practice, regression performance has a high submodularity ratio, justi-\nfying the use of a greedy approach.\nAn alternative to greedy feature selection would be to learn a sparse-regressor. However, the param-\neter that controls the sparsity of the vector is not intuitive and interpretable. In the greedy feature\nselection approach, the budget of bits, which is interpretable, can be explicitly enforced.\n\n5 Results\n\nAttribute annotations help: We \ufb01rst tested the degree to which each general feature-type annota-\ntion in our feature set is effective at predicting memorability. We split the dataset from [13] into\n2/3 training images scored by half the subjects and 1/3 test images scored by the left out half of\nthe subjects. We trained \u0001-SVRs [4] to predict memorability, using grid search to select cost and\n\u0001 hyperparameters. For the new attributes we introduced, and for the object and scene hierarchy\n\n6\n\n\fFeature type\n\nObject annotations\nScene annotations\nAttribute annotations\n\nObjects + Scenes + Attributes\n\nPerf\n0.494\n0.415\n0.528\n0.554\n\nTable 3: Performance (rank correlation)\nof different types of features at predicting\nimage memorability.\n\nfeatures, we used RBF kernels, while for the rest of the features we used the same kernel functions\nas in [13]. We report performance as Spearman\u2019s rank correlation (\u03c1) between predicted and ground\ntruth memorabilities averaged over 10 random splits of the data.\nResults are shown in Table 3. We found that our new\nattributes annotations performed quite well (\u03c1 = 0.528):\nthey outperform higher dimensional object and scene\nannotations.\nFeature selection: We next selected the individual best\nfeatures in our set according to the feature selection al-\ngorithms described above. To compute feature entropy\nand mutual information, we used histogram estimators\non our training data, with 7 bins per feature and 10 bins\nfor memorability. Using these estimators, and measur-\ning feature set cost according to (2), our entire set of 923 features has a total cost of 252 bits. We\nselected reduced feature sets by both running information-theoretic selection and predictive selec-\ntion on our 2/3 training splits, for budgets ranging from 1 to 100-bits.\nFor predictive selection, we further split our training set in half\nand trained SVRs on one half to predict memorability on the\nother half. At each iteration of selection, we greedily selected\nthe feature that maximized predictive performance averaged\nover 3 random splits trials, with predictive performance again\nmeasured as rank correlation between predictions and ground\ntruth memorabilities. Since predictive selection is computa-\ntional expensive, we reduced our candidate feature set by \ufb01rst\npruning with information-theoretic selection. We took as can-\ndidates the union of all features that were selected using our\ninformation-theoretic approach for a budgets 1,2,...,100 bits.\nTaking this union, rather than just the features selected at a\n100-bit budget, ensures that candidates were not missed when\nFigure 5: Regression performance vs.\nthey are only effective in small budget sets.\nlog bit budget of various types of fea-\nture selection. The diminishing returns\nNext, we validated our selections on our 1/3 test set. We\n(submodular-like) behavior is evident.\ntrained SVRs using each of our selected feature sets and made\npredictions on the test set. Both selection algorithms create feature sets that are similarly effective\nat predicting memorability (Figure 5). Using just a 16-bit budget, information-theoretic selection\nachieves \u03c1 = 0.472, and predictive selection achieves \u03c1 = 0.490 (this budget resulted in selected\nsets with 6 to 11 features). This performance is comparable to the performance we get using much\ncostlier features, such as our full list of object annotations (540 features, \u223c106 bits, \u03c1 = 0.490). As\na baseline, we also compared against randomly selecting feature sets up to the same budget, which,\nfor 16 bits, only gives \u03c1 = 0.119.\nWe created a \ufb01nal list of features by run-\nning the above feature selection methods\non the entire dataset (no held out data)\nfor a budget of 10 bits. This produced\nthe sets listed in Table 4. If one is trying\nto understand memorability, these fea-\ntures are a good place to start. In Fig-\n0.39\nure 6, we explore these features further\n0.37\nby hierarchically clustering our images\n0.18\naccording to predictive set. Each cluster\n0.16\ncan be thought of as specifying type of\n-0.33\nimage with respect to memorability. For\nexample, on the far right we have highly memorable \u201cpictures of people in an enclosed space\u201d and\non the far left we have forgettable \u201cpeaceful, open, unfamiliar spaces, devoid of people.\u201d\nAutomatic prediction: While our focus in this paper is on understanding memorability, we hope\nthat by understanding the phenomenon we may also be able to build better automatic predictors of\n\nTable 4:\nInformation-theoretic and predictive fea-\nture selections for a budget of 10 bits. Correlations\nwith memorability are listed after each feature (arrow\nindicates direction of correlation).\nSelections and\ncorrelations run on entire dataset.\n\nInformation-theoretic\n\u2191 enclosed space\n0.39\n\u2191 face visible\n0.37\n\u2193 peaceful\n-0.33\n\u2193 sky present\n-0.35\n\nPredictive\n\u2191 enclosed space\n\u2191 face visible\n\u2191 tells a story\n\u2191 recognize place\n\u2193 peaceful\n\n7\n\n12345670.100.150.200.250.300.350.400.450.500.55Rank corrlog  Bit BudgetPredictiveInformation-theoretic Random 2\fFigure 6: Hierarchical clustering of images in \u2018memorability space\u2019 as achieved via a regression-tree [2], along\nwith examples images from each cluster. Memorability of each cluster given at the leaf nodes, and also depicted\nas shade of cluster image borders (darker borders correspond to lower memorability than brighter borders).\n\n(a) Hierarchical clustering\n\nit. The only previous work predicting memorability is our recent paper [13]. In that paper, we made\npredictions on the basis of a suite of global image features \u2013 pixel histograms, GIST, SIFT, HOG,\nSSIM [13]. Running the same methods on our current 2/3 data splits achieves \u03c1 = 0.468. Here we\nattempt to do better by using our selected features as an abstraction layer between raw images and\nmemorability.\nWe trained a suite of SVRs to predict annotations from images, and\nanother SVR to predict memorability from these predicted anno-\ntations. For image features, we used the same methods as [13].\nFor the annotation types, we used the feature types selected by our\n100-bit predictive selection on 2/3 training sets. To predict the an-\nnotations for each image in our training set, we split the training\nset in half and predicted annotations for one half by training on the\nother half, and vice versa, covering both halves with predictions.\nWe then trained a \ufb01nal SVR to predict memorability on the test set\nin three ways: 1) using only image features (Direct), 2) using only predicted annotations (Indirect),\nand 3) using both (Direct + Indirect) (Table 5). Combining indirect predictions with direct predic-\ntions performed best (\u03c1 = 0.479), slightly outperforming the direct prediction method of our previous\nwork [13] (\u03c1 = 0.468).\n6 Conclusion\n\nTable 5: Performance (rank\nautomatic\ncorrelation)\nmemorability\nprediction\nmethods.\n\nPerf.\n0.468\n0.436\n0.479\n\nof\n\nFeatures\nDirect [13]\n\nIndirect\n\nDirect + indirect\n\nThe goal of this work was to characterize aspects of an image that make it memorable. Understand-\ning these characteristics is crucial for anyone hoping to work with memorability, be they psychol-\nogists, advertisement-designers, or photographers. We augmented the object and scene annotations\nof the dataset of Isola et al. [13] with attribute annotations describing the spatial layout, content,\nand aesthetic properties of the images. We employed a greedy feature selection scheme to obtain\ncompact lists of features that are highly informative about memorability and highly predictive of\nmemorability. We found that images of enclosed spaces containing people with visible faces are\nmemorable, while images of vistas and peaceful settings are not. Contrary to popular belief, unusu-\nalness and aesthetic beauty attributes are not associated with high memorability \u2013 in fact, they are\nnegatively correlated with memorability \u2013 and these attributes are not among our top few selections,\nindicating that other features more concisely describe memorability (Figure 4).\nThrough this work, we have begun to uncover some of the core features that contribute to image\nmemorability. Understanding how these features interact to actually produce memories remains an\nimportant direction for future research. We hope that by parsing memorability into a concise and\nunderstandable set of attributes, we have provided a description that will interface well with other\ndomains of knowledge and may provide fodder for future theories and applications of memorability.\n\nAcknowledgements: We would like to thank Jianxiong Xiao for providing the global image fea-\ntures. This work is supported by the National Science Foundation under Grant No. 1016862 to\nA.O., CAREER Awards No. 0546262 to A.O and No. 0747120 to A.T. A.T. was supported in part\nby the Intelligence Advanced Research Projects Activity via Department of the Interior contract\nD10PC20023, and ONR MURI N000141010933.\n\n8\n\n0.840.770.850.720.660.63 0.670.570.63  enclosed_space > 0.47  peaceful > 0.75  face_visible > 0.47  face_visible > 0.21  face_visible > 0.30  peaceful > 0.75  recognize_place > 0.45  recognize_place > 0.55\fReferences\n[1] T. F. Brady, T. Konkle, G. A. Alvarez, and A. Oliva. Visual long-term memory has a massive storage\n\ncapacity for object details. In Proceedings of the National Academy of Sciences, 2008.\n\n[2] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi\ufb01cation and regression trees. Boca Raton, FL:\n\nCRC Press, 1984.\n\n[3] G. D. A. Brown, I. Neath, and N. Chater. A temporal ratio model of memory. Psych. Review, 2007.\n[4] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.\n[5] D. Cohen-Or, O. Sorkine, R. Gal, T. Leyvand, and Y.-Q. Xu. Color harmonization. ACM Transactions on\n\nGraphics (Proceedings of ACM SIGGRAPH), 2006.\n\n[6] A. Das and D. Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse ap-\n\nproximation and dictionary selection. In arXiv:1102.3975v2 [stat.ML], 2011.\n\n[7] S. Dhar, V. Ordonez, and T. L. Berg. High level describable attributes for predicting aesthetics and\n\ninterestingness. In IEEE Computer Vision and Pattern Recognition, 2011.\n\n[8] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In IEEE Computer\n\nVision and Pattern Recognition, 2009.\n\n[9] C. Fellbaum. Wordnet: an electronic lexical database. In The MIT Press, 1998.\n[10] B. Gooch, E. Reinhard, C. Moulding, and P. Shirley. Artistic composition for image creation. In Euro-\n\n[11] M. W. Howard and M. J. Kahana. A distributed representation of temporal context. In Journal ofMathe-\n\ngraphics Workshop on Rendering, 2001.\n\nmatical Psychology, 2001.\n\n[29] M. Spain and P. Perona. Some objects are more equal than others: measuring and predicting importance.\n\nIn Proceedings of the European Conference on Computer Vision, 2008.\n\n[30] L. Standing. Learning 10,000 pictures. In Quarterly Journal of Experimental Psychology, 1973.\n[31] S. Ullman, M. Vidal-Naquet, and E. Sali. Visual features of intermediate complexity and their use in\n\nclassi\ufb01cation. In Nature Neuroscience, 2002.\n\n[32] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from\n\nabbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition, 2010.\n\n9\n\n[12] R. R. Hunt and J. B. Worthen. Distinctiveness and memory. In NY:Oxford Univeristy Press, 2006.\n[13] P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an image memorable? In IEEE Computer Vision\n\nand Pattern Recognition, 2011.\n\n[14] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. In\n\nPattern Analysis and Machine Intelligence, 1998.\n\n[15] T. Konkle, T. F. Brady, G. A. Alvarez, and A. Oliva. Conceptual distinctiveness supports detailed visual\n\nlong-term memory for realworld objects. In Journal of Experimental Psychology: General, 2010.\n\n[16] T. Konkle, T. F. Brady, G. A. Alvarez, and A. Oliva. Scene memory is more detailed than you think: the\n\nrole of categories in visual longterm memory. In Psychological Science, 2010.\n\n[17] A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models. In Con-\n\nference on Uncertainty in Arti\ufb01cial Intelligence, 2005.\n\n[18] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between class\n\nattribute transfer. In IEEE Computer Vision and Pattern Recognition, 2009.\n\n[19] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak\ndetection in networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, 2007.\n\n[20] T. Leyvand, D. Cohen-Or, G. Dror, and D. Lischinski. Data-driven enhancement of facial attractiveness.\n\nACM Transactions on Graphics (Proceedings of ACM SIGGRAPH 2008), 2008.\n\n[21] Y. Luo and X. Tang. Photo and video quality evaluation: Focusing on the subject. In European Conference\n\non Computer Vision, 2008.\n\n[22] J. L. McClelland, B. L. McNaughton, and R. C. O\u2019Reilly. Why there are complementary learning systems\nin the hippocampus and neocortex: Insights from the successes and failures of connectionist models of\nlearning and memory. In Psychological Review, 1995.\n\n[23] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial enve-\n\nlope. In International Journal of Computer Vision, 2001.\n\n[24] L. Renjie, C. L. Wolf, and D. Cohen-Or. Optimizing photo composition. In Technical report, Tel-Aviv\n\n[25] I. Rock and P. Englestein. A study of memory for visual form. The American Journal of Psychology,\n\n[26] B. C. Russell, A. Torralba, K. Murphy, and W. T. Freeman. Labelme: A database and web-based tool for\n\nimage annotation. In International Journal of Computer Vision, 2008.\n\n[27] R. M. Shiffrin and M. Steyvers. A model for recognition memory: Rem - retrieving effectively from\n\nmemory. In Psychnomic Bulletin and Review, 1997.\n\n[28] A. J. Smola and B. Schlkopf. A tutorial on support vector regression. Statistics and Computing, 14:199\u2013\n\nUniversity, 2010.\n\n1959.\n\n222, 2004.\n\n\f", "award": [], "sourceid": 1297, "authors": [{"given_name": "Phillip", "family_name": "Isola", "institution": null}, {"given_name": "Devi", "family_name": "Parikh", "institution": null}, {"given_name": "Antonio", "family_name": "Torralba", "institution": null}, {"given_name": "Aude", "family_name": "Oliva", "institution": null}]}