{"title": "Visual Concept-Metaconcept Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5001, "page_last": 5012, "abstract": "Humans reason with concepts and metaconcepts: we recognize red and blue from visual input; we also understand that they are colors, i.e., red is an instance of color. In this paper, we propose the visual concept-metaconcept learner (VCML) for joint learning of concepts and metaconcepts from images and associated question-answer pairs. The key is to exploit the bidirectional connection between visual concepts and metaconcepts. Visual representations provide grounding cues for predicting relations between unseen pairs of concepts. Knowing that red and blue are instances of color, we generalize to the fact that green is also an instance of color since they all categorize the hue of objects. Meanwhile, knowledge about metaconcepts empowers visual concept learning from limited, noisy, and even biased data. From just a few examples of purple cubes we can understand a new color purple, which resembles the hue of the cubes instead of the shape of them. Evaluation on both synthetic and real-world datasets validates our claims.", "full_text": "Visual Concept-Metaconcept Learning\n\nChi Han\u2217\n\nMIT CSAIL and IIIS, Tsinghua University\n\nJiayuan Mao\u2217\nMIT CSAIL\n\nChuang Gan\n\nMIT-IBM Watson AI Lab\n\nJoshua B. Tenenbaum\n\nMIT BCS, CBMM, CSAIL\n\nJiajun Wu\nMIT CSAIL\n\nAbstract\n\nHumans reason with concepts and metaconcepts: we recognize red and green\nfrom visual input; we also understand that they describe the same property of\nobjects (i.e., the color). In this paper, we propose the visual concept-metaconcept\nlearner (VCML) for joint learning of concepts and metaconcepts from images\nand associated question-answer pairs. The key is to exploit the bidirectional\nconnection between visual concepts and metaconcepts. Visual representations\nprovide grounding cues for predicting relations between unseen pairs of concepts.\nKnowing that red and green describe the same property of objects, we generalize to\nthe fact that cube and sphere also describe the same property of objects, since they\nboth categorize the shape of objects. Meanwhile, knowledge about metaconcepts\nempowers visual concept learning from limited, noisy, and even biased data. From\njust a few examples of purple cubes we can understand a new color purple, which\nresembles the hue of the cubes instead of the shape of them. Evaluation on both\nsynthetic and real-world datasets validates our claims.\n\n1\n\nIntroduction\n\nLearning to group objects into concepts is an essential human cognitive process, supporting composi-\ntional reasoning over scenes and sentences. To facilitate learning, we have developed metaconcepts\nto describe the abstract relations between concepts [Speer et al., 2017, McRae et al., 2005]. Learning\nboth concepts and metaconcepts involves categorization at various levels, from concrete visual at-\ntributes such as red and cube, to abstract relations between concepts, such as synonym and hypernym.\nIn this paper, we focus on the problem of learning visual concepts and metaconcepts with a linguistic\ninterface, from looking at images and reading paired questions and answers.\nFigure 1a gives examples of concept learning and metaconcept learning in the context of answering\nvisual reasoning questions and purely textual questions about metaconcepts. We learn to distinguish\nred objects from green ones by their hues, by looking at visual reasoning (type I) examples. We also\nlearn metaconcepts, e.g., red and green describe the same property of objects, by reading metaconcept\n(type II) questions and answers.\nConcept learning and metaconcept learning help each other. Figure 1b illustrates the idea. First,\nmetaconcepts enable concept learning from limited, noisy, and even biased examples, with general-\nization to novel compositions of attributes at test time. Assuming only a few examples of red cubes\nwith the label red, the visual grounding of the word red is ambiguous: it may refer to the hue red\nor cube-shaped objects. We can resolve such ambiguities knowing that red and green describe the\nsame property of objects. During test (Figure 1b-I), we can then generalize to red cylinders. Second,\nconcept learning provides visual cues for predicting relations between unseen pairs of concepts.\nAfter learning that red and green describe the same property of objects, one may hypothesize a\n\nFirst two authors contributed equally. Work was done when Chi Han was a visiting student at MIT CSAIL.\nProject Page: http://vcml.csail.mit.edu.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Our model learns concepts and metaconcepts from images and two types of questions. The\nlearned knowledge helps visual concept learning (generalizing to unseen visual concept compositions,\nor to concepts with limited visual data) and metaconcept generalization (generalizing to relations\nbetween unseen pairs of concepts.)\n\ngeneralization of the notion \u201csame property\u201d to cube and sphere, since they both categorize objects\nby their shapes (in comparison to red and green both categorizing the hue, sees Figure 1b-II).\nBased on this observation, we propose the visual concept-metaconcept learner (VCML, see Figure 2)\nfor joint learning of visual concepts (red and cube) and metaconcepts (e.g., red and green describe\nthe same property of objects). VCML consists of three modules: a visual perception module\nextracting an object-based representation of the input image, a semantic parser translating the natural\nlanguage question into a symbolic program, and a neuro-symbolic program executor executing the\nprogram based on the visual representation to answer the question. VCML bridges the learning of\nvisual concepts and metaconcepts with a latent vector space. Concepts are associated with vector\nembeddings, whereas metaconcepts are neural operators that predict relations between concepts.\nBoth concept embeddings and metaconcept operators are learned by looking at images and reading\nquestion-answer pairs. Our training data are composed of two parts: 1) questions about the visual\ngrounding of concepts (e.g., is there any red cube?), and 2) questions about the abstract relations\nbetween concepts (e.g., do red and green describe the same property of objects?).\nVCML generalizes well in two ways, by learning from the two types of questions. It can successfully\ncategorize objects with new combinations of visual attributes, or objects with attributes with limited\ntraining data (Figure 1b, type I); it can also predict relations between unseen pairs of concepts\n(Figure 1b, type II). We present a systematic evaluation on both synthetic and real-world images,\nwith a focus on learning ef\ufb01ciency and strong generalization.\n\n2 Related Work\n\nLearning visual concepts from language or other forms of symbols, such as class labels or tags, serves\nas a prerequisite for a broad set of downstream visual-linguistic applications, including cross-modal\nretrieval [Kiros et al., 2014], caption generation [Karpathy and Fei-Fei, 2015], and visual-question\nanswering [Malinowski et al., 2015]. Existing literature has been focused on improving visual\nconcept learning by introducing new representations [Wu et al., 2017], new forms of supervisions\n[Johnson et al., 2016, Ganju et al., 2017], new training algorithms [Faghri et al., 2018, Shi et al.,\n2018], structured and geometric embedding spaces [Ren et al., 2016, Vendrov et al., 2016], and extra\nknowledge base [Thoma et al., 2017].\nOur model learns visual concepts by reasoning over question-answer pairs. Prior works on visual\nreasoning have proposed to use end-to-end neural networks for jointly learning visual concepts and\nreasoning [Malinowski et al., 2015, Yang et al., 2016, Xu and Saenko, 2016, Andreas et al., 2016,\nGan et al., 2017, Mascharka et al., 2018, Hudson and Manning, 2018, Hu et al., 2018], whereas some\nrecent papers [Yi et al., 2018, Mao et al., 2019, Yi et al., 2019] attempt to disentangle visual concept\nlearning and reasoning. The disentanglement brings better data ef\ufb01ciency and generalization.\nIn this paper, we study the new challenge of incorporating metaconcepts, i.e., relational concepts\nabout concepts, into visual concept learning. Beyond just learning from questions regarding visual\nscenes (e.g., is there any red cubes?), our model learns from questions about metaconcepts (e.g., do\nred and yellow describe the same property of objects?). Both concepts and metaconcepts are learned\nwith a uni\ufb01ed neuro-symbolic reasoning process. Our evaluation focuses on revealing bidirectional\nconnections between visual concept learning and metaconcept learning. Speci\ufb01cally, we study\n\n2\n\nQ:Isthereanyredcube?A:Yes.I.Visualreasoningquestions.II.Metaconceptquestions.Q:Isthereanygreensphere?A:Yes.Q:Doredandgreendescribethesamepropertyofobjects?A:Yes.(a)Training(b)TestQ:Isthereanyredobject?A:Yes.I.VisualCompositionalGeneralizationII.MetaconceptgeneralizationQ:Docubeandspheredescribethesamepropertyofobjects?A:Yes.\fFigure 2: The Visual Concept-Metaconcept Learner. The model comprises three modules: (a) a\nperception module for extracting object-based visual representations, (b) a semantic parsing module\nfor recovering latent programs from natural language, and (c) a neuro-symbolic reasoning module\nthat executes the program to answer the question.\n\nboth visual compositional generalization and metaconcept generalization (i.e., relational knowledge\ngeneralization).\nVisual compositional generalization focuses on exploiting the compositional nature of visual cate-\ngories. For example, the compositional concept red cube can be factorized into two concepts: red and\ncube. Such compositionality suggests the ability to generalize to unseen combination of concepts:\ne.g., from red cube and yellow sphere to red sphere. Many approaches towards compositional visual\nconcept learning have been proposed, including compositional embeddings [Misra et al., 2017],\nneural operators [Nagarajan and Grauman, 2018], and neural module networks [Purushwalkam\net al., 2019]. In this paper, we go one step further towards compositional visual concept learning by\nintroducing metaconcepts, which empower learning from biased data. As an example, by looking at\nonly examples of red cubes, our model can recover the visual concept red accurately, as a category of\nchromatic color, and generalizes to unseen compositions such as red sphere.\nGeneralizing from known relations between concepts to unseen pairs of concepts can be cast as an\ninstance of relational knowledge base completion. The existing literature has focused on learning\nvector embeddings from known knowledge [Socher et al., 2013, Bordes et al., 2013, Wang et al., 2014],\nrecovering logic rules between metaconcepts [Yang et al., 2017], and learning from corpora [Lin\net al., 2017]. In this paper, we propose a visually-grounded metaconcept generalization framework.\nThis allows our model, for example, to generalize from red and yellow describe the same property of\nobjects to green and yellow also describe the same property of objects by observing that both red and\ngreen classify objects by their hues based on vision.\n\n3 Visual Concept-Metaconcept Learning\n\nWe present the visual concept-metaconcept learner (VCML), a uni\ufb01ed framework for learning visual\nconcepts and metaconcepts by reasoning over questions and answers about scenes. It answers both\nvisual reasoning questions (e.g., is there any red object?) and text-only metaconcept questions (e.g.,\ndo red and green describe the same property of objects?) with a uni\ufb01ed neuro-symbolic framework.\nVCML comprises three modules (Figure 2):\n\u2022 A perception module (Figure 2a) for extracting an object-based representation of the scene, where\n\neach object is represented as a vector embedding of a \ufb01xed dimension.\n\n\u2022 A semantic parsing module (Figure 2b) for translating the input question into a symbolic executable\nprogram. Each program consists of hierarchical primitive functional modules for reasoning.\nConcepts are represented as vector embeddings in a latent vector space, whereas metaconcepts\nare small neural networks predicting relations between concepts. The concept embeddings are\nassociated with the visual representations of objects.\n\n\u2022 A neuro-symbolic reasoning module (Figure 2c) for executing the program to answer the question,\nbased on the scene representation, concept embeddings, and metaconcept operators. During\n\n3\n\nb.SemanticParsingModulea.Perception ModuleFeatureExtractionObjectDetectionObj.1Obj.2Q:Isthereanyredobject?ParsingP:Exist(Filter(red))Q:Doredandgreendescribethesamepropertyofobjects?P:MetaVerify(red,green,same_kind)Parsingc.Neuro-SymbolicReasoningredMetaVerify(red,green,same_kind)MetaVerify(red,green,same_kind)Score=0.9(A:Yes)redSimilaritySimilarityExist(Filter(red))Score=0.1Score=0.9MaxScore=0.9(A:Yes)Exist(Filter(red))I.VisualReasoningQuestionII.MetaconceptQuestion(Text-Only)same_kindgreen\fTable 1: Our extension of the visual-reasoning DSL [Johnson et al., 2017, Hudson and Manning,\n2019], including one new primitive function for metaconcepts.\n\nMetaVerify\nSignature\nSemantics\nExample\n\nConcept, Concept, MetaConcept \u2212\u2192 Bool\nReturns whether two input concepts have the speci\ufb01ed metaconcept relation.\nMetaVerify(Sphere, Ball, Synonym) \u2212\u2192 True\n\ntraining, it also receives the groundtruth answer as the supervision and back-propagates the training\nsignals.\n\n3.1 Motivating Examples\n\nIn Figure 2, we illustrate VCML by walking through two motivating examples of visual reasoning\nand metaconcept reasoning.\nVisual reasoning. Given the input image, the perception module generates object proposals for\ntwo objects and extracts vector representations for them individually. Meanwhile, the question\nis there any red object will be translated by a semantic parsing module into a two-step program:\nExist(Filter(red)). The neuro-symbolic reasoning module executes the program. It \ufb01rst com-\nputes the similarities between the concept embedding red and object embeddings to classify both\nobjects. Then, it answers the question by checking whether a red object has been \ufb01ltered out.\nMetaconcept questions. Metaconcept questions are text-only. Each of the metaconcept questions\nqueries a metaconcept relation between a pair of concepts. Speci\ufb01cally, consider the question do red\nand green describe the same property of objects. We denote two concepts are related by a same_kind\nmetaconcept if they describe the same property of objects. To answer this question, we \ufb01rst run\nthe semantic parsing module to translate it into a symbolic program: MetaVerify(red, green,\nsame_kind). The neuro-symbolic reasoning module answers the question by inspecting the latent\nembeddings of two concepts (red and green) with the metaconcept operator of same_kind.\n\n3.2 Model Details\nPerception module. Given the input image, VCML builds an object-based representation of the\nscene. This is done by using a Mask R-CNN [He et al., 2017] to generate object proposals, followed\nby a ResNet-34 [He et al., 2015] to extract region-based feature representations of individual objects.\nSemantic parsing module. The semantic parsing module takes the question as input and recovers\na latent program. The program has a hierarchical structure of primitive operations such as \ufb01ltering\nout a set of objects with a speci\ufb01c concept, or evaluating whether two concepts are same_kind.\nThe domain-speci\ufb01c language (DSL) of the semantic parser extends the DSL used by prior works on\nvisual reasoning [Johnson et al., 2017, Hudson and Manning, 2019] by introducing new functional\nmodules for metaconcepts (see Table 1).\nConcept embeddings. The concept embedding space lies at the core of this model. It is a joint\nvector embedding space of object representations and visual concepts. Metaconcepts operators are\nsmall neural networks built on top of it. Figure 3 gives a graphical illustration of the embedding\nspace.\nWe build the concept embedding space drawing inspirations from the order-embedding framework\n[Ivan Vendrov, 2016] and its extensions [Lai and Hockenmaier, 2017, Vilnis et al., 2018]. By\nexplicitly de\ufb01ning a partial order or entailment probabilities between vector embeddings, these\nmodels are capable of learning a well structured embedding space which captures certain kinds of\nrelations among the concept embeddings.\nIn VCML, we have designed another probabilistic order embedding in a high-dimensional space RN .\nEach object and each visual concept (e.g., red) is associated with an embedding vector x \u2208 RN . This\nvector de\ufb01nes a half-space V (x) = {y \u2208 RN | (y \u2212 x)T x > 0}. We assume a standard normal\ndistribution N (0, I) distrbuted over the whole space. We further de\ufb01ne a denotational probability for\neach entity a, which can be either an object or a concept, with the associated embedding vector xa as\n\n4\n\n\fthe measure of Va = V (xa) over this distribution:\n1\u221a\n2\u03c0\n\nPr(a) = VolN (0,I)(Va) =\n\nz\u2208Va\n\ne\u2212 1\n\n2(cid:107)z(cid:107)2\n\n2 dz =\n\n[1 \u2212 erf(\n\n1\n2\n\n(cid:107)x(cid:107)2\u221a\n\n2\n\n)]\n\nSimilarly, the joint probability of two entites a, b can be computed as the measure of the intersection\nof their half-spaces:\n\n(cid:90)\n\n(cid:90)\n\nFigure 3: A graphical illustration of our concept-metaconcept embedding space. Each concept or\nobject is embedded as a high-dimensional vector, which is associated with a half-space supported by\nthis vector. Each metaconcept is associated with a multi-layer perceptron as a classi\ufb01er. The programs\nare executed based on the scene representation, concept embeddings and metaconcept operators.\n\nPr(a, b) = VolN (0,I)(Va \u2229 Vb) =\n\ne\u2212 1\n\n2(cid:107)z(cid:107)2\n\n2dz\n\n1\u221a\n2\u03c0\n\nz\u2208Va\u2229Vb\n\nWe can therefore de\ufb01ne the entailment probability between two entities as\nVolN (0,I)(Vb \u2229 Va)\n\nPr(b | a) =\n\nPr(a, b)\nPr(a)\n\n=\n\nVolN (0,I)(Va)\n\nThis entailment probability can then be used to give (asymmetric) similarity scores between objects\nand concepts. For example, to classify whether an object o is red, we compute Pr(object o is red) =\nPr(red | o) .\nMetaconcept operators For simplicity, we only consider metaconcepts de\ufb01ned over a pair of\nconcepts, such as synonym and same_kind. Each metaconcept is associated with a multi-layer\nperceptron (e.g., fsynonym for the metaconcept synonym). To classify whether two concepts (e.g.,\nred and cube) are related by a metaconcept (e.g., synonym), we \ufb01rst compute several denotational\nentailment probabilities between them. Mathematical, we de\ufb01ne two helper functions g1 and g2:\n\ng1(a, b) = logit(Pr(a | b)),\n\ng2(a, b) = ln\n\nPr(a, b)\n\nPr(a) Pr(b)\n\n,\n\nwhere logit(\u00b7) is the logit function. These values are then fed into the perceptron to predict the relation\n\nMetaVerify(red , cube, synonym) = fsynonym(g1(red, cube), g1(cube, red), g2(red, cube)) .\n\nNeuro-symbolic reasoning module. VCML executes the program recovered by the semantic\nparsing module with a neuro-symbolic reasoning module [Mao et al., 2019]. It contains a set of\ndeterministic functional modules and does not require any training. The high-level idea is to relax the\nboolean values during execution into soft scores ranging from 0 to 1. Illustrated in Figure 2, given a\nscene of two objects, the result of a Filter(red) operation is a vector of length two, where the i-th\nelement denotes the probability whether the i-th object is red. The Exist(\u00b7) operation takes the max\nvalue of the input vector as the answer to the question. The neuro-symbolic execution allows us to\ndisentangle visual concept learning and metaconcept learning from reasoning. The derived answer to\nthe question is fully differentiable with respect to the concept-metaconcept representations.\n\n3.3 Training\n\nThere are \ufb01ve modules to be learned in VCML: the object proposal generator, object representations,\nthe semantic parser, concept embeddings, and metaconcept operators. Since we focus on the concept-\nmetaconcept learning problem, we assume the access to a pre-trained object proposal generator and\n\n5\n\nVisual Perception Module!\"#$Q:Isthereanyredobject?Concept-ObjectEmbeddingSpaceSemanticParsingModuleQ:Isthereanyredobject?P:Exist(Filter(red))%&'()*,,-&'(-\"#$%#./'%0/#'-0/#'-#./'*\fa semantic parser. The ResNet-34 model for extracting object features is pretrained on ImageNet\n[Deng et al., 2009], and gets \ufb01netuned during training. Unless otherwise stated, the concepts and\nmetaconcepts are learned with an Adam optimizer [Kingma and Ba, 2015] based on learning signals\nback-propagated from the neuro-symbolic reasoning module. We use a learning rate of 0.001 and a\nbatch size of 10.\n\n4 Experiments\n\nThe experiment section is organized as follows. In Section 4.1 and Section 4.2, we introduce datasets\nwe used and the baselines we compare our model with, respectively. And then we evaluate the\ngeneralization performance of various models from two perspectives. First, in Section 4.3, we show\nthat incorporating metaconcept learning improves the data ef\ufb01ciency of concept learning. It also\nsuggests solutions to compositional generalization of visual concepts. Second, in Section 4.4, we\nshow how concept grounding can help the prediction of metaconcept relations between unseen pairs\nof concepts.\nAll results in this paper are the average of four different runs, and \u00b1 in results denotes standard\ndeviation.\n4.1 Dataset\nWe evaluate different models on both synthetic (CLEVR [Johnson et al., 2017]) and natural image\n(GQA [Hudson and Manning, 2019], CUB [Wah et al., 2011]) datasets. To have a \ufb01ne-grained control\nover question splits, we use programs to generate synthetic questions and answers, based on the\nground-truth annotations of visual concepts.\nThe CLEVR dataset [Johnson et al., 2017] is a diagnostic dataset for visual reasoning. It contains\nsynthetic images of objects with different sizes, colors, materials, and shapes. There are a total of 22\nvisual concepts in CLEVR (including concepts and their synonyms). We use a subset of 70K images\nfrom the CLEVR training split for learning visual concepts and metaconcepts. 15K images from the\nvalidation split are used during test.\nThe GQA dataset [Hudson and Manning, 2019] contains visual reasoning questions for natural\nimages. Each image is associated with a scene graph annotating the visual concepts in the scene. It\ncontains a diverse set of 2K visual concepts. We truncate the long-tail distribution of concepts by\nselecting a subset of 584 most frequent concepts. We use 75K images for training and another 10K\nimages for test. In GQA, we use the ground-truth bounding boxes provided in the dataset. Attribute\nlabels are not used for training our model or any baselines.\nThe CUB dataset [Wah et al., 2011] contains around 12K images of 200 bird species. Each image con-\ntains one bird, with classi\ufb01cation and body part attribute annotations (for the meronym metaconcept).\nWe use a subset of 8K images for training and another 1K images for test.\nIn addition to the visual reasoning questions, we extend the datasets to include metaconcept questions.\nThese questions are generated according to external knowledge bases. For CLEVR, we use the\noriginal ontology. For GQA, we use the synsets and same_kind relations from the WordNet [Miller,\n1995]. For CUB, we extend the ontology with 166 bird taxa of higher categories (families, genera,\netc.), and use the hypernym relations from the eBird Taxonomy [Sullivan BL, 2009] to build a\nhierarchy of all the taxonomic concepts. We also use the attribute annotations in CUB.\n4.2 Baseline\nWe compare VCML with the following baselines: general visual reasoning frameworks (GRU-CNN\nand MAC), concept learning frameworks (NS-CL), and language-only frameworks (GRU and BERT).\nGRU-CNN. The GRU-CNN baseline is a simple baseline for visual question answering [Zhou\net al., 2015]. It consists of a ResNet-34 [He et al., 2015] encoder for images and a GRU [Cho\net al., 2014] encoder for questions. To answer a question, image features and question features are\nconcatenated, followed by a softmax classi\ufb01er to predict the answer. For text-only questions, we use\nonly question features.\nMAC. We replicate the result of the MAC network [Hudson and Manning, 2018] for visual\nreasoning, which is a dual attention-based model. For text-only questions, the MAC network takes a\nblank image as the input.\n\n6\n\n\fTable 2: The metaconcept synonym provides\nabstract-level supervision for concepts. This en-\nables zero-shot learning of novel concepts.\n\nTable 3: The metaconcept same_kind helps the\nmodel learn from biased data and generalize to\nnovel combinations of visual attributes.\n\nGRU-CNN MAC\n\nNS-CL VCML\n68.7\u00b13.8 80.2\u00b13.1 94.1\u00b14.6\n49.5\u00b10.2 49.3\u00b10.6 50.5\u00b10.1\n\nCLEVR 50.0\u00b10.0\nGQA\n50.0\u00b10.0\n\nCLEVR-200\nCLEVR-20\n\nGRU-CNN MAC\n50.0\u00b10.0\n50.0\u00b10.0\n\nVCML\n94.2\u00b13.3 98.5\u00b10.3 98.9\u00b10.2\n79.7\u00b12.6 95.7\u00b10.0 95.1\u00b11.6\n\nNS-CL\n\nTable 4: The metaconcept hypernym enables few-\nshot learning of new concepts.\n\nTable 5: Application of VCML on Referential\nExpression task on CLEVR dataset\n\nGRU-CNN MAC\n\nNS-CL VCML\n70.8\u00b13.4 80.0\u00b12.3 80.2\u00b11.7\n\nCUB 50.0\u00b10.0\n\nRef. Expr.\n\nw/.\n\n#Train\n10K 74.9\u00b10.1 73.8\u00b11.7\n1K 59.7\u00b10.2 51.6\u00b12.6\n\nw/o.\n\nNS-CL. The NS-CL framework is proposed by Mao et al. [2019] for learning visual concepts from\nvisual reasoning. Similar to VCML, it works on object-based representations for scenes and program-\nlike representations for questions. We extend it to support functional modules of metaconcepts. They\nare implemented as a two-layer feed forward neural network that takes concept embeddings as input.\nGRU (language only). We include a language-only baseline that uses a GRU [Cho et al., 2014] to\nencode the question and a softmax layer to predict the answer. We use pre-trained GloVe [Pennington\net al., 2014] word embeddings as concept embeddings. We \ufb01x word embeddings during training, and\nonly train GRU weights on the language modeling task on training questions.\nBERT. We also include BERT [Jacob Devlin, 2019] as a language-only baseline. Two variants of\nBERT are considered here. Variant I encodes the natural language question with BERT and uses an\nextra single-layer perceptron to predict the answer. Vartiant II works with a pre-trained semantic\nparser as the one in VCML. To predict metaconcept relations, it encodes concept words or phrases\ninto embedding vectors, concatenates them, and applies a single-layer perceptron to answer the\nquestion. During training, the parameters of the BERT encoders are always \ufb01xed.\n4.3 Metaconcepts Help Concept Learning\nMetaconcepts help concept learning by providing extra supervision at an abstract level. Three types\nof generalization tests are studied in this paper. First, we show that the metaconcept synonym enables\nthe model to learn a novel concept without any visual examples (i.e., zero-shot learning). Second, we\ndemonstrate how the metaconcept same_kind supports learning from biased visual data. Third, we\nevaluate the performance of few-shot learning with the support of the metaconcept hypernym. Finally,\nwe provide extra results to demonstrate that metaconcepts can improve the overall data-ef\ufb01ciency\nof visual concept learning. For more examples on the data split, please refer to the supplementary\nmaterial.\n\nsynonym Supports Zero-Shot Learning of Novel Concepts\n\n4.3.1\nThe metaconcept synonym provides abstract-level supervision for concept learning. With the visual\ngrounding of the concept cube and the fact that cube is a synonym of block, we can easily generalize\nto recognize blocks. To evaluate this, we hold out a set of concepts Csyn\ntest that are synonyms of other\nconcepts. The training dataset contains synonym metaconcept questions about Csyn\ntest concepts but no\nvisual reasoning questions about them. In contrast, all test questions are visual reasoning questions\ninvolving Csyn\ntest.\n\nDataset. For the CLEVR dataset, we hold out three concepts out of 22 concepts. For the GQA\ndataset, we hold out 30 concepts.\nResults. Quantitative results are summarized in Table 2. Our model signi\ufb01cantly outperforms all\nbaselines that are metaconcept-agnostic on the synthetic dataset CLEVR. It also outperforms all other\nmethods on the real-world dataset GQA, but the advantage is smaller. We attribute this result to\nthe complex visual features and the vast number of objects in the real world scenes, which degrade\nthe performance of concept learning. As an ablation, we test the trained model on a validation set\nwhich has the same data distribution as the training set. The train-validation gap on GQA (training:\n83.1%; validation: 52.3%) is one-magnitude-order larger than the gap on CLEVR (training: 99.4%;\nvalidation: 99.4%).\n\n7\n\n\fsame_kind Supports Learning from Biased Data\n\n4.3.2\nThe metaconcept same_kind supports visual concept learning from biased data. Here, we focus on\nbiased visual attribute composition in the training set. For example, from just a few examples of\npurple cubes, the model should learn a new color purple, which resembles the hue of the cubes\ninstead of the shape of them.\n\nDataset. We replicate the setting of CLEVR-CoGenT [Johnson et al., 2017], and create two splits\nof the CLEVR dataset: in split A, all cubes are gray, yellow, brown, or yellow, whereas in split B,\ncubes are red, green, purple or cyan. In training, we use all the images in split A, together with a\nfew images from split B (which are 200 images from split B in the CLEVR-200 group, and 20 in the\nCLEVR-20 group, shown in Table 3). During training, we use metaconcept questions to indicate that\ncube categorizes shapes of objects rather than colors. The held out images that in split B are used for\ntest.\n\nResults. Table 3 shows the results. VCML and NS-CL successfully learn visual concepts from\nbiased data through the concept-metaconcept integration. We also evaluate all trained models on a\nvalidation set which has the same data distribution as the training set. We found that most models\nperform equally well on this validation set (for example, MAC gets 99.1% accuracy in the validation\nset, while both NSCL and VCML get 99.7%). The contrast between the validation accuracies and\ntest accuracies supports that only a better concept-metaconcept integration contributes to the visual\nconcept learning in such biased settings. Please refer to the supplementary material for more details.\n\n4.3.3 hypernym Supports Few-Shot Learning Concepts\nThe abstract-level supervision provided by the metaconcept hypernym supports learning visual\nconcepts from limited data. After having learned the visual concept Sterna, and the fact that Sterna is\na hypernym of Arctic Tern, we can narrow down the possible visual grounding of Arctic Tern. This\nhelps the model to learn the concept Arctic Tern with fewer data.\nDataset. We select 91 out of all 366 taxonomic concepts in CUB dataset [Wah et al., 2011] as Chyp\ntest .\nIn the training set, there are only 5 images per concept for the concepts in Chyp\ntest . In contrast, each\nof the other concepts are associated with around 40 images during training. We evaluate different\nmodels with visual reasoning questions about concepts in Chyp\ntest , based on the held-out images. All\nvisual reasoning questions are generated based on the class labels of images.\n\nResults. The results in Table 4 show that our model outperforms both GRU-CNN and MAC. NS-CL\n[Mao et al., 2019], augmented with metaconcept operators, also achieves a comparable performance\nas VCML. To further validate the effectiveness of extra metaconcept knowledge, we also test the\nperformance of different models when the metaconcept questions are absent. Almost all models show\ndegraded performance to various degrees (for example, MAC gets a test accuracy of 60.4%, NS-CL\ngets 79.6%, and VCML gets 78.5%).\n\n4.3.4 Application of concept embeddings to downstream task\nWe supplement extra results on the CLEVR referential expression task. This task is to select out a\nspeci\ufb01c object from a scene given a description (e.g., the red cube). We compare VCML with and\nwithout metaconcept information using Recall@1 for referential expressions.\n\nResults. Table 5 suggests that the metaconcept information signi\ufb01cantly improves visual concept\nlearning in low-resource settings, using only 10K or even 1K visually grounded questions. We found\nthat as the number of visually-grounded questions increases, the gap between training with and\nwithout metaconcept questions gets smaller. We conjecture that this is an indication of the model\nrelying more on visual information instead of metaconcept knowledge when there is larger visual\nreasoning dataset.\n4.4 Concepts Help Metaconcept Generalization\nConcept learning provides visual cues for predicting the relations between unseen pairs of concepts.\nWe quantitatively evaluate different models by their accuracy of predicting metaconcept relations\nbetween unseen pairs. Four representative metaconcepts are studied here: synonym, same_kind,\nhypernym and meronym.\n\n8\n\n\fTable 6: Metaconcept generalization evaluation on the CLEVR, GQA and CUB dataset. (Two variants\nof BERT are shown here; see Section 4.2 for details.)\n\nCLEVR Synonym\nSame-kind\nSynonym\nSame-kind\nHypernym\nMeronym\n\nGQA\n\nCUB\n\nQ.Type GRU (Lang. Only) GRU-CNN BERT (Variant I ; Variant II)\n50.0\n50.0\n50.0\n50.0\n50.0\n50.0\n\n76.2\u00b110.2 ; 80.2\u00b116.1\n75.4\u00b15.4 ; 80.1\u00b110.0\n76.2\u00b12.4 ; 83.1\u00b11.5\n59.5\u00b12.7 ; 68.2\u00b14.0\n75.6\u00b11.2 ; 61.7\u00b110.3\n63.1\u00b13.2 ; 72.9\u00b19.9\n\n60.9\u00b110.6\n61.5\u00b16.6\n76.2\u00b10.8\n57.3\u00b15.3\n76.7\u00b18.8\n78.1\u00b14.8\n\n66.3\u00b11.4\n64.7\u00b15.1\n80.8\u00b11.0\n56.3\u00b12.3\n74.3\u00b15.2\n80.1\u00b15.9\n\nNS-CL\n\nVCML\n\n100.0\u00b10.0\n92.3\u00b14.9\n81.2\u00b12.8\n66.8\u00b14.1\n80.1\u00b17.3\n97.7\u00b11.1\n\n100.0\u00b10.0\n99.3\u00b11.0\n91.1\u00b11.7\n69.1\u00b11.7\n94.8\u00b11.3\n92.5\u00b11.0\n\nFigure 4: Data split used for metaconcept generalization tests. The models are required to leverage\nthe visual analogy between concepts to predict metaconcepts about unseen pairs of concepts (shown\nin blue). More details for other metaconcepts can be found in the supplementary material.\n\nDataset. Figure 4 shows the training-test split for the metaconcept generalization test. For each\nmetaconcept, a subset of concepts C[metaconcept]_gen\nare selected as test concepts . The rest concepts\nform the training concept set C[metaconcept]_gen\n. Duing training, only metaconcept questions with\nboth queried concepts in C[metaconcept]_gen\nare used. Metaconcept questions with both concepts in\nC[metaconcept]_gen\nare used for test. Models should leverage the visual grounding of concepts to predict\ntest\nthe metaconcept relation between unseen pairs.\n\ntrain\n\ntest\n\ntrain\n\nResults. The results for metaconcept generalization on the three datasets are summarized in Table 6.\nThe question type baseline (shown as Q. Type) is the best-guess baseline for all metaconcepts. Overall,\nVCML achieves the best metaconcept generalization, and only shows inferior performance to NS-CL\non the meronym metaconcept. Note that the NS-CL baseline used here is our re-implementation that\naugments the original version with similar metaconcept operators as VCML.\n\n5 Conclusion\n\nIn this paper, we propose the visual concept-metaconcept learner (VCML) for bridging visual concept\nlearning and metaconcept learning (i.e., relational concepts about concepts). The model learns\nconcepts and metaconcepts with a uni\ufb01ed neuro-symbolic reasoning procedure and a linguistic\ninterface. We demonstrate that connecting visual concepts and abstract relational metaconcepts\nbootstraps the learning of both. Concept grounding provides visual cues for predicting relations\nbetween unseen pairs of concepts, while the metaconcepts, in return, facilitate the learning of concepts\nfrom limited, noisy, and even biased data. Systematic evaluation on the CLEVR, CUB, and GQA\ndatasets shows that VCML outperforms metaconcept-agnostic visual concept learning baselines as\nwell as visual reasoning baselines.\n\n9\n\nQ: Is there any airplane?A: Yes.Q: Is there any child?A: Yes.Q: Is airplane a synonymof plane?A: Yes.TrainingQ: Dored andyellowdescribethesamepropertyofobjects?A: Yes.Training(a) Synonym Metaconcept Generalization(b) Same_kindMetaconceptGeneralizationTestQ: Is kid a synonymof child?A: Yes.Q: Is there any plane?A: Yes.Q: Is there any truck?A: Yes.Q: Is there any kid?A: Yes.Q: Is there any yellow object?A: Yes.Q: Is there any redobject?A: Yes.TestQ: Dobusandtruckdescribe the same property of objects?A: Yes.Q: Is there any bus?A: Yes.\fAcknowledgement. We thank Jon Gauthier for helpful discussions and suggestions. This work\nwas supported in part by the Center for Brains, Minds and Machines (CBMM, NSF STC award\nCCF-1231216), ONR MURI N00014-16-1-2007, MIT-IBM Watson AI Lab, and Facebook.\n\nReferences\n\nJacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks for\n\nquestion answering. In NAACL-HLT, 2016. 2\n\nAntoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating\n\nEmbeddings for Modeling Multi-Relational Data. In NeurIPS, 2013. 3\n\nKyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,\nand Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine\ntranslation. In EMNLP, 2014. 6, 7\n\nJia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR, 2009. 6\n\nFartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. VSE++: Improving Visual-Semantic\n\nEmbeddings with Hard Negatives. In BMVC, 2018. 2\n\nChuang Gan, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. VQS: Linking segmentations to questions\nand answers for supervised attention in vqa and question-focused semantic segmentation. In ICCV, 2017. 2\n\nSiddha Ganju, Olga Russakovsky, and Abhinav Gupta. What\u2019s in a question: Using visual questions as a form of\n\nsupervision. In CVPR, 2017. 2\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In\n\nCVPR, 2015. 4, 6\n\nKaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask R-CNN. In ICCV, 2017. 4\n\nRonghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko. Explainable Neural Computation via Stack\n\nNeural Module Networks. In ECCV, 2018. 2\n\nDrew A Hudson and Christopher D Manning. Compositional attention networks for machine reasoning. In\n\nICLR, 2018. 2, 6\n\nDrew A Hudson and Christopher D Manning. GQA: A New Dataset for Compositional Question Answering\n\nover Real-World Images. CVPR, 2019. 4, 6\n\nSanja Fidler Raquel Urtasun Ivan Vendrov, Ryan Kiros. Order-Embeddings of Image and Language. In ICLR,\n\n2016. 4\n\nKenton Lee Kristina Toutanova Jacob Devlin, Ming-Wei Chang. BERT: Pre-training of Deep Bidirectional\n\nTransformers for Language Understanding. In NAACL-HLT, 2019. 7\n\nJustin Johnson, Andrej Karpathy, and Li Fei-Fei. DenseCap: Fully Convolutional Localization Networks for\n\nDense Captioning. In CVPR, 2016. 2\n\nJustin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick.\nCLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.\n4, 6, 8\n\nAndrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR,\n\n2015. 2\n\nDiederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 6\n\nJamie Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with\n\nmultimodal neural language models. NeurIPS Workshop, 2014. 2\n\nAlice Lai and Julia Hockenmaier. Learning to predict denotational probabilities for modeling entailment. In\n\nEACL, 2017. 4\n\nYankai Lin, Zhiyuan Liu, and Maosong Sun. Neural Relation Extraction with Multi-lingual Attention. In ACL,\n\n2017. 3\n\n10\n\n\fMateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask Your Neurons: A Neural-Based Approach to\n\nAnswering Questions about Images. In CVPR, 2015. 2\n\nJiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The Neuro-Symbolic Concept\n\nLearner: Interpreting Scenes, Words, and Sentences From Natural Supervision. In ICLR, 2019. 2, 5, 7, 8\n\nDavid Mascharka, Philip Tran, Ryan Soklaski, and Arjun Majumdar. Transparency by design: Closing the gap\n\nbetween performance and interpretability in visual reasoning. In CVPR, 2018. 2\n\nKen McRae, George S. Cree, Mark S. Seidenberg, and Chris Mcnorgan. Semantic Feature Production Norms for\n\na Large Set of Living and Nonliving Things. Behavior Research Methods, 37(4):547\u2013559, Nov 2005. 1\n\nGeorge A Miller. WordNet: A Lexical Database for English. Commun. ACM, 38(11):39\u201341, 1995. 6\n\nIshan Misra, Abhinav Gupta, and Martial Hebert. From Red Wine to Red Tomato: Composition with Context.\n\nIn CVPR, 2017. 3\n\nTushar Nagarajan and Kristen Grauman. Attributes as Operators: Factorizing Unseen Attribute-Object Composi-\n\ntions. In ECCV, 2018. 3\n\nJeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation.\n\nIn EMNLP, 2014. 7\n\nSenthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc\u2019Aurelio Ranzato. Task-Driven Modular\n\nNetworks for Zero-Shot Compositional Learning. arXiv:1905.05908, 2019. 3\n\nZhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and Alan Yuille. Joint Image-Text Representation by Gaussian\n\nVisual-Semantic Embedding. In ACM MM, 2016. 2\n\nHaoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, and Jian Sun. Learning Visually-Grounded Semantics from\n\nContrastive Adversarial Samples. In COLING, 2018. 2\n\nRichard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks\n\nfor knowledge base completion. In NeurIPS, 2013. 3\n\nRobert Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An Open Multilingual Graph of General\n\nKnowledge. In AAAI, 2017. 1\n\nIliff MJ Bonney RE Fink D Kelling S Sullivan BL, Wood CL. eBird: A citizen-based bird observation network\n\nin the biological sciences. Biological Conservation, 142:2282\u20132292, 2009. 6\n\nSteffen Thoma, Achim Rettinger, and Fabian Both. Towards holistic concept representations: Embedding\n\nrelational knowledge, visual attributes, and distributional word semantics. In ISWC, 2017. 2\n\nIvan Vendrov, Jamie Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-Embeddings of Images and Language.\n\nICLR, 2016. 2\n\nLuke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCallum. Probabilistic embedding of knowledge graphs\n\nwith box lattice measures. In ACL, 2018. 4\n\nC. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset.\n\nTechnical Report CNS-TR-2011-001, California Institute of Technology, 2011. 6, 8\n\nZhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge Graph Embedding by Translating on\n\nHyperplanes. In AAAI, 2014. 3\n\nJiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene de-rendering. In CVPR, 2017. 2\n\nHuijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual\n\nquestion answering. In ECCV, 2016. 2\n\nFan Yang, Zhilin Yang, and William W Cohen. Differentiable Learning of Logical Rules for Knowledge Base\n\nCompletion. In NeurIPS, 2017. 3\n\nZichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image\n\nquestion answering. In CVPR, 2016. 2\n\nKexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Joshua B Tenenbaum. Neural-\nSymbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In NeurIPS, 2018.\n2\n\n11\n\n\fKexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum.\nClevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019. 2\n\nBolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Simple Baseline for Visual\n\nQuestion Answering. arXiv:1512.02167, 2015. 6\n\n12\n\n\f", "award": [], "sourceid": 2764, "authors": [{"given_name": "Chi", "family_name": "Han", "institution": "Tsinghua University"}, {"given_name": "Jiayuan", "family_name": "Mao", "institution": "MIT"}, {"given_name": "Chuang", "family_name": "Gan", "institution": "MIT-IBM Watson AI Lab"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}]}