{"title": "Hierarchical Object Representation for Open-Ended Object Category Learning and Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1948, "page_last": 1956, "abstract": "Most robots lack the ability to learn new objects from past experiences. To migrate a robot to a new environment one must often completely re-generate the knowledge- base that it is running with. Since in open-ended domains the set of categories to be learned is not predefined, it is not feasible to assume that one can pre-program all object categories required by robots. Therefore, autonomous robots must have the ability to continuously execute learning and recognition in a concurrent and interleaved fashion. This paper proposes an open-ended 3D object recognition system which concurrently learns both the object categories and the statistical features for encoding objects. In particular, we propose an extension of Latent Dirichlet Allocation to learn structural semantic features (i.e. topics) from low-level feature co-occurrences for each category independently. Moreover, topics in each category are discovered in an unsupervised fashion and are updated incrementally using new object views. The approach contains similarities with the organization of the visual cortex and builds a hierarchy of increasingly sophisticated representations. Results show the fulfilling performance of this approach on different types of objects. Moreover, this system demonstrates the capability of learning from few training examples and competes with state-of-the-art systems.", "full_text": "Hierarchical Object Representation for Open-Ended\n\nObject Category Learning and Recognition\n\nS.Hamidreza Kasaei, Ana Maria Tom\u00e9, Lu\u00eds Seabra Lopes\n\nIEETA - Instituto de Engenharia Electr\u00f3nica e Telem\u00e1tica de Aveiro\n\nUniversity of Aveiro, Averio, 3810-193, Portugal\n{seyed.hamidreza, ana, lsl}@ua.pt\n\nAbstract\n\nMost robots lack the ability to learn new objects from past experiences. To migrate\na robot to a new environment one must often completely re-generate the knowledge-\nbase that it is running with. Since in open-ended domains the set of categories to\nbe learned is not prede\ufb01ned, it is not feasible to assume that one can pre-program\nall object categories required by robots. Therefore, autonomous robots must have\nthe ability to continuously execute learning and recognition in a concurrent and\ninterleaved fashion. This paper proposes an open-ended 3D object recognition\nsystem which concurrently learns both the object categories and the statistical\nfeatures for encoding objects. In particular, we propose an extension of Latent\nDirichlet Allocation to learn structural semantic features (i.e. topics) from low-level\nfeature co-occurrences for each category independently. Moreover, topics in each\ncategory are discovered in an unsupervised fashion and are updated incrementally\nusing new object views. The approach contains similarities with the organization of\nthe visual cortex and builds a hierarchy of increasingly sophisticated representations.\nResults show the ful\ufb01lling performance of this approach on different types of\nobjects. Moreover, this system demonstrates the capability of learning from few\ntraining examples and competes with state-of-the-art systems.\n\n1\n\nIntroduction\n\nOpen-ended learning theory in cognitive psychology has been a topic of considerable interest for many\nresearchers. The general principle is that humans learn to recognize object categories ceaselessly\nover time. This ability allows them to adapt to new environments, by enhancing their knowledge\nfrom the accumulation of experiences and the conceptualization of new object categories [1]. In\nhumans there is evidence of hierarchical models for object recognition in cortex [2]. Moreover,\nin humans object recognition skills and the underlying capabilities are developed concurrently [2].\nIn hierarchical recognition theories, the human sequentially processes information about the target\nobject leading to the recognition result. This begins with lower level cortical processors such as the\nelementary visual cortex and go \u201cup\u201d to the inferotemporal cortex (IT) where recognition occurs.\nTaking this as inspiration, an autonomous robot will process visual information continuously, and\nperform learning and recognition concurrently. In other words, apart from learning from a batch of\nlabelled training data, the robot should continuously update and learn new object categories while\nworking in the environment in an open-ended manner. In this paper, \u201copen-ended\u201d implies that the set\nof object categories to be learned is not known in advance. The training instances are extracted from\non-line experiences of a robot, and thus become gradually available over time, rather than completely\navailable at the beginning of the learning process.\nClassical object recognition systems are often designed for static environments i.e. training (of-\n\ufb02ine) and testing (online) are two separated phases. If limited training data is used, this might\nlead to non-discriminative object representations and, as a consequence, to poor object recog-\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fnition performance. Therefore, building a discriminative object representation is a challeng-\ning step to improve object recognition performance. Moreover, time and memory ef\ufb01ciency is\nalso important. Comparing 3D directly based their local features is computationally expensive.\nTopic modelling is suitable for open-ended learning because, not only it provides short object\ndescriptions (i.e. optimizing memory), but also enables ef\ufb01cient processing of large collections.\n\nThis paper proposes a 3D object recog-\nnition system capable of learning both\nobject categories as well as the topics\nused to encode them concurrently and\nin an open-ended manner. We pro-\npose an extension of Latent Dirichlet\nAllocation to learn incrementally top-\nics for each category independently.\nMoreover, topics in each category are\ndiscovered in an unsupervised fashion\nand updated incrementally using new\nobject views. As depicted in Fig.1,\nthe approach is designed to be used by\na service robot working in a domes-\ntic environment. Fig.1(left), shows a\nPR2 robot looking at some objects on\nthe table. Fig.1(right) shows the point\ncloud of the scene obtained through\nthe robot\u2019s Kinect and the used representations. Tabletops objects are tracked (signed by different\ncolors) and processed through a hierarchy of \ufb01ve layers. For instance, to describe an object view,\nin the feature layer, a spin-image shape descriptor [3] is used to represent the local shapes of the\nobject in different key points; afterwards, in the Bag-of-Words (BoW) layer, the given object view is\ndescribed by histograms of local shape features, as de\ufb01ned in Bag-of-Words models; in the topic layer,\neach topic is de\ufb01ned as a discrete distribution over visual words and each object view is described as\na random mixture over latent topics of the category and stores them into the memory (view layer).\nFinally, the category model is updated by adding the obtained representation (category layer).\nThe remainder of this paper is organized as follows. In section2, we discuss related works. Section3\nprovides a system overview. The methodology for constructing visual words dictionary is presented\nin section4. Section5 describes the proposed object representation. Object category learning and\nrecognition are then explained in section6. Evaluation of the proposed system is presented in section7.\nFinally, conclusions are presented and future research is discussed.\n\nFigure 1: The proposed multi layer object representation being\ntested on a service robot. It consists of \ufb01ve layers of hierarchy\nincluding feature layer, BoW layer, topic layer, object view layer\nand category layer.\n\n2 Related work\n\nOne of the important tasks in the \ufb01eld of assistive and service robots is to achieve human-like object\ncategory learning and recognition. Riesenhuber and Poggio [2] proposed a hierarchical approach for\nobject recognition consistent with physiological data, in which objects are modelled in a hierarchy of\nincreasingly sophisticated representations.\nSivic et al. [4] proposed an approach to discover objects in images using Probabilistic Latent Semantic\nIndexing (pLSI) modelling [5]. Blei et al. [6] argued that the pLSI is incomplete in that it provides\nno probabilistic model at the level of documents. They extended the pLSI model calling the approach\nLatent Dirichlet Allocation (LDA). Similar to pLSI and LDA, we discover topics in an unsupervised\nfashion. Unlike our approach in this paper, pLSI and LDA do not incorporate class information.\nSeveral works have been presented to incorporate a class label in the generative model [7][8][9].\nBlei et al. [7] extend LDA and proposed Supervised LDA (sLDA). The sLDA was \ufb01rst used for\nsupervised text prediction. Later, Wang et al. [8] extended sLDA to classi\ufb01cation problems. Another\npopular extension of LDA is the classLDA (cLDA) [9]. Similar to our approach, the only supervision\nused by sLDA and cLDA is the category label of each training object. However, there are two main\ndifferences. First, the learned topics in sLDA and cLDA are shared among all categories, while we\npropose to learn speci\ufb01c topics per category. Second, the sLDA and cLDA approaches follow a\nstandard train-and-test procedure (i.e. set of classes, train and test data are known or available in\n\n2\n\ncategory layerview layertopic layerBoWlayerfeature layer\fadvance), our approach can incrementally update topics using new observations and the set of classes\nis continuously growing. There are some topic-supervised approaches e.g. Labeled LDA [10] and\nsemiLDA [11] that consider class labels for topics. On one hand, these approaches need tens of hours\nof manual annotation. On the other hand, a human can not provide a speci\ufb01c category label for a 3D\nlocal shape description (e.g. a spin-images [3]).\nThere are some LDA approaches that support incremental learning of object categories. The difference\nbetween incremental and open-ended learning is that the set of classes is prede\ufb01ned in incremental\nlearning, while in open-ended learning the set of classes is continuously growing. Banerjee et al.\nproposed [12] online LDA (o-LDA) that is a simple modi\ufb01cation of batch collapsed Gibbs sampler.\nThe o-LDA \ufb01rst applies the batch Gibbs sampler to the full dataset and then samples new topics for\neach newly observed word using information observed so far. Canini et al. [13] extended o-LDA and\nproposed an incremental Gibbs sampler for LDA (here referred to as I-LDA). The I-LDA does not\nneed a batch initialization phase like o-LDA. In o-LDA and I-LDA, the number of categories is \ufb01xed,\nwhile in our approach the number of categories is growing. Moreover, o-LDA and I-LDA are used to\ndiscover topics shared among all categories, while our approach is used to discover speci\ufb01c topics\nper category.\nCurrently, a popular approach in object recognition is deep learning. However, there are several limi-\ntations to use Deep Neural Networks (DNN) in open-ended domains. Deep networks are incremental\nby nature but not open-ended, since the inclusion of novel categories enforces a restructuring in the\ntopology of the network. Moreover, DNN usually needs a lot of training data and long training times\nto obtain an acceptable accuracy. Schwarz et.al [14] used DNN for 3D object category learning. They\nclearly showed that the performance of DNN degrades when the size of dataset is reduced.\n\n3 System overview\n\nThe main motivation of this work is to achieve a multi-layered object representation that builds an\nincreasingly complex object representation (see Fig. 1). Particularly, a statistical model is used to\nget structural semantic features from low-level feature co-occurrences. The basic idea is that each\nobject view is described as a random mixture over a set of latent topics, and each topic is de\ufb01ned as\na discrete distribution over visual words (i.e. local shape features). It must be pointed out that we\nare using shape features rather than semantic properties to encode the statistical structure of object\ncategories [15]. It is easier to explain the details using an example. We start by selecting a category\nlabel, for example Mug. To represent a new instance of Mug, a distribution over Mug topics is drawn\nthat will specify which intermediate topics should be selected for generating each visual words of\nthe object. According to this distribution, a particular topic is selected out of the mixture of possible\ntopics of the Mug category for generating each visual word in the object. For instance, a Mug usually\nhas a handle, and a \u201chandle\u201d topic refers to some visual words that occur frequently together in\nhandles. The process of drawing both the topic and visual word is repeated several times to choose a\nset of visual words that would construct a Mug. We use statistical inference techniques for inverting\nthis process to automatically \ufb01nd out a set of topics for each category from a collection of instances.\nIn other words, we try to learn a model for each category (a set of latent variables) that explains how\neach object obtains its visual words. In our approach, the characteristics of surfaces belonging to\nobjects are described by local shape features called spin-images [3].\n\n4 Dictionary construction\n\nComparing 3D objects based on their local features is computationally expensive. The topic modelling\napproach directly addresses this concern. It requires a dictionary with V visual words. Usually, the\ndictionary is created via off-line clustering of training data, while in open-ended learning, there is\nno training data available at the beginning of the learning process. To cope with this limitation, we\npropose that the robot freely explores several scenes and collects several object experiences.\nIn general, object exploration is a challenging task because of ill-de\ufb01nition of the objects [16]. Since\na system of boolean equations can represent any expression or any algorithm, it is particularly well\nsuited for encoding the world and object candidates. Similar to Collet\u2019s work [16], we have used\nboolean algebra based on the three logical operators, namely AND \u2227, OR \u2228 and NOT \u00ac. A set of\nconstraints, C, is then de\ufb01ned. Each constraint has been implemented as a function that returns either\ntrue or false (see Table 1).\n\n3\n\n\fTable 1: List of used constraints with a short description for each one.\n\nConstraints\nCtable: \u201cis this candidate on a table?\u201d\nCtrack: \u201cis this candidate being tracked?\u201d\n\nCsize: \u201cis this candidate manipulatable?\u201d\nCinstructor: \u201cis this candidate part of the instructor\u2019s body?\u201d\nCrobot: \u201cis this candidate part of the robot\u2019s body?\u201d\nCedge: \u201cis this candidate near to the edge of the table?\u201d\nCkey_view: \u201cis this candidate a key view?\u201d\n\nDescription\nThe interest object candidate is placed on top of a table.\nThis constraint is used to infer that the segmented\nobject is already being tracked or not.\nReject large object candidate.\nReject candidates that are belong to the user\u2019s body.\nReject candidates that are belong to the robot\u2019s body.\nReject candidates that are near to the edge of the table.\nOnly key-views are stored into Perceptual Memory.\n\nNote that, storing all object views while the object is static would lead to unnecessary accumulation\nof highly redundant data. Therefore, Ckey_view is used to optimize memory usage and computation\nwhile keeping potentially relevant and distinctive information. An object view is selected as a key\nview whenever the tracking of an object is initialized (Ctrack), or when it becomes static again after\nbeing moved. In case the hands are detected near the object, storing key views are postponed until the\nhands are withdrawn [17]. Using these constraints, boolean expressions, \u03c8, are built to encode object\ncandidates for the Object Exploration and Object Recognition purposes (see equations 1 and 2).\n\n\u03c8exploration = Ctable \u2227 Ctrack \u2227 Ckey_view \u2227 \u00ac(Cinstructor \u2228 Crobot),\n\u03c8recognition = Ctable \u2227 Ctrack \u2227 \u00ac (Cinstructor \u2228 Crobot \u2228 Cedge),\n\n(1)\n\n(2)\n\nThe basic perception infrastructure, which is strongly based on the Point Cloud Library (PCL), has\nbeen described in detail in previous publications [18][19]. A table is detected by \ufb01nding the dominant\nplane in the point cloud. This is done using the RANSAC algorithm. The extraction of polygonal\nprisms mechanism is used for collecting the points which lie directly above the table. Afterwards, an\nEuclidean Cluster Extraction algorithm is used to segment each scene into individual clusters. Every\ncluster that satis\ufb01es the exploration expression is selected. The output of this object exploration is a\npool of object candidates. Subsequently, to construct a pool of features, spin-images [3] are computed\nfor the selected points extracted from the pool of object candidates. We computed around 32000\nspin-images from the point cloud of the 194 objects views. Finally, the dictionary is constructed by\nclustering the features using the k-means algorithm. The centers of the V extracted clusters are used\nas visual words, wt (1 \u2264 t \u2264 V ). A video of the robot exploring an environment1 is available at:\nhttps://youtu.be/MwX3J6aoAX0.\n5 Object representation\nA hierarchical system is presented which follows the organization of the visual cortex and builds\nan increasingly complex object representation. Plasticity and learning can occurr at all layers and\ncertainly at the top-most layers of the hierarchy. In this paper, object view representation in the\nfeature layer involves two main phases: keypoint extraction and computation of spin images for the\nkeypoints. For keypoint extraction, a voxelized grid approach is used to obtain a smaller set of points\nby taking only the nearest neighbor point for each voxel center. Afterwards, the spin-image descriptor\nis used to encode the surrounding shape in each keypoint using the original point cloud (i.e. feature\nlayer). Subsequently, the spin images go \u201cup\u201d to the BoW layer where each spin image is assigned\nto a visual word by searching for the nearest neighbor in the dictionary. Afterwards, each object is\nrepresented as a set of visual words. The obtained representation is then presented as input to the\ntopic layer. The LDA model consists of three levels\u2019 parameters including category-level parameters\n(i.e. \u03b1), which are sampled once in the process of generating a category of objects; object-level\nvariables (i.e. \u03b8d), which are sampled once per object, and word-level variables (i.e. zd,n and wd,n),\nwhich are sampled every time a feature is extracted. The variables \u03b8, \u03c6 and z are latent variables that\nshould be inferred. Assume everything is observed and a category label is selected for each object;\ni.e. each object belongs to one category. The joint distribution of all hidden and observed variables\nfor a category is de\ufb01ned as follows:\n\np(c)(w, z, \u03b8, \u03c6|\u03b1, \u03b2) =\n\np(c)(\u03c6z|\u03b2)\n\np(c)(\u03b8d|\u03b1)\n\np(c)(zd,n|\u03b8d)p(c)(wd,n|zd,n, \u03c6),\n\n(3)\n\nz=1\n\nd=1\n\nn=1\n\n1The ROS bag \ufb01le used in this video was created by the Knowledge-Based Systems Group, Institute of\n\nComputer Science, University of Osnabrueck.\n\n4\n\nK(cid:89)\n\n|c|(cid:89)\n\nN(cid:89)\n\n\fwhere \u03b1 and \u03b2 are Dirichlet prior hyper-parameters that affect the sparsity of distributions, and K\nis the number of topics, |c| is the number of known objects in the category c and N is the number\nof words in the object d. Each \u03b8d represents an instance of category c in topic-space as a Cartesian\nhistogram (i.e. topic layer), w represents an object as a vector of visual words, w = {w1, w2, ..., wN},\nwhere each entry represents one of the V words of the dictionary (i.e. BoW layer). z is a vector of\ntopics and zi = 1 means wi was generated form ith topic. It should be noticed that there is a topic\nfor each word and \u03c6 is a K \u00d7 V matrix, which represents word-probability matrix for each topic,\nwhere V is the size of dictionary and \u03c6i,j = p(c)(wi|zj); thus, the posterior distributions of the latent\nvariables given the observed data is computed as follows:\n\np(c)(z, \u03b8, \u03c6|w, \u03b1, \u03b2) =\n\np(c)(w, z, \u03b8, \u03c6|\u03b1, \u03b2)\n\np(c)(w|\u03b1, \u03b2)\n\n,\n\n(4)\n\nUnfortunately, the denominator of the equation 4 is intractable and can not be computed exactly. A\ncollapsed Gibbs sampler is used to solve the inference problem. Since \u03b8 and \u03c6 can be derived from zi,\nthey are integrated out from the sampling procedure. In this work, for each category an incremental\nLDA model is created. Whenever a new training instance is presented, the collapsed Gibbs sampling\nis employed to update the parameters of the model. The collapsed Gibbs sampler is used to estimate\nthe probability of topic zi being assigned to a word wi, given all other topics assigned to all other\nwords:\n\np(c)(zi = k|z\u00aci, w) \u221d p(c)(zi = k|z\u00aci) \u00d7 p(c)(wi|z\u00aci, w\u00aci)\n\n[(cid:80)K\n\n\u221d\n\nnd,k,\u00aci + \u03b1\nk=1 nd,k + \u03b1] \u2212 1\n\n(cid:80)V\n\u00d7 n(c)\n\nw,k,\u00aci + \u03b2\nw=1 n(c)\n\nw,k + \u03b2\n\n,\n\n(5)\n\nwhere z\u00aci means all hidden variables expect zi and z = {zi, z\u00aci}. nd,k is the number of times topic\nk is assigned to some visual word in object d and n(c)\nw,k shows the number of times visual word w\nassigned to topic k. In addition, the denominator of the p(c)(zi = k|z\u00aci) is omitted because it does\nnot depend on zi. The multinomial parameter sets \u03b8(c) and \u03c6(c) can be estimated using the following\nequations:\n\n\u03b8(c)\nk,d =\n\nnd,k + \u03b1\nnd + K\u03b1\n\n,\n\nand \u03c6(c)\n\nw,k =\n\nn(c)\nw,k + \u03b2\nn(c)\nk + V \u03b2\n\n.\n\n(6)\n\nwhere n(c)\nis the number of times a word assigned to topic k in category c and nd is the number of\nk\nwords exist in the object d. Since in this approach, what happens next depends only on the current\nstate of the system and not on the sequence of previous states, whenever a new object view, \u03b8(c)\nd , is\nadded to the category c, n(c)\n\nw,k are updated incrementally.\n\nk and n(c)\n\n6 Object category learning and recognition\n\nWhenever a new object view is added to a category [17], the object conceptualizer retrieves the\ncurrent model of the category as well as representation of the new object view, and creates a new, or\nupdates the existing category. To exemplify the strength of object representation, an instance-based\nlearning approach is used in the current system, i.e. object categories are represented by sets of\nknown instances. The instance-based approach is used because it is a baseline method for category\nrepresentation. However, more advanced approaches like Bayesian approach can be easily adapted.\nAn advantage of the instance based approach is to facilitate incremental learning in an open-ended\nfashion. Similarly, a baseline recognition mechanism in the form of a nearest neighbour classi\ufb01er\nwith a simple thresholding approach are used to recognize a given object view.\nThe query object view, Oq, is \ufb01rst represented using the topic distribution of each category, \u03b8(c)\n.\nq\nAfterwards, to assess the dissimilarity between the query object and stored instances of category c,\n\u03b8p, the symmetric Kullback Leibler divergence, i.e. DKL(\u03b8(c)\nq , \u03b8p), is used to measure the difference\nbetween two distributions. Subsequently, the minimum distance between the query object and all\ninstances of the category c, is considered as the Object-Category Distance, OCD(.):\n\nOCD(\u03b8(c)\n\nq , c) = min\n\u03b8p\u2208c\n\nDKL(\u03b8(c)\n\nq , \u03b8p), c \u2208 {1, . . . , C}.\n\n(7)\n\n5\n\n\fConsequently, the query object is classi\ufb01ed based on the minimum OCD(.). If, for all categories, the\nOCD(.) is larger than a given Classi\ufb01cation Threshold (e.g. CT= 0.75), then the object is classi\ufb01ed\nas unknown; otherwise, it is classi\ufb01ed as the category that has the highest similarity.\n\n7 Experimental results\nThe proposed approach was evaluated using a standard cross-validation protocol as well as an\nopen-ended protocol. We also report on a demonstration of the system.\n\n7.1 Off-line evaluation\nAn object dataset has been used [18], which contains 339 views of 10 categories of objects. The\nsystem has \ufb01ve different parameters that must be well selected to provide a good balance between\nrecognition performance and memory usage. To examine the performance of different con\ufb01gurations\nof the proposed approach, 10-fold cross-validation has been used. A total of 180 experiments were\nperformed for different values of \ufb01ve parameters of the system, namely the voxel size (VS), which\ndetermines the number of keypoints extracted from each object view, the image width (IW) and\nsupport length (SL) of spin images, the dictionary size (DS) and the number of topics (NT). Results\nare presented in Table 2. The parameters that obtained the best average accuracy was selected as\nthe default con\ufb01guration: VS=0.03, IW=4 and SL=0.05, DS=90 and NT=30. In all experiments,\nthe number of iterations for Gibbs sampling was 30 and \u03b1 and \u03b2 parameters were set to 1 and 0.1\nrespectively. The accuracy of the proposed system with the default con\ufb01guration was 0.87. Therefore,\nthis con\ufb01guration displays a good balance between recognition performance and memory usage. The\nremaining results were obtained using this con\ufb01guration.\nThe accuracy of the system in each layer has been\ncalculated individually. For comparison, the accu-\nracy of a topic layer with topics shared among all\ncategories is also computed. Results are presented\nin Table 3. One important observation is that the\noverall performance of the recognition system based\non topic modelling is promising and the proposed\nrepresentation is capable of providing distinctive representation for the given object. Moreover, it\nwas observed that the discriminative power of the proposed representation was better than the other\nlayers. In addition, independent topics for each category provides better representation than shared\ntopics for all categories. Furthermore, it has been observed that the discriminative power of shared\ntopics depends on the order of introduction of categories.\nThe accuracy of object recognition based on pure shape features (i.e. feature layer) is very low. The\nBoW representation obtains an acceptable performance. The topic layer provides a good balance\nbetween memory usage and descriptiveness with 30 \ufb02oats (i.e. NT=30). The length of the BoW\nlayer is around three times larger than the representation of the topic layer. The feature layer is\nthe less compact representation. These results show the hierarchical object representation builds an\nincreasingly complex representation.\n\nTable 3: Object recognition performance\nAccuracy\n\nTopic Layer (shared topics)\nTopic Layer (our approach)\n\nRepresentation\nFeature Layer\nBoW Layer\n\n0.12\n0.79\n0.79\n0.87\n\n7.2 Open-ended evaluation\n\nThe off-line evaluation methodologies (e.g k-fold cross validation, etc.) are not well suited to evaluate\nopen-ended learning systems, because they do not abide to the simultaneous nature of learning\nand recognition. Those methodologies imply that the set of categories must be prede\ufb01ned. An\nevaluation protocol for open-ended learning systems was proposed in [20]. The idea is to emulate\nthe interactions of a recognition system with the surrounding environment over long periods of time.\nA simulated teacher was developed to follow the evaluation protocol and autonomously interact\nwith the recognition system using three basic actions including: teach, for teaching a new object\ncategory; ask, to ask the system what is the category of an object view; and correct, for providing\n\nTable 2: Object recognition performance for different parameters\n\nParameters\n\nValues\n\nAvg. Accuracy(%)\n\nVS(m)\n\n0.03\n85\n\n0.04\n81\n\nIW (bins)\n8\n4\n83\n83\n\nSL(m)\n0.05\n83\n\n0.04\n82\n\n0.06\n83\n\n50\n82\n\n6\n\nDS (visual words)\n\n60\n82\n\n70\n83\n\n80\n84\n\n90\n84\n\n30\n84\n\nNT\n40\n83\n\n50\n82\n\n\f#QCI\n1740\n803\n1099\n1518\n1579\n\n39\n30\n35\n38\n42\n\n1\n2\n3\n4\n5\n\n65\n69\n67\n66\n67\n\nEXP#\n\n#TLC\n\nTable 4: Summary of experiments.\n\n#AIC GCA (%) APA (%)\n18.38\n11.07\n13.20\n16.29\n15.12\n\nFigure 2: Evolution of accuracy vs. number of question/correction iterations in the \ufb01rst 200 iterations of the\nthird experiment. Vertical red lines and labels indicate when and which categories are introduced to the system.\ncorrective feedback, i.e. the ground truth label of a misclassi\ufb01ed object view. The idea is that, for each\nnewly taught category, the simulated teacher repeatedly picks unseen object views of the currently\nknown categories from a dataset and presents them to the system. It progressively estimates the\nrecognition accuracy of the system and, in case this accuracy exceeds a given threshold (marked by\nthe horizontal line in Fig.2), introduces an additional object category (marked by the vertical lines\nand labels in Fig.2). This way, the system is trained, and at the same time the accuracy of the system\nis continuously estimated. The simulated teacher must be connected to an object dataset. In this work,\nthe simulated teacher was connected to the largest available dataset namely RGB-D Object Dataset\nconsisting of 250,000 views of 300 common household objects, organized into 51 categories [21].\nSince the performance of an open-\nended learning system is not limited\nto the object recognition accuracy,\nwhen an experiment is carried out,\nlearning performance is evaluated us-\ning three distinct measures, includ-\ning: (i) the number of learned cate-\ngories at the end of an experiment\n(TLC), an indicator of How much does it learn?; (ii) The number of question / correction iterations\n(QCI) required to learn those categories and the average number of stored instances per category\n(AIC), indicators of How fast does it learn? (see Fig.3 (right)); (iii) Global classi\ufb01cation accuracy\n(GCA), an accuracy computed using all predictions in a complete experiment, and the average\nprotocol accuracy (APA), indicators of How well does it learn? (see Fig.3 (left)). Since the order of\nthe categories introduced may have an affect on the performance of the system, \ufb01ve experiments were\ncarried out in which categories were introduced in random sequences. Results are reported in Table 4.\nFigure 2 shows the performance of the system in the initial 200 iterations of the third experiment.\nBy comparing all experiments, it is visible that in the \ufb01fth experiment, the system learned more\ncategories than other experiments. Figure 3 (left) shows the global classi\ufb01cation accuracy obtained\nby the proposed approach as a function of the number of learned categories. In experiments 1, 4, 5,\nthe accuracy \ufb01rst decreases, and then starts slightly going up again as more categories are introduced.\nThis is expected since the number of categories known by the system makes the classi\ufb01cation task\nmore dif\ufb01cult. However, as the number of learned categories increases, also the number of instances\nper category increases, which augments the category models (topics) and therefore improves per-\nformance of the system. Fig.3 (right) gives a measure of how fast the learning occurred in each of\nthe experiments and shows the number of question/correction iterations required to learn a certain\nnumber of categories. Our approach learned faster than that of Schwarz et. al [14] approach, i.e. our\napproach requires much less examples than Schwarz\u2019s work. Furthermore, we achieved accuracy\naround 75% while storing less than 20 instances per category (see Table 4), while Schwarz et.al [14]\nstored more than 1000 training instances per category (see Fig.8 in [14]). In addition, they clearly\nshowed the performance of DNN degrades when the size of dataset is reduced.\n\n71\n79\n77\n73\n72\n\nFigure 3: System performance during simulated user experiments.\n\n7\n\n02040608010012014016018020000.20.40.60.811.2IterationsAccuracy pitchercapnotebookstaplerpotatobag-foodonionplatewater-bottlehand-towelcameracalculatorcell-phonescissorsinstant-noodlesshampoobell-pepperjar-foodExp30510152025303540450.50.60.70.80.91Number of learned categoriesGlobal classification accuracy Exp1Exp2Exp3Exp4Exp5020040060080010001200140016001800051015202530354045IterationsNumber of learned categories 1740803109915181579Exp1Exp2Exp3Exp4Exp5\f(a)\n\n(b)\n\nFigure 4: Three snapshots showing object recognition results in two scenarios: \ufb01rst two snapshots show the\nproposed system supports (a) classical learning from a batch of train labelled data and (b) open-ended learning\nfrom on-line experiences. Snapshot (c) shows object recognition results on a scene of Washington scene dataset.\n\n(c)\n\n7.3 System demonstration\nTo show the strength of object representation, a real demonstration was performed, in which the\nproposed approach has been integrated in the object perception system presented in [18]. In this\ndemonstration a table is in front of a robot and two users interact with the system. Initially, the system\nonly had prior knowledge about the Vase and Dish categories, learned from batch data (i.e. set of\nobservations with ground truth labels), and there is no information about other categories (i.e. Mug,\nBottle, Spoon). Throughout this session, the system must be able to recognize instances of learned\ncategories and incrementally learn new object categories. Figure4 illustrates the behaviour of the\nsystem:\n\n(a) The instructor puts object TID6 (a Mug) on the table. It is classi\ufb01ed as Unknown because mugs\nare not known to the system; Instructor labels TID6 as a Mug. The system conceptualizes Mug\nand TID6 is correctly recognized. The instructor places a Vase on the table. The system has\nlearned Vase category from batch data, therefore, the Vase is properly recognized (Fig.4 (a)).\n\n(b) Later, another Mug is placed on the table. This particular Mug had not been previously seen, but\n\nthe system can recognize it, because the Mug category was previously taught (Fig.4 (b)).\n\nThis demonstration shows that the system is capable of using prior knowledge to recognize new\nobjects in the scene and learn about new object categories in an open-ended fashion. A video of this\ndemonstration is available at: https://youtu.be/J0QOc_Ifde4.\nAnother demonstration has been performed using Washington RGB-D Scenes Dataset v2. This dataset\nconsists of 14 scenes containing a subset of the objects in the RGB-D Object Dataset, including\nbowls, caps, mugs, and soda cans and cereal boxes. Initially, the system had no prior knowledge. The\nfour \ufb01rst objects are introduced to the system using the \ufb01rst scene and the system conceptualizes\nthose categories. The system is then tested using the second scene of the dataset and it can recognize\nall objects except cereal boxes, because this category was not previously taught. The instructor\nprovided corrective feedback and the system conceptualized the cereal boxes category. Afterwards,\nall objects are classi\ufb01ed correctly in all 12 remaining scenes (Fig.4 (c)). This evaluation illustrates\nthe process of acquiring categories in an open-ended fashion. A video of this demonstration is online\nat: https://youtu.be/pe29DYNolBE.\n8 Conclusion\nThis paper presented a multi-layered object representation to enhance a concurrent 3D object category\nlearning and recognition. In this work, for optimizing the recognition process and memory usage,\neach object view was hierarchically described as a random mixture over a set of latent topics, and\neach topic was de\ufb01ned as a discrete distribution over visual words. This paper focused in detail on\nunsupervised object exploration to construct a dictionary and concentrated on supervised open-ended\nobject category learning using an extension of topic modelling. We transform objects from bag-of-\nwords space into a local semantic space and used distribution over distribution representation for\nproviding powerful representation and deal with the semantic gap between low-level features and\nhigh-level concepts. Results showed that the proposed system supports classical learning from a\nbatch of train labelled data and open-ended learning from actual experiences of a robot.\nAcknowledgements\nThis work was funded by National Funds through FCT project PEst-OE/EEI/UI0127/2016 and FCT\nscholarship SFRH/BD/94183/2013.\n\n8\n\n\fReferences\n[1] Sungmoon Jeong and Minho Lee. Adaptive object recognition model using incremental feature representa-\n\ntion and hierarchical classi\ufb01cation. Neural Networks, 25:130\u2013140, 2012.\n\n[2] Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature\n\nneuroscience, 2(11):1019\u20131025, 1999.\n\n[3] AE. Johnson and M. Hebert. Using spin images for ef\ufb01cient object recognition in cluttered 3D scenes.\n\nPattern Analysis and Machine Intelligence, IEEE Transactions on, 21(5):433\u2013449, May 1999.\n\n[4] Josef Sivic, Bryan C Russell, Alexei Efros, Andrew Zisserman, William T Freeman, et al. Discovering\nobjects and their location in images. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International\nConference on, volume 1, pages 370\u2013377. IEEE, 2005.\n\n[5] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international\nACM SIGIR conference on Research and development in information retrieval, pages 50\u201357. ACM, 1999.\n[6] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine\n\nLearning research, 3:993\u20131022, 2003.\n\n[7] Jon D Mcauliffe and David M Blei. Supervised topic models. In Advances in neural information processing\n\nsystems, pages 121\u2013128, 2008.\n\n[8] Chong Wang, David Blei, and Fei-Fei Li. Simultaneous image classi\ufb01cation and annotation. In Computer\nVision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1903\u20131910. IEEE, 2009.\n[9] Li Fei-Fei and Pietro Perona. A bayesian hierarchical model for learning natural scene categories. In\nComputer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on,\nvolume 2, pages 524\u2013531. IEEE, 2005.\n\n[10] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. Labeled lda: A supervised\ntopic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on\nEmpirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 248\u2013256, 2009.\n\n[11] Yang Wang, Payam Sabzmeydani, and Greg Mori. Semi-latent dirichlet allocation: A hierarchical model\nfor human action recognition. In Human Motion\u2013Understanding, Modeling, Capture and Animation, pages\n240\u2013254. Springer, 2007.\n\n[12] Arindam Banerjee and Sugato Basu. Topic models over text streams: A study of batch and online\n\nunsupervised learning. In SDM, volume 7, pages 437\u2013442. SIAM, 2007.\n\n[13] Kevin R Canini, Lei Shi, and Thomas L Grif\ufb01ths. Online inference of topics with latent dirichlet allocation.\n\nIn International conference on arti\ufb01cial intelligence and statistics, pages 65\u201372, 2009.\n\n[14] Max Schwarz, Hannes Schulz, and Sven Behnke. RGB-D object recognition and pose estimation based\non pre-trained convolutional neural network features. In Robotics and Automation (ICRA), 2015 IEEE\nInternational Conference on, pages 1329\u20131335. IEEE, 2015.\n\n[15] Jiye G Kim, Irving Biederman, Mark D Lescroart, and Kenneth J Hayworth. Adaptation to objects in the\n\nlateral occipital complex (loc): shape or semantics? Vision research, 49(18):2297\u20132305, 2009.\n\n[16] Alvaro Collet, Bo Xiong, Corina Gurau, Martial Hebert, and Siddhartha S Srinivasa. Herbdisc: Towards\n\nlifelong robotic object discovery. The International Journal of Robotics Research, 34(1):3\u201325, 2015.\n\n[17] Gi Hyun Lim, M. Oliveira, V. Mokhtari, S. Hamidreza Kasaei, A. Chauhan, L. Seabra Lopes, and A.M.\nTome. Interactive teaching and experience extraction for learning about objects and robot activities. In\nRobot and Human Interactive Communication, The 23rd IEEE International Symposium on, 2014.\n\n[18] S Hamidreza Kasaei, Miguel Oliveira, Gi Hyun Lim, Lu\u00eds Seabra Lopes, and Ana Maria Tom\u00e9. Interactive\nopen-ended learning for 3d object recognition: An approach and experiments. Journal of Intelligent &\nRobotic Systems, 80(3):537\u2013553, 2015.\n\n[19] Miguel Oliveira, Lu\u00eds Seabra Lopes, Gi Hyun Lim, S. Hamidreza Kasaei, Ana Maria Tom\u00e9, and Aneesh\nChauhan. 3D object perception and perceptual learning in the RACE project. Robotics and Autonomous\nSystems, 75, Part B:614 \u2013 626, 2016.\n\n[20] Aneesh Chauhan and Lu\u00eds Seabra Lopes. Using spoken words to guide open-ended category formation.\n\nCognitive processing, 12(4):341\u2013354, 2011.\n\n[21] K. Lai, Liefeng Bo, Xiaofeng Ren, and D. Fox. A large-scale hierarchical multi-view RGB-D object\ndataset. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 1817\u20131824,\n2011.\n\n9\n\n\f", "award": [], "sourceid": 1060, "authors": [{"given_name": "Seyed Hamidreza", "family_name": "Kasaei", "institution": "IEETA"}, {"given_name": "Ana Maria", "family_name": "Tom\u00e9", "institution": "University of Averio"}, {"given_name": "Lu\u00eds", "family_name": "Lopes", "institution": "University of Averio"}]}