{"title": "Deep Functional Dictionaries: Learning Consistent Semantic Structures on 3D Models from Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 485, "page_last": 495, "abstract": "Various 3D semantic attributes such as segmentation masks, geometric features, keypoints, and materials can be encoded as per-point probe functions on 3D geometries. Given a collection of related 3D shapes, we consider how to jointly analyze such probe functions over different shapes, and how to discover common latent structures using a neural network \u2014 even in the absence of any correspondence information. Our network is trained on point cloud representations of shape geometry and associated semantic functions on that point cloud. These functions express a shared semantic understanding of the shapes but are not coordinated in any way. For example, in a segmentation task, the functions can be indicator functions of arbitrary sets of shape parts, with the particular combination involved not known to the network. Our network is able to produce a small dictionary of basis functions for each shape, a dictionary whose span includes the semantic functions provided for that shape. Even though our shapes have independent discretizations and no functional correspondences are provided, the network is able to generate latent bases, in a consistent order, that reflect the shared semantic structure among the shapes. We demonstrate the effectiveness of our technique in various segmentation and keypoint selection applications.", "full_text": "Deep Functional Dictionaries: Learning Consistent\nSemantic Structures on 3D Models from Functions\n\nMinhyuk Sung\nStanford University\n\nmhsung@cs.stanford.edu\n\nHao Su\n\nUniversity of California San Diego\n\nhaosu@eng.ucsd.edu\n\nRonald Yu\n\nUniversity of California San Diego\n\nronaldyu@ucsd.edu\n\nLeonidas Guibas\nStanford University\n\nguibas@cs.stanford.edu\n\nAbstract\n\nVarious 3D semantic attributes such as segmentation masks, geometric features,\nkeypoints, and materials can be encoded as per-point probe functions on 3D geome-\ntries. Given a collection of related 3D shapes, we consider how to jointly analyze\nsuch probe functions over different shapes, and how to discover common latent\nstructures using a neural network \u2014 even in the absence of any correspondence\ninformation. Our network is trained on point cloud representations of shape geome-\ntry and associated semantic functions on that point cloud. These functions express\na shared semantic understanding of the shapes but are not coordinated in any way.\nFor example, in a segmentation task, the functions can be indicator functions of\narbitrary sets of shape parts, with the particular combination involved not known to\nthe network. Our network is able to produce a small dictionary of basis functions\nfor each shape, a dictionary whose span includes the semantic functions provided\nfor that shape. Even though our shapes have independent discretizations and no\nfunctional correspondences are provided, the network is able to generate latent\nbases, in a consistent order, that re\ufb02ect the shared semantic structure among the\nshapes. We demonstrate the effectiveness of our technique in various segmentation\nand keypoint selection applications.\n\n1\n\nIntroduction\n\nUnderstanding 3D shape semantics from a large collection of 3D geometries has been a popular\nresearch direction over the past few years in both the graphics and vision communities. Many\napplications such as autonomous driving, robotics, and bio-structure analysis depend on the ability to\nanalyze 3D shape collections and the information associated with them.\n\nBackground It is common practice to encode 3D shape information such as segmentation masks,\ngeometric features, keypoints, re\ufb02ectance, materials, etc. as per-point functions de\ufb01ned on the shape\nsurface, known as probe functions. We are interested, in a joint analysis setting, in discovering\ncommon latent structures among such probe functions de\ufb01ned on a collection of related 3D shapes.\nWith the emergence of large 3D shape databases [7], a variety of data-driven approaches, such as\ncycle-consistency-based optimization [17] and spectral convolutional neural networks [6], have been\napplied to a range of tasks including semi-supervised part co-segmentation [16, 17] and supervised\nkeypoint/region correspondence estimation [41].\nHowever, one major obstacle in joint analysis is that each 3D shape has its own individual functional\nspace, and linking related functions across shapes is challenging. To clarify this point, we contrast\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f3D shape analysis with 2D image processing. Under the functional point of view, each 2D image is a\nfunction de\ufb01ned on the regular 2D lattice, so all images are functions over a common underlying\nparameterizing domain. In contrast, with discretized 3D shapes, the probe functions are generally\nde\ufb01ned on heterogeneous shape graphs/meshes, whose nodes are points on each individual shape and\nedges link adjacent points. Therefore, the functional spaces on different 3D shapes are independent\nand not naturally aligned, making joint analysis over the probe functions non-trivial.\nTo cope with this problem, in the classical framework, ideas from manifold harmonics and linear\nalgebra have been introduced. To analyze meaningful functions that are often smooth, a compact\nset of basis functions are computed by the eigen-decomposition of the shape graph/mesh Laplacian\nmatrix. Then, to relate basis functions across shapes, additional tools such as functional maps must\nbe introduced [29] to handle the conversions among functional bases. This, however, raises further\nchallenges since functional map estimation can be challenging for non-isometric shapes, and errors\nare often introduced in this step. In fact, functional maps are computed from corresponding sets of\nprobe functions on the two shapes, something which we neither assume nor need.\n\nApproach Instead of a two-stage procedure to \ufb01rst build independent functional spaces and then\nrelate them through correspondences (functional or traditional), we propose a novel correspondence-\nfree framework that directly learns consistent bases across a shape collection that re\ufb02ect the shared\nstructure of the set of probe functions. We produce a compact encoding for meaningful functions\nover a collection of related 3D shapes by learning a small functional basis for each shape using neural\nnetworks. The set of functional bases of each shape, a.k.a a shape-dependent dictionary, is computed\nas a set of functions on a point cloud representing the underlying geometry \u2014 a functional set whose\nspan will include probe functions on that shape. The training is accomplished in a very simple manner\nby giving the network sequences of pairs consisting of a shape geometry (as point clouds) and a\nsemantic probe function on that geometry (that should be in the associated basis span). Our shapes are\ncorrelated, and thus the semantic functions we train on re\ufb02ect the consistent structure of the shapes.\nThe neural network will maximize its representational capacity by learning consistent bases that\nre\ufb02ect this shared functional structure, leading in turn to consistent sparse function encodings. Thus,\nin our setting, consistent functional bases emerge from the network without explicit supervision.\nWe also demonstrate how to impose different constraints to the network optimization problem so that\natoms in the dictionary exhibit desired properties adaptive to application scenarios. For instance, we\ncan encourage the atoms to indicate smallest parts in the segmentation, or single points in keypoint\ndetection. This implies that our model can serve as a collaborative \ufb01lter that takes any mixture of\nsemantic functions as inputs, and \ufb01nd the \ufb01nest granularity that is the shared latent structure. Such a\npossibility can particularly be useful when the annotations in the training data are incomplete and\ncorrupted. For examples, users may desire to decompose shapes into speci\ufb01c parts, but all shapes\nin the training data have only partial decomposition data without labels on parts. Our model can\naggregate the partial information across the shapes and learn the full decomposition.\nWe remark that our network can be viewed as a function autoencoder, where the decoding is required\nto be in a particular format (a basis selection in which our function is compactly expressible). The\nresulting canonicalization of the basis (the consistency we have described above) is something also\nrecently seen in other autoencoders, for example in the quotient-space autoencoder of [10] that\ngenerates shape geometry into a canonical pose.\nIn experiments, we test our model with existing neural network architectures, and demonstrate the\nperformance on labeled/unlabeled segmentation and keypoint correspondence problem in various\ndatasets. In addition, we show how our framework can be utilized in learning synchronized basis\nfunctions with random continuous functions.\n\nContribution Though simple, our model has advantages over the previous bases synchronization\nworks [37, 36, 41] in several aspects. First, our model does not require precomputed basis functions.\nTypical bases such as Laplacian (on graphs) or Laplace-Beltrami (on mesh surfaces) eigenfunctions\nneed extra preprocessing time for computation. Also, small perturbation or corruption in the shapes\ncan lead to big differences. We can avoid the overhead of such preprocesssing by predicting\ndictionaries while also synchronizing them simultaneously. Second, our dictionaries are application-\ndriven, so each atom of the dictionary itself can attain a semantic meaning associated with small-\nscale geometry, such as a small part or a keypoint, while LB eigenfunctions are only suitable for\napproximating continuous and smooth functions (due to basis truncation). Third, the previous works\n\n2\n\n\fFigure 1: Inputs and outputs of various applications introduced in Section 3: (a) co-segmentation,\n(b) keypoint correspondence, and (c) smooth function approximation problems. The inputs of (a)\nand (b) are a random set of segments/keypoints (without any labels), and the outputs are single\nsegment/keypoint per atom in the dictionaries consistent across the shapes. The input of (c) is a\nrandom linear combination of LB bases, and the outputs are synchronized atomic functions.\n\nde\ufb01ne canonical bases, and the synchronization is achieved from the mapping between each individual\nset of bases and the canonical bases. In our model, the neural network becomes the synchronizer,\nwithout any explicit canonical bases. Lastly, compared with classical dictionary learning works that\nassume a universal dictionary for all data instances, we obtain a data-dependent dictionary that allows\nnon-linear distortion of atoms but still preserves consistency. This endows us additional modeling\npower without sacri\ufb01cing model interpretability.\n\n1.1 Related Work\n\nSince much has already been discussed above, we only cover important missing ones here.\nLearning compact representations of signals has been widely studied in many forms such as factor\nanalysis and sparse dictionaries. Sparse dictionary methods learn an overcomplete basis of a collection\nof data that is as succinct as possible and have been studied in natural language processing [9, 12],\ntime-frequency analysis [8, 22], video [25, 1], and images [21, 42, 5]. Encoding sparse and succinct\nrepresentations of signals has also been observed in biological neurons [27, 26, 28].\nSince the introduction of functional maps [29], shape analysis on functional spaces has also been\nfurther developed in a variety of settings [30, 20, 17, 11, 34, 24], and mappings between pre-computed\nfunctional spaces have been studied in a deep learning context as well [23]. In addition to our work,\ndeep learning on point clouds has also been done on shape classi\ufb01cation [32, 33, 19, 39], semantic\nscene segmentation [15], instance segmentation [38], and 3D amodal object detection [31]. We\nbridge these areas of research in a novel framework that learns, in a data-driven end-to-end manner,\ndata-adaptive dictionaries on the functional space of 3D shapes.\n\n2 Problem Statement\nGiven a collection of shapes {Xi}, each of which has a sample function of speci\ufb01c semantic meaning\n{fi} (e.g. indicator of a subset of semantic parts or keypoints), we consider the problem of sharing\nthe semantic information across the shapes, and predicting a functional dictionary A(X ; \u0398) for\neach shape that linearly spans all plausible semantic functions on the shape (\u0398 denotes the neural\nnetwork weights). We assume that a shape is given as n points sampled on its surface, a function f is\nrepresented with a vector in Rn (a scalar per point), and the atoms of the dictionary are represented\nas columns of a matrix A(X ; \u0398) \u2208 Rn\u00d7k, where k is a suf\ufb01ciently large number for the size of the\ndictionary. Note that the column space of A(X ; \u0398) can include any function f if it has all Dirac\ndelta functions of all points as columns. We aim at \ufb01nding a much lower-dimensional vector space\nthat also contains all plausible semantic functions. We also force the columns of A(X ; \u0398) to encode\natomic semantics in applications, such as atomic instances in segmentation, by adding appropriate\nconstraints.\n\n3 Deep Functional Dictionary Learning Framework\n\nGeneral Framework We propose a simple yet effective loss function, which can be applied to any\nneural network architecture processing a 3D geometry as inputs. The neural network takes pairs\nof a shape X including n points and a function f \u2208 Rn as inputs in training, and outputs a matrix\nA(X ; \u0398) \u2208 Rn\u00d7k as a dictionary of functions on the shape. The loss function needs to be designed\n\n3\n\n(a) Co-segmentation(b) Keypointcorrespondence(c) Smooth function approximation\fCompute: At = A(X ; \u0398t).\nSolve: xt = arg minx (cid:107)Atx \u2212 f(cid:107)2\nUpdate: \u0398t+1 = \u0398t \u2212 \u03b7\u2207L(A(X ; \u0398t); f, xt).\n\n1: function SINGLE STEP GRADIENT ITERATION(X , f, \u0398t, \u03b7)\n2:\n3:\n4:\n5: end function\nAlgorithm 1: Single-Step Gradient Iteration. X is an input shape (n points), f is an input function\nde\ufb01ned on X , \u0398t is neural network weights at time t, A(X ; \u0398t) is an output dictionary of functions\non X , C(x) is the constraints on x, and \u03b7 is learning rate. See Section 2 and 3 for details.\n\ns.t. C(x).\n\n2\n\nfor minimizing both 1) the projection error from the input function f to the vector space A(X ; \u0398),\nand 2) the number of atoms in the dictionary matrix. This gives us the following loss function:\n\nL(A(X ; \u0398); f ) = min\n\nF (A(X ; \u0398), x; f ) + \u03b3(cid:107)A(X ; \u0398)(cid:107)2,1\n\ns.t.\n\nx\n\nF (A(X ; \u0398), x; f ) = (cid:107)A(X ; \u0398)x \u2212 f(cid:107)2\nC(A(X ; \u0398), x),\n\n2\n\n(1)\n\nwhere x \u2208 Rk is a linear combination weight vector, \u03b3 is a weight for regularization. F (A(X ; \u0398))\nis a function that measures the projection error, and the l2,1-norm is a regularizer inducing struc-\ntured sparsity, encouraging more columns to be zero vectors. We may have a set of constraints\nC(A(X ; \u0398), x) on both A(X ; \u0398) and x depending on the applications. For example, when the input\nfunction is an indicator (binary) function, we constrain all elements in both A(X ; \u0398) and x to be in\n[0, 1] range. Other constraints for speci\ufb01c applications are also introduced at the end of this section.\nNote that our loss minimization is a min-min optimization problem; the inner minimization, which is\nembedded in our loss function in Equation 1, optimizes the reconstruction coef\ufb01cients based on the\nshape dependent dictionary predicted by the network, and the outer minimization, which minimizes\nour loss function, updates the neural network weights to predict a best shape dependent dictionary.\nThe nested minimization generally does not have an analytic solution due to the constraint on x.\nThus, it is not possible to directly compute the gradient of L(A(X ; \u0398); f ) without x. We solve this\nby an alternating minimization scheme as described in Algorithm 1. In a single gradient descent step,\nwe \ufb01rst minimize F (A(X ; \u0398); f ) over x with the current A(X ; \u0398), and then compute the gradient\nof L(A(X ; \u0398); f ) while \ufb01xing x. The minimization F (A(X ; \u0398); f ) over x is a convex quadratic\nprogramming, and the scale is very small since A(X ; \u0398) is a very thin matrix (n (cid:29) k). Hence, a\nsimplex method can very quickly solve the problem in every gradient iteration.\nAdaptation in Weakly-supervised Co-segmentation Some constraints for both A(X ; \u0398) and x\ncan be induced from the assumptions of the input function f and the properties of the dictionary\natoms. In the segmentation problem, we take an indicator function of a set of segments as an input,\nand we desire that each atom in the output dictionary indicates an atomic part (Figure 1 (a)). Thus, we\nrestrict both A(X ; \u0398) and x to have values in the [0, 1] range. Also, the atomic parts in the dictionary\nmust partition the shape, meaning that each point must be assigned to one and only one atom. Thus,\nwe add sum-to-one constraint for every row of A(X ; \u0398). The set of constraints for the segmentation\nproblem is de\ufb01ned as follows:\n\nCseg(A(X ; \u0398), x) =\n\n(2)\n\n\uf8f1\uf8f2\uf8f30 \u2264 x \u2264 1\n(cid:80)\nj A(X ; \u0398)i,j = 1 for all i\n\n0 \u2264 A(X ; \u0398) \u2264 1\n\n\uf8fc\uf8fd\uf8fe ,\n\nwhere A(X ; \u0398)i,j is the (i, j)-th element of matrix A(X ; \u0398), and 0 and 1 are vectors/matrices with\nan appropriate size. The \ufb01rst constraint on x is incorporated in solving the inner minimization\nproblem, and the second and third constraints on A(X ; \u0398) can simply be implemented by using\nsoftmax activation at the last layer of the network.\n\nAdaptation in Weakly-supervised Keypoint Correspondence Estimation Similarly with the\nsegmentation problem, the input function in the keypoint correspondence problem is also an indicator\nfunction of a set of points (Figure 1 (b)). Thus, we use the same [0, 1] range constraint for both\nA(X ; \u0398) and x. Also, each atom needs to represent a single point, and thus we add sum-to-one\nconstraint for every column of A(X ; \u0398):\n\n4\n\n\f(cid:40)(cid:88)\n\n(cid:41)\n\n\uf8f1\uf8f2\uf8f30 \u2264 x \u2264 1\n(cid:80)\n\nCkey(A(X ; \u0398), x) =\n\n0 \u2264 A(X ; \u0398) \u2264 1\n\ni A(X ; \u0398)i,j = 1 for all j\n\n\uf8fc\uf8fd\uf8fe\n\n(3)\n\nFor robustness, a distance function from the keypoints can be used as input instead of the binary\nindicator function. Particularly some neural network architectures such as PointNet [32] do not exploit\nlocal geometry context. Hence, a spatially localized distance function can avoid over\ufb01tting to the\nDirac delta function. We use a normalized Gaussian-weighed distance function g in our experiment:\ngi(s) = exp(d(pi,s)2/\u03c3)\n, where gi(s) is i-th element of the distance function from the keypoint s, pi\nis i-th point coordinates, d(\u00b7,\u00b7) is Euclidean distance, and \u03c3 is Gaussian-weighting parameter (0.001\nin our experiment). The distance function is normalized to sum to one, which is consistent with our\nconstraints in Equation 3. The sum of any subset of the keypoint distance functions becomes an input\nfunction in our training.\n\n(cid:80)\n\ni gi(s)\n\nAdaptation in Smooth Function Approximation and Mapping For predicting atomic functions\nwhose linear combination can approximate any smooth function, we generate the input function by\ntaking a random linear combination of LB bases functions (Figure 1 (c)). We also use a unit vector\nconstraint for each atom of the dictionary:\n\nCmap(A(X ; \u0398), x) =\n\nA(X ; \u0398)2\n\ni,j = 1 for all j\n\n(4)\n\n4 Experiments\n\ni\n\nWe demonstrate the performance of our model in keypoint correspondence and segmentation problems\nwith different datasets. We also provide qualitative results of synchronizing atomic functions on\nnon-rigid shapes. While any neural network architecture processing 3D geometry can be employed\nin our model (e.g. PointNet [32], PointNet++ [33], KD-NET [19], DGCNN [39], ShapePFCN [18]),\nwe use PointNet [32] architecture in the experiments due to its simplicity. Note that our output\nA(X ; \u0398) is a set of k-dimensional row vectors for all points. Thus, we can use the PointNet\nsegmentation architecture without any modi\ufb01cation. Code for all experiments below is available in\nhttps://github.com/mhsung/deep-functional-dictionaries.\n\n4.1 ShapeNet Keypoint Correspondence\n\nYi et al. [41] provide keypoint annotations on 6,243 chair models\nin ShapeNet [7]. The keypoints are manually annotated by ex-\nperts, and all of them are matched and aligned across the shapes.\nEach shape has up to 10 keypoints, while most of the shapes have\nmissing keypoints. In the training, we take a random subset of\nkeypoints of each shape to feed an input function, and predict a\nfunction dictionary in which atoms indicate every single keypoint.\nIn the experiment, we use a 80-20 random split for training/test\nsets 1, train the network with 2k point clouds as provided by [41],\nand set k = 10 and \u03b3 = 0.0.\nFigure 2 (at the top) illustrates examples of predicted keypoints\nwhen picking the points having the maximum value in each atom.\nThe colors denote the order of atoms in dictionaries, which is\nconsistent across all shapes despite their different geometries. The\noutputs are also evaluated by the percentage of correct keypoints\n(PCK) metric as done in [41] while varying the Euclidean distance\nthreshold (Figure 2 at the bottom). We report the results for both\nwhen \ufb01nding the best one-to-one correspondences between the\n\nFigure 2: ShapeNet keypoint cor-\nrespondence result visualizations\nand PCK curves.\n\n1Yi et al. [41] use a select subset of models in their experiment, but this subset is not provided by the authors.\n\nThus, we use the entire dataset and make our own train/test split.\n\n5\n\n0.000.020.040.060.080.10Euclidean distance0.00.10.20.30.40.50.60.70.80.91.0% CorrespondencesHuang et al. 2013Yi et al. 2017Ours (w/o corrs)Ours (w/ corrs)\fTable 1: ShapeNet part segmentation comparison with PointNet segmentation (same backbone\nnetwork architecture as ours). Note that PointNet has additional supervision (class labels) compared\nwith ours (Sec 4.2). The average mean IoU of our method is measured by \ufb01nding the correspondences\nbetween ground truth and predicted segments for each shape. k = 10 and \u03b3 = 1.0.\n\nmean\n\nair-\nplane\n\nbag\n\ncap\n\ncar\n\nchair\n\near-\nphone\n\nguitar knife lamp laptop motor-\n\nbike\n\nmug pistol rocket skate-\nboard\n\ntable\n\nPointNet [32]\n\nOurs\n\n82.4 81.4 81.1 59.0 75.6 87.6 69.7 90.3 83.9 74.6 94.2 65.5 93.2 79.3 53.2 74.5 81.3\n84.6 81.2 72.7 79.9 76.5 88.3 70.4 90.0 80.5 76.1 95.1 60.5 89.8 80.8 57.1 78.3 88.1\n\nTable 2: ShapeNet part segmentation results. The \ufb01rst row is when \ufb01nding the correspondences\nbetween ground truth and predicted segments per shape. The second row is when \ufb01nding the\ncorrespondences between part labels and indices of atoms per category. k = 10 and \u03b3 = 1.0.\n\nmean\n\nair-\nplane\n\nbag\n\ncap\n\ncar\n\nchair\n\near-\nphone\n\nguitar knife lamp laptop motor-\n\nbike\n\nmug pistol rocket skate-\nboard\n\ntable\n\nOurs (per shape) 84.6 81.2 72.7 79.9 76.5 88.3 70.4 90.0 80.5 76.1 95.1 60.5 89.8 80.8 57.1 78.3 88.1\n77.3 79.0 67.5 66.9 75.4 87.8 58.7 90.0 79.7 37.1 95.0 57.1 88.8 78.4 46.0 75.8 78.4\nOurs (per cat.)\n\nground truth and predicted keypoints for each shape (red line)\nand when \ufb01nding the correspondences between ground truth labels and atom indices for all shapes\n(green line). These two plots are identical, meaning that the order of predicted keypoints is rarely\nchanged in different shapes. Our results also outperform the previous works [14, 41] by a big margin.\n\n4.2 ShapeNet Semantic Part Segmentation\n\nShapeNet [7] contains 16,881 shapes in 16 categories, and each shape has semantic part annota-\ntions [40] for up to six segments. Qi et al. [32] train PointNet segmentation using shapes in all\ncategories, and the loss function is de\ufb01ned as the cross entropy per point with all labels. We follow\ntheir experiment setup by using the same split of training/validation/test sets and the same 2k sampled\npoint cloud as inputs. The difference is that we do not leverage the labels of segments in training,\nand consider the parts as unlabeled segments. We also deal with the more general situation that\neach shape may have incomplete segmentation by taking an indicator function of a random subset of\nsegments as an input.\nEvaluation For evaluation, we binarize A(X ; \u0398) by \ufb01nding the maximum value in each row, and\nconsider each column as an indicator of a segment. The accuracy is measured based on the average of\neach shape mean IoU similarly with Qi et al. [32], but we make a difference since our method does\nnot exploit labels. In ShapeNet, some categories have optional labels, and shapes may or may not\nhave a part with these optional labels (e.g. armrests of chairs). Qi et al. [32] take into account the\noptional labels even when the segment does not exist in a shape 2. But we do not predict labels of\npoints, and thus such cases are ignored in our evaluation.\nWe \ufb01rst measure the performance of segmentation by \ufb01nding the correspondences between ground\ntruth and predicted segments for each shape. The best one-to-one correspondences are found by\nrunning the Hungarian algorithm on mean IoU values. Table 1 shows the results of our method when\nusing k = 10 and \u03b3 = 1.0, and the results of the label-based PointNet segmentation [32]. When only\nconsidering the segmentation accuracy, our approach outperforms the original PointNet segmentation\ntrained with labels.\nWe also report the average mean IoUs when \ufb01nding the best correspondences between part labels and\nthe indices of dictionary atoms per category. As shown in Table 2, the accuracy is still comparable in\nmost categories, indicating that the order of column vectors in A(X ; \u0398) are mostly consistent with the\nsemantic labels. There are a few exceptions; for example, lamps are composed of shades, base, and\ntube, and half of lamps are ceiling lamps while the others are standing lamps. Since PointNet learns\nper-point features from the global coordinates of the points, shades and bases are easily confused\nwhen their locations are switched (Figure 3). Such problem can be resolved when using a different\n\n2IoU becomes zero if the label is assigned to any point in prediction, and one otherwise.\n\n6\n\n\fFigure 3: Examples of ShapeNet part segmentation results. The colors indicate the indices of atoms\nin the dictionaries. The order of atoms are consistent in most shapes except when the part geometries\nare not distinguishable. See the confusion of a ceiling lamp shade (at \ufb01rst row) and a standing lamp\nbase (at second row) highlighted with red circles.\n\nFigure 4: S3DIS instance\nsegmentation proposal recall\ncomparison while varying\nIoU threshold.\n\nFigure 5: S3DIS instance seg-\nmentation confusion matrix for\nground truth object labels.\n\nFigure 6: Comparison of\nS3DIS instance segmentation\nresults. Left is SGPN [38],\nand right is ours.\n\nTable 3: S3DIS instance segmentation proposal recall comparison per class. IoU threshold is 0.5.\n\nmean\n\nceiling\n\n\ufb02oor\n\nwall\n\nsofa\n\nbookcase board\nSGPN [38] 64.7 67.0 71.4 66.8 54.5 45.4 51.2 69.9 63.1 67.6 64.0 54.4 60.5\n69.1 95.4 99.2 77.3 48.0 39.2 68.2 49.2 56.0 53.2 35.3 31.6 42.2\n\nbeam column window door\n\nOurs\n\nchair\n\ntable\n\nneural network architecture learning more from the local geometric contexts. For more analytic\nexperiments, refer to the supplementary material.\n\n4.3 S3DIS Instance Segmentation\n\nStanford 3D Indoor Semantic Dataset (S3DIS) [2] is a collection of real scan data of indoor scenes\nwith annotations of instance segments and their semantic labels. When segmenting instances in\nsuch data, the main difference with the semantic segmentation of ShapeNet is that there can exist\nmultiple instances of the same semantic label. Thus, the approach of classifying points with labels\nis not applicable. Recently, Wang et al. [38] tried to solve this problem by leveraging the PointNet\narchitecture. Their framework named SGPN learns a similarity metric among points, enabling every\npoint to generate a instance proposal based on proximity in the learned feature space. The per-point\nproposals are further merged in a heuristic post processing step. We compare the performance of our\nmethod with the same experiment setup with SGPN. The input is a 4k point cloud of a 1m \u00d7 1m\n\ufb02oor block in the scenes, and each block contains up to 150 instances. Thus, we use k = 150 and\n\u03b3 = 1.0. Refer to [38] for the details of the data preparation. In the experiments of both methods, all\n6 areas of scenes except area 5 are used as a training set, and the area 5 is used as a test set.\n\nEvaluation We evaluate the performance of instance proposal prediction in each block of the\nscenes. 3 As an evaluation metric, we use proposal recall [13], which measures the percentage\n\n3Wang et al. [38] propose a heuristic process of merging prediction results of each block and generating\ninstance proposals in a scene, but we measure the performance for each block in order to factor out the effect of\nthis post-processing step.\n\n7\n\n0.50.60.70.80.91.0IoU threshold0.00.10.20.30.40.50.60.70.80.91.0RecallWang et al. 2018Oursceilingfloorwallbeamcolumnwindowdoortablechairsofabookcaseboardceilingfloorwallbeamcolumnwindowdoortablechairsofabookcaseboard0.00.20.40.60.81.0\fFigure 7: Output atomic functions with random continuous\nfunctions on MPI-FAUST human shapes [4]. K = 10 and\n\u03b3 = 0.0. The order of atoms are consistent.\n\nFigure 8: Five parts transfer from\nthe base shape (left) to other shapes\n(each row).\n\nof ground truth instances covered by any prediction within a given IoU threshold. In both SGPN\nand our model, the outputs are non-overlapped segments, thus we do not consider the number of\nproposals in the evaluation. Figure 4 depicts the proposal recall of both methods when varying the\nIoU threshold from 0.5 to 1.0. The recall of our method is greater than the baseline throughout all\nthreshold levels. The recalls for each semantic part label with IoU threshold 0.5 are reported in\nTable 3. Our method performs well speci\ufb01cally for large objects such as ceilings, \ufb02oors, walls, and\nwindows. Note that Wang et al. [38] start their training from a pretrained model for semantic label\nprediction, and also their framework consumes point labels as supervision in the training to jointly\npredict labels and segments. Our model is trained from scratch and label-free.\n\nConsistency with semantic labels Although it is hard to expect strong correlations among semantic\npart labels and the indices of dictionary atoms in this experiment due to the large variation of scene\ndata, we still observe weak consistency between them. Figure 5 illustrates confusion among semantic\npart labels. This confusion is calculated by \ufb01rst creating a vector for each label in which the i-th\nelement indicates the count of the label in the i-th atom, normalizing this vector, and taking a dot\nproduct for every pair of labels. Ceilings and \ufb02oors are clearly distinguished from the others due\nto their unique positions and scales. Some groups of objects having similar heights (e.g. doors,\nbookcases, and boards; chairs and sofas) are confused with each other frequently, but objects in\ndifferent groups are discriminated well.\n\n4.4 MPI-FAUST Human Shape Bases Synchronization\n\nIn this experiment, we aim at \ufb01nding synchronized atomic functions in a collection of shapes for\nwhich linear combination can approximate any continuous function. Such synchronized atomic\nfunctions can be utilized in transferring any information on one shape to the other without having\npoint-wise correspondences. Here, we test with 100 non-rigid human body shapes in MPI-FAUST\ndataset [4]. Since the shapes are deformable, it is not appropriate to process Euclidean coordinates\nof a point cloud as inputs. Hence, instead of a point cloud and the PointNet, we use HKS [35] and\nWKS [3] point descriptors for every vertex, and process them using 7 residual layers shared for all\npoints as proposed in [23]. The point descriptors cannot clearly distinguish symmetric parts in a\nshape, so the output atomic functions also become symmetric. To break the ambiguity, we sample\nfour points using farthest point sampling in each shape, \ufb01nd their one-to-one correspondences in\nother shapes using the same point descriptor, and use the geodesic distances from these points as\nadditional point features. As input functions, we compute Laplace-Beltrami operators on shapes, and\ntake a random linear combination of the \ufb01rst ten bases.\nFigure 7 visualizes the output atomic function when we train the network with k = 10 and \u03b3 = 0.0.\nThe order of atomic functions are consistent in all shapes. In Figure 8, we show how the information\nin one shape is transferred to the other shapes using our atomic functions. We project the indicator\nfunction of each segment (at left in \ufb01gure) to the function dictionary space of the base shape, and\nunproject them in the function dictionary space of the other shapes. The transferred segment functions\nare blurry since the network is trained with only continuous functions, but still indicate proper areas\nof the segments.\n\n8\n\n\f5 Conclusion\n\nWe have investigated a problem of jointly analyzing probe functions de\ufb01ned on different shapes,\nand \ufb01nding a common latent space through a neural network. The learning framework we proposed\npredicts a function dictionary of each shape that spans input semantic functions, and \ufb01nds the atomic\nfunctions in a consistent order without any correspondence information. Our framework is very\ngeneral, enabling easy adaption to any neural network architecture and any application scenario. We\nhave shown some examples of constraints in the loss function that can allow the atomic functions to\nhave desired properties in speci\ufb01c applications: the smallest parts in segmentation, and single points\nin keypoint correspondence.\nIn the future, we will further explore the potential of our framework to be applied to various\napplications and even in different data domains. Also, we will investigate how the power of neural\nnetwork decomposing a function space to atoms can be enhanced through different architectures and\na hierarchical basis structure.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their comments and suggestions. This project was supported\nby a DoD Vannevar Bush Faculty Fellowship, NSF grants CHS-1528025 and IIS-1763268, and an\nAmazon AWS AI Research gift.\n\nReferences\n[1] Anali Alfaro, Domingo Mery, and Alvaro Soto. Action recognition in video using sparse coding\n\nand relative features. In CVPR, 2016.\n\n[2] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3d semantic\n\nparsing of large-scale indoor spaces. In CVPR, 2016.\n\n[3] M. Aubry, U. Schlickewei, and D. Cremers. The wave kernel signature: A quantum mechanical\n\napproach to shape analysis. In ICCV Workshops, 2011.\n\n[4] Federica Bogo, Javier Romero, Matthew Loper, and Michael J. Black. FAUST: Dataset and\n\nevaluation for 3D mesh registration. In CVPR, 2014.\n\n[5] Hilton Bristow, Anders Eriksson, and Simon Lucey. Fast convolutional sparse coding. In CVPR,\n\n2013.\n\n[6] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally\n\nconnected networks on graphs. In ICLR, 2014.\n\n[7] Angel X. Chang, Thomas A. Funkhouser, Leonidas J. Guibas, Pat Hanrahan, Qi-Xing Huang,\nZimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and\nFisher Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015.\n\n[8] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by\n\nbasis pursuit. SIAM review, 2001.\n\n[9] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard\nHarshman. Indexing by latent semantic analysis. Journal of the American society for information\nscience, 1990.\n\n[10] V. Guitteny M. Cord E. Mehr, N. Thome. Manifold learning in quotient spaces. In CVPR, 2018.\n\n[11] Davide Eynard, Emanuele Rodola, Klaus Glashoff, and Michael M Bronstein. Coupled func-\n\ntional maps. In 3DV, pages 399\u2013407, 2016.\n\n[12] Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth\n\nconference on Uncertainty in arti\ufb01cial intelligence, 1999.\n\n[13] Jan Hosang, Rodrigo Benenson, Piotr Dollar, and Bernt Schiele. What makes for effective\n\ndetection proposals? IEEE TPAMI, 2016.\n\n9\n\n\f[14] Qi-Xing Huang, Hao Su, and Leonidas Guibas. Fine-grained semi-supervised labeling of large\n\nshape collections. In SIGGRAPH Asia, 2013.\n\n[15] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Recurrent slice networks for 3d segmen-\n\ntation on point clouds. In CVPR, 2018.\n\n[16] Qixing Huang, Vladlen Koltun, and Leonidas Guibas. Joint shape segmentation with linear\n\nprogramming. In SIGGRAPH Asia, 2011.\n\n[17] Qixing Huang, Fan Wang, and Leonidas Guibas. Functional map networks for analyzing and\n\nexploring large shape collections. In SIGGRAPH, 2014.\n\n[18] Evangelos Kalogerakis, Melinos Averkiou, Subhransu Maji, and Siddhartha Chaudhuri. 3D\n\nshape segmentation with projective convolutional networks. In CVPR, 2017.\n\n[19] Roman Klokov and Victor S. Lempitsky. Escape from cells: Deep kd-networks for the recogni-\n\ntion of 3d point cloud models. In ICCV, 2017.\n\n[20] Artiom Kovnatsky, Michael M Bronstein, Xavier Bresson, and Pierre Vandergheynst. Functional\n\ncorrespondence by matrix completion. In CVPR, 2015.\n\n[21] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Ef\ufb01cient sparse coding algorithms.\n\nIn NIPS, 2007.\n\n[22] Michael S Lewicki and Terrence J Sejnowski. Learning overcomplete representations. Neural\n\ncomputation, 2000.\n\n[23] Or Litany, Tal Remez, Emanuele Rodola, Alex Bronstein, and Michael Bronstein. Deep\n\nfunctional maps: Structured prediction for dense shape correspondence. In CVPR, 2017.\n\n[24] Dorian Nogneng and Maks Ovsjanikov. Informative descriptor preservation via commutativity\n\nfor shape matching. In Eurographics, 2017.\n\n[25] Bruno A Olshausen. Sparse coding of time-varying natural images. Journal of Vision, 2002.\n\n[26] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive \ufb01eld properties by\n\nlearning a sparse code for natural images. Nature, 1996.\n\n[27] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy\n\nemployed by v1? Vision research, 1997.\n\n[28] Bruno A Olshausen and David J Field. Sparse coding of sensory inputs. Current opinion in\n\nneurobiology, 2004.\n\n[29] Maks Ovsjanikov, Mirela Ben-Chen, Justin Solomon, Adrian Butscher, and Leonidas Guibas.\n\nFunctional maps: a \ufb02exible representation of maps between shapes. In SIGGRAPH, 2012.\n\n[30] Jonathan Pokrass, Alexander M Bronstein, Michael M Bronstein, Pablo Sprechmann, and\n\nGuillermo Sapiro. Sparse modeling of intrinsic correspondences. In Eurographics, 2013.\n\n[31] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d\n\nobject detection from rgb-d data. In CVPR, 2018.\n\n[32] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning\n\non point sets for 3d classi\ufb01cation and segmentation. In CVPR, 2017.\n\n[33] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchical\n\nfeature learning on point sets in a metric space. In NIPS, 2017.\n\n[34] Emanuele Rodol\u00e0, Luca Cosmo, Michael M Bronstein, Andrea Torsello, and Daniel Cremers.\n\nPartial functional correspondence. In SGP, 2016.\n\n[35] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise and provably informative multi-\n\nscale signature based on heat diffusion. In SGP, 2009.\n\n[36] F. Wang, Q. Huang, M. Ovsjanikov, and L. J. Guibas. Unsupervised multi-class joint image\n\nsegmentation. In CVPR, 2014.\n\n10\n\n\f[37] Fan Wang, Qixing Huang, and Leonidas J. Guibas. Image co-segmentation via consistent\n\nfunctional maps. In ICCV, 2013.\n\n[38] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neumann. Sgpn: Similarity group\n\nproposal network for 3d point cloud instance segmentation. In CVPR, 2018.\n\n[39] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M.\n\nSolomon. Dynamic graph cnn for learning on point clouds. arXiv, 2018.\n\n[40] Li Yi, Vladimir G. Kim, Duygu Ceylan, I-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing\nHuang, Alla Sheffer, and Leonidas Guibas. A scalable active framework for region annotation\nin 3d shape collections. In SIGGRAPH Asia, 2016.\n\n[41] Li Yi, Hao Su, Xingwen Guo, and Leonidas J Guibas. Syncspeccnn: Synchronized spectral cnn\n\nfor 3d shape segmentation. In CVPR, 2017.\n\n[42] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional\n\nnetworks. In CVPR, 2010.\n\n11\n\n\f", "award": [], "sourceid": 295, "authors": [{"given_name": "Minhyuk", "family_name": "Sung", "institution": "Stanford University"}, {"given_name": "Hao", "family_name": "Su", "institution": "UCSD"}, {"given_name": "Ronald", "family_name": "Yu", "institution": "UCSD"}, {"given_name": "Leonidas", "family_name": "Guibas", "institution": "stanford.edu"}]}