{"title": "PointCNN: Convolution On X-Transformed Points", "book": "Advances in Neural Information Processing Systems", "page_first": 820, "page_last": 830, "abstract": "We present a simple and general framework for feature learning from point cloud. The key to the success of CNNs is the convolution operator that is capable of leveraging spatially-local correlation in data represented densely in grids (e.g. images). However, point cloud are irregular and unordered, thus a direct convolving of kernels against the features associated with the points will result in deserting the shape information while being variant to the orders. To address these problems, we propose to learn a X-transformation from the input points, which is used for simultaneously weighting the input features associated with the points and permuting them into latent potentially canonical order. Then element-wise product and sum operations of typical convolution operator are applied on the X-transformed features. The proposed method is a generalization of typical CNNs into learning features from point cloud, thus we call it PointCNN. Experiments show that PointCNN achieves on par or better performance than state-of-the-art methods on multiple challenging benchmark datasets and tasks.", "full_text": "PointCNN: Convolution On X -Transformed Points\n\nYangyan Li\u2020\u21e4 Rui Bu\u2020 Mingchao Sun\u2020 Wei Wu\u2020 Xinhan Di\u2021 Baoquan Chen\u00a7\n\n\u2020Shandong University\n\n\u2021Huawei Inc.\n\n\u00a7Peking University\n\nAbstract\n\nWe present a simple and general framework for feature learning from point clouds.\nThe key to the success of CNNs is the convolution operator that is capable of\nleveraging spatially-local correlation in data represented densely in grids (e.g. im-\nages). However, point clouds are irregular and unordered, thus directly convolving\nkernels against features associated with the points will result in desertion of shape\ninformation and variance to point ordering. To address these problems, we propose\nto learn an X -transformation from the input points to simultaneously promote two\ncauses: the \ufb01rst is the weighting of the input features associated with the points, and\nthe second is the permutation of the points into a latent and potentially canonical\norder. Element-wise product and sum operations of the typical convolution operator\nare subsequently applied on the X -transformed features. The proposed method\nis a generalization of typical CNNs to feature learning from point clouds, thus\nwe call it PointCNN. Experiments show that PointCNN achieves on par or better\nperformance than state-of-the-art methods on multiple challenging benchmark\ndatasets and tasks.\n\n1\n\nIntroduction\n\nSpatially-local correlation is a ubiquitous property of various types of data that is independent of the\ndata representation. For data that is represented in regular domains, such as images, the convolution\noperator has been shown to be effective in exploiting that correlation as the key contributor to the\nsuccess of CNNs on a variety of tasks [25]. However, for data represented in point cloud form,\nwhich is irregular and unordered, the convoralution operator is ill-suited for leveraging spatially-local\ncorrelations in the data.\n\n!\"\n!$\n\n!#\n!%\n\n!\"\n\n1\n\n!#\n\n2\n\n!%\n\n4\n\n!$\n\ni\n\n3\n\nii\n\n!\"\n\n1\n\n!$\n\n3\n\n!%\n\n4\n\niii\n\n!#\n\n!\"\n\n2\n\n2\n\n!#\n\n3\n\n!$\n\n1\n\n!%\n\n4\n\niv\n\nFigure 1: Convolution input from regular grids (i)\nand point clouds (ii-iv). In (i), each grid cell is\nassociated with a feature. In (ii-iv), the points are\nsampled from local neighborhoods, in analogy to\nlocal patches in (i), and each point is associated\nwith a feature, an order index, and coordinates.\n\nfii = Conv(K, [fa, fb, fc, fd]T ),\nfiii = Conv(K, [fa, fb, fc, fd]T ),\nfiv = Conv(K, [fc, fa, fb, fd]T ).\n\nfii = Conv(K,Xii \u21e5 [fa, fb, fc, fd]T ),\nfiii = Conv(K,Xiii \u21e5 [fa, fb, fc, fd]T ),\nfiv = Conv(K,Xiv \u21e5 [fc, fa, fb, fd]T ).\n\n(1a)\n\n(1b)\n\nWe illustrate the problems and challenges of applying convolutions on point clouds in Figure 1.\nSuppose the unordered set of the C-dimensional input features is the same F = {fa, fb, fc, fd}\n\n\u21e4Part of the work was done during Yangyan\u2019s Autodesk Research 2017 summer visit.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fin all the cases ((i)  (iv)), and we have one kernel K = [k\u21b5, k, k, k]T of shape 4 \u21e5 C. In\n(i), by following the canonical order given by the regular grid structure, the features in the local\n2 \u21e5 2 patch can be cast into [fa, fb, fc, fd]T of shape 4 \u21e5 C, for convolving with K, yielding\nfi = Conv(K, [fa, fb, fc, fd]T ), where Conv(\u00b7,\u00b7) is simply an element-wise product followed by\na sum2. In (ii), (iii), and (iv), the points are sampled from local neighborhoods, and thus their\nordering may be arbitrary. By following orders as illustrated in the \ufb01gure, the input feature set F\ncan be cast into [fa, fb, fc, fd]T in (ii) and (iii), and [fc, fa, fb, fd]T in (iv). Based on this, if the\nconvolution operator is directly applied, the output features for the three cases could be computed as\ndepicted in Eq. 1a. Note that fii \u2318 fiii holds for all cases, while fiii 6= fiv holds for most cases. This\nexample illustrates that a direct convolution results in deserting shape information (i.e., fii \u2318 fiii),\nwhile retaining variance to the ordering (i.e., fiii 6= fiv).\nIn this paper, we propose to learn a K \u21e5 K X -transformation for the coordinates of K input points\n(p1, p2, ..., pK), with a multilayer perceptron [39], i.e., X = MLP(p1, p2, ..., pK). Our aim is to\nuse it to simultaneously weight and permute the input features, and subsequently apply a typical\nconvolution on the transformed features. We refer to this process as X -Conv, and it is the basic\nbuilding block for our PointCNN. The X -Conv for (ii), (iii), and (iv) in Figure 1 can be formulated\nas in Eq. 1b, where the X s are 4\u21e5 4 matrices, as K = 4 in this \ufb01gure. Note that since Xii and Xiii are\nlearned from points of different shapes, they can differ so as to weight the input features accordingly,\nand achieve fii 6= fiii. For Xiii and Xiv, if they are learned to satisfy Xiii = Xiv \u21e5 \u21e7, where \u21e7 is\nthe permutation matrix for permuting (c, a, b, d) into (a, b, c, d), then fiii \u2318 fiv can be achieved.\nFrom the analysis of the example in Figure 1, it is clear that, with ideal X -transformations, X -Conv\nis capable of taking the point shapes into consideration, while being invariant to ordering. In practice,\nwe \ufb01nd that the learned X -transformations are far from ideal, especially in terms of the permutation\nequivalence aspect. Nevertheless, PointCNN built with X -Conv is still signi\ufb01cantly better than a\ndirect application of typical convolutions on point clouds, and on par or better than state-of-the-art\nneural networks designed for point cloud input data, such as PointNet++ [35].\nSection 3 contains the details of X -Conv, as well as PointCNN architectures. We show our results on\nmultiple challenging benchmark datasets and tasks in Section 4, together with ablation experiments\nand visualizations for a better understanding of PointCNN.\n\n2 Related Work\n\nFeature Learning from Regular Domains. CNNs have been very successful for leveraging\nspatially-local correlation in images \u2014 pixels in 2D regular grids [26]. There has been work\nin extending CNNs to higher dimensional regular domains, such as 3D voxels [52]. However, as\nboth the input and convolution kernels are of higher dimensions, the amount of both computation\nand memory in\ufb02ates dramatically. Octree [37, 47], Kd-Tree [22] and Hash [41] based approaches\nhave been proposed to save computation by skipping convolution in empty space. The activations\nare kept sparse in [13] to retain sparsity in convolved layers. [17] and [4] partition point cloud into\ngrids and represent each grid with grid mean points and Fisher vectors respectively for convolving\nwith 3D kernels. In these approaches, the kernels themselves are still dense and of high dimension.\nSparse kernels are proposed in [28], but this approach cannot be applied recursively for learning\nhierarchical features. Compared with these methods, PointCNN is sparse in both input representation\nand convolution kernels.\n\nFeature Learning from Irregular Domains. Stimulated by the rapid advances and demands in\n3D sensing, there has been quite a few recent developments in feature learning from 3D point\nclouds. PointNet [33] and Deep Sets [58] proposed to achieve input order invariance by the use of a\nsymmetric function over inputs. PointNet++ [35] and SO-Net [27] apply PointNet hierarchically for\nbetter capturing of local structures. Kernel correlation and graph pooling are proposed for improving\nPointNet-like methods in [42]. RNN is used in [18] for processing features aggregated by pooling\nfrom ordered point cloud slices. [50] proposed to leverage neighborhood structures in both point and\nfeature spaces. While these symmetric pooling based approaches, as well as those in [10, 58, 36],\nhave guarantee in achieving order invariance, they come with a price of throwing away information.\n\n2Actually, this is a special instance of convolution \u2014 a convolution that is applied in one spatial location.\n\nFor simplicity, we call it convolution as well.\n\n2\n\n\f[43, 3, 44] propose to \ufb01rst \u201cinterpolate\u201d or \u201cproject\u201d features into prede\ufb01ned regular domains, where\ntypical CNNs can be applied. In contrast, the regular domain is latent in our method. CNN kernels\nare represented as parametric functions of neighborhood point positions to generalize CNNs for point\nclouds in [48, 14, 53]. The kernels associated with each point are parametrized individually in these\nmethods, while the X -transformations in our method are learned from each neighborhood, thus could\npotentially by more adaptive to local structures.\nBesides as point clouds, sparse data in irregular domains can be represented as graphs, or meshes,\nand a few works have been proposed for feature learning from such representations [31, 55, 30]. We\nrefer the interested reader to [5] for a comprehensive survey of work along these directions. Spectral\ngraph convolution on a local graph is used for processing point clouds in [46].\n\nInvariance vs. Equivariance. A line of pioneering work aiming at achieving equivariance has\nbeen proposed to address the information loss problem of pooling in achieving invariance [16, 40].\nThe X -transformations in our formulation, ideally, are capable of realizing equivariance, and are\ndemonstrated to be effective in practice. We also found similarity between PointCNN and Spatial\nTransformer Networks [20], in the sense that both of them provided a mechanism to \u201ctransform\u201d\ninput into latent canonical forms for being further processed, with no explicit loss or constraint in\nenforcing the canonicalization. In practice, it turns out that the networks \ufb01nd their ways to leverage\nthe mechanism for learning better. In PointCNN, the X -transformation is supposed to serve for both\nweighting and permutation, thus is modelled as a general matrix. This is different than that in [8],\nwhere a permutation matrix is the desired output, and is approximated by a doubly stochastic matrix.\n\n3 PointCNN\n\nThe hierarchical application of convolutions is essential for learning hierarchical representations via\nCNNs. PointCNN shares the same design and generalizes it to point clouds. First, we introduce\nhierarchical convolutions in PointCNN, in analogy to that of image CNNs, then, we explain the core\nX -Conv operator in detail, and \ufb01nally, present PointCNN architectures geared toward various tasks.\n3.1 Hierarchical Convolution\n\n!\"#$\n\n'(\n%-!\"#$\n\n*(\n\n+(\n\n*)\n\n+)\n\n!\"#$\n\n')\n\n%-!\"#$\n\nFigure 2: Hierarchical convolution on regular\ngrids (upper) and point clouds (lower). In reg-\nular grids, convolutions are recursively applied\non local grid patches, which often reduces the\ngrid resolution (4 \u21e5 4 ! 3 \u21e5 3 ! 2 \u21e5 2), while\nincreasing the channel number (visualized by dot\nthickness). Similarly, in point clouds, X -Conv is\nrecursively applied to \u201cproject\u201d, or \u201caggregate\u201d,\ninformation from neighborhoods into fewer rep-\nresentative points (9 ! 5 ! 2), but each with\nricher information.\n\nBefore we introduce the hierarchical convolution in PointCNN, we brie\ufb02y go through its well known\nversion for regular grids, as illustrated in Figure 2 upper. The input to grid-based CNNs is a feature\nmap F1 of shape R1 \u21e5 R1 \u21e5 C1, where R1 is the spatial resolution, and C1 is the feature channel\ndepth. The convolution of kernels K of shape K \u21e5 K \u21e5 C1 \u21e5 C2 against local patches of shape\nK \u21e5 K \u21e5 C1 from F1, yields another feature map F2 of shape R2 \u21e5 R2 \u21e5 C2. Note that in Figure 2\nupper, R1 = 4, K = 2, and R2 = 3. Compared with F1, F2 is often of lower resolution (R2 < R1)\nand of deeper channels (C2 > C1), and encodes higher level information. This process is recursively\napplied, producing feature maps with decreasing spatial resolution (4 \u21e5 4 ! 3 \u21e5 3 ! 2 \u21e5 2 in\nFigure 2 upper), but deeper channels (visualized by increasingly thicker dots in Figure 2 upper).\nThe input to PointCNN is F1 = {(p1,i, f1,i) : i = 1, 2, ..., N1}, i.e., a set of points {p1,i : p1,i 2\nRDim}, each associated with a feature {f1,i : f1,i 2 RC1}. Following the hierarchical construction\nof grid-based CNNs, we would like to apply X -Conv on F1 to obtain a higher level representation\nF2 = {(p2,i, f2,i) : f2,i 2 RC2, i = 1, 2, ..., N2}, where {p2,i} is a set of representative points of\n\n3\n\n\f\"',#\n\n(\"#,#,%#,#)\n\"','\na\n\n(\"#,#\u2212\"',',%#,#)\n(\"#,#\u2212\"',#,%#,#)\n\"','\n\"',#\nb\n\n[)*+,(\"#,#\u2212\"',#),%#,#]\n[)*+,(\"#,#\u2212\"',#),%#,#]\n\"','\n\"',#\nc\n\nFigure 3: The process for converting point coordinates to features. Neighboring points are transformed\nto the local coordinate systems of the representative points (a and b). The local coordinates of each\npoint are then individually lifted and combined with the associated features (c).\n\n{p1,i} and F2 is of a smaller spatial resolution and deeper feature channels than F1, i.e., N2 < N1,\nand C2 > C1. When the X -Conv process of turning F1 into F2 is recursively applied, the input points\nwith features are \u201cprojected\u201d, or \u201caggregated\u201d, into fewer points (9 ! 5 ! 2 in Figure 2 lower), but\neach with increasingly richer features (visualized by increasingly thicker dots in Figure 2 lower).\nThe representative points {p2,i} should be the points that are bene\ufb01cial for the information \u201cprojection\u201d\nor \u201caggregation\u201d. In our implementation, they are generated by random down-sampling of {p1,i} in\nclassi\ufb01cation tasks, and farthest point sampling in segmentation tasks, since segmentation tasks are\nmore demanding on a uniform point distribution. We suspect some more advanced point selections\nwhich have shown promising performance in geometry processing, such as Deep Points [51], could\n\ufb01t in here as well. We leave the exploration of better representative point generation methods for\nfuture work.\n\n3.2 X -Conv Operator\nX -Conv is the core operator for turning F1 into F2. In this section, we \ufb01rst introduce the input, output\nand procedure of the operator, and then explain the rationale behind the procedure.\n\nALGORITHM 1: X -Conv Operator\nInput\n:K, p, P, F\nOutput :Fp\n1: P0 P  p.\n2: F MLP (P0)\n3: F\u21e4 [F, F]\n4: X MLP(P0)\n5: FX X\u21e5 F\u21e4\n6: Fp Conv(K, FX )\n\n. Features \u201cprojected\u201d, or \u201caggregated\u201d, into representative point p\nMove P to local coordinate system of p\n. Individually lift each point into C dimensional space\n. Concatenate F and F, F\u21e4 is a K \u21e5 (C + C1) matrix\n. Learn the K \u21e5 K X -transformation matrix\n. Weight and permute F\u21e4 with the learnt X\n. Finally, typical convolution between K and FX\n\nTo leverage spatially-local correlation, similar to convolution in grid-based CNNs, X -Conv operates\nin local regions. Since the output features are supposed to be associated with the representative\npoints {p2,i}, X -Conv takes their neighborhood points in {p1,i}, as well as the associated features,\nas input to convolve with. For simplicity, we denote a representative point in {p2,i} as p, the\nfeatures with p as f and its K neighbors in {p1,i} as N, thus the X -Conv input for this speci\ufb01c p is\nS = {(pi, fi) : pi 2 N}. Note that S is an unordered set. Without loss of generality, S can be cast\ninto a K \u21e5 Dim matrix P = (p1, p2, ..., pK)T , and a K \u21e5 C1 matrix F = (f1, f2, ..., fK)T , and K\ndenotes the trainable convolution kernels. With these inputs, we would like to compute the features\nFp, which are the \u201cprojection\u201d, or \u201caggregation\u201d, of the input features into the representative point p.\nWe detail the X -Conv operator in Algorithm 1, and summarize it concisely as:\n\nFp = X Conv(K, p, P, F) = Conv(K, MLP(P  p) \u21e5 [MLP (P  p), F]),\n\n(2)\nwhere MLP (\u00b7) is a multilayer perceptron applied individually on each point, as in PointNet [33].\nNote that all the operations involved in building X -Conv, i.e., Conv(\u00b7,\u00b7), MLP(\u00b7), matrix multiplica-\ntion (\u00b7) \u21e5 (\u00b7), and MLP (\u00b7), are differentiable. Accordingly. X -Conv is differentiable, and can be\nplugged into a neural network for training by back propagation.\nLines 4-6 in Algorithm 1 are the core X -transformation as described in Eq. 1b in Section 1. Here,\nwe explain the rationale behind lines 1-3 of Algorithm 1 in detail. X -Conv is designed to work\non local point regions, and the output should not be dependent on the absolute position of p and\n\n4\n\n\fits neighboring points, but on their relative positions. To that end, we position local coordinate\nsystems at the representative points (line 1 of Algorithm 1, Figure 3b). It is the local coordinates of\nneighboring points, together with their associated features, that de\ufb01ne the output features. However,\nthe local coordinates are of a different dimensionality and representation than the associated features.\nTo address this issue, we \ufb01rst lift the coordinates into a higher dimensional and more abstract\nrepresentation (line 2 of Algorithm 1), and then combine it with the associated features (line 3 of\nAlgorithm 1) for further processing (Figure 3c).\nLifting coordinates into features is done through a point-wise MLP (\u00b7), as in PointNet-based methods.\nDifferently, however,the lifted features are not processed by a symmetric function. Instead, along\nwith the associated features, they are weighted and permuted by the X -transformation that is jointly\nlearned across all neighborhoods. The resulting X is dependent on the order of the points, and this\nis desired, as X is supposed to permute F\u21e4 according to the input points, and therefore has to be\naware of the speci\ufb01c input order. For an input point cloud without any additional features, i.e., F is\nempty, the \ufb01rst X -Conv layer uses only F. PointCNN can thus handle point clouds with or without\nadditional features in a robust uniform fashion.\nFor more details about the X -Conv operator, including the actual de\ufb01nition of MLP (\u00b7), MLP(\u00b7)\nand Conv(\u00b7,\u00b7), please refer to Supplementary Material Section 1.\n3.3 PointCNN Architectures\nFrom Figure 2, we can see that the Conv layers in grid-based CNNs and X -Conv layers in PointCNN\nonly differ in two aspects: the way the local regions are extracted (K \u21e5 K patches vs. K neighboring\npoints around representative points) and the way the information from local regions is learned (Conv\nvs. X -Conv). Otherwise, the process of assembling a deep network with X -Conv layers highly\nresembles that of grid-based CNNs.\n\n7#89$88\n!-#$%&((=1,#=#0,-=4)\n!-#$%&((=4,#=#,,-=4)\n\na\n\n7#89$88\n\n7#89$88\n7#89$88\n!-#$%&((=4,#=#0,-=4,2=2)\n!-#$%&((=7,#=#,,-=4)\n\n7#89$88\n\nb\n\n!-#$%&((=10,#=#=,-=3)\n!-#$%&((=7,#=#:,-=3)\n!-#$%&((=4,#=#0,-=4,2=2)\n!-#$%&((=7,#=#,,-=4)\n\nc\n\nFigure 4: PointCNN architecture\nfor classi\ufb01cation (a and b) and\nsegmentation (c), where N and C\ndenote the output representative\npoint number and feature dimen-\nsionality, K is the neighboring\npoint number for each representa-\ntive point, and D is the X -Conv\ndilation rate.\n\nFigure 4a depicts a simple PointCNN with two X -Conv layers that gradually transform the input\npoints (with or without features) into fewer representation points, but each with richer features. After\nthe second X -Conv layer, there is only one representative point left, and it aggregates information\nfrom all the points from the previous layer. In PointCNN, we can roughly de\ufb01ne the receptive \ufb01eld of\neach representative point as the ratio K/N, where K is the neighboring point number, and N is the\npoint number in the previous layer. With this de\ufb01nition, the \ufb01nal point \u201csees\u201d all the points from the\nprevious layer, thus has a receptive \ufb01eld of 1.0 \u2014 it has a global view of the entire shape, and its\nfeatures are informative for semantic understanding of the shape. We can add fully connected layers\non top of the last X -Conv layer output, followed by a loss, for training the network.\nNote that the number of training samples for the top X -Conv layers drops rapidly (Figure 4a), making\nit inef\ufb01cient to train them thoroughly. To address this problem, we propose PointCNN with denser\nconnections (Figure 4b), where more representative points are kept in the X -Conv layers. However,\nwe aim to maintain the depth of the network, while keeping the receptive \ufb01eld growth rate, such that\nthe deeper representative points \u201csee\u201d increasingly larger portions of the entire shape. We achieve\nthis goal by employing the dilated convolution idea from grid-based CNNs in PointCNN. Instead of\nalways taking the K neighboring points as input, we uniformly sample K input points from K \u21e5 D\nneighboring points, where D is the dilation rate. In this case, the receptive \ufb01eld increases from K/N\nto (K \u21e5 D)/N, without increasing actual neighboring point count or kernel size.\nIn the second X -Conv layer of PointCNN in Figure 4b, dilation rate D = 2 is used, thus all the\nfour remaining representative points \u201csee\u201d the entire shape, and all of them are suitable for making\n\n5\n\n\fFlex-Convolution [14]\nKCNet [42]\nKd-Net [22]\nSO-Net [27]\n3DmFV-Net [4]\nPCNN [3]\nPointNet [33]\nPointNet++ [35]\nSpecGCN [46]\nSpiderCNN [53]\nDGCNN [50]\nPointCNN\n\nmA\n-\n-\n\n88.5\n\n-\n-\n-\n-\n-\n-\n-\n-\n\n92.3\n\n-\n-\n-\n-\n-\n\n88.8\n\n92.5\n\nModelNet40\n\nPre-aligned\nOA\n90.2\n91\n\n90.6 (91.8 w/ P32768)\n90.7 (93.4 w/ PN5000)\n91.4 (91.6 w/ P2048)\n\nUnaligned\nOA\n-\n-\n-\n-\n-\n-\n\n89.2\n\n90.7 (91.9 w/ PN5000)\n91.5 (92.1 w/ PN2048)\n\n- (92.4 w/ PN1024)\n\n92.2\n92.2\n\nScanNet\nOA\nmA\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n\n-\n-\n-\n\n76.1\n\n55.7\n\n79.7\n\nTable 1: Comparisons of mean\nper-class accuracy (mA) and\noverall accuracy (OA) (%) on\nModelNet40 [52] and Scan-\nNet [9]. The reported perfor-\nmances are based on 1024 input\npoints, unless otherwise noted\nby P# (# input points) or PN#\n(# input points with normals).\n\nmA\n-\n-\n-\n-\n-\n-\n\n86.2\n\n-\n-\n-\n\n90.2\n88.1\n\npredictions. Note that, in this way, we can train the top X -Conv layers more thoroughly, as much\nmore connections are involved in the network, compared to PointCNN in Figure 4a. In test time,\nthe output from the multiple representative points is averaged right before the sof tmax to stabilize\nthe prediction. This design is similar to that of Network in Network [29]. The denser version of\nPointCNN (Figure 4b) is the one we used for classi\ufb01cation tasks.\nFor segmentation tasks, high resolution point-wise output is required, and this can be realized by\nbuilding PointCNN following Conv-DeConv [32] architecture, where the DeConv part is responsible\nfor propagating global information into high resolution predictions (see Figure 4c). Note that both\nthe \u201cConv\u201d and \u201cDecConv\u201d in the PointCNN segmentation network are the same X -Conv operator.\nThe only differences between the \u201cConv\u201d and \u201cDeConv\u201d layers is that the latter has more points but\nless feature channels in its output vs. its input, and its higher resolution points are forwarded from\nearlier \u201cConv\u201d layers, following the design of U-Net [38].\nDropout is applied before the last fully connected layer to reduce over-\ufb01tting. We also employed\nthe \u201csubvolume supervision\u201d idea from [34], to further address the over-\ufb01tting problem. In the last\nX -Conv layers, the receptive \ufb01eld is set to be less than 1, such that only partial information is \u201cseen\u201d\nby the representative points. The network is pushed to learn harder from the partial information during\ntraining, and performs better at test time. In this case, the global coordinates of the representative\npoints matter, thus they are lifted into feature space RCg with M LPg(\u00b7) (detailed in Supp. Material\nSection 1) and concatenated into X -Conv for further processing by follow-up layers.\nData augmentation. To train the parameters in X -Conv, it is evidently not bene\ufb01cial to keep using\nthe same set of neighboring points, in the same order, for a speci\ufb01c representative point. To improve\ngeneralization, we propose to randomly sample and shuf\ufb02e the input points, such that both the\nneighboring point sets and order may differ from batch to batch. To train a model that takes N points\nas input, N (N, (N/8)2) points are used for training, where N denotes a Gaussian distribution. We\nfound that this strategy is crucial for successful training of PointCNN.\n\n4 Experiments\n\nWe conducted an extensive evaluation of PointCNN for shape classi\ufb01cation on six datasets (Model-\nNet40 [52], ScanNet [9], TU-Berlin [11], Quick Draw [15], MNIST, CIFAR10), and segmentation\ntask on three datasets (ShapeNet Parts [54], S3DIS [2], and ScanNet [9]). The details of the datasets\nand how we convert and feed data into PointCNN, are described in Supp. Material Section 2, and the\nPointCNN architectures for the tasks on these datasets can be found in Supp. Material Section 3.\n\n4.1 Classi\ufb01cation and Segmentation Results\n\nWe summarize our 3D point cloud classi\ufb01cation results on ModelNet40 and ScanNet in Table 1,\nand compare to several neural network methods designed for point clouds. Note that a large portion\nof the 3D models from ModelNet40 are pre-aligned to the common up direction and horizontal\nfacing direction. If a random horizontal rotation is not applied on either the training or testing sets,\nthen the relatively consistent horizontal facing direction is leveraged, and the metrics based on this\nsetting is not directly comparable to those with the random horizontal rotation. For this reason, we\nran PointCNN and reported its performance in both settings. Note that PointCNN achieved top\nperformance on both ModelNet40 and ScanNet.\n\n6\n\n\f-\n-\n\n-\n-\n\nScanNet\n\n56.47\n56.1\n47.6\n\n50.37\n62.1\n\n65.39\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n\n-\n\n-\n\n82.0\n82.7\n83.3\n83.7\n81.7\n81.0\n81.8\n82.2\n\n-\n\n77.4\n81.0\n81.4\n82.3\n80.4\n81.9\n82.8\n\nOA\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n\nSyncSpecCNN [55]\nPd-Network [22]\nSSCN [12]\nSPLATNet [43]\nSpiderCNN [53]\nSO-Net [27]\nPCNN [3]\nKCNet [42]\nSpecGCN [46]\nKd-Net [22]\n3DmFV-Net [4]\nRSNet [18]\nDGCNN [50]\nPointNet [33]\nPointNet++ [35]\nSGPN [49]\nSPGraph [24]\nTCDP [44]\nPointCNN\n\nS3DIS\nShapeNet Parts\npIoU mpIoU mIoU\n84.74\n85.49\n85.98\n85.4\n85.3\n84.9\n85.1\n83.7\n85.4\n82.3\n84.3\n84.9\n85.1\n83.7\n85.1\n85.8\n\nWe evaluate PointCNN on the segmentation of\nShapeNet Parts, S3DIS, and ScanNet datasets,\nand summarize the results in Table 2. More\ndetailed segmentation result comparisons can\nbe found in Supplementary Material Section 4.\nWe note that PointCNN outperforms all the\ncompared methods, including SSCN [12], SP-\nGraph [24] and SGPN [49], which are special-\nized segmentation networks with state-of-the-art\nperformance. Note that the part averaged IoU\nmetric for ShapeNet Parts is the one used in [56].\nCompared with mean IoU, the part averaged IoU\nputs more emphasis on the correct prediction of\nsmall parts.\nSketches are 1D curves in 2D space,\nthus\ncan be more effectively represented with point\nclouds, rather than with 2D images. We eval-\nuate PointCNN on TU-Berlin and Quick Draw\nsketches, and present results in Table 3, where\nwe compare its performance with the competi-\ntive PointNet++, as well as image CNN based\nmethods. PointCNN outperforms PointNet++\non both datasets, with a more prominent advan-\ntage on Quick Draw (25M data samples), which\nis signi\ufb01cantly larger than TU-Berlin (0.02M data samples). On the TU-Berlin dataset, while the\nperformance of PointCNN is slightly better than the generic image CNN AlexNet [23], there is still a\ngap with the specialized Sketch-a-Net [57]. It is interesting to study whether architectural elements\nfrom Sketch-a-Net can be adopted and integrated into PointCNN to improve its performance on the\nsketch datasets.\nSince X -Conv is a generalization of Conv, ideally, PointCNN should perform on par with CNNs,\nif the underlying data is the same, but only represented differently. To verify this, we evaluate\nPointCNN on the point cloud representation of MNIST and CIFAR10, and show results in Table 4.\nFor MNIST data, PointCNN achieved comparable performance with other methods, indicating its\neffective learning of the digits\u2019 shape information. For CIFAR10 data, where there is mostly no\n\u201cshape\u201d information, PointCNN has to learn mostly from the spatially-local correlation in the RGB\nfeatures, and it performed reasonably well on this task, though there is a large gap between PointCNN\nand the mainstream image CNNs. From this experiment, we can conclude that CNNs are still the\nbetter choice for general images.\n\nTable 2: Segmentation comparisons on ShapeNet\nParts in part-averaged IoU (pIoU, %) and mean per-\nclass pIoU (mpIoU, %), S3DIS in mean per-class\nIoU (mIoU, %) and ScanNet in per voxel overall\naccuracy (OA, %).\n\n86.14\n\n84.6\n\n73.9\n84.5\n\n-\n-\n\n80.9\n85.1\n\nTU-Berlin Quick Draw\n\n-\n-\n\nMethod\nSketch-a-Net [57]\nAlexNet [23]\nPointNet++ [35]\nPointCNN\n\n77.95\n68.60\n66.53\n70.57\n\n51.58\n59.13\nTable 3: Sketch classi\ufb01cation results.\n\nMethod\nLeNet [26]\nNetwork in Network [29]\nPointNet++ [33]\nPointCNN\n\nMNIST CIFAR10\n99.20\n99.53\n99.49\n99.54\n\n84.07\n91.20\n10.03\n80.22\n\nTable 4: Image classi\ufb01cation results.\n\n4.2 Ablation Experiments and Visualizations\nAblation test of the core X -Conv operator. To verify the effectiveness of the X -transformation,\nwe propose PointCNN without it as a baseline, where lines 4-6 of Algorithm 1 are replaced by Fp \nConv(K, F\u21e4). Compared with PointCNN, the baseline has less trainable parameters, and is more\n\u201cshallow\u201d due to the removal of MLP(\u00b7) in line 4 of Algorithm 1. For a fair comparison, we further\npropose PointCNN w/o X -W/D, which is wider/deeper, and has approximately the same amount\nof parameters as PointCNN. The model depth of PointCNN w/o X (deeper) also compensates for\n3PointNet++ performs no better than random choice on CIFAR10. We suspect the reason is that, in\nPointNet++, the RGB features become in-discriminative after being processed by the max-pooling. Together\nwith the lack of \u201cshape\u201d information, PointNet++ fails completely on this task.\n\n7\n\n\fFigure 5: T-SNE visualization of features without (a/Fo), before (b/F\u21e4) and after (c/FX ) X -\ntransformation.\n\nthe decrease in depth caused by the removal of MLP(\u00b7) from PointCNN. The comparison results\nare summarized in Table 5. Clearly, PointCNN outperforms the proposed variants by a signi\ufb01cant\nmargin, and the gap between PointCNN and PointCNN w/o X is not due to model parameter number,\nor model depth. With these comparisons, we conclude that X -Conv is the key to the performance of\nPointCNN.\n\n0.6M\n92.2\n\nCore Layers\n# Parameter\nAccuracy (%)\n\nTable 5: Ablation tests on ModelNet40.\n\nw/o X -W w/o X -D\nConv\u21e54\nConv\u21e55\n0.61M\n0.63M\n90.8\n90.7\n\nPointCNN\nw/o X\nX -Conv\u21e54 Conv\u21e54\n0.54M\n90.7\n\nVisualization of X -Conv features. Each rep-\nresentative point, with its neighboring points in a\nparticular order, has a corresponding F\u21e4 and FX\nin RK\u21e5C, where C = C + C1. For the same\nrepresentative point, if its neighboring points in\ndifferent orders are fed into the network, we get\na set of F\u21e4 and FX , and we denote them as F\u21e4\nand FX . Similarly, we de\ufb01ne the set of F\u21e4 in PointCNN w/o X as Fo. Clearly, F\u21e4 can be quite\nscattering in the RK\u21e5C space, since differences in input point order will result in a different F\u21e4. On\nthe other hand, if the learned X can perfectly canonize F\u21e4, FX is supposed to stay at a canonical\npoint in the space.\nTo verify this, we show T-SNE visualization of Fo, F\u21e4 and FX of 15 randomly picked representative\npoints from the ModelNet40 dataset in Figure 5, each with one color, and consistent in the sub-\ufb01gures.\nNote that Fo is quite \u201cblended\u201d, which indicates that the features from different representative points\nare not discriminative against each other (Figure 5a). While F\u21e4 is better than Fo, it is still \u201cfuzzy\u201d\n(Figure 5b). In Figure 5c, FX are \u201cconcentrated\u201d by X , and the features of each representative point\nbecome highly discriminative. To give an quantitative reference of the \u201cconcentration\u201d effect, we\n\ufb01rstly compute the feature centers of different representative points, then classify all the feature points\nto the representative points they belong to, based on nearest search to the centers. The classi\ufb01cation\naccuracies are 76.83%, 89.29% and 94.72% for Fo, F\u21e4 and FX , respectively. With the qualitative\nvisualization and quantitative investigation, we conclude that though the \u201cconcentration\u201d is far from\nreaching a point, the improvement is signi\ufb01cant, and it explains the performance of PointCNN in\nfeature learning.\n\nPointNet [33]\n\nPointNet++ [35]\n\n3DmFV-Net [4] DGCNN [50]\n\nSpecGCN [46]\n\nPCNN [3]\n\nMethods\nParameters\n\nFLOPs\n\nTime\n\nTraining\nInference\nTraining\nInference\n\n3.48M\n43.82B\n14.70B\n0.068s\n0.015s\n\n1.48M\n67.94B\n26.94B\n0.091s\n0.027s\n\n45.77M\n48.57B\n16.89B\n0.101s\n0.039s\n\n1.84M\n131.37B\n44.27B\n0.171s\n0.064s\n\n2.05M\n49.97B\n17.79B\n14.640s\n11.254s\n\n8.2M\n6.49B\n4.70B\n0.476s\n0.226s\n\nPointCNN\n\n0.6M\n93.03B\n25.30B\n0.031s\n0.012s\n\nTable 6: Parameter number, FLOPs and running time comparisons.\n\nOptimizer, model size, memory usage and timing. We implemented PointCNN in tensor\ufb02ow [1],\nand use ADAM optimizer [21] with an initial learning rate 0.01 for the training of our models. As\nshown in Table 6, we summarize our running statistics based with the model for classi\ufb01cation with\nbatch size 16, 1024 input points on nVidia Tesla P100 GPU, in comparison with several other methods.\nPointCNN achieves 0.031/0.012 second per batch for training/inference on this setting. In addition,\nthe model for segmentation with 2048 input points has 4.4M parameters runs on nVidia Tesla P100\nwith batch size 12 at 0.61/0.25 second per batch for training/inference.\n\n8\n\n\f5 Conclusion\n\nWe proposed PointCNN, which is a generalization of CNN into leveraging spatially-local corre-\nlation from data represented in point cloud. The core of PointCNN is the X -Conv operator that\nweights and permutes input points and features before they are processed by a typical convolution.\nWhile X -Conv is empirically demonstrated to be effective in practice, a rigorous understanding\nof it, especially when being composited into a deep neural network, is still an open problem for\nfuture work. It is also interesting to study how to combine PointCNN and image CNNs to jointly\nprocess paired point clouds and images, probably at the early stages. We open source our code at\nhttps://github.com/yangyanli/PointCNN to encourage further development.\n\nAcknowledgments\n\nYangyan would like to thank Leonidas Guibas from Stanford University and Mike Haley from\nAutodesk Research for insightful discussions, and Noa Fish from Tel Aviv University and Thomas\nSchattschneider from Technical University of Hamburg for proof reading. The work is supported in\npart by National Key Research and Development Program of China grant No. 2017YFB1002603, the\nNational Basic Research grant (973) No. 2015CB352501, National Science Foundation of China\nGeneral Program grant No. 61772317, and \u201cQilu\u201d Young Talent Program of Shandong University.\n\nReferences\n[1] Mart\u00edn Abadi and et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.\n\nSoftware available from tensor\ufb02ow.org.\n\n[2] Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese.\n\n3d semantic parsing of large-scale indoor spaces. In CVPR, pages 1534\u20131543, 2016.\n\n[3] Matan Atzmon, Haggai Maron, and Yaron Lipman. Point convolutional neural networks by extension\n\noperators. ACM Trans. Graph., 37(4):71:1\u201371:12, July 2018.\n\n[4] Yizhak Ben-Shabat, Michael Lindenbaum, and Anath Fischer. 3d point cloud classi\ufb01cation and segmen-\ntation using 3d modi\ufb01ed \ufb01sher vector representation for convolutional neural networks. arXiv preprint\narXiv:1711.08241, 2018.\n\n[5] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric\n\ndeep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18\u201342, 2017.\n\n[6] Fran\u00e7ois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint\n\narXiv:1610.02357, 2016.\n\n[7] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning\n\nby exponential linear units (elus). In ICLR, 2016.\n\n[8] Rodrigo Santa Cruz, Basura Fernando, Anoop Cherian, and Stephen Gould. Deeppermnet: Visual\n\npermutation learning. In CVPR, July 2017.\n\n[9] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nie\u00dfner.\n\nScannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.\n\n[10] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional\nneural networks. In Proceedings of the 33rd International Conference on International Conference on\nMachine Learning - Volume 48, ICML\u201916, pages 1889\u20131898. JMLR.org, 2016.\n\n[11] Mathias Eitz, James Hays, and Marc Alexa. How do humans sketch objects? ToG, 31(4):44:1\u201344:10,\n\n2012.\n\n2017.\n\n2018.\n\n[12] Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3d semantic segmentation with\n\nsubmanifold sparse convolutional networks. arXiv preprint arXiv:1711.10275, 2017.\n\n[13] Benjamin Graham and Laurens van der Maaten. Submanifold sparse convolutional networks. arXiv\n\npreprint arXiv:1706.01307, 2017.\n\n[14] Fabian Groh, Patrick Wieschollek, and Hendrik P. A. Lensch. Flex-convolution (deep learning beyond\n\ngrid-worlds). arXiv preprint arXiv:1803.07289, 2018.\n\n[15] David Ha and Douglas Eck. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477,\n\n[16] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International\n\nConference on Arti\ufb01cial Neural Networks, pages 44\u201351. Springer, 2011.\n\n[17] Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. Point-wise convolutional neural network. In CVPR,\n\n[18] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Recurrent slice networks for 3d segmentation on\n\npoint clouds. In CVPR, 2018.\n\n9\n\n\f[19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In International Conference on Machine Learning, pages 448\u2013456, 2015.\n\n[20] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in\n\nNeural Information Processing Systems, pages 2017\u20132025, 2015.\n\n[21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.\n[22] Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d\n\npoint cloud models. In ICCV, 2017.\n\n[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NeurIPS, pages 1097\u20131105, 2012.\n\n[24] Lo\u00efc Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint\n\ngraphs. CoRR, abs/1711.09869, 2017.\n\n[25] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n[26] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[27] Jiaxin Li, Ben M. Chen, and Gim Hee Lee. So-net: Self-organizing network for point cloud analysis. In\n\nCVPR, 2018.\n\n[28] Yangyan Li, S\u00f6ren Pirk, Hao Su, Charles R Qi, and Leonidas J Guibas. Fpnn: Field probing neural\n\nnetworks for 3d data. In NeurIPS, pages 307\u2013315, 2016.\n\n[29] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In ICLR, 2014.\n[30] Haggai Maron, Meirav Galun, Noam Aigerman, Miri Trope, Nadav Dym, Ersin Yumer, Vladimir G. Kim,\nand Yaron Lipman. Convolutional neural networks on surfaces via seamless toric covers. ACM Trans.\nGraph., 36(4):71:1\u201371:10, July 2017.\n\n[31] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodol\u00e0, Jan Svoboda, and Michael M.\nBronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, July\n2017.\n\n[32] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic\nsegmentation. In ICCV, ICCV \u201915, pages 1520\u20131528, Washington, DC, USA, 2015. IEEE Computer\nSociety.\n\n[33] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d\n\nclassi\ufb01cation and segmentation. In CVPR, pages 77\u201385, July 2017.\n\n[34] Charles R. Qi, Hao Su, Matthias Nie\u00dfner, Angela Dai, Mengyuan Yan, and Leonidas J. Guibas. Volumetric\n\nand multi-view cnns for object classi\ufb01cation on 3d data. In CVPR, pages 5648\u20135656, 2016.\n\n[35] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on\n\npoint sets in a metric space. In NeurIPS, pages 5105\u20135114, 2017.\n\n[36] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Deep learning with sets and point clouds.\n\n[38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical\nimage segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi,\neditors, MICCAI, pages 234\u2013241, Cham, 2015. Springer International Publishing.\n\n[39] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by\nerror propagation. In David E. Rumelhart, James L. McClelland, and CORPORATE PDP Research Group,\neditors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, pages\n318\u2013362. MIT Press, Cambridge, MA, USA, 1986.\n\n[40] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing between capsules. In NeurIPS,\n\npages 3859\u20133869, 2017.\n\n[41] Tianjia Shao, Yin Yang, Yanlin Weng, Qiming Hou, and Kun Zhou. H-CNN: spatial hashing based CNN\n\nfor 3d shape analysis. arXiv preprint arXiv:1803.11385, 2018.\n\n[42] Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Mining point cloud local structures by kernel\n\ncorrelation and graph pooling. In CVPR, 2018.\n\n[43] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and\n\nJan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In CVPR, 2018.\n\n[44] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-Yi Zhou. Tangent convolutions for dense\n\nprediction in 3d. In CVPR, 2018.\n\n[45] Lyne P. Tchapmi, Christopher B. Choy, Iro Armeni, JunYoung Gwak, and Silvio Savarese. Segcloud:\n\nSemantic segmentation of 3d point clouds. In 3DV, 2017.\n\n[46] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spectral graph convolution for point set feature\n\nlearning. arXiv preprint arXiv:1803.05827, 2018.\n\n[47] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-cnn: Octree-based convolu-\n\ntional neural networks for 3d shape analysis. ACM Trans. Graph., 36(4):72:1\u201372:11, July 2017.\n\n[37] Gernot Riegler, Ali Osman Ulusoys, and Andreas Geiger. Octnet: Learning deep 3d representations at\n\narXiv preprint arXiv:1611.04500, 2016.\n\nhigh resolutions. In CVPR, 2017.\n\n10\n\n\f[48] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, and Raquel Urtasun. Deep parametric\n\ncontinuous convolutional neural networks. In CVPR, 2018.\n\n[49] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neumann. SGPN: similarity group proposal network\n\nfor 3d point cloud instance segmentation. In CVPR, 2018.\n\n[50] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon.\n\nDynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.\n\n[51] Shihao Wu, Hui Huang, Minglun Gong, Matthias Zwicker, and Daniel Cohen-Or. Deep points consolidation.\n\nToG, 34(6):176:1\u2013176:13, October 2015.\n\n[52] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao.\n\n3d shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912\u20131920, 2015.\n\n[53] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with\n\nparameterized convolutional \ufb01lters. arXiv preprint arXiv:1803.11527, 2018.\n\n[54] Li Yi, Vladimir G. Kim, Duygu Ceylan, I-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla\nSheffer, and Leonidas Guibas. A scalable active framework for region annotation in 3d shape collections.\nToG, 35(6):210:1\u2013210:12, November 2016.\n\n[55] Li Yi, Hao Su, Xingwen Guo, and Leonidas Guibas. Syncspeccnn: Synchronized spectral cnn for 3d shape\n\nsegmentation. In CVPR, pages 6584\u20136592, July 2017.\n\n[56] Li Yi, Hao Su, Lin Shao, Manolis Savva, Haibin Huang, Yang Zhou, Benjamin Graham, Martin Engelcke,\nRoman Klokov, Victor Lempitsky, et al. Large-scale 3d shape reconstruction and segmentation from\nshapenet core55. arXiv preprint arXiv:1710.06104, 2017.\n\n[57] Qian Yu, Yongxin Yang, Feng Liu, Yi-Zhe Song, Tao Xiang, and Timothy M. Hospedales. Sketch-a-net: A\n\ndeep neural network that beats humans. IJCV, 122(3):411\u2013425, May 2017.\n\n[58] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R. Salakhutdinov, and\nAlexander J Smola. Deep sets. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-\nwanathan, and R. Garnett, editors, NeurIPS, pages 3394\u20133404, 2017.\n\n11\n\n\f", "award": [], "sourceid": 450, "authors": [{"given_name": "Yangyan", "family_name": "Li", "institution": "Alibaba AI Labs"}, {"given_name": "Rui", "family_name": "Bu", "institution": "Shandong University"}, {"given_name": "Mingchao", "family_name": "Sun", "institution": "Shandong University"}, {"given_name": "Wei", "family_name": "Wu", "institution": "Shandong University"}, {"given_name": "Xinhan", "family_name": "Di", "institution": "Vivo Communication Technology Co.Ltd[Relative Work was done at Huawei Technology Co.Ltd]"}, {"given_name": "Baoquan", "family_name": "Chen", "institution": "Shandong University"}]}